Skip to content

Update notes of version 0.6.0

Naibo Wang edited this page Dec 25, 2023 · 1 revision

Update Notes

  1. Click Interaction in Browser: During task design, clicking on an element in the browser will now automatically highlight and map it for easier debugging. This applies to all browser-related operations, including JavaScript command debugging and automatic element marking for conditional branches.
  2. Double-Click for Dynamic Debugging: In task design, double-clicking an operation will test run it for dynamic debugging, displaying the results in real-time in the browser.
  3. Speed Optimization: Data extraction operations, especially those without additional tasks like executing JavaScript or downloading images, will see significantly improved speed.
  4. Dynamic XPath and Code Modification Using eval: Any XPath or JavaScript code snippet can now incorporate expressions directly from the Python environment using eval("expression_value"), eliminating the need for intermediate storage variables. For instance:
    • Define a variable a using the exec option in a custom operation:
      self.a = 1
    • In an XPath data extraction operation, use the following to represent /html/body/div[1]:
      /html/body/div[eval("self.a")]
    • Change the value of a using the exec option:
      self.a = self.a + 1
    • The XPath for data extraction will now correspond to /html/body/div[2].

This is particularly useful for scenarios where there is no "next page" button, and pages must be turned by clicking different page numbers. A detailed tutorial and example task file (290.json) will be released soon: 290.json 5. External Code File for Exec and Eval: Users can now write Python code in an IDE like VSCode and input outside:myCode.py in the task input box. The program will execute the code from myCode.py in the EasySpider directory. This is suitable for scenarios requiring extensive code that benefits from an IDE.

Note that EasySpider supports custom Python functions, importing external Python packages, and using try...except for exception handling.

  1. Input text (including batch text inputs) can also use the keyword eval("Python code") to input values dynamically generated by the Python program during the execution of the task; similarly, you can also use the keyword JS("return JavaScript code") to input text content dynamically generated by JavaScript (JavaScript code cannot have line breaks). For example, to input "current month/2023", you would use JS("return new Date().getMonth()+1")/2023, which would input: 12/2023 (for the input value in December 2023).

  2. Handling Multi-Layer Nested iframes: The experience is the same as with no iframes, but XPath should be set to locate elements only within the specified iframe. Thus, a generic XPath like //body will only target the body tag of the first iframe layer.

  3. Post-Data Extraction Paging Prompt: After designing a data extraction operation, the browser console will suggest whether to add paging. Specifying the paging button location automatically generates a data extraction operation with paging functionality:

  4. Batch Text Input Feature: Automatically generates a loop operation with a text list.

  5. Option to Store Extracted Data as a New Row: If set to 'no', the data isn't stored as a new row but temporarily held until another data extraction operation creates a new row. This is suitable for linked list scenarios: Issue #35, Issue #189

  6. Pause Function in Custom Operations: Allows pausing the program, useful when a captcha or other interactive page appears.

  7. Refresh Page Function in Custom Operations.

  8. Send Email Feature in Custom Operations.

  9. Alert Dialog Handling in Click Element Operations: Choose to accept or dismiss alerts.

  10. Optimizations for Parallel Execution: For browser executions with user information, the user directory is now copied before execution to solve parallel execution issues. Multiple task executions or command line programs can be run in parallel. After task completion, the copied user information folder is automatically deleted (if manually exited, delete the TempUserDataFolder directory manually).

  11. Automatic Operation Naming: Operations are automatically named based on the scenario, eliminating the need to manually rename operations. Examples include default names for click and move operations based on the text value of the element, loop operations named according to loop type, and automatic renaming when switching custom operations/loops/conditional branches.

  12. Single Element Loop Optimization: For loops like continuously clicking a pagination button, the unchanged content check can be limited to a single element instead of the entire page.

  13. Default File Download Location: Now set to the task folder.

  14. New Conditional Branches Added to the Right Side.

  15. Right-Click Menu in Flowchart: Enables trial run (debug run), copy, cut, delete elements, and adjust the order of conditional branches.

  16. Add a close hint at the bottom right of the operation prompt box, which is useful for cases where the QR code is occluded during login. You can click the "×" at the bottom right to close the operation panel.

  17. Custom Pause/Control Keys When Saving Tasks: Different programs can use different keys to pause/continue.

  18. Maximize Browser Window Option When Saving Tasks.

  19. Data Overwrite Mode When Writing Data: Each execution of the same task ID will delete the original file and recollect data (requires static file name setting).

  20. MySQL Database Writing: When encountering duplicate data, ignore and continue running. Suitable for scenarios where inserting duplicate data is undesirable (requires setting the database table's primary key to specific fields; otherwise, as per EasySpider's design, the primary key is an auto-increment ID, preventing duplicates).

  21. Base64 Image Download: Handles images that require login for download (not always effective).

  22. Enhanced Exception Handling: Prevents accidental interruptions during collection; retries in case of interruption, bug fixes for history rollback.

  23. Support for automatic word wrapping of extracted data fields, such as automatic line breaks when collecting long articles.

  24. The browser window with user information mode can remember the browser position from the last design task, instead of splitting the screen equally with the flowchart every time.

  25. Clicking on an element now supports clicking based on coordinates, which is useful for scenarios where you need to click in an empty space to close some dialog or popup window. For example, if the coordinates of the space are (10, 10), you would write point(10, 10) in the element's XPath field to represent a click at the web page coordinates (10, 10).

  26. You can choose whether to remove duplicate data after the data collection is completed. Please note that this feature needs to be executed at the end of the task, so exiting in the middle of the task execution will prevent deduplication!!!

  27. In the loop through a non-fixed/fixed cycle list, text list, or URL list, it is possible to set the option to skip the first n iterations. This feature is useful for scenarios where the task is interrupted midway, and there is no desire to restart from the beginning (official version).

  28. When executing a task, one may manually specify the task ID. In this case, clicking the "Execute Directly" or "Get ID" button will not generate a new task ID but will use the specified ID instead. If the specified task ID has previously existed, the task's invocation file will be overwritten. This is suitable for scenarios where, after modifying the task workflow, there is no wish to start with a new task ID; instead, one may want to continue appending and writing files within the original task ID folder.

  29. ddddocr Library Upgrade.

  30. UI Update.

  31. Chrome Browser Upgrade to Version 120.

Clone this wiki locally