Skip to content

Latest commit

 

History

History
163 lines (127 loc) · 4.51 KB

documentation.md

File metadata and controls

163 lines (127 loc) · 4.51 KB

Web Vision Agent Documentation

Overview

The Web Vision Agent is a web browsing automation tool that combines visual understanding capabilities with web interaction. It uses the smolagents library along with vision language models to navigate and interact with web pages intelligently.

Architecture

Core Components

  1. Vision Callback (vision_callback.py)

    • Handles screenshot capture after each agent action
    • Processes and stores visual information for the agent
    • Updates agent observations with current URL information
  2. Browser Tools (browser_tools.py)

    • Provides a suite of web interaction tools:
      • Navigation: navigate_to(), go_back()
      • Interaction: click_element(), scroll_page()
      • Search: search_item_ctrl_f(), get_text_content()
      • Utility: close_popups(), wait_for_load()
  3. Agent Creation (agent.py)

    • Creates and configures the web vision agent
    • Sets up vision model (Qwen2-VL-72B by default)
    • Configures browsing tools and callbacks
    • Provides system prompts for agent behavior
  4. Web Interface (app.py & templates/)

    • Flask-based web interface for controlling the agent
    • Allows users to input tasks and view results
    • Displays agent responses in real-time

Setup and Configuration

Environment Variables

  • FIREWORKS_API_KEY: Required for the vision language model
  • FLASK_SECRET_KEY: Used for Flask session security

Dependencies

helium
selenium
smolagents
flask
python-dotenv
Pillow

Usage Instructions

  1. Starting the Application

    python app.py

    This starts the Flask server on port 5000.

  2. Using the Web Interface

    • Navigate to http://localhost:5000
    • Enter your task in the input field
    • Click "Run Agent" to execute the task
    • View results in the result area below
  3. Example Tasks

    "Go to huggingface.co and tell me what's on the homepage"
    "Navigate to GitHub trending and find the top repository"
    "Search for a specific topic on Wikipedia and summarize the content"
    

Browser Tools Reference

Navigation

navigate_to("website.com")  # Navigate to a URL
go_back()  # Go back one page

Page Interaction

click_element("Sign up")  # Click any element
click_element("Learn more", element_type="link")  # Click a link
scroll_page("down", pixels=800)  # Scroll down
scroll_page("up", pixels=800)  # Scroll up

Content Access

search_item_ctrl_f("text")  # Find text on page
get_text_content()  # Get page text

Utility Functions

close_popups()  # Close modal windows
wait_for_load(5.0)  # Wait for page load

Vision Capabilities

The agent uses screenshots to:

  • Understand page layout and content
  • Verify successful navigation
  • Identify clickable elements
  • Read and extract text content
  • Process visual information for decision making

Best Practices

  1. Task Writing

    • Be specific about what you want the agent to do
    • Break complex tasks into smaller steps
    • Include clear success criteria
  2. Performance

    • Allow time for page loads between actions
    • Use wait_for_load() after navigation
    • Verify results through screenshots
  3. Error Handling

    • The agent will automatically retry failed actions
    • Pop-ups are handled automatically
    • Navigation errors are reported clearly

Limitations

  1. Authentication

    • Cannot handle login-required pages
    • Avoid tasks requiring authentication
  2. Dynamic Content

    • May need extra wait time for JavaScript content
    • Some dynamic elements might not be immediately visible
  3. Browser Support

    • Uses Chrome/Chromium browser
    • Requires graphical environment

Troubleshooting

  1. Common Issues

    • Elements not found: Use wait_for_load() or increase wait time
    • Pop-ups blocking: Try close_popups()
    • Page not loading: Check URL format and connectivity
  2. Debug Tips

    • Check browser screenshots for visual state
    • Review current URL in observations
    • Use get_text_content() to verify page content

Development

To extend the agent's capabilities:

  1. Add new tools to browser_tools.py
  2. Update agent system prompt in agent.py
  3. Add new routes to app.py for additional features
  4. Enhance frontend in templates/index.html

Security Notes

  • API keys should be kept secure in environment variables
  • The agent runs in a controlled environment
  • Avoid sending sensitive information through the interface
  • Browser runs without sandbox for compatibility