Web Vision Agent Documentation

Overview

The Web Vision Agent is a web browsing automation tool that combines visual understanding capabilities with web interaction. It uses the smolagents library along with vision language models to navigate and interact with web pages intelligently.

Architecture

Core Components

Vision Callback (vision_callback.py)
- Handles screenshot capture after each agent action
- Processes and stores visual information for the agent
- Updates agent observations with current URL information
Browser Tools (browser_tools.py)
- Provides a suite of web interaction tools:
  - Navigation: navigate_to(), go_back()
  - Interaction: click_element(), scroll_page()
  - Search: search_item_ctrl_f(), get_text_content()
  - Utility: close_popups(), wait_for_load()
Agent Creation (agent.py)
- Creates and configures the web vision agent
- Sets up vision model (Qwen2-VL-72B by default)
- Configures browsing tools and callbacks
- Provides system prompts for agent behavior
Web Interface (app.py & templates/)
- Flask-based web interface for controlling the agent
- Allows users to input tasks and view results
- Displays agent responses in real-time

Setup and Configuration

Environment Variables

FIREWORKS_API_KEY: Required for the vision language model
FLASK_SECRET_KEY: Used for Flask session security

Dependencies

helium
selenium
smolagents
flask
python-dotenv
Pillow

Usage Instructions

Starting the Application
```
python app.py
```
This starts the Flask server on port 5000.
Using the Web Interface
- Navigate to http://localhost:5000
- Enter your task in the input field
- Click "Run Agent" to execute the task
- View results in the result area below

Example Tasks

"Go to huggingface.co and tell me what's on the homepage"
"Navigate to GitHub trending and find the top repository"
"Search for a specific topic on Wikipedia and summarize the content"

Browser Tools Reference

Navigation

navigate_to("website.com")  # Navigate to a URL
go_back()  # Go back one page

Page Interaction

click_element("Sign up")  # Click any element
click_element("Learn more", element_type="link")  # Click a link
scroll_page("down", pixels=800)  # Scroll down
scroll_page("up", pixels=800)  # Scroll up

Content Access

search_item_ctrl_f("text")  # Find text on page
get_text_content()  # Get page text

Utility Functions

close_popups()  # Close modal windows
wait_for_load(5.0)  # Wait for page load

Vision Capabilities

The agent uses screenshots to:

Understand page layout and content
Verify successful navigation
Identify clickable elements
Read and extract text content
Process visual information for decision making

Best Practices

Task Writing
- Be specific about what you want the agent to do
- Break complex tasks into smaller steps
- Include clear success criteria
Performance
- Allow time for page loads between actions
- Use wait_for_load() after navigation
- Verify results through screenshots
Error Handling
- The agent will automatically retry failed actions
- Pop-ups are handled automatically
- Navigation errors are reported clearly

Limitations

Authentication
- Cannot handle login-required pages
- Avoid tasks requiring authentication
Dynamic Content
- May need extra wait time for JavaScript content
- Some dynamic elements might not be immediately visible
Browser Support
- Uses Chrome/Chromium browser
- Requires graphical environment

Troubleshooting

Common Issues
- Elements not found: Use wait_for_load() or increase wait time
- Pop-ups blocking: Try close_popups()
- Page not loading: Check URL format and connectivity
Debug Tips
- Check browser screenshots for visual state
- Review current URL in observations
- Use get_text_content() to verify page content

Development

To extend the agent's capabilities:

Add new tools to browser_tools.py
Update agent system prompt in agent.py
Add new routes to app.py for additional features
Enhance frontend in templates/index.html

Security Notes

API keys should be kept secure in environment variables
The agent runs in a controlled environment
Avoid sending sensitive information through the interface
Browser runs without sandbox for compatibility

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!