The Web Vision Agent is a web browsing automation tool that combines visual understanding capabilities with web interaction. It uses the smolagents library along with vision language models to navigate and interact with web pages intelligently.
-
Vision Callback (
vision_callback.py
)- Handles screenshot capture after each agent action
- Processes and stores visual information for the agent
- Updates agent observations with current URL information
-
Browser Tools (
browser_tools.py
)- Provides a suite of web interaction tools:
- Navigation:
navigate_to()
,go_back()
- Interaction:
click_element()
,scroll_page()
- Search:
search_item_ctrl_f()
,get_text_content()
- Utility:
close_popups()
,wait_for_load()
- Navigation:
- Provides a suite of web interaction tools:
-
Agent Creation (
agent.py
)- Creates and configures the web vision agent
- Sets up vision model (Qwen2-VL-72B by default)
- Configures browsing tools and callbacks
- Provides system prompts for agent behavior
-
Web Interface (
app.py
&templates/
)- Flask-based web interface for controlling the agent
- Allows users to input tasks and view results
- Displays agent responses in real-time
FIREWORKS_API_KEY
: Required for the vision language modelFLASK_SECRET_KEY
: Used for Flask session security
helium
selenium
smolagents
flask
python-dotenv
Pillow
-
Starting the Application
python app.py
This starts the Flask server on port 5000.
-
Using the Web Interface
- Navigate to
http://localhost:5000
- Enter your task in the input field
- Click "Run Agent" to execute the task
- View results in the result area below
- Navigate to
-
Example Tasks
"Go to huggingface.co and tell me what's on the homepage" "Navigate to GitHub trending and find the top repository" "Search for a specific topic on Wikipedia and summarize the content"
navigate_to("website.com") # Navigate to a URL
go_back() # Go back one page
click_element("Sign up") # Click any element
click_element("Learn more", element_type="link") # Click a link
scroll_page("down", pixels=800) # Scroll down
scroll_page("up", pixels=800) # Scroll up
search_item_ctrl_f("text") # Find text on page
get_text_content() # Get page text
close_popups() # Close modal windows
wait_for_load(5.0) # Wait for page load
The agent uses screenshots to:
- Understand page layout and content
- Verify successful navigation
- Identify clickable elements
- Read and extract text content
- Process visual information for decision making
-
Task Writing
- Be specific about what you want the agent to do
- Break complex tasks into smaller steps
- Include clear success criteria
-
Performance
- Allow time for page loads between actions
- Use
wait_for_load()
after navigation - Verify results through screenshots
-
Error Handling
- The agent will automatically retry failed actions
- Pop-ups are handled automatically
- Navigation errors are reported clearly
-
Authentication
- Cannot handle login-required pages
- Avoid tasks requiring authentication
-
Dynamic Content
- May need extra wait time for JavaScript content
- Some dynamic elements might not be immediately visible
-
Browser Support
- Uses Chrome/Chromium browser
- Requires graphical environment
-
Common Issues
- Elements not found: Use
wait_for_load()
or increase wait time - Pop-ups blocking: Try
close_popups()
- Page not loading: Check URL format and connectivity
- Elements not found: Use
-
Debug Tips
- Check browser screenshots for visual state
- Review current URL in observations
- Use
get_text_content()
to verify page content
To extend the agent's capabilities:
- Add new tools to
browser_tools.py
- Update agent system prompt in
agent.py
- Add new routes to
app.py
for additional features - Enhance frontend in
templates/index.html
- API keys should be kept secure in environment variables
- The agent runs in a controlled environment
- Avoid sending sensitive information through the interface
- Browser runs without sandbox for compatibility