A sophisticated web browsing AI agent with advanced multi-model visual understanding capabilities. This agent can autonomously navigate web pages, understand visual content, and perform complex web-based tasks using state-of-the-art vision language models.
- 🖼️ Visual Web Understanding
- 🌐 Autonomous Web Navigation
- 🔄 Multi-Stage AI Processing
- 🎯 Task-Oriented Interaction
- 🔍 Advanced Content Analysis
-
Vision Language Models
-
Primary Model: Qwen2-VL-72B
- Used for: Visual understanding, task processing, and decision making
- Capabilities: Processes screenshots, understands layouts, reads text content
-
Secondary Model: Dobby Unhinged (Optional)
- Used for: Response refinement and personality injection
- Adds a unique, engaging tone to agent responses
-
-
Browser Tools (
browser_tools.py
)- Navigation:
navigate_to()
,go_back()
- Interaction:
click_element()
,scroll_page()
- Search:
search_item_ctrl_f()
,get_text_content()
- Utility:
close_popups()
,wait_for_load()
- Navigation:
-
Vision Callback (
vision_callback.py
)- Manages screenshot capture
- Processes visual information
- Updates agent observations
-
Web Interface (
app.py
)- Flask-based API endpoints
- Real-time agent interaction
- Task execution and monitoring
pip install "smolagents[all]" helium selenium python-dotenv flask flask-cors
FIREWORKS_API_KEY=your_fireworks_api_key
FLASK_SECRET_KEY=your_secret_key
- Start the server:
python app.py
- Access the interface at
http://localhost:5000
- Select your preferred model:
- Qwen2-VL-72B (Default): Best for detailed visual analysis
- Dobby Unhinged: Adds personality to responses
"Go to huggingface.co and tell me what's on the homepage"
"Navigate to GitHub trending and find the top repository"
"Search for specific content and analyze the visual layout"
Executes the web vision agent with specified tasks.
Request Body:
{
"task": "Your task description",
"model_id": "accounts/fireworks/models/qwen2-vl-72b-instruct",
"max_steps": 20
}
Response:
{
"success": true,
"result": "Task execution result"
}
Returns the current status of the agent system.
-
Task Input
- User submits task through web interface
- Task is processed by selected model
-
Visual Processing
- Agent navigates to webpage
- Screenshots are captured after each action
- Qwen2-VL analyzes visual content
-
Decision Making
- Model determines next actions based on:
- Visual analysis
- Task requirements
- Previous actions
- Model determines next actions based on:
-
Response Generation
- Primary model (Qwen2-VL) generates base response
- If Dobby is selected, response is refined with personality
- Add new tools to
browser_tools.py
- Update agent system prompt in
agent.py
- Add routes to
app.py
for new features - Enhance frontend in
templates/index.html
- Use
wait_for_load()
after navigation - Handle dynamic content with appropriate delays
- Use
close_popups()
for modal windows - Verify actions through screenshots
-
Authentication
- Cannot handle login-required pages
- Avoid tasks requiring authentication
-
Dynamic Content
- May need extra wait time for JavaScript
- Some dynamic elements might need special handling
-
Browser Support
- Uses Chrome/Chromium browser
- Requires graphical environment
-
Elements Not Found
- Solution: Use
wait_for_load()
or increase wait time - Check screenshot to verify element visibility
- Solution: Use
-
Pop-ups Blocking
- Solution: Use
close_popups()
- Verify pop-up closure in screenshots
- Solution: Use
-
Page Loading Issues
- Solution: Check URL format
- Verify connectivity
- Increase wait times for heavy pages
- API keys must be stored in environment variables
- Browser runs in controlled environment
- Avoid sending sensitive information
- Regular security updates recommended
Feel free to submit issues and enhancement requests!
MIT License - Feel free to use and modify as needed.