Web Vision Agent

Overview

A sophisticated web browsing AI agent with advanced multi-model visual understanding capabilities. This agent can autonomously navigate web pages, understand visual content, and perform complex web-based tasks using state-of-the-art vision language models.

Key Features

🖼️ Visual Web Understanding
🌐 Autonomous Web Navigation
🔄 Multi-Stage AI Processing
🎯 Task-Oriented Interaction
🔍 Advanced Content Analysis

Architecture

Core Components

Vision Language Models
- Primary Model: Qwen2-VL-72B
  - Used for: Visual understanding, task processing, and decision making
  - Capabilities: Processes screenshots, understands layouts, reads text content
- Secondary Model: Dobby Unhinged (Optional)
  - Used for: Response refinement and personality injection
  - Adds a unique, engaging tone to agent responses
Browser Tools (browser_tools.py)
- Navigation: navigate_to(), go_back()
- Interaction: click_element(), scroll_page()
- Search: search_item_ctrl_f(), get_text_content()
- Utility: close_popups(), wait_for_load()
Vision Callback (vision_callback.py)
- Manages screenshot capture
- Processes visual information
- Updates agent observations
Web Interface (app.py)
- Flask-based API endpoints
- Real-time agent interaction
- Task execution and monitoring

Setup

Prerequisites

pip install "smolagents[all]" helium selenium python-dotenv flask flask-cors

Environment Variables

FIREWORKS_API_KEY=your_fireworks_api_key
FLASK_SECRET_KEY=your_secret_key

Usage

Web Interface

Start the server:

python app.py

Access the interface at http://localhost:5000
Select your preferred model:
- Qwen2-VL-72B (Default): Best for detailed visual analysis
- Dobby Unhinged: Adds personality to responses

Example Tasks

"Go to huggingface.co and tell me what's on the homepage"
"Navigate to GitHub trending and find the top repository"
"Search for specific content and analyze the visual layout"

API Endpoints

POST `/api/agent/run`

Executes the web vision agent with specified tasks.

Request Body:

{
    "task": "Your task description",
    "model_id": "accounts/fireworks/models/qwen2-vl-72b-instruct",
    "max_steps": 20
}

Response:

{
    "success": true,
    "result": "Task execution result"
}

GET `/api/agent/status`

Returns the current status of the agent system.

Model Processing Pipeline

Task Input
- User submits task through web interface
- Task is processed by selected model
Visual Processing
- Agent navigates to webpage
- Screenshots are captured after each action
- Qwen2-VL analyzes visual content
Decision Making
- Model determines next actions based on:
  - Visual analysis
  - Task requirements
  - Previous actions
Response Generation
- Primary model (Qwen2-VL) generates base response
- If Dobby is selected, response is refined with personality

Development Guidelines

Adding New Features

Add new tools to browser_tools.py
Update agent system prompt in agent.py
Add routes to app.py for new features
Enhance frontend in templates/index.html

Best Practices

Use wait_for_load() after navigation
Handle dynamic content with appropriate delays
Use close_popups() for modal windows
Verify actions through screenshots

Limitations

Authentication
- Cannot handle login-required pages
- Avoid tasks requiring authentication
Dynamic Content
- May need extra wait time for JavaScript
- Some dynamic elements might need special handling
Browser Support
- Uses Chrome/Chromium browser
- Requires graphical environment

Troubleshooting

Common Issues

Elements Not Found
- Solution: Use wait_for_load() or increase wait time
- Check screenshot to verify element visibility
Pop-ups Blocking
- Solution: Use close_popups()
- Verify pop-up closure in screenshots
Page Loading Issues
- Solution: Check URL format
- Verify connectivity
- Increase wait times for heavy pages

Security Notes

API keys must be stored in environment variables
Browser runs in controlled environment
Avoid sending sensitive information
Regular security updates recommended

Contributing

Feel free to submit issues and enhancement requests!

License

MIT License - Feel free to use and modify as needed.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
attached_assets		attached_assets
templates		templates
.replit		.replit
README.md		README.md
agent.py		agent.py
app.py		app.py
browser_tools.py		browser_tools.py
curl.md		curl.md
documentation.md		documentation.md
generated-icon.png		generated-icon.png
main.py		main.py
pyproject.toml		pyproject.toml
replit.nix		replit.nix
uv.lock		uv.lock
vision_callback.py		vision_callback.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Vision Agent

Overview

Key Features

Architecture

Core Components

Setup

Prerequisites

Environment Variables

Usage

Web Interface

Example Tasks

API Endpoints

POST `/api/agent/run`

GET `/api/agent/status`

Model Processing Pipeline

Development Guidelines

Adding New Features

Best Practices

Limitations

Troubleshooting

Common Issues

Security Notes

Contributing

License

About

Releases

Packages

Languages

CloudCorpRecords/dobby

Folders and files

Latest commit

History

Repository files navigation

Web Vision Agent

Overview

Key Features

Architecture

Core Components

Setup

Prerequisites

Environment Variables

Usage

Web Interface

Example Tasks

API Endpoints

POST /api/agent/run

GET /api/agent/status

Model Processing Pipeline

Development Guidelines

Adding New Features

Best Practices

Limitations

Troubleshooting

Common Issues

Security Notes

Contributing

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

POST `/api/agent/run`

GET `/api/agent/status`

Packages