Skip to content

CloudCorpRecords/dobby

Repository files navigation

Web Vision Agent

Overview

A sophisticated web browsing AI agent with advanced multi-model visual understanding capabilities. This agent can autonomously navigate web pages, understand visual content, and perform complex web-based tasks using state-of-the-art vision language models.

Key Features

  • 🖼️ Visual Web Understanding
  • 🌐 Autonomous Web Navigation
  • 🔄 Multi-Stage AI Processing
  • 🎯 Task-Oriented Interaction
  • 🔍 Advanced Content Analysis

Architecture

Core Components

  1. Vision Language Models

    • Primary Model: Qwen2-VL-72B

      • Used for: Visual understanding, task processing, and decision making
      • Capabilities: Processes screenshots, understands layouts, reads text content
    • Secondary Model: Dobby Unhinged (Optional)

      • Used for: Response refinement and personality injection
      • Adds a unique, engaging tone to agent responses
  2. Browser Tools (browser_tools.py)

    • Navigation: navigate_to(), go_back()
    • Interaction: click_element(), scroll_page()
    • Search: search_item_ctrl_f(), get_text_content()
    • Utility: close_popups(), wait_for_load()
  3. Vision Callback (vision_callback.py)

    • Manages screenshot capture
    • Processes visual information
    • Updates agent observations
  4. Web Interface (app.py)

    • Flask-based API endpoints
    • Real-time agent interaction
    • Task execution and monitoring

Setup

Prerequisites

pip install "smolagents[all]" helium selenium python-dotenv flask flask-cors

Environment Variables

FIREWORKS_API_KEY=your_fireworks_api_key
FLASK_SECRET_KEY=your_secret_key

Usage

Web Interface

  1. Start the server:
python app.py
  1. Access the interface at http://localhost:5000
  2. Select your preferred model:
    • Qwen2-VL-72B (Default): Best for detailed visual analysis
    • Dobby Unhinged: Adds personality to responses

Example Tasks

"Go to huggingface.co and tell me what's on the homepage"
"Navigate to GitHub trending and find the top repository"
"Search for specific content and analyze the visual layout"

API Endpoints

POST /api/agent/run

Executes the web vision agent with specified tasks.

Request Body:

{
    "task": "Your task description",
    "model_id": "accounts/fireworks/models/qwen2-vl-72b-instruct",
    "max_steps": 20
}

Response:

{
    "success": true,
    "result": "Task execution result"
}

GET /api/agent/status

Returns the current status of the agent system.

Model Processing Pipeline

  1. Task Input

    • User submits task through web interface
    • Task is processed by selected model
  2. Visual Processing

    • Agent navigates to webpage
    • Screenshots are captured after each action
    • Qwen2-VL analyzes visual content
  3. Decision Making

    • Model determines next actions based on:
      • Visual analysis
      • Task requirements
      • Previous actions
  4. Response Generation

    • Primary model (Qwen2-VL) generates base response
    • If Dobby is selected, response is refined with personality

Development Guidelines

Adding New Features

  1. Add new tools to browser_tools.py
  2. Update agent system prompt in agent.py
  3. Add routes to app.py for new features
  4. Enhance frontend in templates/index.html

Best Practices

  1. Use wait_for_load() after navigation
  2. Handle dynamic content with appropriate delays
  3. Use close_popups() for modal windows
  4. Verify actions through screenshots

Limitations

  1. Authentication

    • Cannot handle login-required pages
    • Avoid tasks requiring authentication
  2. Dynamic Content

    • May need extra wait time for JavaScript
    • Some dynamic elements might need special handling
  3. Browser Support

    • Uses Chrome/Chromium browser
    • Requires graphical environment

Troubleshooting

Common Issues

  1. Elements Not Found

    • Solution: Use wait_for_load() or increase wait time
    • Check screenshot to verify element visibility
  2. Pop-ups Blocking

    • Solution: Use close_popups()
    • Verify pop-up closure in screenshots
  3. Page Loading Issues

    • Solution: Check URL format
    • Verify connectivity
    • Increase wait times for heavy pages

Security Notes

  • API keys must be stored in environment variables
  • Browser runs in controlled environment
  • Avoid sending sensitive information
  • Regular security updates recommended

Contributing

Feel free to submit issues and enhancement requests!

License

MIT License - Feel free to use and modify as needed.