A video analysis tool that combines vision models like Llama's 11B vision model and Whisper to create a description by taking key frames, feeding them to the vision model to get details. It uses the details from each frame and the transcript, if available, to describe what's happening in the video.
- Features
- Requirements
- Usage
- Design
- Project Structure
- Configuration
- Output
- Uninstallation
- License
- Contributing
- π» Can run completely locally - no cloud services or API keys needed
- βοΈ Or, leverage any OpenAI API compatible LLM service (openrouter, openai, etc) for speed and scale
- π¬ Intelligent key frame extraction from videos
- π High-quality audio transcription using OpenAI's Whisper
- ποΈ Frame analysis using Ollama and Llama3.2 11B Vision Model
- π Natural language descriptions of video content
- π Automatic handling of poor quality audio
- π Detailed JSON output of analysis results
- βοΈ Highly configurable through command line arguments or config file
The system operates in three stages:
-
Frame Extraction & Audio Processing
- Uses OpenCV to extract key frames
- Processes audio using Whisper for transcription
- Handles poor quality audio with confidence checks
-
Frame Analysis
- Analyzes each frame using vision LLM
- Each analysis includes context from previous frames
- Maintains chronological progression
- Uses frame_analysis.txt prompt template
-
Video Reconstruction
- Combines frame analyses chronologically
- Integrates audio transcript
- Uses first frame to set the scene
- Creates comprehensive video description
- Python 3.11 or higher
- FFmpeg (required for audio processing)
- When running LLMs locally (not necessary when using openrouter)
- At least 16GB RAM (32GB recommended)
- GPU at least 12GB of VRAM or Apple M Series with at least 32GB
- Clone the repository:
git clone https://github.com/byjlw/video-analyzer.git
cd video-analyzer
- Create and activate a virtual environment:
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
- Install the package:
pip install . # For regular installation
# OR
pip install -e . # For development installation
- Install FFmpeg:
- Ubuntu/Debian:
sudo apt-get update && sudo apt-get install -y ffmpeg
- macOS:
brew install ffmpeg
- Windows:
choco install ffmpeg
-
Install Ollama following the instructions at ollama.ai
-
Pull the default vision model:
ollama pull llama3.2-vision
- Start the Ollama service:
ollama serve
If you want to use OpenAI-compatible APIs (like OpenRouter or OpenAI) instead of Ollama:
-
Get an API key from your provider:
-
Configure via command line:
# For OpenRouter video-analyzer video.mp4 --client openai_api --api-key your-key --api-url https://openrouter.ai/api/v1 --model gpt-4o-mini # For OpenAI video-analyzer video.mp4 --client openai_api --api-key your-key --api-url https://api.openai.com/v1 --model meta-llama/llama-3.2-11b-vision-instruct
Or add to config/config.json:
{ "clients": { "default": "openai_api", "openai_api": { "api_key": "your-api-key", "api_url": "https://openrouter.ai/api/v1" # or https://api.openai.com/v1 } } }
Note: With OpenRouter, you can use llama 3.2 11b vision for free by adding :free to the model name
video-analyzer/
βββ config/
β βββ default_config.json
βββ prompts/
β βββ frame_analysis/
β βββ frame_analysis.txt
β βββ describe.txt
βββ output/ # Generated during runtime
βββ video_analyzer/ # Package source code
βββ setup.py # Package installation configuration
For detailed information about the project's design and implementation, including how to make changes, see docs/DESIGN.md.
For detailed usage instructions and all available options, see docs/USAGES.md.
# Local analysis with Ollama (default)
video-analyzer video.mp4
# Cloud analysis with OpenRouter
video-analyzer video.mp4 \
--client openai_api \
--api-key your-key \
--api-url https://openrouter.ai/api/v1 \
--model meta-llama/llama-3.2-11b-vision-instruct:free
# Analysis with custom prompt
video-analyzer video.mp4 \
--prompt "What activities are happening in this video?" \
--whisper-model large
Video Summary**\n\nDuration: 5 minutes and 67 seconds\n\nThe video begins with a person with long blonde hair, wearing a pink t-shirt and yellow shorts, standing in front of a black plastic tub or container on wheels. The ground appears to be covered in wood chips.\n\nAs the video progresses, the person remains facing away from the camera, looking down at something inside the tub. Their left hand is resting on their hip, while their right arm hangs loosely by their side. There are no new objects or people visible in this frame, but there appears to be some greenery and possibly fruit scattered around the ground behind the person.\n\nThe black plastic tub on wheels is present throughout the video, and the wood chips covering the ground remain consistent with those seen in Frame 0. The person's pink t-shirt matches the color of the shirt worn by the person in Frame 0.\n\nAs the video continues, the person remains stationary, looking down at something inside the tub. There are no significant changes or developments in this frame.\n\nThe key continuation point is to watch for the person to pick up an object from the tub and examine it more closely.\n\n**Key Continuation Points:**\n\n* The person's pink t-shirt matches the color of the shirt worn by the person in Frame 0.\n* The black plastic tub on wheels is also present in Frame 0.\n* The wood chips covering the ground are consistent with those seen in Frame 0.
The tool uses a cascading configuration system with command line arguments taking highest priority, followed by user config (config/config.json), and finally the default config. See docs/USAGES.md for detailed configuration options.
The tool generates a JSON file (analysis.json
) containing:
- Metadata about the analysis
- Audio transcript (if available)
- Frame-by-frame analysis
- Final video description
To uninstall the package:
pip uninstall video-analyzer
Apache License
We welcome contributions! Please see docs/CONTRIBUTING.md for detailed guidelines on how to:
- Review the project design
- Propose changes through GitHub Discussions
- Submit pull requests