A comprehensive system for crawling technical documentation, extracting keywords, generating embeddings, and providing an interactive question-answering interface using RAG (Retrieval-Augmented Generation).
This project consists of three main components:
- A documentation crawler that fetches and processes web content
- A keyword extraction system that identifies technical terms and definitions
- A Streamlit-based UI for interactive question-answering
This is still a work in progress. Right now this does what it's supposed to do, but is poorly optimized and very slow. There's a lot of redundancy, batch processing and rate limiting are implemented manually so it's inefficient.
The crawler component is responsible for:
- Fetching documentation from specified websites
- Processing content into manageable chunks
- Generating embeddings for each chunk
- Storing processed data in Supabase
Key features:
- Parallel processing with rate limiting
- Support for sitemap.xml and recursive crawling
- Batch processing for API calls
- Error handling and retry mechanisms
- Queue-based crawling system
This component:
- Processes documentation chunks to identify technical terms
- Extracts definitions for each term
- Uses GPT-4 for accurate keyword identification
- Stores keyword-definition pairs in Supabase
Features:
- Batch processing for efficient API usage
- Robust error handling
- Progress tracking
- Parallel processing capabilities
This is the main component of the system. It uses a combination of RAG, keyword extraction, and a conversation interface to answer questions about the documentation.
-
Python 3.8+
-
Supabase account
-
OpenAI API access
-
Nomic API access (for embeddings visualization)
Create a .env
file with:
SUPABASE_URL=your_supabase_url
SUPABASE_SERVICE_KEY=your_supabase_key
OPENAI_API_KEY=your_openai_key
OPENROUTER_API_KEY=your_openrouter_key
OPENROUTER_API_BASE=your_openrouter_base
NOMIC_API_KEY=your_nomic_key
LLM_MODEL=gpt-4o-mini
- Clone the repository
- Install dependencies:
pip install -r requirements.txt
- url: string (primary key)
- chunk_number: integer
- title: string
- summary: string
- content: text
- metadata: jsonb
- embedding: vector(1536)
- url: string (foreign key)
- chunk_number: integer
- keyword: string
- definition: text
- url: string (primary key)
- status: string (pending/completed/failed)
- attempts: integer
- last_attempt: timestamp
- error_message: text
python crawler.py [url1] [url2] ...
If no URLs are provided, it will use the default URLs in the script.
python extract_keywords.py
This will process all unprocessed chunks in the database.
streamlit run streamlit_ui.py
Access the UI through your browser at http://localhost:8501
.
- Automatic sitemap detection
- Recursive crawling capability
- Parallel processing
- Rate limiting
- Error handling and retries
- Queue-based processing
- Content chunking
- Embedding generation
- GPT-4 powered extraction
- Batch processing
- Progress tracking
- Error handling
- Parallel processing
- Real-time streaming responses
- Source filtering
- Conversation history
- RAG-based answers
- Markdown support
- System message display
The system implements sophisticated rate limiting and batch processing:
- Configurable batch sizes
- Adjustable rate limits
- Automatic retries
- Error handling
- Progress tracking