RAGdoc - Documentation Crawler and RAG System

A comprehensive system for crawling technical documentation, extracting keywords, generating embeddings, and providing an interactive question-answering interface using RAG (Retrieval-Augmented Generation).

Overview

This project consists of three main components:

A documentation crawler that fetches and processes web content
A keyword extraction system that identifies technical terms and definitions
A Streamlit-based UI for interactive question-answering

Data Pipeline

System Architecture

1. Crawler (`crawler.py`)

This is still a work in progress. Right now this does what it's supposed to do, but is poorly optimized and very slow. There's a lot of redundancy, batch processing and rate limiting are implemented manually so it's inefficient.

The crawler component is responsible for:

Fetching documentation from specified websites
Processing content into manageable chunks
Generating embeddings for each chunk
Storing processed data in Supabase

Key features:

Parallel processing with rate limiting
Support for sitemap.xml and recursive crawling
Batch processing for API calls
Error handling and retry mechanisms
Queue-based crawling system

2. Keyword Extraction (`extract_keywords.py`)

This component:

Processes documentation chunks to identify technical terms
Extracts definitions for each term
Uses GPT-4 for accurate keyword identification
Stores keyword-definition pairs in Supabase

Features:

Batch processing for efficient API usage
Robust error handling
Progress tracking
Parallel processing capabilities

3. Conversation Agent (`agent.py`)

This is the main component of the system. It uses a combination of RAG, keyword extraction, and a conversation interface to answer questions about the documentation.

Setup

Prerequisites

Python 3.8+
Supabase account
OpenAI API access
Nomic API access (for embeddings visualization)

Environment Variables

Create a .env file with:

SUPABASE_URL=your_supabase_url
SUPABASE_SERVICE_KEY=your_supabase_key
OPENAI_API_KEY=your_openai_key
OPENROUTER_API_KEY=your_openrouter_key
OPENROUTER_API_BASE=your_openrouter_base
NOMIC_API_KEY=your_nomic_key
LLM_MODEL=gpt-4o-mini

Installation

Clone the repository
Install dependencies:

pip install -r requirements.txt

Database Schema

Table: site_pages

url: string (primary key)
chunk_number: integer
title: string
summary: string
content: text
metadata: jsonb
embedding: vector(1536)

Table: keywords

url: string (foreign key)
chunk_number: integer
keyword: string
definition: text

Table: crawl_queue

url: string (primary key)
status: string (pending/completed/failed)
attempts: integer
last_attempt: timestamp
error_message: text

Usage

1. Crawling Documentation

python crawler.py [url1] [url2] ...

If no URLs are provided, it will use the default URLs in the script.

2. Extracting Keywords

python extract_keywords.py

This will process all unprocessed chunks in the database.

3. Running the UI

streamlit run streamlit_ui.py

Access the UI through your browser at http://localhost:8501.

Features

Crawler Features

Automatic sitemap detection
Recursive crawling capability
Parallel processing
Rate limiting
Error handling and retries
Queue-based processing
Content chunking
Embedding generation

Keyword Extraction Features

GPT-4 powered extraction
Batch processing
Progress tracking
Error handling
Parallel processing

UI Features

Real-time streaming responses
Source filtering
Conversation history
RAG-based answers
Markdown support
System message display

Rate Limiting and Batch Processing

The system implements sophisticated rate limiting and batch processing:

Configurable batch sizes
Adjustable rate limits
Automatic retries
Error handling
Progress tracking

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
analysis		analysis
docs		docs
tests		tests
.example.env		.example.env
.gitignore		.gitignore
README.md		README.md
agent.py		agent.py
crawler.py		crawler.py
cursorrules.md		cursorrules.md
extract_keywords.py		extract_keywords.py
requirements.txt		requirements.txt
site_pages.sql		site_pages.sql
streamlit_ui.py		streamlit_ui.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAGdoc - Documentation Crawler and RAG System

Overview

Data Pipeline

System Architecture

1. Crawler (`crawler.py`)

2. Keyword Extraction (`extract_keywords.py`)

3. Conversation Agent (`agent.py`)

Setup

Prerequisites

Environment Variables

Installation

Database Schema

Table: site_pages

Table: keywords

Table: crawl_queue

Usage

1. Crawling Documentation

2. Extracting Keywords

3. Running the UI

Features

Crawler Features

Keyword Extraction Features

UI Features

Rate Limiting and Batch Processing

About

Releases

Packages

Languages

brickbrycebrick/RAGdoc

Folders and files

Latest commit

History

Repository files navigation

RAGdoc - Documentation Crawler and RAG System

Overview

Data Pipeline

System Architecture

1. Crawler (crawler.py)

2. Keyword Extraction (extract_keywords.py)

3. Conversation Agent (agent.py)

Setup

Prerequisites

Environment Variables

Installation

Database Schema

Table: site_pages

Table: keywords

Table: crawl_queue

Usage

1. Crawling Documentation

2. Extracting Keywords

3. Running the UI

Features

Crawler Features

Keyword Extraction Features

UI Features

Rate Limiting and Batch Processing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. Crawler (`crawler.py`)

2. Keyword Extraction (`extract_keywords.py`)

3. Conversation Agent (`agent.py`)

Packages