DeepSeek AI Web Crawler

A robust web crawler using DeepSeek AI for extracting structured data from websites, with comprehensive error logging and validation.

Architecture

graph TD
    A[Web Page] --> B[AsyncWebCrawler]
    B --> C[HTML Content]
    C --> D[LLM Extraction]
    D --> E[Structured Data]
    E --> F[Validation]
    F --> G[CSV Output]
    
    H[Logger] --> I[Console Output]
    H --> J[Log Files]
    
    B -.-> H
    D -.-> H
    F -.-> H

Error Logging System

The project implements a comprehensive logging system crucial for LLM-based web scraping. Due to the inherent unpredictability of LLM responses and the brittle nature of web scraping, extensive error logging is essential.

Key Logging Features

Unique Error Tracking
- Each error gets a unique ID with timestamp
- Errors can be traced across the entire scraping process
- Related errors are linked for debugging

Structured Log Format

[TIMESTAMP] [LEVEL] [COMPONENT] Message
=====================================
Error ID: YYYYMMDD_HHMMSS_uniqueid
Error: Detailed error message
Traceback:
  Stack trace with proper indentation
=====================================

Log Rotation
- Daily log files with date stamps
- 10MB size limit per file
- Keeps last 5 rotated files
- UTF-8 encoding for international text
Contextual Logging
- Operation context (FETCH, PARSE, VALIDATE)
- Performance metrics (timing, counts)
- Data samples for debugging
- Success/failure statistics

Why Extensive Logging is Critical

LLM Unpredictability
- LLM responses can vary significantly
- Need to track extraction patterns
- Identify common failure modes
- Tune prompts based on failures
Web Scraping Challenges
- Sites change structure frequently
- Network issues are common
- Rate limiting needs monitoring
- CSS selectors can break
Data Validation
- Track missing required fields
- Monitor data quality
- Identify pattern mismatches
- Catch schema violations
Performance Optimization
- Track timing per operation
- Monitor resource usage
- Identify bottlenecks
- Guide optimization efforts

Important Note on Web Scraping

Web scraping is inherently brittle, even with advanced LLM capabilities. Common challenges include:

Website structure changes
Dynamic content loading
Anti-bot measures
Rate limiting
Network instability
Character encoding issues
International text handling

The logging system helps identify and debug these issues quickly, but regular maintenance and updates will still be necessary.

Usage

Install dependencies:
```
pip install -r requirements.txt
```

Set up environment variables:

cp .env.example .env
# Edit .env with your API keys

Run the crawler:
```
python main.py
```
Monitor logs:
- Check console for real-time updates
- Review logs/crawler_YYYY-MM-DD.log for details
- Use error IDs to track specific issues

Log File Location

Logs are stored in the logs directory:

Daily files: logs/crawler_YYYY-MM-DD.log
Rotated files: logs/crawler_YYYY-MM-DD.log.1, .2, etc.
Debug logs include raw content samples
Error logs include full stack traces

Best Practices

Monitor Logs Regularly
- Check for recurring errors
- Look for pattern changes
- Monitor success rates
- Track performance metrics
Update Selectors and Prompts
- Use log data to refine CSS selectors
- Adjust LLM prompts based on failures
- Update validation rules as needed
- Monitor site structure changes
Performance Tuning
- Use timing data to optimize
- Adjust batch sizes if needed
- Fine-tune retry strategies
- Balance speed vs. reliability
Error Response
- Use error IDs for tracking
- Check related errors
- Review context data
- Update code based on patterns

Remember: The key to successful LLM-based scraping is not just writing good extraction code, but having comprehensive logging to understand and respond to failures quickly.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
logs		logs
models		models
utils		utils
.env.example		.env.example
.gitignore		.gitignore
README.MD		README.MD
complete_venues.csv		complete_venues.csv
config.py		config.py
main.py		main.py
old_config.py		old_config.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepSeek AI Web Crawler

Architecture

Error Logging System

Key Logging Features

Why Extensive Logging is Critical

Important Note on Web Scraping

Usage

Log File Location

Best Practices

About

Releases

Packages

Languages

pleabargain/deepseek-ai-web-crawler

Folders and files

Latest commit

History

Repository files navigation

DeepSeek AI Web Crawler

Architecture

Error Logging System

Key Logging Features

Why Extensive Logging is Critical

Important Note on Web Scraping

Usage

Log File Location

Best Practices

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages