Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
stochastic-sisyphus authored Dec 17, 2024
1 parent 633f766 commit d702f70
Showing 1 changed file with 205 additions and 109 deletions.
314 changes: 205 additions & 109 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,166 +1,262 @@
# SynSearch

SynSearch is an advanced document processing and semantic search system that combines embedding generation, clustering, and summarization capabilities to effectively process and analyze large collections of text documents.

## 🌟 Features

- **Document Processing Pipeline**
- Domain-agnostic text preprocessing
- Supports multiple dataset formats
- Efficient batch processing capabilities
## Overview
SynSearch is a sophisticated Python-based research paper analysis system that combines advanced NLP techniques, clustering algorithms, and scientific text processing. The project aims to help researchers effectively analyze and summarize large collections of scientific literature.

## πŸ“š Table of Contents
1. [Core Features](#core-features)
2. [System Architecture](#system-architecture)
3. [Installation](#installation)
4. [Configuration](#configuration)
5. [Usage Guide](#usage-guide)
6. [API Reference](#api-reference)
7. [Development](#development)
8. [Testing](#testing)
9. [Performance Optimization](#performance-optimization)
10. [Troubleshooting](#troubleshooting)

## Core Features

### πŸ“– Document Processing
- **Multi-Dataset Support**
- XL-Sum dataset integration
- ScisummNet dataset processing
- Custom dataset handling capabilities

### 🧠 Advanced Text Processing
- **Domain-Specific Processing**
- Scientific text preprocessing
- Legal document handling
- Metadata extraction
- URL and special character normalization

### πŸ”„ Data Pipeline
- **Robust Data Loading**
- Batch processing support
- Progress tracking
- Automatic validation
- Performance optimization

### 🎯 Clustering & Analysis
- **Dynamic Clustering**
- HDBSCAN implementation
- Silhouette score calculation
- Cluster quality metrics
- Adaptive cluster size

### πŸ“Š Summarization
- **Hybrid Summarization System**
- Multiple summarization styles:
- Technical summaries
- Concise overviews
- Detailed analyses
- Batch processing support
- GPU acceleration

## System Architecture

### Directory Structure
```
synsearch/
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ api/ # API integrations
β”‚ β”œβ”€β”€ preprocessing/ # Text preprocessing
β”‚ β”œβ”€β”€ clustering/ # Clustering algorithms
β”‚ β”œβ”€β”€ summarization/ # Summary generation
β”‚ β”œβ”€β”€ utils/ # Utility functions
β”‚ └── visualization/ # Visualization tools
β”œβ”€β”€ tests/ # Test suite
β”œβ”€β”€ config/ # Configuration files
β”œβ”€β”€ data/ # Dataset storage
β”œβ”€β”€ logs/ # Log files
β”œβ”€β”€ cache/ # Cache storage
└── outputs/ # Generated outputs
```

- **Advanced Embedding Generation**
- Transformer-based embeddings
- Configurable model selection
- GPU acceleration support
- Optimized batch processing
### Key Components

- **Dynamic Clustering**
- Adaptive clustering algorithms
- Theme-based document grouping
- Support for multiple clustering strategies
#### 1. Data Management
- `DataLoader`: Handles dataset loading and validation
- `DataPreparator`: Prepares and preprocesses text data
- `DataValidator`: Ensures data quality and format

- **Intelligent Summarization**
- Hybrid summarization approach
- Support for scientific and legal domains
- Cluster-based summary generation
- Configurable summary length
#### 2. Text Processing
- `TextPreprocessor`: Handles text cleaning and normalization
- `DomainAgnosticPreprocessor`: Generic text preprocessing
- `EnhancedDataLoader`: Optimized data loading

- **ArXiv Integration**
- Direct ArXiv paper search
- Batch paper fetching
- Rate-limited API handling
#### 3. Analysis
- `ClusterManager`: Manages document clustering
- `EnhancedEmbeddingGenerator`: Generates text embeddings
- `HybridSummarizer`: Multi-style text summarization

## πŸ“‹ Requirements
## Installation

### Prerequisites
- Python 3.8 or higher
- CUDA-compatible GPU (optional, for acceleration)
- Required Python packages:
- torch
- transformers
- pandas
- numpy
- spacy
- pyyaml

## πŸš€ Installation

1. Clone the repository:
- CUDA-compatible GPU (optional)
- 8GB RAM minimum (16GB recommended)

### Setup Steps
```bash
# Clone repository
git clone https://github.com/stochastic-sisyphus/synsearch.git
cd synsearch
```

2. Set up a virtual environment (recommended):
```bash
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```

3. Install dependencies:
```bash
# Install dependencies
pip install -r requirements.txt
```

4. Download required datasets:
```bash
# Download required datasets
make download-data
```

## βš™οΈ Configuration

The system is configured through YAML files located in the `config` directory. Key configuration areas include:
# Initialize system
python -m src.initialization
```

- Data sources and paths
- Embedding model settings
- Preprocessing parameters
- Clustering configuration
- Summarization options
## Configuration

Example configuration:
### Basic Configuration (config/config.yaml)
```yaml
data:
datasets:
- name: scisummnet
enabled: true
scisummnet_path: "path/to/dataset"
input_path: "data/raw"
output_path: "data/processed"
scisummnet_path: "data/scisummnet"
batch_size: 32

preprocessing:
min_length: 100
max_length: 1000
validation:
min_words: 50

embedding:
model_name: "bert-base-uncased"
dimension: 768
max_seq_length: 512
batch_size: 32

preprocessing:
# Preprocessing settings
max_seq_length: 512
device: "cuda"

clustering:
# Clustering settings
algorithm: "hdbscan"
min_cluster_size: 5
min_samples: 3
metric: "euclidean"

summarization:
# Summarization settings
model_name: "t5-base"
max_length: 150
min_length: 50
batch_size: 16
```
## πŸ”¨ Usage
### Advanced Settings
- Performance optimization
- Cache management
- Logging configuration
- Visualization options
## Usage Guide
1. Basic usage:
### Basic Usage
```python
from src.main import main

# Run the complete pipeline
# Run complete pipeline
main()
```

2. Using specific components:
### Custom Pipeline
```python
from src.api.arxiv_api import ArxivAPI
from src.preprocessing.domain_agnostic_preprocessor import DomainAgnosticPreprocessor
from src.embedding_generator import EnhancedEmbeddingGenerator
from src.clustering.cluster_manager import ClusterManager

# Initialize components
api = ArxivAPI()
preprocessor = DomainAgnosticPreprocessor()
embedding_generator = EnhancedEmbeddingGenerator(model_name="bert-base-uncased")
cluster_manager = ClusterManager(config)

# Process texts
processed_texts = preprocessor.preprocess_texts(your_texts)
embeddings = embedding_generator.generate_embeddings(processed_texts)
# Process papers
papers = api.search("quantum computing", max_results=50)
processed_texts = preprocessor.preprocess_texts([p['text'] for p in papers])
clusters, metrics = cluster_manager.perform_clustering(processed_texts)
```

## πŸ§ͺ Testing
## Development

Run the test suite:
```bash
pytest tests/
```
### Environment Setup
- Use Python 3.8+ virtual environment
- Install development dependencies: `pip install -r requirements-dev.txt`
- Setup pre-commit hooks: `pre-commit install`

Key test areas include:
- Preprocessing functionality
- ArXiv API integration
- Embedding generation
- Clustering algorithms
- Summarization quality
### Code Style
- Follow PEP 8 guidelines
- Use type hints
- Document using Google docstring format

## πŸ“ Project Structure
### Contributing
1. Fork the repository
2. Create feature branch
3. Add tests
4. Submit pull request

```
synsearch/
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ api/ # API integrations
β”‚ β”œβ”€β”€ preprocessing/ # Text preprocessing
β”‚ β”œβ”€β”€ clustering/ # Clustering algorithms
β”‚ β”œβ”€β”€ summarization/ # Summary generation
β”‚ └── utils/ # Utility functions
β”œβ”€β”€ tests/ # Test suite
β”œβ”€β”€ config/ # Configuration files
β”œβ”€β”€ data/ # Dataset storage
└── outputs/ # Generated outputs
```
## Testing

## 🀝 Contributing
### Running Tests
```bash
# Run all tests
pytest tests/

1. Fork the repository
2. Create a feature branch
3. Commit your changes
4. Push to the branch
5. Open a Pull Request
# Run specific test category
pytest tests/test_preprocessor.py
pytest tests/test_clustering.py
```

## πŸ“ License
### Test Coverage
- Unit tests for all components
- Integration tests for pipelines
- Performance benchmarks

## Performance Optimization

### Automatic Optimization
- Batch size optimization
- Worker count adjustment
- GPU utilization
- Memory management

### Caching System
- Embedding cache
- Dataset cache
- Results cache

## Troubleshooting

### Common Issues
1. Memory errors
- Reduce batch size
- Enable disk caching
2. GPU errors
- Check CUDA installation
- Reduce model size
3. Dataset loading issues
- Verify paths
- Check file permissions

### Logging
- Logs stored in `logs/synsearch.log`
- Debug level logging available
- Performance metrics tracking

## License
[License information pending]

## Contributors
- @stochastic-sisyphus

## Contact
[Contact information pending]

0 comments on commit d702f70

Please sign in to comment.