This project is designed to automatically extract and analyze information about AI models from academic papers. It processes both PDF and LaTeX sources, extracts text and images, and uses advanced natural language processing techniques to answer specific questions about the models described in the papers.
- Paper acquisition from sources like arXiv
- Content extraction from PDF and LaTeX files
- Text and image analysis using advanced AI models (Claude and GPT-4)
- Information extraction for various model fields (e.g., parameters, training compute, dataset size)
- Reasoning and calculation based on extracted information
- User interface for validation and results viewing
- Clone the repository
- Install the required dependencies:
pip install -r requirements.txt
- Set up your environment variables:
Create a .env
file in the root directory and add your OpenAI API key:
OPENAI_API_KEY=your_api_key_here
Run the main script to process a paper:
python main.py
The script will download the paper, extract information, and present a user interface for validation and viewing results.
src/
: Contains the main source codepaper_acquisition/
: Handles downloading paperscontent_extraction/
: Processes PDF and LaTeX filesinformation_extraction/
: Analyzes text and imagesreasoning/
: Performs calculations and reasoning on extracted datauser_interface/
: Provides GUI for validation and results viewing
tests/
: Contains unit testsdata/
: Stores downloaded papers and extracted dataconfig/
: Contains configuration files, includingquestions.yaml
- PaperDownloader: Downloads papers from sources like arXiv
- PDFProcessor and LaTeXProcessor: Extract content from papers
- TextAnalyzer and ImageAnalyzer: Analyze extracted content
- PromptingSystem: Manages interactions with AI models for information extraction
- ReasoningCalculator: Performs final calculations and reasoning
- ValidationInterface and ResultsViewer: Provide user interfaces for interaction
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License.