Welcome to the ML File Classifier project! This application is designed to classify files based on their content rather than just their filenames. It supports various file types and uses both rule-based and machine learning classifiers to enhance accuracy. The classifier is structured to handle large volumes of files efficiently, making it scalable and suitable for production environments.
- Content-Based Classification: Extracts text from files and classifies them based on the actual content.
- Support for Multiple File Types: Handles PDFs, images (JPEG, PNG), Word documents, Excel files, plain text, and RTF files.
- Machine Learning Classifier: Incorporates a trained ML model alongside rule-based classification for improved accuracy.
- Scalable Architecture: Modular codebase designed for easy maintenance and scalability.
- Synthetic Data Generation: Generates synthetic data for training the ML model, facilitating expansion to new industries.
- Comprehensive Logging and Debugging: Detailed logging for easier troubleshooting and monitoring.
- Rate Limiting: Implements rate limiting to prevent abuse and ensure fair usage.
- Extensible Design: Easy to add support for new file types and document categories.
- Project Structure
- Getting Started
- Prerequisites
- Installation
- Environment Variables
- Usage
- Running the Application
- API Endpoints
- Testing the Classifier
- Training the ML Model
- Project Components
- Application Entry Point (
app.py
) - Classifiers
- Base Classifier (
base_classifier.py
) - Rule-Based Classifier (
rule_based_classifier.py
) - Machine Learning Classifier (
ml_classifier.py
)
- Base Classifier (
- Text Extraction (
text_extractor.py
) - Synthetic Data Generation (
synthetic_data_generator.py
) - Model Training (
train_model.py
) - Utilities
- Text Preprocessing (
text_preprocessing.py
) - File Utilities (
file_utils.py
)
- Text Preprocessing (
- Application Entry Point (
- Running Tests
- Future Improvements
- License
├── Dockerfile
├── README.md
├── files
│ ├── bank_statement_1.pdf
│ ├── bank_statement_2.pdf
│ ├── bank_statement_3.pdf
│ ├── drivers_licence_2.jpg
│ ├── drivers_license_1.jpg
│ ├── drivers_license_3.jpg
│ ├── invoice_1.pdf
│ ├── invoice_2.pdf
│ └── invoice_3.pdf
├── requirements.txt
├── src
│ ├── __init__.py
│ ├── app.py
│ ├── classifiers
│ │ ├── __init__.py
│ │ ├── base_classifier.py
│ │ ├── ml_classifier.py
│ │ └── rule_based_classifier.py
│ ├── data_generation
│ │ ├── __init__.py
│ │ └── synthetic_data_generator.py
│ ├── extraction
│ │ ├── __init__.py
│ │ └── text_extractor.py
│ ├── training
│ │ ├── templates
│ │ │ ├── bank_statement_templates.json
│ │ │ ├── drivers_license_templates.json
│ │ │ └── invoice_templates.json
│ │ └── train_model.py
│ └── utils
│ ├── __init__.py
│ ├── file_utils.py
│ └── text_preprocessing.py
└── tests
├── __init__.py
├── classifiers
│ ├── test_ml_classifier.py
│ └── test_rule_based_classifier.py
├── conftest.py
├── extraction
│ └── test_text_extractor.py
├── test_app.py
└── utils
└── test_text_preprocessing.py
- Python 3.9+
- Tesseract OCR: Required for text extraction from images.
- System Dependencies: Libraries for processing various file types.
Ubuntu/Debian:
sudo apt-get update
sudo apt-get install -y tesseract-ocr libtesseract-dev libleptonica-dev \
pkg-config poppler-utils libmagic1 \
libpoppler-cpp-dev python3-dev
macOS (using Homebrew):
brew install tesseract libmagic poppler
- Clone the Repository:
git clone <repository_url>
cd join-the-siege
- Set Up Virtual Environment and Install Dependencies:
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
Create a .env
file in the project root to configure environment variables:
FLASK_APP=src.app
FLASK_ENV=development
CLASSIFIER_TYPE=ml_based # Options: 'rule_based', 'ml_based'
LOG_LEVEL=INFO
- Activate the Virtual Environment:
source venv/bin/activate
- Run the Flask App:
flask run
The application will start on http://127.0.0.1:5000/
.
-
GET /health
Health check endpoint.
Response:{ "status": "healthy" }
-
POST /classify_file
Upload a file to classify.
Parameters:file
: The file to classify.
Response:200 OK
: Returns the classification result and processing time.{ "file_class": "invoice", "processing_time": 0.234 }
400 Bad Request
: If the file is missing or invalid.429 Too Many Requests
: If rate limit is exceeded.500 Internal Server Error
: If an unexpected error occurs.
Submit a File for Classification:
curl -X POST -F 'file=@path_to_file' http://127.0.0.1:5000/classify_file
Check Health Status:
curl http://127.0.0.1:5000/health
... (Continued in the markdown format for all sections)
Before running the application with the ML classifier, you need to train the machine learning model.
-
Generate Synthetic Data and Train the Model:
python src/training/train_model.py
This script generates synthetic data using predefined templates and trains a BERT-based classifier.
-
Model and Tokenizer Files: The trained model and tokenizer are saved in the
src/models/document_classifier
directory.
Located at src/app.py
, this file initializes the Flask application, configures logging, sets up rate limiting, and defines the API endpoints.
- Rate Limiting: Prevents abuse by limiting the number of requests per user.
- Logging: Provides detailed logs for debugging and monitoring.
- Health Check Endpoint: Allows monitoring tools to verify the application’s health.
- TODOs for Future Improvements: Includes placeholders for enhancements like authentication, async processing, and caching.
- The application uses the
Flask-Limiter
library for rate limiting. - Logging is configured based on environment variables for flexibility.
- The classifier is initialized based on the
CLASSIFIER_TYPE
environment variable, allowing easy switching between rule-based and ML classifiers.
An abstract base class that defines the interface for classifiers.
- Abstract Method: Defines the
classify
method that must be implemented by subclasses. - Standardization: Ensures all classifiers adhere to the same interface.
Implements a simple rule-based approach using keyword matching.
- Keyword Matching: Searches for specific keywords in the extracted text to determine the document type.
- Logging: Provides debug information about the classification process, aiding in troubleshooting.
- Quick Initialization: Does not require extensive setup or training, making it lightweight.
Uses a pre-trained Transformer model (BERT) for classification.
- Transformer Model: Utilizes a fine-tuned BERT model for better accuracy in classification tasks.
- Confidence Scores: Provides confidence levels for predictions, allowing for more informed decisions.
- Debug Information: Includes detailed logging and debug data for each classification, helpful for monitoring and improving the model.
- GPU Support: Automatically uses GPU if available, enhancing performance for large-scale deployments.
- The classifier loads the model and tokenizer from the
models/document_classifier
directory. - Input texts are preprocessed and tokenized before being fed into the model.
- Predictions include top labels with their corresponding confidence scores.
Located at src/extraction/text_extractor.py
, this module handles extracting text from various file types.
- PDF: Extracts text from PDF documents using
pypdf
. - Images (JPEG, PNG): Uses Tesseract OCR to extract text from images.
- Word Documents: Extracts text from
.docx
files usingdocx2txt
. - Excel Spreadsheets: Reads data from
.xlsx
files usingopenpyxl
. - RTF Files: Converts RTF files to plain text using
striprtf
. - Plain Text and CSV Files: Reads and decodes text files directly.
- MIME Type Detection: Determines the file type based on MIME type for accurate processing.
- Error Handling: Includes robust error handling and logging for troubleshooting extraction issues.
- Logging: Provides detailed logs at each step of the extraction process.
Generates synthetic documents for training the ML model.
- Randomized Data: Uses the
faker
library to generate realistic and diverse data points. - Templates: Loads templates from JSON files to structure the synthetic data, ensuring consistency.
- Multiple Document Types: Generates synthetic data for bank statements, invoices, and driver’s licenses.
- Metadata Inclusion: Stores metadata alongside text content for potential use in advanced training scenarios.
- The module can generate a customizable number of samples.
- Synthetic data helps in training the ML model when real data is scarce or sensitive.
Trains a BERT-based model using the synthetic dataset.
- Data Preparation: Splits data into training and validation sets, ensuring a fair evaluation.
- Model Configuration: Sets up training arguments optimized for document classification tasks.
- Metrics Calculation: Computes accuracy, precision, recall, and F1 score for comprehensive evaluation.
- Early Stopping: Uses callbacks to prevent overfitting by stopping training when no improvement is observed.
- Model Saving: Saves the trained model, tokenizer, and label mappings for future use.
- The script increases the dataset size and adjusts training parameters for better performance.
- Utilizes the
transformers
library for model training and management. - The trained model is stored in a designated directory for consistency.
Provides functions for cleaning and preparing text data.
- Lowercasing: Converts all text to lowercase for uniformity.
- Special Character Removal: Removes URLs, non-alphabetic characters, and extraneous symbols.
- Whitespace Normalization: Cleans up extra spaces to streamline the text.
- Alphabetic Filtering: Extracts alphabetic characters from alphanumeric strings for focus on meaningful content.
Contains utility functions for file handling, including MIME type validation.
- MIME Type Checking: Validates if a file is of an allowed type using the
python-magic
library. - Supported MIME Types: Defined in
ALLOWED_MIME_TYPES
for easy modification and extension. - File Rewinding: Ensures the file pointer is reset after reading for consistent processing.
- Centralizes file validation logic, making it easier to maintain and update supported file types.
- Enhances security by preventing processing of disallowed or potentially harmful file types.
Tests are located in the tests
directory and cover various components of the application.
pytest
- Classifiers: Ensures both the rule-based and ML classifiers function as expected.
- Text Extraction: Verifies that text is correctly extracted from supported file types.
- Utilities: Tests utility functions like text preprocessing and file validation.
- Application Routes: Checks the API endpoints for correct responses and error handling.
- Asynchronous Processing: Implement Celery or another task queue for handling large files asynchronously. Alternatively, use FastAPI for async processing.
- Caching: Add caching mechanisms to store results of frequently classified documents, improving performance.
- Authentication: Implement API key authentication and role-based access control for enhanced security.
- Monitoring and Metrics: Integrate tools like Prometheus and Grafana for monitoring performance and resource usage.
- Error Handling Enhancements: Add more granular error handling, retry mechanisms, and user-friendly error messages.
- Scalability: Deploy the application using a production-grade WSGI server like Gunicorn and consider containerization with Docker or Kubernetes.