mAIl is a Python-based tool designed to classify email messages (in .eml
and .msg
formats) into categories like Spam, Phishing, Malicious, or Safe. It uses large language models (LLMs) for classification and provides a detailed output that includes content keywords, reasoning, and a certainty level (0-100%).
- Classify
.eml
and.msg
email formats. - Uses Ollama and LangChain to process emails and classify content.
- Outputs classifications, content keywords, reasoning, and a certainty level.
- Supports multiple email files passed as command-line arguments.
- Prints the results as structured JSON.
- Clone the repository:
git clone https://github.com/srozb/mAIl.git
cd mAIl
Create a virtual environment using pipenv:
pipenv install
Ensure Python version >= 3.9.
Make sure you have access to an Ollama daemon.
ollama run <model_name> 'Why is the sky blue?'
You can classify multiple .eml
or .msg
files by running the following command:
python main.py email1.eml email2.msg email3.eml
The tool will process all the files and print a structured JSON output with classification results, content keywords, and reasoning for each email.
Sample Command:
python main.py ./data/sample_emails/test_email.eml ./data/sample_emails/test_email.msg
Sample Output:
[
{
"file": "test_email.eml",
"email_metadata": {
"subject": "Invoice for last month",
"from": "[email protected]",
"to": "[email protected]"
},
"classification": "Safe",
"certainty_level": 97,
"tags": ["invoice", "billing"],
"reason": "The email appears to be a legitimate billing communication."
},
{
"file": "test_email.msg",
"email_metadata": {
"subject": "Your account has been compromised",
"from": "[email protected]",
"to": "[email protected]"
},
"classification": "Phishing",
"certainty_level": 95,
"tags": ["account", "compromised", "reset password"],
"reason": "The email contains language indicative of phishing attempts."
}
]
python main.py -m gemma2:35b email1.eml email2.msg
python main.py -H http://your-ollama-host:11435 email1.eml email2.msg
You can also set the OLLAMA_HOST
environment variable directly before running the script. If not provided, the default behavior is used.
- src/
- email_parser.py: Contains logic for parsing .eml and .msg email files.
- classifier.py: Contains logic for classifying email content using an LLM model.
- utils.py: Contains utility functions like saving the classification results.
- data/: Example email files to test the tool.
- main.py: Main script to classify emails passed as arguments.
- README.md: Project documentation.
Python 3.x
pipenv
for managing dependenciespython-magic
for file type detectioneml-parser
for parsing .eml filesextract-msg
for parsing .msg files- LangChain and Ollama for email classification using large language models (LLMs)
Benchmark scripts and results are located in the benchmark/
directory.
The directory includes scripts to:
- Automatically run benchmarks across multiple models and datasets.
- Parse logs and generate a comprehensive markdown table summarizing the results.
Check the detailed benchmark results in benchmark/benchmark.md. This document provides performance comparisons for different models, including classification accuracy and inference times.
This project is licensed under the MIT License - see the LICENSE file for details.
- Fork the repository.
- Create a feature branch (
git checkout -b feature-branch
). - Commit your changes (
git commit -am 'Add new feature'
). - Push to the branch (
git push origin feature-branch
). - Open a Pull Request.
LangChain and Ollama for their amazing LLM-powered solutions. Python community for creating useful libraries used in this project.