GitHub - fabnemEPFL/mmore at all_kinds_of

Name	Name	Last commit message	Last commit date
Latest commit History 128 Commits
docs	docs
examples	examples
scripts	scripts
src	src
tests	tests
.dockerignore	.dockerignore
.gitignore	.gitignore
.python-version	.python-version
Dockerfile	Dockerfile
LICENSE	LICENSE
README.md	README.md
pyproject.toml	pyproject.toml
requirements.txt	requirements.txt

Massive Multimodal Open RAG & Extraction

A scalable multimodal pipeline for processing, indexing, and querying multimodal documents

Ever needed to take 8000 PDFs, 2000 videos, and 500 spreadsheets and feed them to an LLM as a knowledge base? Well, MMORE is here to help you!

MMORE Installation Guide

Installation Option 1: pip (recommended)

To install all dependencies, run:

pip install -e '.[all]'

To install only processor-related dependencies, run:

pip install -e '.[processor]'

To install only RAG-related dependencies, run:

pip install -e '.[rag]'

Minimal Example

from mmore.process.processors.pdf_processor import PDFProcessor 
from mmore.process.processors.base import ProcessorConfig
from mmore.type import MultimodalSample

pdf_file_paths = ["examples/sample_data/pdf/calendar.pdf"]
out_file = "results/example.jsonl"

pdf_processor_config = ProcessorConfig(custom_config={"output_path": "results"})
pdf_processor = PDFProcessor(config=pdf_processor_config)
result_pdf = pdf_processor.process_batch(pdf_file_paths, True, 1) # args: file_paths, fast mode (True/False), num_workers

MultimodalSample.to_jsonl(out_file, result_pdf)

Installation Option 2: uv

Step 1: Install system dependencies

sudo apt update
sudo apt install -y ffmpeg libsm6 libxext6 chromium-browser libnss3 \
  libgconf-2-4 libxi6 libxrandr2 libxcomposite1 libxcursor1 libxdamage1 \
  libxext6 libxfixes3 libxrender1 libasound2 libatk1.0-0 libgtk-3-0 libreoffice

Step 2: Install `uv`

Refer to the uv installation guide for detailed instructions.

curl -LsSf https://astral.sh/uv/install.sh | sh

Step 3: Clone this repository

git clone https://github.com/swiss-ai/mmore
cd mmore

Step 4: Install project and dependencies

uv sync

For CPU-only installation, use:

uv sync --extra cpu

Step 5: Run a test command

Activate the virtual environment before running commands:

python -m venv .venv
source .venv/bin/activate

Alternatively, prepend each command with uv run:

# Run processing
python -m mmore process --config-file examples/process/config.yaml

# Run indexer
python -m mmore index --config-file examples/index/config.yaml

# Run RAG
python -m mmore rag --config-file examples/rag/api/rag_api.yaml

Installation Option 3: Docker

Note: For manual installation without Docker, refer to the section below.

Step 1: Install Docker

Follow the official Docker installation guide.

Step 2: Build the Docker image

docker build . --tag mmore

To build for CPU-only platforms (results in a smaller image size):

docker build --build-arg PLATFORM=cpu -t mmore .

Step 3: Start an interactive session

docker run -it -v ./test_data:/app/test_data mmore

Note: The test_data folder is mapped to /app/test_data inside the container, corresponding to the default path in examples/process_config.yaml.

Step 4: Run the application inside the container

# Run processing
mmore process --config-file examples/process/config.yaml

# Run indexer
mmore index --config-file examples/index/config.yaml

# Run RAG
mmore rag --config-file examples/rag/api/rag_api.yaml

Usage

To launch the MMORE pipeline follow the specialised instructions in the docs.

📄 Input Documents
Upload your multimodal documents (PDFs, videos, spreadsheets, and more) into the pipeline.
🔍 Process Extracts and standardizes text, metadata, and multimedia content from diverse file formats. Easily extensible! You can add your own processors to handle new file types.
Supports fast processing for specific types.
📁 Index Organizes extracted data into a hybrid retrieval-ready Vector Store DB, combining dense and sparse indexing through Milvus. Your vector DB can also be remotely hosted and then you only have to provide a standard API.
🤖 RAG Use the indexed documents inside a Retrieval-Augmented Generation (RAG) system that provides a LangChain interface. Plug in any LLM with a compatible interface or add new ones through an easy-to-use interface. Supports API hosting or local inference.
🎉 Evaluation
Coming soon An easy way to evaluate the performance of your RAG system using Ragas.

See the /docs directory for additional details on each modules and hands-on tutorials on parts of the pipeline.

🚧 Supported File Types

Category	File Types	Supported Device	Fast Mode
Text Documents	DOCX, MD, PPTX, XLSX, TXT, EML	CPU	❌
PDFs	PDF	GPU/CPU	✅
Media Files	MP4, MOV, AVI, MKV, MP3, WAV, AAC	GPU/CPU	✅
Web Content (TBD)	Webpages	GPU/CPU	✅

Contributing

We welcome contributions to improve the current state of the pipeline, feel free to:

Open an issue to report a bug or ask for a new feature
Open a pull request to fix a bug or add a new feature
You can find ongoing new features and bugs in the [Issues]

Don't hesitate to star the project ⭐ if you find it interesting! (you would be our star).

License

This project is licensed under the Apache 2.0 License, see the LICENSE 🎓 file for details.

Acknowledgements

This project is part of the OpenMeditron initiative developed in LiGHT lab at EPFL/Yale/CMU Africa in collaboration with the SwissAI initiative. Thank you Scott Mahoney, Mary-Anne Hartley

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Massive Multimodal Open RAG & Extraction

MMORE Installation Guide

Installation Option 1: pip (recommended)

Minimal Example

Installation Option 2: uv

Step 1: Install system dependencies

Step 2: Install `uv`

Step 3: Clone this repository

Step 4: Install project and dependencies

Step 5: Run a test command

Installation Option 3: Docker

Step 1: Install Docker

Step 2: Build the Docker image

Step 3: Start an interactive session

Step 4: Run the application inside the container

Usage

🚧 Supported File Types

Contributing

License

Acknowledgements

About

Releases

Packages

Languages

License

fabnemEPFL/mmore

Folders and files

Latest commit

History

Repository files navigation

Massive Multimodal Open RAG & Extraction

MMORE Installation Guide

Installation Option 1: pip (recommended)

Minimal Example

Installation Option 2: uv

Step 1: Install system dependencies

Step 2: Install uv

Step 3: Clone this repository

Step 4: Install project and dependencies

Step 5: Run a test command

Installation Option 3: Docker

Step 1: Install Docker

Step 2: Build the Docker image

Step 3: Start an interactive session

Step 4: Run the application inside the container

Usage

🚧 Supported File Types

Contributing

License

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Step 2: Install `uv`

Packages