Skip to content

Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks

License

Notifications You must be signed in to change notification settings

yijunx/omniparse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OmniParse

Omniprase

Important

OmniParse is a comprehensive parsing tool designed to convert any unstrcutred document, media, or website into strcutured markdown. Whether you're dealing with documents, tables, images, videos, audio files, or web pages, OmniParse ensures your data is parsed and cleaned to a high standard before it is passed to any downstream LLM use case, such as advanced RAG.

Features

✅ Supports 15+ file types
✅ Convert Documents, Multimedia, Web pages to high-quality structured markdown
✅ Table Extraction, Image Extraction/Captioning, Audio/Video Transcription, Web page Crawling
✅ Easily Deployable using Docker and Skypilot
✅ CPU/GPU compatible
✅ Batch processing for handling multiple files at once
✅ Comprehensive logging and error handling for robust performance \

Supported Types

Type Supported Extensions
Plaintext .eml, .html, .md, .msg, .rst, .rtf, .txt, .xml
Documents .doc, .docx, .epub, .odt, .pdf, .ppt, .pptx
Table .csv, .xlsx
Images .png, .jpg, .jpeg, .tiff, .bmp, .heic
Video .mp4, .mkv, .avi, .mov
Audio .mp3, .wav, .aac
Web dynamic webpages, http://.com
Crawl dynamic webpages, http://.com

Installation

To install OmniParse, you can use pip:

git clone https://github.com/adithya-s-k/omniparse
cd omniparse

Create a Virtual Environment:

conda create omniparse-venv python=3.10
conda activate omniparse-venv

Install Dependencies:

poetry install
# or
pip install -e .

Usage

Run the Server:

python server.py

Install the client:

pip install omniparse_client

Example usage:

from omniparse_client import OmniParse

# Initialize the parser
parser = OmniParse(
    base_url="http://localhost:8000" 
    api_key="op-...", # get the api key from dev.omniparse.com
    verbose=True,
    language="en" )

# Parse a document
document = parser.load_data('path/to/document.pdf')

# Convert to markdown
parser.save_to_markdown(document)

License

OmniParse is licensed under the Apache License. See LICENSE for more information.

Acknowledgement

Surya-OCR,Texify - Big thanks to VikParuchuri for creating awesome open-source OCR models which have been extensively used in this project

Contact

For any inquiries, please contact us at [email protected]

About

Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.4%
  • Dockerfile 1.6%