Skip to content

PDF intelligence platform combining IBM Docling for document processing, LlamaIndex for data structuring, and Streamlit for a powerful UI. Extract, analyze, and interact with documents using AI capabilities.

License

Notifications You must be signed in to change notification settings

lesteroliver911/docling-pdf-processor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Docling PDF Processor w/ Streamlit

A simple UI wrapper around Docling for document processing. I built this to make document analysis more accessible and thought others might find it useful.

Inspired by Docling and its integration with LlamaIndex.

What This Does

  • Processes PDFs using Docling's document analysis
  • Extracts text, tables, and performs OCR
  • Presents results in a clean Streamlit interface
  • Handles multi-page documents and complex tables
  • Makes document processing accessible to non-technical users

Demo

Setup

git clone https://github.com/lesteroliver911/docling-pdf-processor.git
cd docling-pdf-processor
pip install -r requirements.txt
streamlit run main.py

How It Works

The app combines three powerful frameworks:

  • Docling: Advanced document processing and analysis
  • LlamaIndex: Robust framework for structuring and indexing document data
  • Streamlit: Simple web interface

Key functions:

# Setting up the document processor
def initialize_converter():
    pipeline_options = PdfPipelineOptions()
    pipeline_options.do_ocr = True
    pipeline_options.do_table_structure = True
    return DocumentConverter(...)

# Processing PDFs
def process_pdf(uploaded_file, doc_converter):
    # Handles conversion and extraction
    # Returns markdown and multimodal content

Configuration

You can adjust a few settings in the code:

  • OMP_NUM_THREADS: CPU threads (default: 4)
  • IMAGE_RESOLUTION_SCALE: Image quality (default: 2.0)

Requirements

docling
llama-index
streamlit
pandas
python-dotenv

Using the App

  1. Upload a PDF
  2. Check out the three tabs:
    • AI Preview: Quick look at the content
    • Extracted Content: Full text and structure
    • Document Analysis: Page-by-page breakdown

Notes

  • Works best with clearly formatted PDFs
  • Table extraction might need tweaking for complex layouts
  • OCR can be slow on large documents
  • Docling provides robust document processing - check their documentation for more features
  • LlamaIndex integration adds powerful document structuring capabilities - see their Docling reader docs

Feel free to use this code, modify it, or suggest improvements. You can find me on LinkedIn if you want to discuss Python, AI, or document processing.

About

PDF intelligence platform combining IBM Docling for document processing, LlamaIndex for data structuring, and Streamlit for a powerful UI. Extract, analyze, and interact with documents using AI capabilities.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages