Name		Name	Last commit message	Last commit date
Latest commit History 610 Commits
.envs/.test		.envs/.test
.github		.github
.idea		.idea
.ipython/profile_default		.ipython/profile_default
compose		compose
config		config
docs		docs
fixtures/vcr_cassettes		fixtures/vcr_cassettes
frontend		frontend
locale		locale
opencontractserver		opencontractserver
requirements		requirements
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
CONTRIBUTORS.txt		CONTRIBUTORS.txt
LICENSE		LICENSE
README.md		README.md
download_GLINER.py		download_GLINER.py
download_embeddings_model.py		download_embeddings_model.py
local.yml		local.yml
local_deploy_with_gremlin.yml		local_deploy_with_gremlin.yml
manage.py		manage.py
merge_production_dotenvs_in_dotenv.py		merge_production_dotenvs_in_dotenv.py
mkdocs.yml		mkdocs.yml
production.yml		production.yml
production_deploy_with_gremlin.yml		production_deploy_with_gremlin.yml
pytest.ini		pytest.ini
schema.graphql		schema.graphql
schema.json		schema.json
setup.cfg		setup.cfg
setup_codecov.sh		setup_codecov.sh
test.yml		test.yml

Repository files navigation

Open Contracts (Demo)

The Free and Open Source Document Analytics Platform


CI/CD
Meta

What Does it Do?

OpenContracts is an Apache-2 Licensed enterprise document analytics tool. It provides several key features:

Manage Documents - Manage document collections (Corpuses)
Layout Parser - Automatically extracts layout features from PDFs
Automatic Vector Embeddings - generated for uploaded PDFs and extracted layout blocks
Pluggable microservice analyzer architecture - to let you analyze documents and automatically annotate them
Human Annotation Interface - to manually annotated documents, including multi-page annotations.
LlamaIndex Integration - Use our vector stores (powered by pgvector) and any manual or automatically annotated features to let an LLM intelligently answer questions.
Data Extract - ask multiple questions across hundreds of documents using complex LLM-powered querying behavior. Our sample implementation uses LlamaIndex + Marvin.
Custom Data Extract - Custom data extract pipelines can be used on the frontend to query documents in bulk.

Key Docs

We recommend you browse our docs via our Mkdocs Site. You can also view the docs in the repo:

Quickstart Guide - You'll probably want to get started quickly. Setting up locally should be pretty painless if you're already running Docker.
Basic Walkthrough - Check out the walkthrough to step through basic usage of the application for document and annotation management.
PDF Annotation Data Format Overview - You may be interested how we map text to PDFs visually and the underlying data format we're using.
Django + Pgvector Powered Hybrid Vector Database We've used the latest open source tooling for vector storage in postgres to make it almost trivially easy to combine structured metadata and vector embeddings with an API-powered application.
LlamaIndex Integration Walkthrough - We wrote a wrapper for our backend database and vector store to make it simple to load our parsed annotations, embeddings and text into LlamaIndex. Even better, if you have additional annotations in the document, the LLM can access those too.
Write Custom Data Extractors - Custom data extract tasks (which can use LlamaIndex or can be totally bespoke) are automatically loaded and displayed on the frontend to let user's select how to ask questions and extract data from documents.

Architecture and Data Flows at a Glance

Core Data Standard

The core idea here - besides providing a platform to analyze contracts - is an open and standardized architecture that makes data extremely portable. Powering this is a set of data standards to describe the text and layout blocks on a PDF page:

Robust PDF Processing Pipeline

We have a robust PDF processing pipeline that is horizontally scalable and generates our standardized data consistently for PDF inputs (We're working on adding additional formats soon):

Special thanks to Nlmatics and nlm-ingestor for powering the layout parsing and extraction.

Limitations

At the moment, it only works with PDFs. In the future, it will be able to convert other document types to PDF for storage and labeling. PDF is an excellent format for this as it introduces a consistent, repeatable format which we can use to generate a text and x-y coordinate layer from scratch.

Adding OCR and ingestion for other enterprise documents is a priority.

Acknowledgements

Special thanks to AllenAI's PAWLS project and Nlmatics nlm-ingestor. They've pioneered a number of features and flows, and we are using their code in some parts of the application.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open Contracts (Demo)

The Free and Open Source Document Analytics Platform

What Does it Do?

Key Docs

Architecture and Data Flows at a Glance

Core Data Standard

Robust PDF Processing Pipeline

Limitations

Acknowledgements

About

Releases 16

Sponsor this project

Packages

Contributors 2

Languages

License

JSv4/OpenContracts

Folders and files

Latest commit

History

Repository files navigation

Open Contracts (Demo)

The Free and Open Source Document Analytics Platform

What Does it Do?

Key Docs

Architecture and Data Flows at a Glance

Core Data Standard

Robust PDF Processing Pipeline

Limitations

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases 16

Sponsor this project

Packages 0

Contributors 2

Languages

Packages