Name		Name	Last commit message	Last commit date
Latest commit History 287 Commits
.github		.github
Docs2KG		Docs2KG
docs		docs
examples		examples
scripts		scripts
tests		tests
.gitignore		.gitignore
.isort.cfg		.isort.cfg
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
config.example.yml		config.example.yml
main.py		main.py
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements.dev.txt		requirements.dev.txt
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Repository files navigation

Docs2KG

A Human-LLM Collaborative Approach to Unified Knowledge Graph Construction from Heterogeneous Documents

Installation

We have published the package to PyPi: Docs2KG,

You can install it via:

pip install Docs2KG

python -m spacy download en_core_web_sm

Setup and Tutorial

Detailed setup and tutorial can be found in the documentation.

You have two ways to run the package:

import the package in the code, and hook it with your own code
run the package in the command line

Command Line

# first setup the CONFIG_FILE environment variable to local one
export CONFIG_FILE=config.yml # or any other path for the configuration file
docs2kg # this command will tell you how to use the package

# we currently support the following commands
docs2kg process-document your_input_file --agent-name phi3.5 --agent-type ollama --project-id your_project_id
docs2kg batch-process your_input_dir --agent-name phi3.5 --agent-type ollama --project-id your_project_id
docs2kg list-formats # list all the supported formats

Usage: docs2kg [OPTIONS] COMMAND [ARGS]...

  Docs2KG - Document to Knowledge Graph conversion tool.

  Supports multiple document formats: PDF, DOCX, HTML, and EPUB.

Options:
  -c, --config PATH  Path to the configuration file (default: ./config.yml)
  --help             Show this message and exit.

Commands:
  batch-process     Process all supported documents in a directory.
  list-formats      List all supported document formats.
  neo4j             Load data to Neo4j database.
  process-document  Process a single document file.

Usage: docs2kg process-document [OPTIONS] FILE_PATH

  Process a single document file.

  FILE_PATH: Path to the document file (PDF, DOCX, HTML, or EPUB)

Options:
  -p, --project-id TEXT  Project ID for the knowledge graph construction
  -n, --agent-name TEXT  Name of the agent to use for NER extraction
  -t, --agent-type TEXT  Type of the agent to use for NER extraction
  --help                 Show this message and exit.

Usage: docs2kg neo4j [OPTIONS] PROJECT_ID

  Load data to Neo4j database.

Options:
  -m, --mode [import|export|load|docker_start|docker_stop]
                                  Mode of operation (import or export)
  -u, --neo4j-uri TEXT            URI for the Neo4j database
  -U, --neo4j-user TEXT           Username for the Neo4j database
  -P, --neo4j-password TEXT       Password for the Neo4j database
  -r, --reset_db                  Reset the database before loading data
  --help

Motivation

To digest diverse unstructured documents into a unified knowledge graph, there are two main challenges:

How to get the documents to be digitized?
- With the dual-path data processing
  - For image based documents, like scanned PDF, images, etc., we can process them through the layout analysis and OCR, etc. Docling and MinerU are focusing on this part.
  - For native digital documents, like ebook, docx, html, etc., we can process them through the programming parser
- It is promising that we will have a robust solution soon.
How to construct a high-quality unified knowledge graph with less effort?

For now, a lot of tools are focusing on the first challenge, however, overlook the second challenge.

To construct a high-quality unified knowledge graph with less effort, we propose the Docs2KG.

We adapt both bottom-up and top-down approaches to construct the unified knowledge graph and its ontology with the help of LLM.
We organise the knowledge graph from three aspects:
- MetaKG: the knowledge about all documents, like the author, the publication date, etc.
- LayoutKG: the knowledge about the layout of the documents, like title, subtitle, section, etc.
- SemanticKG: the knowledge about the content of the documents, like entities, relations, etc.
We provide a human-LLM collaborative interface which allows human to review and enhance the generated knowledge graph.
- An updated version of ontology, entity list, relation list will in return help the KG Construction LLM agent to generate better results in the next iteration.
- The output of the knowledge graph can be used in downstream applications, like RAG, etc.
- Link for the human-LLM collaborative interface: Docs2KG
- After the annotation, metrics to evaluate the quality of automatic construction will be provided.
  - How many entities are correctly extracted by each method?
  - How many relations are correctly extracted by each method?
  - Contribution and retention of each method in the final knowledge graph, including human annotation.

Example of the interface, you only need to register, and you can access it freely.

Development

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
pip install -r requirements.dev.txt

pip install -e .

Citation

If you find this package useful, please consider citing our work:

@misc{sun2024docs2kg,
    title = {Docs2KG: Unified Knowledge Graph Construction from Heterogeneous Documents Assisted by Large Language Models},
    author = {Qiang Sun and Yuanyi Luo and Wenxiao Zhang and Sirui Li and Jichunyang Li and Kai Niu and Xiangrui Kong and Wei Liu},
    year = {2024},
    eprint = {2406.02962},
    archivePrefix = {arXiv},
    primaryClass = {cs.CL}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Docs2KG

Installation

Setup and Tutorial

Command Line

Motivation

Development

Citation

About

Releases 10

Packages

Contributors 4

Languages

License

AI4WA/Docs2KG

Folders and files

Latest commit

History

Repository files navigation

Docs2KG

Installation

Setup and Tutorial

Command Line

Motivation

Development

Citation

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 10

Packages 0

Contributors 4

Languages

Packages