We build a Graph RAG System specifically for the medical domain.
Check our paper here: https://arxiv.org/html/2408.04187v1
-
conda env create -f medgraphrag.yml
-
export OPENAI_API_KEY = your OPENAI_API_KEY
-
python run.py -simple True (now using ./dataset_ex/report_0.txt as RAG doc, "What is the main symptom of the patient?" as the prompt, change the prompt in run.py as you like.)
Top-level Private data (user-provided): we used MIMIC IV dataset as the private data.
Medium-level Books and Papers: We used MedC-K as the medium-level data. The dataset sources from S2ORC. Only those papers with PubMed IDs are deemed as medical-related and used during pretraining. The book is listed in this repo as MedicalBook.xlsx, due to licenses, we cannot release raw content. For reproducing, pls buy and process the books.
Bottom-level Dictionary data: We used Unified Medical Language System (UMLS) as the bottom level data. To access it, you'll need to create an account and apply for usage. It is free and approval is typically fast.
In the code, we use the 'trinity' argument to enable the hierarchy graph linking function. If set to True, you must also provide a 'gid' (graph ID) to specify which graphs the top-level should link to. UMLS is largely structured as a graph, so minimal effort is required to construct it. However, MedC-K must be constructed as graph data. There are several methods you can use, such as the approach we used to process the top-level in this repo (open-source LLMs are recommended to keep costs down), or you can opt for non-learning-based graph construction algorithms (faster, cheaper, and generally noisier)
Recognizing that accessing and processing all the data mentioned may be challenging, we are working to provide simpler example dataset to demonstrate functionality. Currently, we are using the mimic_ex here here as the Top-level data, which is the processed smaller dataset derived from MIMIC. For Medium-level and Bottom-level data, we are in the process of identifying suitable alternatives to simplify the implementation, welcome for any recommendations.
-
conda env create -f medgraphrag.yml
-
prepare neo4j and LLM (using ChatGPT here for an example), you need to export:
export OPENAI_API_KEY = your OPENAI_API_KEY
export NEO4J_URL= your NEO4J_URL
export NEO4J_USERNAME= your NEO4J_USERNAME
export NEO4J_PASSWORD= your NEO4J_PASSWORD
-
Download mimic_ex here, put that under your data path, like ./dataset/mimic_ex
-
python run.py -dataset mimic_ex -data_path ./dataset/mimic_ex(where you put the dataset) -grained_chunk True -ingraphmerge True -construct_graph True
-
put your prompt to ./prompt.txt
-
python run.py -dataset mimic_ex -data_path ./dataset/mimic_ex(where you put the dataset) -inference True
We are building on CAMEL, an awesome framework for construcing multi-agent pipeline.
@article{wu2024medical,
title={Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation},
author={Wu, Junde and Zhu, Jiayuan and Qi, Yunli},
journal={arXiv preprint arXiv:2408.04187},
year={2024}
}