Lab Website | Nature Publication | Harvard Dataverse
Precision Medicine Knowledge Graph (PrimeKG) presents a holistic view of diseases. PrimeKG integrates 20 high-quality biomedical resources to describe 17,080 diseases with 4,050,249 relationships representing ten major biological scales. We accompany PrimeKG’s graph structure with text descriptions of clinical guidelines for drugs and diseases to enable multimodal analyses. Download this csv file to get started!
- [Feb 2023] PrimeKG is published in Nature Scientific Data.
- [Jun 2022] PrimeKG crosses 5,000 downloads on Harvard Dataverse!
- [Apr 2022] PrimeKG is live on bioRxiv and Harvard Dataverse!
- Diverse coverage of diseases: PrimeKG contains over 17,000 diseases including rare dieases. Disease nodes in PrimeKG are densely connected to other nodes in the graph and have been optimized for clinical relevance in downstream precision medicine tasks.
- Heterogeneous knowledge graph: PrimeKG contains over 100,000 nodes distributed over various biological scales as depicted below. PrimeKG also contains over 4 million relationships between these nodes distributed over 29 types of edges.
- Multimodal integration of clinical knowledge: Disease and drug nodes in PrimeKG are augmented with clinical descriptors that come from medical authorities such as Mayo Clinic, Orphanet, Drug Bank, and so forth.
- Ready-to-use datasets: PrimeKG is minimally dependent on external packages. Our knowledge graph can be retrieved in a ready-to-use format from Harvard Dataverse.
- Data functions: PrimeKG provides extensive data functions, including processors for primary resources and scripts to build an updated knowledge graph.
To install the dependencies required to run the PrimeKG code, use pip
:
pip install -r requirements.txt
conda env create --name PrimeKG --file=environments.yml
For a quick start in Python, you can download the raw data files in .csv
format directly from Harvard Dataverse or load PrimeKG using the following community dataloaders.
Download PrimeKG from Harvard Dataverse using the following bash command. You can replace kg.csv
with any file path.
wget -O kg.csv https://dataverse.harvard.edu/api/access/datafile/6180620
You can use the following code to load PrimeKG and visualize its data.
import pandas as pd
primekg = pd.read_csv('kg.csv', low_memory=False)
primekg.query('y_type=="disease"|x_type=="disease"')
pip install PyTDC
from tdc.resource import PrimeKG
data = PrimeKG(path = './data')
drug_feature = data.get_features(feature_type = 'drug')
data.to_nx()
data.get_node_list(type = 'disease')
pip install pykeen
import pykeen.datasets
pykeen.datasets.has_dataset('primekg')
All persistent identifiers and weblinks to download the 20 primary data resources used to build PrimeKG are systematically provided in the Data Records section of our article. We have also mentioned the exact filenames that were downloaded from each resource for easy corroboration.
We provide the scripts used to process all primary data resources and the names of the resulting output files generated by those scripts. We would be happy to share the intermediate processing datasets that were used to create PrimeKG on request.
Database | Processing scripts | Expected script output |
---|---|---|
Bgee | bgee.py | anatomy_gene.csv |
Comparative Toxicogenomics Database | ctd.py | exposure_data.csv |
DisGeNET | - | curated_gene_disease_associations.tsv |
DrugBank | drugbank_drug_drug.py | drug_drug.csv |
DrugBank | parsexml_drugbank.ipynb, Parsed_feature.ipynb | 12 drug feature files |
DrugBank | drugbank_drug_protein.py | drug_protein.csv |
Drug Central | drugcentral_queries.txt | drug_disease.csv |
Drug Central | drugcentral_feature.Rmd | dc_features.csv |
Entrez Gene | ncbigene.py | protein_go_associations.csv |
Gene Ontology | go.py | go_terms_info.csv, go_terms_relations.csv |
Human Phenotype Ontology | hpo.py, hpo_obo_parser.py | hp_terms.csv, hp_parents.csv, hp_references.csv |
Human Phenotype Ontology | hpoa.py | disease_phenotype_pos.csv, disease_phenotype_neg.csv |
MONDO | mondo.py, mondo_obo_parser.py | mondo_terms.csv, mondo_parents.csv, mondo_references.csv, mondo_subsets.csv, mondo_definitions.csv |
Reactome | reactome.py | reactome_ncbi.csv, reactome_terms.csv, reactome_relations.csv |
SIDER | sider.py | sider.csv |
UBERON | uberon.py | uberon_terms.csv, uberon_rels.csv, uberon_is_a.csv |
UMLS | umls.py, map_umls_mondo.py | umls_mondo.csv |
UMLS | umls.ipynb | umls_def_disorder_2021.csv, umls_def_disease_2021.csv |
The code to harmonize datasets and construct PrimeKG is available at build_graph.ipynb
. Simply run this jupyter notebook in order to construct the knowledge graph from the outputs of the processing files mentioned above. This jupyter notebook produces all three versions of PrimeKG, kg_raw.csv
, kg_giant.csv
, and the complete version kg.csv
.
The code required to engineer features can be found at engineer_features.ipynb
and mapping_mayo.ipynb
.
If you find PrimeKG useful, cite our work:
@article{chandak2022building,
title={Building a knowledge graph to enable precision medicine},
author={Chandak, Payal and Huang, Kexin and Zitnik, Marinka},
journal={Nature Scientific Data},
doi={https://doi.org/10.1038/s41597-023-01960-3},
URL={https://www.nature.com/articles/s41597-023-01960-3},
year={2023}
}
PrimeKG is hosted on Harvard Dataverse with the following persistent identifier https://doi.org/10.7910/DVN/IXA7BM. When Dataverse is under maintenance, PrimeKG datasets cannot be retrieved. That happens rarely; please check the status on the Dataverse website.
PrimeKG codebase is under MIT license. For individual dataset usage, please refer to the dataset license found in the website.