Skip to content

Commit

Permalink
Highlight news and updates in README
Browse files Browse the repository at this point in the history
  • Loading branch information
ayushnoori authored Jan 1, 2024
1 parent 7985e67 commit 1961c67
Showing 1 changed file with 140 additions and 139 deletions.
279 changes: 140 additions & 139 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,146 @@ biological scales. We accompany PrimeKG’s graph structure with text descriptio
diseases to enable multimodal analyses. Download [this CSV file](https://dataverse.harvard.edu/api/access/datafile/6180620)
to get started!

## News and Updates
- [Dec 2023] PrimeKG is extended to improve coverage of OMIM data.

<details><summary>Details:</summary>

### December 2023 update

In December 2023, an updated version of PrimeKG that includes complete entries from the Online Mendelian Inheritance in Man
(OMIM) database in a standardized data format was prepared.

#### Changes to PrimeKG
As discussed in [issue #9](https://github.com/mims-harvard/PrimeKG/issues/9), OMIM phenotypes and genes were
not fully included in prior versions of PrimeKG. For more details, see
[this pull request](https://github.com/mims-harvard/PrimeKG/pull/12).

To extend of PrimeKG using a new data source and include edges between existing nodes in the knowledge graph,
we devised a standardized data format (see [PR#207](https://github.com/mims-harvard/TDC/pull/207) in mims-harvard/TD)
that is used for all data sources in the same format as the published PrimeKG edge list.

#### Summary
* `datasets/processing_scripts/omim_tools.py` script contains functions to process OMIM data.
* `datasets/omim/` folder should store OMIM datasets.
* `datasets/omim/omim-api.ipynb` notebook is the OMIM API wrapper, which is used to download OMIM entries (note that
an API key is required).
* `knowledge_graph/append_omim.ipynb` notebook is used to append OMIM entries to PrimeKG.
* `scripts/utils.py` includes scripts that are used across multiple data sources.

#### OMIM Database
Many of the OMIM phenotype entries have been already included in the PrimeKG through MONDO; however, there still exists
OMIM information that was not included in the PrimeKG. Thus, we add scripts and notebooks to cover OMIM genes,
phenotypes, and phenotypic series (see [here](https://www.omim.org/help/faq#1_13)) entries, and enable regular updates.

#### NCBI Gene
* OMIM gene entries are linked to NCBI Gene entries via new edges in the KG.

#### Human Phenotype Ontology
* HPO-OMIM edges are added to PrimeKG.

#### MONDO
* MONDO-OMIM edges are added to PrimeKG.

#### Statistics

New nodes and edges added:
```text
# of new edges: 612282
# of new node: 32866
```
Updated edge count by `display_relation`:
```text
display_relation
associated with 581387
linked to 26784
members 4111
```
Updated edge_count by `relation`:
```text
relation
mim_disease 9599
mim_gene 16636
mim_phenotype 574128
mim_phenotypic_series 4111
mim_phenotypic_series_disease 549
phenotype_map 7259
```
</details>
- [July 2023] PrimeKG construction scripts are updated to include primary source data releases up to July 2023. Note that the files published on Harvard DataVerse remain unchanged; however, we provide new scripts and updated links should users wish to build their own current version of PrimeKG.
<details><summary>Details:</summary>
### July 2023 update
In July 2023, this repository was updated to rebuild PrimeKG and update the knowledge graph to include database releases up to July 2023. Note that the files published on Harvard DataVerse remain unchanged; however, we provide new scripts and updated links should users wish to build their own current version of PrimeKG. For more details, see [this pull request](https://github.com/mims-harvard/PrimeKG/pull/11).
17 scripts `datasets/processing_scripts/` are re-run or updated to build a new version of PrimeKG, while `datasets/feature_construction/` scripts may remain out-of-date. Re-run or updated primary data sources include Bgee, Comparative Toxicogenomics Database, DisGeNET, DrugBank, DrugCentral, NCBI Gene, Gene Ontology, Human Phenotype Ontology, MONDO, Reactome, SIDER, UBERON, and UMLS.
For more information, see `datasets/primary_data_resources.sh`. Changes include the following:
#### General
Created script to automatically create directory structure, pull data, and run all necessary processing and feature extraction steps.
* Fixed broken environment construction script.
* Script automatically creates required directories.
* Added commands to retrieve gene names, details, and NCBI ID to UniProt ID mapping from [www.genenames.org](http://www.genenames.org/), then output to `vocab/gene_names.csv` and `vocab/gene_map.csv`.
#### Bgee
* 58405/5257181 gold quality calls with expression rank < 25000 now specify cell type in a particular tissue (_e.g._, UBERON:0000473 ∩ CL:0000089, which denotes germ line stem cell in testis).
* These rows are dropped in `bgee.py`.
* URL updated to [here](https://www.bgee.org/ftp/current/download/calls/expr_calls/Homo_sapiens_expr_advanced.tsv.gz).
#### Comparative Toxicogenomics Database
* URL updated to [here](https://ctdbase.org/reports/CTD_exposure_events.csv.gz).
#### DisGeNET
* No changes needed.
#### DrugBank
* Fixed paths in `parsexml_drugbank.py`. Output to new `/parsed` subdirectory. Removed extraneous lines in `Parsed_feature.ipynb`.
* :white_check_mark: Successfully ran `drugbank_drug_drug.py` and `drugbank_drug_protein.py`.
* :warning: `parsexml_drugbank.py` and `Parsed_feature.ipynb` may need updates.
#### DrugCentral
* Modified `drugcentral_queries.txt` to work on O2, the Harvard Medical School high-performance computing cluster.
* :warning: `drugcentral_feature.Rmd` may need updates.
#### NCBI Gene
* No changes needed.
#### Gene Ontology
* Used `-L` flag to follow redirects. No other changes needed.
#### Human Phenotype Ontology
* Used `-L` flag to follow redirects. No other changes needed to `hpo.py`.
* Updated `hpoa.py` to replace old column names with new column names.
#### MONDO
* Added check for NoneType values in external references (line 29).
#### Reactome
* No changes needed.
#### SIDER
* No changes needed.
#### UBERON
* Checked for NA values, dropped two obsolete terms (UBERON:0039300 and UBERON:0039302) not marked as obsolete in the source file.
#### UMLS
* UMLS data pulled and paths updated for 2023 data.
* :warning: `umls.ipynb` may need updates.
</details>
- [Feb 2023] PrimeKG is [published](https://www.nature.com/articles/s41597-023-01960-3) in Nature Scientific Data.
- [Jun 2022] PrimeKG crosses 5,000 downloads on Harvard Dataverse!
- [Apr 2022] PrimeKG is live on [bioRxiv](https://www.biorxiv.org/content/10.1101/2022.05.01.489928v1) and [Harvard Dataverse](https://doi.org/10.7910/DVN/IXA7BM)!
## Table of Contents
- [Unique Features of PrimeKG](#unique-features-of-primekg)
- [Environment Setup](#environment-setup)
Expand Down Expand Up @@ -156,144 +296,5 @@ identifier [https://doi.org/10.7910/DVN/IXA7BM](https://doi.org/10.7910/DVN/IXA7
maintenance, PrimeKG datasets cannot be retrieved. That happens rarely; please check the status on
[the Dataverse website](https://dataverse.harvard.edu/).

## Updates
- [Dec 2023] PrimeKG is extended to improve coverage of OMIM data.

<details><summary>Details:</summary>

### December 2023 update

In December 2023, an updated version of PrimeKG that includes complete entries from the Online Mendelian Inheritance in Man
(OMIM) database in a standardized data format was prepared.

#### Changes to PrimeKG
As discussed in [issue #9](https://github.com/mims-harvard/PrimeKG/issues/9), OMIM phenotypes and genes were
not fully included in prior versions of PrimeKG. For more details, see
[this pull request](https://github.com/mims-harvard/PrimeKG/pull/12).

To extend of PrimeKG using a new data source and include edges between existing nodes in the knowledge graph,
we devised a standardized data format (see [PR#207](https://github.com/mims-harvard/TDC/pull/207) in mims-harvard/TD)
that is used for all data sources in the same format as the published PrimeKG edge list.

#### Summary
* `datasets/processing_scripts/omim_tools.py` script contains functions to process OMIM data.
* `datasets/omim/` folder should store OMIM datasets.
* `datasets/omim/omim-api.ipynb` notebook is the OMIM API wrapper, which is used to download OMIM entries (note that
an API key is required).
* `knowledge_graph/append_omim.ipynb` notebook is used to append OMIM entries to PrimeKG.
* `scripts/utils.py` includes scripts that are used across multiple data sources.

#### OMIM Database
Many of the OMIM phenotype entries have been already included in the PrimeKG through MONDO; however, there still exists
OMIM information that was not included in the PrimeKG. Thus, we add scripts and notebooks to cover OMIM genes,
phenotypes, and phenotypic series (see [here](https://www.omim.org/help/faq#1_13)) entries, and enable regular updates.

#### NCBI Gene
* OMIM gene entries are linked to NCBI Gene entries via new edges in the KG.

#### Human Phenotype Ontology
* HPO-OMIM edges are added to PrimeKG.

#### MONDO
* MONDO-OMIM edges are added to PrimeKG.

#### Statistics

New nodes and edges added:
```text
# of new edges: 612282
# of new node: 32866
```
Updated edge count by `display_relation`:
```text
display_relation
associated with 581387
linked to 26784
members 4111
```
Updated edge_count by `relation`:
```text
relation
mim_disease 9599
mim_gene 16636
mim_phenotype 574128
mim_phenotypic_series 4111
mim_phenotypic_series_disease 549
phenotype_map 7259
```
</details>
- [July 2023] PrimeKG construction scripts are updated to include primary source data releases up to July 2023. Note that the files published on Harvard DataVerse remain unchanged; however, we provide new scripts and updated links should users wish to build their own current version of PrimeKG.
<details><summary>Details:</summary>
### July 2023 update
In July 2023, this repository was updated to rebuild PrimeKG and update the knowledge graph to include database releases up to July 2023. Note that the files published on Harvard DataVerse remain unchanged; however, we provide new scripts and updated links should users wish to build their own current version of PrimeKG. For more details, see [this pull request](https://github.com/mims-harvard/PrimeKG/pull/11).
17 scripts `datasets/processing_scripts/` are re-run or updated to build a new version of PrimeKG, while `datasets/feature_construction/` scripts may remain out-of-date. Re-run or updated primary data sources include Bgee, Comparative Toxicogenomics Database, DisGeNET, DrugBank, DrugCentral, NCBI Gene, Gene Ontology, Human Phenotype Ontology, MONDO, Reactome, SIDER, UBERON, and UMLS.
For more information, see `datasets/primary_data_resources.sh`. Changes include the following:
#### General
Created script to automatically create directory structure, pull data, and run all necessary processing and feature extraction steps.
* Fixed broken environment construction script.
* Script automatically creates required directories.
* Added commands to retrieve gene names, details, and NCBI ID to UniProt ID mapping from [www.genenames.org](http://www.genenames.org/), then output to `vocab/gene_names.csv` and `vocab/gene_map.csv`.
#### Bgee
* 58405/5257181 gold quality calls with expression rank < 25000 now specify cell type in a particular tissue (_e.g._, UBERON:0000473 ∩ CL:0000089, which denotes germ line stem cell in testis).
* These rows are dropped in `bgee.py`.
* URL updated to [here](https://www.bgee.org/ftp/current/download/calls/expr_calls/Homo_sapiens_expr_advanced.tsv.gz).
#### Comparative Toxicogenomics Database
* URL updated to [here](https://ctdbase.org/reports/CTD_exposure_events.csv.gz).
#### DisGeNET
* No changes needed.
#### DrugBank
* Fixed paths in `parsexml_drugbank.py`. Output to new `/parsed` subdirectory. Removed extraneous lines in `Parsed_feature.ipynb`.
* :white_check_mark: Successfully ran `drugbank_drug_drug.py` and `drugbank_drug_protein.py`.
* :warning: `parsexml_drugbank.py` and `Parsed_feature.ipynb` may need updates.
#### DrugCentral
* Modified `drugcentral_queries.txt` to work on O2, the Harvard Medical School high-performance computing cluster.
* :warning: `drugcentral_feature.Rmd` may need updates.
#### NCBI Gene
* No changes needed.
#### Gene Ontology
* Used `-L` flag to follow redirects. No other changes needed.
#### Human Phenotype Ontology
* Used `-L` flag to follow redirects. No other changes needed to `hpo.py`.
* Updated `hpoa.py` to replace old column names with new column names.
#### MONDO
* Added check for NoneType values in external references (line 29).
#### Reactome
* No changes needed.
#### SIDER
* No changes needed.
#### UBERON
* Checked for NA values, dropped two obsolete terms (UBERON:0039300 and UBERON:0039302) not marked as obsolete in the source file.
#### UMLS
* UMLS data pulled and paths updated for 2023 data.
* :warning: `umls.ipynb` may need updates.
</details>
- [Feb 2023] PrimeKG is [published](https://www.nature.com/articles/s41597-023-01960-3) in Nature Scientific Data.
- [Jun 2022] PrimeKG crosses 5,000 downloads on Harvard Dataverse!
- [Apr 2022] PrimeKG is live on [bioRxiv](https://www.biorxiv.org/content/10.1101/2022.05.01.489928v1) and [Harvard Dataverse](https://doi.org/10.7910/DVN/IXA7BM)!
## License
PrimeKG codebase and associated tools are released under the MIT license. Please note that this license specifically refers to the PrimeKG software, and is distinct from any licenses governing the PrimeKG dataset itself. For individual dataset usage, refer to the respective dataset licenses available on data website.

0 comments on commit 1961c67

Please sign in to comment.