This repository contains all the materials for our "Machine learning with biomedical ontologies" manuscript. We provide the Jupyter Notebooks to reproduce our experimental results and the benchmark datasets based on predicting protein-protein interactions. Furthermore, we make a set of slides available (as PDF and source code in LaTeX Beamer) that may be useful for teaching or presentations.
We provide several Jupyter notebooks. The notebooks include:
- data.ipynb -- data generation and preprocessing
- semantic-similarity.ipynb -- predicting protein--protein interactions using semantic similarity measures
- onto2vec_opa2vec.ipynb -- axiom embeddings using Onto2Vec and embeddings for axioms plus annotation properties using OPA2Vec
- graph-walks.ipynb -- embedding ontologies using random walks on graphs
- graph-transe.ipynb -- embedding ontologies using translating embeddings on graphs
- elembedding.ipynb -- ontology embedding that approximates the ontology models (EL Embeddings)
We provide two benchmark datasets for protein--protein interaction prediction task. The datasets can be downloaded using the following link:
Two benchmark datasets for evaluating machine learning methods on the task of predicting protein--protein interaction networks. The original data was downloaded from StringDB database of protein--protein interactions and Gene Ontology Resource. This archive includes:
- Protein--protein interactions for human and yeast organisms
- Gene Ontology in OBO and OWL format
- Gene Ontology Annotations for human and yeast proteins
- Protein aliases files with ID mappings between StringDB proteins and other databases.
We filter out interactions with confidence score less than 700 and consider them to be symmetric. We randomly split the datasets into 80/20% training/testing sets by the number of interactions and use 20% of the training set as a validation set.
Please install the following software to run our notebooks:
- Groovy 2.0+
- Raptor RDF Syntax Library
- Python 3.6+
- Install python dependencies with:
pip install -r requirements.txt
- Load submodules with:
git submodule update --init --recursive
Run jupyter notebook
and then open the notebook files.
Method | Raw Hits@10 | Filtered Hits@10 | Raw Hits@10 | Filtered Hits@100 | Raw Mean Rank | Filtered Mean Rank | Raw AUC | Filtered AUC |
---|---|---|---|---|---|---|---|---|
TransE | 0.06 | 0.13 | 0.32 | 0.40 | 1125.4 | 1074.8 | 0.82 | 0.83 |
SimResnik | 0.09 | 0.17 | 0.38 | 0.48 | 757.8 | 706.9 | 0.86 | 0.87 |
SimLin | 0.08 | 0.15 | 0.33 | 0.41 | 875.4 | 824.5 | 0.84 | 0.85 |
SiameseNN | 0.06 | 0.17 | 0.46 | 0.68 | 674.27 | 622.20 | 0.89 | 0.90 |
SiameseNN (Ont) | 0.08 | 0.19 | 0.50 | 0.72 | 543.56 | 491.56 | 0.91 | 0.92 |
EL Embeddings | 0.08 | 0.17 | 0.44 | 0.62 | 451.29 | 394.04 | 0.92 | 0.93 |
Onto2Vec | 0.08 | 0.15 | 0.35 | 0.48 | 641.1 | 587.9 | 0.79 | 0.80 |
OPA2Vec | 0.06 | 0.13 | 0.39 | 0.58 | 523.3 | 466.6 | 0.87 | 0.88 |
Random walk | 0.06 | 0.13 | 0.31 | 0.40 | 612.6 | 587.4 | 0.87 | 0.88 |
Node2Vec | 0.07 | 0.15 | 0.36 | 0.46 | 589.1 | 522.4 | 0.87 | 0.88 |
Method | Raw Hits@10 | Filtered Hits@10 | Raw Hits@10 | Filtered Hits@100 | Raw Mean Rank | Filtered Mean Rank | Raw AUC | Filtered AUC |
---|---|---|---|---|---|---|---|---|
TransE | 0.05 | 0.11 | 0.24 | 0.29 | 3960.4 | 3890.6 | 0.78 | 0.79 |
SimResnik | 0.05 | 0.09 | 0.25 | 0.30 | 1933.6 | 1864.4 | 0.88 | 0.89 |
SimLin | 0.04 | 0.08 | 0.20 | 0.23 | 2287.9 | 2218.7 | 0.86 | 0.87 |
SiameseNN | 0.05 | 0.15 | 0.41 | 0.64 | 1881.10 | 1808.77 | 0.90 | 0.89 |
SiameseNN (Ont) | 0.05 | 0.13 | 0.38 | 0.59 | 1838.31 | 1766.34 | 0.89 | 0.89 |
EL Embeddings | 0.01 | 0.02 | 0.22 | 0.26 | 1679.72 | 1637.65 | 0.90 | 0.90 |
Onto2Vec | 0.05 | 0.08 | 0.24 | 0.31 | 2434.6 | 2391.2 | 0.77 | 0.77 |
OPA2Vec | 0.03 | 0.07 | 0.23 | 0.26 | 1809.7 | 1767.6 | 0.86 | 0.88 |
Random walk | 0.04 | 0.10 | 0.28 | 0.34 | 1942.6 | 1958.6 | 0.85 | 0.86 |
Node2Vec | 0.03 | 0.07 | 0.22 | 0.28 | 1860.5 | 1813.1 | 0.86 | 0.87 |
To add your own results to the benchmark, please send us a pull request with a link to the source repository that contains the code to reproduce the results. Alternatively, please create an issue on the issue tracker and we will add your results.
We provides slides that can be used to present some of this work. The slides have been created as part of an Ontology Tutorial that was developed and taught over several years at various events. All methods in the slides are also implemented with examples in our Jupyter Notebooks.
- Introduction
- Ontologies and Graphs -- basic introduction to ontologies, Description Logic, and how they can give rise to graph-based representations
- Semantic Similarity -- different semantic similarity measures on ontologies
- Ontology Embeddings -- methods to generate embeddings for ontologies, including syntactic, graph-based, and model-based approaches.
- OWLAPI: Reference library to process OWL ontologies, supports most OWL reasoners.
- funowl: Python library to process OWL ontologies.
- owlready2: Python library to process OWL ontologies.
- Apache Jena: RDF library with OWL support.
- rdflib: Python RDF library with OWL support, in particular infixowl.
- Protege: Comprehensive ontology editor and knowledge engineering environment.
- ELK: Very fast reasoner for the OWL 2 EL profile with polynomial worst-case time complexity.
- HermiT: Automated reasoner supporting most of OWL axioms with exponential worst-case complexity.
- Pellet: OWL reasoner supporting most of the OWL constructs and supporting several additional features.
- OBOGraphs: Syntactic conversion of ontologies to graphs, targeted at OBO ontologies.
- Onto2Graph: Semantic conversion of OWL ontologies to graphs using automated reasoning, following the axiom patterns of the OBO Relation Ontology.
- Semantic Measures Library: Comprehensive Java library to compute semantic similarity measures over ontologies.
- sematch: Python library to compute semantic similarity on knowledge graphs.
- DiShIn: Python library to compute semantic similarity on ontologies.
- OWL2Vec: Graph-based ontology embedding method that combines generation of graphs from ontologies, random walks on the generated graphs, and generation of embeddings using Word2Vec. Supports most OWL axioms.
- DL2Vec: Graph-based ontology embedding method that combines generation of graphs from ontologies, random walks on the generated graphs, and generation of embeddings using Word2Vec. Supports most OWL axioms.
- Walking RDF & OWL: Graph-based ontology embedding method that combines generation of graphs from ontologies, random walks on the generated graphs, and generation of embeddings using Word2Vec. Supports only the ontology taxonomy.
- RDF2Vec (pyRDF2Vec): Method to embed RDF graphs.
- Node2Vec: Method to embed graphs using biased random walks.
- PyKEEN and BioKEEN: Toolkit for generating knowledge graph embeddings using several different approaches.
- OpenKE: Library and toolkit for generating knowledge graph embeddings
- PyTorch Geometric: Library for graph neural networks which can be used to generate graph embeddings.
- OWL2Vec*: Graph-based ontology embedding method that combines generation of graphs from ontologies, random walks on the generated graphs, and generation of embeddings using Word2Vec. Supports OWL 2 axioms and annotation axioms. Extension of OWL2Vec.
- Onto2Vec: Embeddings based on treating logical axioms as a text corpus.
- OPA2Vec: Embeddings that combine logical axioms with annotation properties.
- EL Embeddings: Embeddings that approximate the interpretation function and preserve semantics for intersection, existential quantifiers, and bottom class.
- DeepGO and DeepPheno: Implement ontology-based hierarchical classifiers for function and phenotype prediction. The hierarchical classification modules are generic and can be used with other ontologies and applications.
If you like our work, please cite our paper:
@article{machine-learning-with-ontologies,
author = {Kulmanov, Maxat and Smaili, Fatima Zohra and Gao, Xin and Hoehndorf, Robert},
title = {Semantic similarity and machine learning with ontologies},
journal = {Briefings in Bioinformatics},
year = {2020},
month = {10},
abstract = {Ontologies have long been employed in the life sciences to formally represent and reason over domain knowledge and they are employed in almost every major biological database. Recently, ontologies are increasingly being used to provide background knowledge in similarity-based analysis and machine learning models. The methods employed to combine ontologies and machine learning are still novel and actively being developed. We provide an overview over the methods that use ontologies to compute similarity and incorporate them in machine learning methods; in particular, we outline how semantic similarity measures and ontology embeddings can exploit the background knowledge in ontologies and how ontologies can provide constraints that improve machine learning models. The methods and experiments we describe are available as a set of executable notebooks, and we also provide a set of slides and additional resources at https://github.com/bio-ontology-research-group/machine-learning-with-ontologies.},
issn = {1477-4054},
doi = {10.1093/bib/bbaa199},
url = {https://doi.org/10.1093/bib/bbaa199},
note = {bbaa199},
eprint = {https://academic.oup.com/bib/advance-article-pdf/doi/10.1093/bib/bbaa199/33875255/bbaa199.pdf},
}