This folder contains a wide variety of scripts that were used to gather data, train the classifiers, test the classifiers, and report the results.
VarSight.py
contains the main command-line interface for running VarSight after models have been trained. Please refer to the main README for more instructions.
The following files are primarily used to pre-process data into easy-to-manage formats in Python.
- CodiDumpUtil.py - This script contains helper functions for parsing a filtered Codicem JSON. It's main purpose is to load a Codicem file and full out the fields requested for use by the classifiers.
- ExomiserUtil.py - This script contains helper functions for parsing prioritized variants from Exomiser.
- HPOUtil.py - This script contains helper functions for calculating the gene rankings based on the cosine score from the Human Phenotype Ontology (HPO) terms.
- OntologyUtil.py - This script contains helper functions for ontologies. It is primarily support code for HPOUtil.py.
- PVPUtil.py - This script contains helper functions for parsing prioritized variants from DeepPVP.
- PhenGenUtil.py - This script contains helper function for parsing prioritized variants from Phen-Gen.
- PyxisMapUtil.py - This script contains helper functions for retrieving gene rankings based on the PyxisMap ranks using the HPO terms.
- SummaryDBUtil.py - This script contains helper functions for parsing the database dump files containing metadata for the Undiagnosed Diseases Network including samples IDs, HPO terms, and reported primary variants. Note: these files are not available on GitHub due to Personal Health Information (PHI).
The following files perform the core workhorse training, testing, and reporting of results from VarSight:
- TestLearners.py - This command line tool contains the primary test functions for gathering data, loading data, cleaning/reformatting data, training the models, testings the models, and reporting results.
gather
- This sub-routine will pre-gather ranks from PyxisMap and HPOUtil for our use during analysis.analyze
- This sub-routine will actually perform the analysis and write the results to rendered .tex files that are automatically pulled into the LaTeX paper. Here's a breakdown of options available inanalyze
mode (original paper results used-g
):
usage: TestLearners.py analyze [-h] [-R] [-P] [-e | -g | -r] optional arguments: -h, --help show this help message and exit -R, --regenerate regenerate all test results even if already available -P, --path-only only uses pathogenic and likely pathogenic variants as true positives -e, --exact-mode use the dev-specified hyperparameters (single-execution) -g, --grid-mode perform a grid search for the best hyperparameters (long multi-execution) -r, --random-mode perform a random search for the best hyperparameters (short multi-execution)
The following files were used to generate data and/or figure for the paper and supplementary documents:
- SupplementGen.py - This script parses the Codicem filter JSON file and creates some .dot files that
graphviz
converts into figures for the supplementary document. These figures visualize the filtering process that was by Codicem before returning results to analysts. We use the variants that pass this filter for training and testing the classifiers that are a part of the core paper. - TestLearners.py - Refer to "Training/Testing Scripts" section