GitHub - sheridancbio/cBioPortal-new-study-assistant: A tool to help match attributes in a new study to attributes in existing studies on the cBioPortal

cBioPortal-new-study-assistant

Introduction

This tool is designed to aid in curating new study clinical data into cBioPortal. The tool reads clinical data from the new study and compares the data to existing studies in the cBioPortal database to find attributes which match those in existing studies. The ultimate goal of the tool is to aid the curator in normalizing the new study data relative to cBioPortal studies.

Required python packages

NumPy
SciPy (recent version required)
pandas
seaborn
python-levenshtein: anaconda link
Pylatex (required only for the '--output_pdf' option)

Running the script

The default mode of the script selects a random study from the portal and searches other studies on the portal for matching attributes. The current version of the script typically takes around 10 minutes to run and depends on internet access to download data from cBioPortal. Alternatively, one can also clone the datahub repository and run the tool on that local data.

Default example:

python new_study_assistant.py

A more practical mode of the tool is to test raw study data from a new study against the existing cBioPortal data. Currently the tool assumes that the raw study data only contains one line in the header. Example raw data from the acyc_mda_2015 study is provided on this repository for reference.

Example using acyc_mda_2015 raw data (this data is provided in the acyc_mda_2015 folder in this repository):

python new_study_assistant.py --study_to_drop='acyc_mda_2015' --new_study_path='./acyc_mda_2015/raw_data_clinical.txt' > similarity_output.txt

Options available

--new_study_path PATH

Path to raw study data file that you want to analyze. Currently the code assumes that the file only contains attribute names in the header. If the file contains a multi-line header the program will probably crash.

--study_to_drop STUDY_ID

Excludes a study (specified by STUDY_ID) from the analysis, this is useful when analyzing a study already on cBioPortal

--specific_study STUDY_ID

Use this option if you want to run the analysis on a specific study (specified by STUDY_ID) that is already in the cBioPortal

--output_pdf

Activate this flag if you would like the report results printed to a pdf instead of in an HTML format (requires Pylatex)

--datahub_path PATH

Specify path to a local version of the datahub, if this path is not specified the script will download data via the API instead

Output

Similar attributes that are detected in the test study are printed to the screen. The prefix "NEW_STUDY_" is added to each attribute in the test study to distinguish those attributes from those already on cBioPortal. The text output from the script can be redirected to a file by using "> FILENAME" at the end of the python command.

Running the script also results in several image files:

report.html - HTML file containing an output report containing the attribute matches and several figures.
dendrogram.png - An image showing the dendrogram obtained for similarity detection based on attribute values.
n_attribute_distribution.png - An image showing the number of attributes contained in the test study relative to all studies on the cBioPortal.
n_common_attribute_distribution.png - An image showing the number of attributes in common with studies on the cBioPortal.
n_unique_attribute_distribution.png - An image showing the number of unique attributes in the test study relative to all studies on the cBioPortal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cBioPortal-new-study-assistant

Introduction

Required python packages

Running the script

Options available

Output

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
acyc_mda_2015		acyc_mda_2015
README.md		README.md
README.txt		README.txt
dendrogram.png		dendrogram.png
example_output.txt		example_output.txt
n_attribute_distribution.png		n_attribute_distribution.png
n_common_attribute_distribution.png		n_common_attribute_distribution.png
n_unique_attribute_distribution.png		n_unique_attribute_distribution.png
new_study_assistant.py		new_study_assistant.py
report.html		report.html
report.pdf		report.pdf

sheridancbio/cBioPortal-new-study-assistant

Folders and files

Latest commit

History

Repository files navigation

cBioPortal-new-study-assistant

Introduction

Required python packages

Running the script

Options available

Output

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages