Personalized-cancer-diagnosis

Classify the given genetic variations/mutations based on evidence from text-based clinical literature.

Business Problem

Description

Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment/

Data: Memorial Sloan Kettering Cancer Center (MSKCC)

Download training_variants.zip and training_text.zip from Kaggle.

Context:

Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment/discussion/35336#198462

Problem statement :

Classify the given genetic variations/mutations based on evidence from text-based clinical literature.

Source/Useful Links

Some articles and reference blogs about the problem statement

Real-world/Business objectives and constraints.

* No low-latency requirement. * Interpretability is important. * Errors can be very costly. * Probability of a data-point belonging to each class is needed.

Data Overview

Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment/data
We have two data files: one conatins the information about the genetic mutations and the other contains the clinical evidence (text) that human experts/pathologists use to classify the genetic mutations.
Both these data files are have a common column called ID
Data file's information:
- training_variants (ID , Gene, Variations, Class)
- training_text (ID, Text)

Example Data Point

training_variants

ID,Gene,Variation,Class
0,FAM58A,Truncating Mutations,1
1,CBL,W802*,2
2,CBL,Q249E,2
...

training_text

ID,Text
0||Cyclin-dependent kinases (CDKs) regulate a variety of fundamental cellular processes. CDK10 stands out as one of the last orphan CDKs for which no activating cyclin has been identified and no kinase activity revealed. Previous work has shown that CDK10 silencing increases ETS2 (v-ets erythroblastosis virus E26 oncogene homolog 2)-driven activation of the MAPK pathway, which confers tamoxifen resistance to breast cancer cells. The precise mechanisms by which CDK10 modulates ETS2 activity, and more generally the functions of CDK10, remain elusive. Here we demonstrate that CDK10 is a cyclin-dependent kinase by identifying cyclin M as an activating cyclin. Cyclin M, an orphan cyclin, is the product of FAM58A, whose mutations cause STAR syndrome, a human developmental anomaly whose features include toe syndactyly, telecanthus, and anogenital and renal malformations. We show that STAR syndrome-associated cyclin M mutants are unable to interact with CDK10. Cyclin M silencing phenocopies CDK10 silencing in increasing c-Raf and in conferring tamoxifen resistance to breast cancer cells. CDK10/cyclin M phosphorylates ETS2 in vitro, and in cells it positively controls ETS2 degradation by the proteasome. ETS2 protein levels are increased in cells derived from a STAR patient, and this increase is attributable to decreased cyclin M levels. Altogether, our results reveal an additional regulatory mechanism for ETS2, which plays key roles in cancer and development. They also shed light on the molecular mechanisms underlying STAR syndrome.Cyclin-dependent kinases (CDKs) play a pivotal role in the control of a number of fundamental cellular processes (1). The human genome contains 21 genes encoding proteins that can be considered as members of the CDK family owing to their sequence similarity with bona fide CDKs, those known to be activated by cyclins (2). Although discovered almost 20 y ago (3, 4), CDK10 remains one of the two CDKs without an identified cyclin partner. This knowledge gap has largely impeded the exploration of its biological functions. CDK10 can act as a positive cell cycle regulator in some cells (5, 6) or as a tumor suppressor in others (7, 8). CDK10 interacts with the ETS2 (v-ets erythroblastosis virus E26 oncogene homolog 2) transcription factor and inhibits its transcriptional activity through an unknown mechanism (9). CDK10 knockdown derepresses ETS2, which increases the expression of the c-Raf protein kinase, activates the MAPK pathway, and induces resistance of MCF7 cells to tamoxifen (6). ...

There are nine different classes a genetic mutation can be classified into => Multi class classification problem.

Performance Metric

Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment#evaluation

Metric(s):

Multi class log-loss
Confusion matrix

Summary of single features

Here is a Comparison of our models.

Vectorization	Feature	Model	Logloss	Missclassified points
--	--	Random	2.59	--
BoW	Gene	logistic regression	1.19	--
BoW	Variation	logistic regression	1.68	--
BoW	Text	logistic regression	1.20	--
BoW	text	logistic regression	1.20	--

Stacking the three types of features

Vectorization	Features	Model	Logloss	Missclassified points
BoW	All 3	MultinomialNB	1.19	35%
BoW	All 3	KNN	0.98	33%
BoW	All 3	Logistic Regression With balanced class weight	1.15	31%
BoW	All 3	Linear SVM	1.05	32%
BoW	All 3	Random forests	1.20	41%
Response Coding	All 3	Random forests	1.28	45%
BoW	All 3	Stacking Classifier	1.99	50%%
BoW	All 3	Maximum Voting Classifier	1.38	35%%

TFIDF

Summary of single features(TFIDF)

Here is a Comparison of our models.

__Vectorization__	Feature	Model	Logloss	Missclassified points
Tfidf	Gene	logistic regression	1.32	--
Tfidf	Variation	logistic regression	1.74	--
Tfidf	Text	logistic regression	1.12	--
Tfidf	text	logistic regression	1.20	--

Stacking the three types of features

Vectorization	Features	Model	Logloss	Missclassified points
Tfidf	All 3	MultinomialNB	1.35	41%
Tfidf	All 3	Logistic Regression Without balanced class weight	1.34	37%
Tfidf	All 3	Logistic Regression With balanced class weight	1.14	37%
Tfidf	All 3	Linear SVM	1.05	32%
Tfidf	All 3	Random forests	1.07	34%
Tfidf	All 3	Linear SVM	1.44	41%
Tfidf	All 3	Stacking Classifier	2.06	69%
Tfidf	All 3	Maximum Voting Classifier	1.42	42%

Taking the CV logloss to below 1

Log loss of Cv below 1

Here is a Comparison of our models.

Model	Train Logloss	Test Logloss	CV Logloss	Missclassified points
Logistic regression with weight balancing	0.488	0.55	0.541	12%
Logistic regression without weight balancing	0.488	0.55	0.541	12%
Linear SVM	0.463	0.551	0.544	11%
Naive Bayes	0.811	0.821	0.77	23%
KNN	0.028	0.029	0.024	0.3%
RF	0.337	0.527	0.509	16%

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Approach and results.ipynb		Approach and results.ipynb
Assignment.ipynb		Assignment.ipynb
Assignment_final.ipynb		Assignment_final.ipynb
README.md		README.md
Reading and Prepocessing.ipynb		Reading and Prepocessing.ipynb
final_features.csv		final_features.csv
response_gene_cv.pkl		response_gene_cv.pkl
response_gene_test.pkl		response_gene_test.pkl
response_gene_train.pkl		response_gene_train.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Personalized-cancer-diagnosis

Business Problem

Description

Context:

Problem statement :

Source/Useful Links

Real-world/Business objectives and constraints.

Data Overview

Example Data Point

training_variants

training_text

Performance Metric

Summary of single features

TFIDF

Summary of single features(TFIDF)

Taking the CV logloss to below 1

Log loss of Cv below 1

About

Releases

Packages

Languages

ankanD1601/Personalized-cancer-diagnosis

Folders and files

Latest commit

History

Repository files navigation

Personalized-cancer-diagnosis

Business Problem

Description

Context:

Problem statement :

Source/Useful Links

Real-world/Business objectives and constraints.

Data Overview

Example Data Point

training_variants

training_text

Performance Metric

Summary of single features

TFIDF

Summary of single features(TFIDF)

Taking the CV logloss to below 1

Log loss of Cv below 1

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages