Skip to content

Simple traditional machine learning method for clinical prediction, all implemented in R

License

Notifications You must be signed in to change notification settings

nn1999/ez-prediction

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ez-prediction

  • Sometimes it is desirable to use clinical information like age, gender, clinical index, or high through put experiment data to predict disease outcome, or perform so-called molecular diagnosis.

  • Traditional machine learning methods work well for these proposes. This repo contains scripts for clinical prediction, all implemented as R markdown.

  • Before you perform such machine learning, you'd better visualization your data with PCA, MDS, or hierarchical clustering, colored with sample labels.

    • Under some cases, tumor samples and tumor adjacent normal samples for example, different samples could form highly distinct clusters, and supervised learning is expected to have very high accuracy, hence not even necessary.
    • Supervised learning here may useful for identify mild difference, like tumor samples from patients with good prognosis and bad prognosis, or even more mild difference, like plasma samples from cancer patient and healthy donors.
    • Sample data here is actually some COAD tumor and paired tumor adjacent normal tissue data from TCGA. As described bellow, they are quiet distinct, and accuracy on test set should near 100%. One would never perform such analysis in real practice, this data is only used to exemplify how to use some machine learning package in R.
  • Several R packages is required:

    • edgeR: for data normalization, and identify differential genes
    • caret: for dataset splitting
    • pROC: for performance evaluation
    • glmnet: for (regularized) logistic regression
    • e1071: for SVM
    • randomForest: for random forest

Prepare input data

Preprocessing

Dataset splitting

  • See notebooks/splitting.Rmd
  • If you want samples with same label to evenly distributed in training and testing set, perform stratified splitting

Model training

Feature selection

Model fitting

  • Note that R packages typically distinguish regression tasks and classification tasks by data type of the response variable. This is different from sklearn, which provide seperated API for classification and regression.
  • If your response is factor in R, it will perform regression, or if your response is a numeric vector, it will perform regression.
  • So for classification, make sure your input response variable is a vector of R factors
  • Logistic regression, SVM, random forest or gradient boosting?
    • All is OK.
    • If you emphasis interpretability rather than performance, use logistic regression.
    • If you want your model to tolerate dirty data (minimal preprocessing), run fast, and expect good performance, use tree-based method (random forest or gradient boosting).
    • SVM is also a good choice under most situation.

Parameter tuning

  • notebooks/tune.Rmd
  • Default parameters usually works quite well under most situation.
  • If you want to tune parameters
    • K fold cross validation, or leave-one-out cross validation if sample size is very small.

Performance evaluation

  • notebooks/performance.Rmd
  • For each sample, in binary cases, the model gives P(y_i=1|X_i,Model)
  • See https://en.wikipedia.org/wiki/Confusion_matrix and https://en.wikipedia.org/wiki/Receiver_operating_characteristic
  • Also see https://people.inf.elte.hu/kiss/11dwhdm/roc.pdf
  • Some alias
    • sensitivity, recall, TPR
    • FPR, 1 - specificity
    • precision, PPV
  • We shall calculate the following metrics from known labels y_i (binary value in 0,1) and predicted probability P(y_i=1|X_i,Model)
  • To calculate recall and precision, we should specify a predefined cutoff. For different cutoff, we can have different FPR, recall and precision. That is to say, every (FPR,TPR) pair, or (1-specificity,sensitivity) is a point on ROC (Receiver operating characteristic) curve, every (recall,precision) pair is a point on precision recall curve (PRC)
  • For whole validation set, traverse all possible cutoff, we have a single ROC curve and single PRC curve, hence a single AUROC value and a single AUPRC value.
  • For clinical application, seems AUROC is reported in most publications, sensitivity and specificity some times is also reported. The confidence interval can also be reported. As there is different (1 - specificity, sensitivity) pair, we often take point closest to up-left corner (closest.topleft in pROC), or point where maximize sensitivity+specificity, (youden in pROC).

About

Simple traditional machine learning method for clinical prediction, all implemented in R

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 100.0%