Merge branch 'master' into dev

xiexincn · Aug 13, 2020 · fabef54 · fabef54
2 parents 78bf8c4 + 8f4b8cb
commit fabef54
Show file tree

Hide file tree

Showing 26 changed files with 437 additions and 778 deletions.
diff --git a/CHANGES.txt b/CHANGES.txt
@@ -3,3 +3,7 @@ v<0.0.1>, <08/03/2020> -- Initial release & minor example fix.
 v<0.0.2>, <08/04/2020> -- Set up autotests.
 v<0.0.2>, <08/05/2020> -- Enable read the docs.
 v<0.0.2>, <08/06/2020> -- Add models and remove task parameter from model invocation.
+v<0.0.3>, <08/10/2020> -- Massive code refactoring.
+v<0.0.3>, <08/11/2020> -- Add GPU support.
+v<0.0.3>, <08/12/2020> -- Add support for image and clinical notes.
+
diff --git a/README.rst b/README.rst
@@ -66,7 +66,7 @@ Python Library for Healthcare AI (PyHealth)
 
 -----
 
-**Development Status**: **As of 08/04/2020, PyHealth is under active development and in its alpha stage. Please follow, star, and fork to get the latest functions**!
+**Development Status**: **As of 08/12/2020, PyHealth is under active development and in its alpha stage. Please follow, star, and fork to get the latest functions**!
 
 
 **PyHealth** is a comprehensive and flexible **Python library** for **healthcare AI**, designed for both **ML researchers** and **medical practitioners**.
@@ -75,7 +75,6 @@ PyHealth makes many important healthcare tasks become accessible, such as **phen
 **ICU length stay forecasting**, etc. Running these prediction tasks with deep learning models can be as short as 10 lines of code.
 
 
-
 PyHealth comes with three major modules: (i) *data preprocessing module*; (ii) *learning module*
 and (iii) *evaluation module*. Typically, one can run the data prep module to prepare the data, then feed to the learning module for prediction, and finally assess
 the result with the evaluation module.
@@ -89,6 +88,7 @@ PyHealth is featured for:
 
 * **Unified APIs, detailed documentation, and interactive examples** across various datasets and algorithms.
 * **Advanced models**\ , including **latest deep learning models** and **classical machine learning models**.
+* **Wide coverage**, supporting **sequence data**, **image data**, and **text data** like clinical notes.
 * **Optimized performance with JIT and parallelization** when possible, using `numba <https://github.com/numba/numba>`_ and `joblib <https://github.com/joblib/joblib>`_.
 * **Customizable modules and flexible design**: each module may be turned on/off or totally replaced by custom functions. The trained models can be easily exported and reloaded for fast exexution and deployment.
 
@@ -99,15 +99,18 @@ PyHealth is featured for:
 
 
        # load pre-processed CMS dataset
-       from pyhealth.data.expdata_generator import cms as cms_expdata_generator
+       from pyhealth.data.expdata_generator import sequencedata as expdata_generator
 
-       cur_dataset = cms_expdata_generator(exp_id=exp_id, sel_task='phenotyping')
-       cur_dataset.get_exp_data()
+       expdata_id = '2020.0810.data.mortality.mimic'
+       cur_dataset = expdata_generator(exp_id=exp_id)
+       cur_dataset.get_exp_data(sel_task='mortality', )
        cur_dataset.load_exp_data()
 
        # initialize the model for training
-       from pyhealth.models.lstm import LSTM
-       clf = LSTM(exp_id)  # LSTM related parameters can be set here
+       from pyhealth.models.sequence.lstm import LSTM
+       # enable GPU
+       clf = LSTM(expmodel_id=expmodel_id, n_batchsize=20, use_gpu=True,
+           n_epoch=100, gpu_ids='0,1')
        clf.fit(cur_dataset.train, cur_dataset.valid)
 
        # load the best model for inference
@@ -116,9 +119,10 @@ PyHealth is featured for:
        pred_results = clf.get_results()
 
        # evaluate the model
-       from pyhealth import evaluation
-       evaluator = evaluation.__dict__['phenotyping']
-       r = evaluator(pred_results['hat_y'], pred_results['y'])
+       from pyhealth.evaluation.evaluator import func
+       r = func(pred_results['hat_y'], pred_results['y'])
+       print(r)
+
 
 
 **Citing PyHealth**\ :
@@ -240,27 +244,27 @@ EHU-Claim            CMS               DE-SynPUF: CMS 2008-2010 Data Entrepreneu
 
 You may download the above datasets at the links. The structure of the generated datasets can be found in datasets folder:
 
-* \\datasets\\cms\\x_datat\\...csv
+* \\datasets\\cms\\x_data\\...csv
 * \\datasets\\cms\\y_data\\phenotyping.csv
 * \\datasets\\cms\\y_data\\mortality.csv
 
-The processed datasets (X,y) should be put in x_data, y_data correspondingly, to be appropriately digested by deep learning models.
+The processed datasets (X,y) should be put in x_data, y_data correspondingly, to be appropriately digested by deep learning models. We include some sample datasets under \\datasets folder.
 
 **(ii) Machine Learning and Deep Learning Models** :
 
-===================  ================  ======================================================================================================  =====  ========================================
+===================  ================  ========================================  ======================================================================================================  =====  ========================================
 Type                 Abbr              Algorithm                                                                                               Year   Ref
-===================  ================  ======================================================================================================  =====  ========================================
-Classical Models     LogisticReg       Logistic Regression                                                                                     N/A
-Classical Models     XGBoost           XGBoost: A scalable tree boosting system                                                                2016   [#Chen2016Xgboost]_
-Neural Networks      LSTM              Long short-term memory                                                                                  1997   [#Hochreiter1997Long]_
-Neural Networks      GRU               Gated recurrent unit                                                                                    2014   [#Cho2014Learning]_
-Neural Networks      RETAIN            RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism         2016   [#Choi2016RETAIN]_
-Neural Networks      Dipole            Dipole: Diagnosis Prediction in Healthcare via Attention-based Bidirectional Recurrent Neural Networks  2017   [#Ma2017Dipole]_
-Neural Networks      tLSTM             Patient Subtyping via Time-Aware LSTM Networks                                                          2017   [#Baytas2017tLSTM]_
-Neural Networks      RAIM              RAIM: Recurrent Attentive and Intensive Model of Multimodal Patient Monitoring Data                     2018   [#Xu2018RAIM]_
-Neural Networks      StageNet          StageNet: Stage-Aware Neural Networks for Health Risk Prediction                                        2020   [#Gao2020StageNet]_
-===================  ================  ======================================================================================================  =====  ========================================
+===================  ================  ========================================  ======================================================================================================  =====  ========================================
+Classical Models     LogisticReg       pyhealth.models.sequence.lr               Logistic Regression                                                                                     N/A
+Classical Models     XGBoost           pyhealth.models.sequence.lr.xgboost       XGBoost: A scalable tree boosting system                                                                2016   [#Chen2016Xgboost]_
+Neural Networks      LSTM              pyhealth.models.sequence.lstm             Long short-term memory                                                                                  1997   [#Hochreiter1997Long]_
+Neural Networks      GRU               pyhealth.models.sequence.gru              Gated recurrent unit                                                                                    2014   [#Cho2014Learning]_
+Neural Networks      RETAIN            pyhealth.models.sequence.retain           RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism         2016   [#Choi2016RETAIN]_
+Neural Networks      Dipole            pyhealth.models.sequence.dipole           Dipole: Diagnosis Prediction in Healthcare via Attention-based Bidirectional Recurrent Neural Networks  2017   [#Ma2017Dipole]_
+Neural Networks      tLSTM             pyhealth.models.sequence.tlstm            Patient Subtyping via Time-Aware LSTM Networks                                                          2017   [#Baytas2017tLSTM]_
+Neural Networks      RAIM              pyhealth.models.sequence.raim             RAIM: Recurrent Attentive and Intensive Model of Multimodal Patient Monitoring Data                     2018   [#Xu2018RAIM]_
+Neural Networks      StageNet          pyhealth.models.sequence.stagenet         StageNet: Stage-Aware Neural Networks for Health Risk Prediction                                        2020   [#Gao2020StageNet]_
+===================  ================  ========================================  ======================================================================================================  =====  ========================================
 
 Examples of running ML and DL models can be found below, or directly at \\examples\\learning_examples\\
 
@@ -356,8 +360,12 @@ scripts to generate the customized datasets.
 Quick Start for Running Predictive Models
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-`"examples/learning_models/lstm_cms_example.py" <https://github.com/yzhao062/pyhealth/blob/master/examples/learning_models/lstm_cms_example.py>`_
-demonstrates the basic API of using LSTM for phenotyping prediction. **It is noted that the API across all other algorithms are consistent/similar**.
+
+Before running examples, you need the datasets. Please download from the GitHub repository `"datasets" <https://github.com/yzhao062/PyHealth/tree/master/datasets>`_.
+You can either unzip them manually or running our script `"00_extract_data_run_before_learning.py" <https://github.com/yzhao062/pyhealth/blob/master/examples/learning_models/00_extract_data_run_before_learning.py>`_
+
+`"examples/learning_models/example_sequence_gpu_mortality.py" <https://github.com/yzhao062/pyhealth/blob/master/examples/learning_models/example_sequence_gpu_mortality.py>`_
+demonstrates the basic API of using GRU for mortality prediction. **It is noted that the API across all other algorithms are consistent/similar**.
 
 **If you do not have the preprocessed datasets yet, download the \\datasets folder (cms.zip and mimic.zip) from PyHealth repository, and run \\examples\\learning_models\\extract_data_run_before_learning.py to prepare/unzip the datasets.**
 
@@ -367,10 +375,11 @@ demonstrates the basic API of using LSTM for phenotyping prediction. **It is not
    .. code-block:: python
 
       # load pre-processed CMS dataset
-      from pyhealth.data.expdata_generator import cms as cms_expdata_generator
+      from pyhealth.data.expdata_generator import sequencedata as expdata_generator
 
-      cur_dataset = cms_expdata_generator(exp_id=exp_id, sel_task='phenotyping')
-      cur_dataset.get_exp_data()
+      expdata_id = '2020.0810.data.mortality.mimic'
+      cur_dataset = expdata_generator(exp_id=exp_id)
+      cur_dataset.get_exp_data(sel_task='mortality', )
       cur_dataset.load_exp_data()
 
 
@@ -379,8 +388,10 @@ demonstrates the basic API of using LSTM for phenotyping prediction. **It is not
    .. code-block:: python
 
       # initialize the model for training
-      from pyhealth.models.lstm import LSTM
-      clf = LSTM(exp_id)
+      from pyhealth.models.sequence.lstm import LSTM
+      # enable GPU
+      clf = LSTM(expmodel_id=expmodel_id, n_batchsize=20, use_gpu=True,
+          n_epoch=100, gpu_ids='0,1')
       clf.fit(cur_dataset.train, cur_dataset.valid)
 
 #. Load the best shot of the training, predict on the test datasets
@@ -398,9 +409,9 @@ demonstrates the basic API of using LSTM for phenotyping prediction. **It is not
    .. code-block:: python
 
       # evaluate the model
-      from pyhealth import evaluation
-      evaluator = evaluation.__dict__['phenotyping']
-      r = evaluator(pred_results['hat_y'], pred_results['y'])
+      from pyhealth.evaluation.evaluator import func
+      r = func(pred_results['hat_y'], pred_results['y'])
+      print(r)
 
 
 
@@ -417,7 +428,6 @@ Blueprint & Development Plan
 The long term goal of PyHealth is to become a comprehensive healthcare AI toolkit that supports
 beyond EHR data, but also the images and clinical notes.
 
-- The support of image datasets and clinical notes
 - The compatibility and the support of OMOP format datasets
 - Model persistence (save, load, and portability)
 - The release of a benchmark paper with PyHealth

diff --git a/datasets/image.zip b/datasets/image.zip
diff --git a/docs/example.rst b/docs/example.rst
@@ -70,8 +70,11 @@ scripts to generate the customized datasets.
 Quick Start for Running Predictive Models
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-`"examples/learning_models/lstm_cms_example.py" <https://github.com/yzhao062/pyhealth/blob/master/examples/learning_models/lstm_cms_example.py>`_
-demonstrates the basic API of using LSTM for phenotyping prediction. **It is noted that the API across all other algorithms are consistent/similar**.
+Before running examples, you need the datasets. Please download from the GitHub repository `"datasets" <https://github.com/yzhao062/PyHealth/tree/master/datasets>`_.
+You can either unzip them manually or running our script `"00_extract_data_run_before_learning.py" <https://github.com/yzhao062/pyhealth/blob/master/examples/learning_models/00_extract_data_run_before_learning.py>`_
+
+`"examples/learning_models/example_sequence_gpu_mortality.py" <https://github.com/yzhao062/pyhealth/blob/master/examples/learning_models/example_sequence_gpu_mortality.py>`_
+demonstrates the basic API of using GRU for mortality prediction. **It is noted that the API across all other algorithms are consistent/similar**.
 
 **If you do not have the preprocessed datasets yet, download the \\datasets folder (cms.zip and mimic.zip) from PyHealth repository, and run \\examples\\learning_models\\extract_data_run_before_learning.py to prepare/unzip the datasets.**
 
@@ -81,10 +84,11 @@ demonstrates the basic API of using LSTM for phenotyping prediction. **It is not
    .. code-block:: python
 
       # load pre-processed CMS dataset
-      from pyhealth.data.expdata_generator import cms as cms_expdata_generator
+      from pyhealth.data.expdata_generator import sequencedata as expdata_generator
 
-      cur_dataset = cms_expdata_generator(exp_id=exp_id, sel_task='phenotyping')
-      cur_dataset.get_exp_data()
+      expdata_id = '2020.0810.data.mortality.mimic'
+      cur_dataset = expdata_generator(exp_id=exp_id)
+      cur_dataset.get_exp_data(sel_task='mortality', )
       cur_dataset.load_exp_data()
 
 
@@ -93,8 +97,10 @@ demonstrates the basic API of using LSTM for phenotyping prediction. **It is not
    .. code-block:: python
 
       # initialize the model for training
-      from pyhealth.models.lstm import LSTM
-      clf = LSTM(exp_id)
+      from pyhealth.models.sequence.lstm import LSTM
+      # enable GPU
+      clf = LSTM(expmodel_id=expmodel_id, n_batchsize=20, use_gpu=True,
+          n_epoch=100, gpu_ids='0,1')
       clf.fit(cur_dataset.train, cur_dataset.valid)
 
 #. Load the best shot of the training, predict on the test datasets
@@ -112,7 +118,7 @@ demonstrates the basic API of using LSTM for phenotyping prediction. **It is not
    .. code-block:: python
 
       # evaluate the model
-      from pyhealth import evaluation
-      evaluator = evaluation.__dict__['phenotyping']
-      r = evaluator(pred_results['hat_y'], pred_results['y'])
+      from pyhealth.evaluation.evaluator import func
+      r = func(pred_results['hat_y'], pred_results['y'])
+      print(r)