Lightweight Python implementation of Normative Modelling with Gaussian Processes, LOESS & Centiles approaches.

For a more advanced implementation, see the Python librairie PCNtoolkit.

Installation

To install pynm:

$ pip install pynm

Alternatively, for development purposes, clone this repository and run:

$ git clone https://github.com/ppsp-team/PyNM
$ cd PyNM
$ python setup.py develop

All code for PyNM is written in Python (Python>=3.5). See requirements.txt for a full list of dependencies.

Usage

usage: pynm [-h] --pheno_p PHENO_P --out_p OUT_P [--confounds CONFOUNDS]
            [--conf CONF] [--score SCORE] [--group GROUP] [--method METHOD]
            [--num_epochs NUM_EPOCHS] [--n_inducing N_INDUCING]
            [--batch_size BATCH_SIZE] [--length_scale LENGTH_SCALE] [--nu NU]

optional arguments:
  -h, --help                        show this help message and exit
  
  --pheno_p PHENO_P                 Path to phenotype data. Data must be in a .csv file.
  
  --out_p OUT_P                     Path to output directory.
  
  --confounds CONFOUNDS             List of confounds to use in the GP model.The list must
                                    formatted as a string with commas between confounds,
                                    each confound must be a column name from the phenotype
                                    .csv file. Categorical confounds must be denoted by
                                    with C(): e.g. 'C(SEX)' for column name 'SEX'. Default
                                    value is 'age'.
                                    
  --conf CONF                       Single numerical confound to use in LOESS & centile
                                    models. Must be a column name from the phenotype .csv
                                    file. Default value is 'age'.
                                    
  --score SCORE                     Response variable for all models. Must be a column
                                    title from phenotype .csv file. Default value is 'score'.
                                    
  --group GROUP                     Column name from the phenotype .csv file that
                                    distinguishes probands from controls. The column must
                                    be encoded with str labels using 'PROB' for probands
                                    and 'CTR' for controls or with int labels using 1 for
                                    probands and 0 for controls. Default value is 'group'.
                                    
  --method METHOD                   Method to use for the GP model. Can be set to
                                    'auto','approx' or 'exact'. In 'auto' mode, the exact
                                    model will be used for datasets smaller than 1000 data
                                    points. SVGP is used for the approximate model. 
                                    See documentation for details. Default value is 'auto'.
                                    
  --num_epochs NUM_EPOCHS           Number of training epochs for SVGP model. 
                                    See documentation for details. Default value is 20.
                                    
  --n_inducing N_INDUCING           Number of inducing points for SVGP model. 
                                    See documentation for details. Default value is 500.
  --batch_size BATCH_SIZE           Batch size for training and predicting from SVGP
                                    model. See documentation for details. Default value is 256.
  --length_scale LENGTH_SCALE       Length scale of Matern kernel for exact model. 
                                    See documentation for details. Default value is 1.
                                    
  --nu NU                           Nu of Matern kernel for exact and SVGP model.
  
  --train_sample TRAIN_SAMPLE       On what subset to train the model, can be 'controls',
                                    'manual', or a value in (0,1]. Default value is 'controls'.

Documentation

All the functions have the classical Python DocStrings that you can summon with help(). You can also see the tutorials for documented examples.

Training sample

By default, the models are fit on all the controls in the dataset and prediction is then done on the entire dataset. The residuals (scores of the normative model) are then calculated as the difference between the actual value and predicted value for each subject. This paradigm is not meant for situations in which the residuals will then be used in a prediction setting, since any train/test split stratified by proband/control will have information from the training set leaked into the test data.

In order to avoid contaminating the test set, in a prediction setting it is important to fit the normative model on a subset of the controls and then leave those out. This is implemented in PyNM with the --train_sample flag. It can be used in three ways:

Number in (0,1]
- This is simplest usage that defines the sample size, PyNM will then select a random sample of the controls and use those as a training group.
- The subjects used in the sample are recorded in the column 'train_sample' of the resulting PyNM.data object. Subjects used in the training sample are encoded as 1s, and the rest as 0s.
'manual'
- It is also possible to specify exactly which subjects to use as a training group by providing a column in the input data labeled 'train_sample' encoded the same way.
'controls'
- This is the default setting that will fit the model on all the controls.

Gaussian Process Model

References

Original papers with Gaussian Processes (GP):

Marquand et al. Biological Psychiatry 2016 (doi:10.1016/j.biopsych.2015.12.023)
Marquand et al. Molecular Psychiatry 2019 (doi:10.1038/s41380-019-0441-1)

Example of use of the LOESS approach:

Lefebvre et al. Front. Neurosci. 2018 (doi:10.3389/fnins.2018.00662)
Maruani et al. Front. Psychiatry 2019 (doi:10.3389/fpsyt.2019.00011)

For the Centiles approach, see Bethlehem et al. Communications Biology 2020 (doi:10.1038/s42003-020-01212-9) with the R implementation here.

How to report errors

If you spot any bugs 🪲? Check out the open issues to see if we're already working on it. If not, open up a new issue and we will check it out when we can!

How to contribute

Thank you for considering contributing to our project! Before getting involved, please review our contribution guidelines.

Support

This work is supported by Compute Canada, IVADO, and FRQS.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Installation

Usage

Documentation

Training sample

Gaussian Process Model

References

How to report errors

How to contribute

Support

Files

README.md

Latest commit

History

README.md

File metadata and controls

Installation

Usage

Documentation

Training sample

Gaussian Process Model

References

How to report errors

How to contribute

Support