Representative marker selection, cluster, visualization, and binomial differential expression pipeline reported in Yuan et al. 2018, Mizrak et al. 2019, Levitin et al. 2019 and Szabo, Levitin et al. 2019.
The pipeline comes in two flavors: marker selection with fixed bin widths (as in Yuan et al 2018 and Mizrak et al. 2019) or an updated procedure with rolling windows and a scaled drop out score (Levitin et al. 2019 and Szabo, Levitin et al. 2019).
This pipline requires Python >= 3.6 and the packages:
- scipy >= 1.1
- numpy
- pandas
- scikit-learn
- statsmodels
- matplotlib
- seaborn
- umap-learn
- phenograph == 1.5.2
The easiest way to setup a python environment for cluster_diffex is with anaconda.
conda create -n cluster_diffex_p36 python=3.6 scikit-learn statsmodels seaborn
# older anaconda
source activate cluster_diffex_p36
# XOR newer anaconda
conda activate cluster_diffex_p36
# Install UMAP
conda install -c conda-forge umap-learn
# Install phenograph
pip install git+
- Optionally, install loompy
pip install -U loompy
- Optionally, install dmaps. First make sure you have cmake. On debian-based distributions do:
sudo apt install cmake
XOR on OSX with homebrew do:
brew install cmake
XOR something else for a different OS/package manager.
Then, install dmaps:
pip install git+
Once you have set up the environment, clone this repository and install.
git clone [email protected]:simslab/cluster_diffex2018.git
cd cluster_diffex2018
pip install .
A typical run of the pipeline might look like:
python scripts/ --count-matrix UMI_MATRIX.txt -o OUTDIR -p PREFIX
where UMI_MATRIX.txt is a whitespace delimited gene by cell UMI count matrix with two leading columns of gene attributes: ENSEMBL_ID GENE_NAME UMICOUNT_CELL0 UMICOUNT_CELL1 ...
To see more options, such as setting k
for the k nearest neighbors graph, using a preselected list of markers in a file, setting thresholds for marker gene selection, using the older scoring scheme, visualization with tSNE or dmaps, etc.:
python scripts/ -h
In particular, to use the old scoring scheme, add the flag --unscaled-score
The dropout curve with marker genes colored in green -
A record of the absolute threshold ('t') and the adaptive threshold. The minimum is used as the cutoff for marker selection. -
Indices of the marker gene rows used in the origninal count matrix after all rows with only zeros have been removed. -
ENSEMBL_ID and GENE_NAME for selected genes. -
Coordinates for UMAP embedding of cells (determined using Spearman's correlation distace on marker genes). -
Plot of UMAP embedding of cells (determined using Spearman's correlation distance on marker genes). -
Integer labels for phenograph clusters (determind using Spearman's correlation distance on marker genes). -1 indicates an unclustered cell. -
Number of neighbors used for clustering (k) and final modularity of the clustering (Q). -
Plot of UMAP embedding of cells (determined using Spearman's correlation distace) colored by Phenograph cluster. -
For a cluster CLUSTER_ID with N cells in the cluster, and M cells in the rest of the dataset, a table of gene ids, names, count in cluster, count out of cluster, fdr_bh corrected pvalues and effect sizes by a binomial test for binary upregulation. Ordered by effect size (decreasing). -
For a cluster CLUSTER_ID with N cells in the cluster, and M cells in the rest of the dataset, a table of gene ids, names, count in cluster, count out of cluster, fdr_bh corrected pvalues and effect sizes by a binomial test for binary downregulation. Ordered by negative effect size. -
Heatmap of normalized expression for the the top differentially expressed genes (by a binomial test) in each cluster. -
JSON file of command line arguments