scDoRI is a deep learning model for single-cell multiome data (RNA + ATAC in same cell) that infers enhancer-mediated gene regulatory networks (eGRNs). By combining an encoder–decoder approach with mechanistic constraints (enhancer–gene links, TF binding logic), scDoRI learns topics that group co-accessible peaks, their cis-linked genes and upstream activator and repressor TFs – all while scaling to large datasets via mini-batches.
- 🔄 Unified approach: a single model for dimensionality reduction + eGRN inference
- 🧠 Learns topics that represent cell-state-specific regulatory programs
- 🧬Continuous eGRN modelling : each cell is a mixture of topics, allowing the study of changes in GRNs. No need for predefined clusters
- 🧰 Scalable to large datasets via mini-batch training
scDoRI expects single-cell multiome data with the following inputs:
RNA
: an AnnData.h5ad
object with cells × genes expression matrixATAC
: an AnnData.h5ad
object with cells × peaks accessibility matrix- Peaks must include genomic coordinates in
.var
(columns:chr
,start
,end
)
- Peaks must include genomic coordinates in
These datasets must be paired — i.e., RNA and ATAC should come from the same cells.
To install all dependencies for scDoRI, we recommend using Conda or Micromamba.
git clone https://github.com/saraswatmanu/scDoRI.git
cd scDoRI
conda env create -f environment.yml
conda activate scdori_env
# Install the scDoRI package
pip install . --no-deps
⚡ Note: The training process is GPU-accelerated and highly recommended to be run on a GPU-enabled machine. While preprocessing can run on CPU, training large datasets on CPU is not advised due to slow performance.
You’ll work through two notebooks, using two separate config files to set parameters for your dataset preprocessing and training.
src/scdori/pp/config.py
to specify the location of RNA and ATAC anndata .h5ad files, motif file, and set number of peaks/genes/TFs to train on.
docs/notebooks/preprocessing.ipynb
src/scdori/_core/config.py
for scDoRI hyperparameters (number of topics, learning rate, epochs etc.) and specify path for preprocessed anndata objects and insilico-chipseq files
docs/notebooks/training.ipynb
docs/notebooks/downstream.ipynb
The provided notebooks use the mouse gastrulation dataset from:
📄 Paper: Argelaguet et al., Bioarxiv 2022 📦 Download: Dropbox link
preprocessing_pipeline/config.py
provides flexible options:
- You can set the number of peaks, genes, and TFs to use for model training
- 💡 Tip: Adjust based on your available GPU memory
- You can also force inclusion of specific genes or TFs, even if they aren’t highly variable
- Useful for focusing on known regulators/ genes of interest
📖 Full documentation and API reference is hosted at: https://scdori.readthedocs.io/en/latest/
Includes:
- API reference (docstrings)
- In-depth method overview
- Preprocessing + training guides
- (upcoming) Customization tips
If you use scDoRI in your work, please cite our preprint/paper (coming soon). Until then, feel free to open an issue or get in touch at [email protected]