SpaVAE, spaPeakVAE, spaMultiVAE, and spaLDVAE are dependency-aware deep generative models for multitasking analysis of spatial genomics data. Different models are designed for different analytical tasks of spatial genomics data.
spaVAE is a negative binomial (NB) model-based variational autoencoder (VAE) with a mixture embedding of Gaussian process (GP) prior and Gaussian prior. The model is for multitasking analysis of spatially resolved transcriptomics (SRT) data, including dimensionality reduction, visualization, clustering, batch integration, denoising, differential expression, spatial imputation, and resolution enhancement.
spaPeakVAE is a variant model of spaVAE, which uses a Bernoulli decoder to characterize spatial ATAC-seq binary data. The analytical tasks in spaVAE can also be fulfilled by spaPeakVAE for spatial ATAC-seq data.
spaMultiVAE characterizes spatial multi-omics data, which profiles gene expression and surface protein intensity simultaneously. Besides the analyses aforementioned, spaMultiVAE uses a NB mixture decoder to denoise backgrounds in proteins.
spaLDVAE and spaPeakLDVAE are spaVAE variants with a linear decoder, which also contains two latent embedding components, one follows GP prior and the other follows standard normal prior. The model can be used for detecting spatial variable genes and peaks.
Diagram of spaVAE (a), spaPeakVAE (a), spaMultiVAE (b), spaLDVAE (c), and spaPeakLDVAE (c) networks:
Python: 3.9.7
PyTorch: 1.11.0 (https://pytorch.org)
Scanpy: 1.9.1 (https://scanpy.readthedocs.io/en/stable)
Numpy: 1.21.5 (https://numpy.org)
Pandas: 1.4.2 (https://pandas.pydata.org)
h5py: 3.6.0 (https://pypi.org/project/h5py)
For human DLPFC dataset:
python run_spaVAE.py --data_file HumanDLPFC_151673.h5 --inducing_point_steps 6
For integrating 4 human DLPFC samples:
python run_spaVAE_Batch.py --data_file 151673_151674_151675151676_samples_union.h5 --inducing_point_steps 6
For mouse hippocampus Slide-seq V2 dataset:
python run_spaVAE.py --data_file Mouse_hippocampus.h5 --grid_inducing_points False --inducing_point_nums 400 --loc_range 40
For spatial ATAC-seq dataset of mouse embryonic (E15.5) brain tissues in the MISAR-seq dataset:
python run_spaPeakVAE.py --data_file MISAR_seq_mouse_E15_brain_ATAC_data.h5 --inducing_point_steps 19
For spatial multi-omics Spatial-ATAC-seq data:
python run_spaMultiVAE.py --data_file Multiomics_Spatial_ATAC_Human_tonsil_data.sh --inducing_point_steps 19
--data_file specifies the data file name, in the h5 file. For SRT data, spot-by-gene count matrix is stored in "X" and 2D location is stored in "pos". For spatial ATAC-seq data, "X" represents spot-by-peak count matrix. For spatial multi-omics data, "X_gene" represents spot-by-gene count matrix, and "X_protein" represents spot-by-protein count matrix.
--data_file: data file name.
--select_genes: number of selected genes for analysis, default = 0 means no filtering. It will use the mean-variance relationship to select informative genes.
--batch_size: mini-batch size, default = "auto", which means if sample size <= 1024 then batch size = 128, if 1024 < sample size <= 2048 then batch size = 256, if sample size > 2048 then batch size = 512.
--maxiter: number of max training iterations, default = 5000.
--train_size: proportion of training set, others will be validating set, default = 0.95.
--patience: patience of early stopping when using validating set, default = 200.
--lr: learning rate, default = 1e-3 for spaVAE and spaPeakVAE, and defualt = 5e-3 for spaMultiVAE.
--weight_decay: weight decay coefficient, default = 1e-6.
--noise: coefficient of random Gaussian noise for the encoder, default = 0.
--dropoutE: dropout probability for encoder, default = 0.
--dropoutD: dropout probability for decoder, default = 0.
--encoder_layers: hidden layer sizes of encoder, default = [128, 64].
--GP_dim: dimension of the latent Gaussian process embedding, default = 2 for spaVAE and spaMultiVAE, and default = 4 for spaPeakVAE.
--Normal_dim: dimension of the latent standard Gaussian embedding, default = 8.
--decoder_layers: hidden layer sizes of decoder, default = [128].
--init_beta: initial coefficient of the KL loss, default = 10.
--min_beta: minimal coefficient of the KL loss, default = 4.
--max_beta: maximal coefficient of the KL loss, default = 25.
--KL_loss: desired KL_divergence value (GP and standard normal combined), default = 0.025.
--num_samples: number of samplings of the posterior distribution of latent embedding during training, default = 1.
--fix_inducing_points: fixed or trainable inducing points, default = True, which means inducing points are fixed.
--grid_inducing_points: whether to use 2D grid inducing points or k-means centroids of positions as inducing points, default = True. "True" for 2D grid, "False" for k-means centroids.
--inducing_point_steps: if using 2D grid inducing points, set the number of 2D grid steps, default = None. Needed when grid_inducing_points = True.
--inducing_point_nums: if using k-means centroids on positions, set the number of inducing points, default = None. Needed when grid_inducing_points = False.
--fixed_gp_params: kernel scale is fixed or not, default = False, which means kernel scale is trainable.
--loc_range: positional locations will be scaled to the specified range. For example, loc_range = 20 means x and y locations will be scaled to the range 0 to 20, default = 20. This value can be set larger if it isn't numerical stable during training.
--kernel_scale: initial kernel scale, default = 20.
--model_file: file name to save weights of the model, default = model.pt
--final_latent_file: file name to output final latent representations, default = final_latent.txt.
--denoised_counts_file: file name to output denoised counts, default = denoised_mean.txt.
--device: pytorch device, default = cuda.
Datasets used in the study can be found
https://figshare.com/articles/dataset/Spatial_genomics_datasets/21623148
Tian Tian [email protected]