Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Driver gene identification #77

Closed
forrestzhaosen opened this issue Mar 8, 2023 · 15 comments
Closed

Driver gene identification #77

forrestzhaosen opened this issue Mar 8, 2023 · 15 comments
Assignees

Comments

@forrestzhaosen
Copy link

forrestzhaosen commented Mar 8, 2023

Is it possible to generate a list of genes that drive the significance of the enriched cell type?
Thanks!

@Al-Murphy
Copy link
Collaborator

Al-Murphy commented Mar 9, 2023

Hey!

Let me explain a little about how EWCE works since what you are asking doesn't really fit into EWCE's usual use cases. So EWCE uses reference single-cell/single-nucleus RNA-seq datasets to work the specificity of genes' expressions to specific cell types. In this sense, you can get these specificity values for each cell type:

#this reference dataset is from mouse cortex and hypothalamus single-cell RNA-seq data (from the Karolinska Institute).
ctd <- ewceData::ctd()
#Different cell levels can be inspected, let's go with the highest cell level groupings and look at the specificity matrix
ctd[[1]]$specificity[1:10,]
              astrocytes_ependymal endothelial-mural interneurons  microglia oligodendrocytes pyramidal CA1 pyramidal SS
Tspan12                 0.20156111        0.43344472    0.1276534 0.09963109       0.04496569   0.060313947   0.03243007
Tshz1                   0.17881643        0.22553038    0.1572263 0.20926498       0.09224942   0.009966436   0.12694604
Fnbp1l                  0.10217628        0.22373111    0.1993075 0.09898422       0.04575399   0.081178989   0.24886792
Adamts15                0.03860933        0.51199973    0.3113053 0.00000000       0.06116274   0.017480158   0.05944269
Cldn12                  0.19225487        0.11971558    0.1550142 0.05023070       0.13721201   0.130088678   0.21548392
Rxfp1                   0.00000000        0.10061558    0.5423008 0.00000000       0.06304908   0.114920073   0.17911438
2310042E22Rik           0.03910053        0.12023499    0.6568037 0.03405283       0.05650745   0.077247458   0.01605304
Sema3c                  0.04175674        0.25607623    0.5989986 0.04959018       0.02405407   0.011796175   0.01772802
Jam2                    0.30642131        0.33054824    0.2148130 0.06456381       0.04821188   0.013628951   0.02181276
Apbb1ip                 0.00000000        0.03384452    0.2095321 0.69014938       0.03752203   0.012885393   0.01606654

However, going to your cell type of interest and picking the most specific genes may not be a sensible thing to do as the actual expression levels of these genes, for example, could be very low but it might just happen that the small amount of expression that was noted came from this cell type. It is worth understanding what we mean by specificity here before trying to use this data for your own problem.

You want:

a list of genes that drive the significance of the enriched cell type

From what I understand, you are trying to get a list of genes that are specific to a cell type. This could be done by taking say the top 10% quantile of genes from this specificity matrix using a reference dataset from the region of interest (see here for documentation for creating your own). However note the caveats and possible issues with this I give above. Have a read of the original publication to get a better understanding of these concepts and how we use them in EWCE - https://www.frontiersin.org/articles/10.3389/fnins.2016.00016/full

Cheers,
Alan.

@bschilder
Copy link
Collaborator

bschilder commented Mar 9, 2023

@Al-Murphy I think what @forrestzhaosen is asking for is actually something a bit different. i.e. Given cell-type-specific enrichment for some gene list, what are the cell-type specific genes that are most strongly driving that enrichment. Since not all genes will overlap between the gene list and the genes with the top specificity quantiles, simply taking the latter wouldn't be sufficient.

EWCE doesn't currently return this kind of information, but I take your point @forrestzhaosen .
One simple solution would be extend what @Al-Murphy is suggesting and find the intersection between genes with the gene list and the genes in the top specificity quantiles for a given enriched celltype. Ideally, we would want to collect some stats on this across all bootstrap iterations. I don't think this would be too hard to implement, and might be quite useful in certain cases.

@NathanSkene
Copy link
Owner

NathanSkene commented Mar 9, 2023 via email

@bschilder
Copy link
Collaborator

bschilder commented Mar 9, 2023

You can get the bootstrapping plots, which show the probability of each gene being there, at each position in the ranked list. There is a function in EWCE for this. But I don’t think there is a subset of genes driving the enrichments, in most cases

Oh right, forgot about that function! This what more what I mean. If I'm remembering correctly though, the plots were implemented in such a way that the enrichments tests had to be rerun, which isn't ideal since it doubles compute time and isn't tied directly to your first round of enrichment results (due to the stochasticity).

If am indeed remembering correctly, maybe we can add a feature to the main bootstrapping function that computes the per-gene probability and stores them in another slot within the output.

@bschilder
Copy link
Collaborator

I believe this is the function in question @NathanSkene .
Does this seem like what you're looking for @forrestzhaosen ?
https://nathanskene.github.io/EWCE/reference/generate_bootstrap_plots.html

And here is what i mean about the bootstrapping tests being redone:

for (s in seq_len(nReps)) {

Looking into the code now to see if we can just record these gene-wise probabilities during the bootstrap tests conducted by bootstrap_enrichment_test

@forrestzhaosen
Copy link
Author

Thank you all! That's really helpful.
I just tried generate_bootstrap_plots but met this ERROR:
Error in rownames<-(*tmp*, value = sprintf("Rep%s", seq_len(nReps))) :
attempt to set 'rownames' on an object with no dimensions

I couldn't figure this out by reading your source code. I am using EWCE version 1.6.
Do you get any quick idea about it?

@bschilder
Copy link
Collaborator

Thank you all! That's really helpful. I just tried generate_bootstrap_plots but met this ERROR: Error in rownames<-(*tmp*, value = sprintf("Rep%s", seq_len(nReps))) : attempt to set 'rownames' on an object with no dimensions I couldn't figure this out by reading your source code. I am using EWCE version 1.6. Do you get any quick idea about it?

@forrestzhaosen Please do include a reproducible example using the Bugs template in a new Issue. Otherwise there's not much we can do for you.

In other news, making progress on implementing the built-in gene scoring for
Looks something like this atm:
Screenshot 2023-03-09 at 15 54 18

@forrestzhaosen
Copy link
Author

Thanks! That's amazing.

@bschilder
Copy link
Collaborator

Actually, needs to be cell-type specific scores, and be based on the specificity of each gene within a cellular context.
So I think this works better.
Screenshot 2023-03-09 at 17 02 27

@bschilder
Copy link
Collaborator

bschilder commented Mar 10, 2023

Ok, so i rewrote much of generate_bootstrap_plots plots to make a number of improvments:
- Revamp wrap code into reusable subfunctions.
- Avoid resampling random genes when the gene data is stored in the bootstrap results as gene_data.
It will also tell you which of these options it's using.
- Save with ggsave instead if grDevices.
- Facet by celltype instead of generating tons of separate plots.
- Let users decide cutoff threshold with new arg adj_pval_thresh
- Now returns a named list with the plots themselves ("plot") and the paths to where they're saved ("paths") rather than just a higher-level directory path in which users had to search for the right files (and didn't ever have access to the ggplot2 objects).

Here's what the plots look like now:

Reprex

## Load the single cell data
sct_data <- ewceData::ctd()

## Set the parameters for the analysis
## Use 5 bootstrap lists for speed, for publishable analysis use >10000
reps <- 5

## Load the gene list and get human orthologs
hits <- ewceData::example_genelist()[1:100]

## Bootstrap significance test,
##  no control for transcript length or GC content
## Use pre-computed results to speed up example
full_results <- EWCE::example_bootstrap_results()
 
output <- EWCE::generate_bootstrap_plots(
    sct_data = sct_data,
    hits = hits,
    reps = reps,
    full_results = full_results,
    listFileName = "Example",
    sctSpecies = "mouse",
    genelistSpecies = "human",
    annotLevel = 1,
    save_dir = tempdir()
)

Plots

Plot 1

p1

Plot 2

p2

Plot 3

p3

Plot 4

p4

Let me know if anything seems awry with these @NathanSkene

@bschilder
Copy link
Collaborator

bschilder commented Mar 10, 2023

actually, shouldn't the y-axes be labeled "Specificity in cell type"? Since it's the specificity matrix that it ultimately being used to generate these plots (currently, and before I touched anything)

@bschilder bschilder self-assigned this Mar 10, 2023
@bschilder
Copy link
Collaborator

Note to self, I'll also revamp the generate_bootstrap_plots_for_transcriptome func to match the upgrades in generate_bootstrap_plots. This is not automatic since a lot of the original code was just a separate copy of generate_bootstrap_plots, as opposed to using subfunctions that could be applied to both. This kind of structure inflates the size of EWCE's total code and leaves room for unintended divergence between function behaviour. When I have time, I'll try to refactor some of the code to alleviate this issue.

@bschilder
Copy link
Collaborator

@Al-Murphy I've pushed the changes I've made so far to generate_bootstrap_plots so that @forrestzhaosen can start using it right away. But I'd hold off on pushing upstream to Bioc just yet as I want to finish the generate_bootstrap_plots_for_transcriptome upgrades first.

@forrestzhaosen to install the dev version of EWCE use:

remotes::install_github("NathanSkene/EWCE", dependencies = TRUE, upgrade = "always")

@forrestzhaosen
Copy link
Author

Thank you so much! It's greatly appreciated.

@bschilder
Copy link
Collaborator

Changes to the generate_bootstrap_plots_for_transcriptome function have now been pushed as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants