Driver gene identification #77

forrestzhaosen · 2023-03-08T22:52:30Z

Is it possible to generate a list of genes that drive the significance of the enriched cell type?
Thanks!

Al-Murphy · 2023-03-09T10:23:00Z

Hey!

Let me explain a little about how EWCE works since what you are asking doesn't really fit into EWCE's usual use cases. So EWCE uses reference single-cell/single-nucleus RNA-seq datasets to work the specificity of genes' expressions to specific cell types. In this sense, you can get these specificity values for each cell type:

#this reference dataset is from mouse cortex and hypothalamus single-cell RNA-seq data (from the Karolinska Institute).
ctd <- ewceData::ctd()
#Different cell levels can be inspected, let's go with the highest cell level groupings and look at the specificity matrix
ctd[[1]]$specificity[1:10,]
              astrocytes_ependymal endothelial-mural interneurons  microglia oligodendrocytes pyramidal CA1 pyramidal SS
Tspan12                 0.20156111        0.43344472    0.1276534 0.09963109       0.04496569   0.060313947   0.03243007
Tshz1                   0.17881643        0.22553038    0.1572263 0.20926498       0.09224942   0.009966436   0.12694604
Fnbp1l                  0.10217628        0.22373111    0.1993075 0.09898422       0.04575399   0.081178989   0.24886792
Adamts15                0.03860933        0.51199973    0.3113053 0.00000000       0.06116274   0.017480158   0.05944269
Cldn12                  0.19225487        0.11971558    0.1550142 0.05023070       0.13721201   0.130088678   0.21548392
Rxfp1                   0.00000000        0.10061558    0.5423008 0.00000000       0.06304908   0.114920073   0.17911438
2310042E22Rik           0.03910053        0.12023499    0.6568037 0.03405283       0.05650745   0.077247458   0.01605304
Sema3c                  0.04175674        0.25607623    0.5989986 0.04959018       0.02405407   0.011796175   0.01772802
Jam2                    0.30642131        0.33054824    0.2148130 0.06456381       0.04821188   0.013628951   0.02181276
Apbb1ip                 0.00000000        0.03384452    0.2095321 0.69014938       0.03752203   0.012885393   0.01606654

However, going to your cell type of interest and picking the most specific genes may not be a sensible thing to do as the actual expression levels of these genes, for example, could be very low but it might just happen that the small amount of expression that was noted came from this cell type. It is worth understanding what we mean by specificity here before trying to use this data for your own problem.

You want:

a list of genes that drive the significance of the enriched cell type

From what I understand, you are trying to get a list of genes that are specific to a cell type. This could be done by taking say the top 10% quantile of genes from this specificity matrix using a reference dataset from the region of interest (see here for documentation for creating your own). However note the caveats and possible issues with this I give above. Have a read of the original publication to get a better understanding of these concepts and how we use them in EWCE - https://www.frontiersin.org/articles/10.3389/fnins.2016.00016/full

Cheers,
Alan.

bschilder · 2023-03-09T12:05:14Z

@Al-Murphy I think what @forrestzhaosen is asking for is actually something a bit different. i.e. Given cell-type-specific enrichment for some gene list, what are the cell-type specific genes that are most strongly driving that enrichment. Since not all genes will overlap between the gene list and the genes with the top specificity quantiles, simply taking the latter wouldn't be sufficient.

EWCE doesn't currently return this kind of information, but I take your point @forrestzhaosen .
One simple solution would be extend what @Al-Murphy is suggesting and find the intersection between genes with the gene list and the genes in the top specificity quantiles for a given enriched celltype. Ideally, we would want to collect some stats on this across all bootstrap iterations. I don't think this would be too hard to implement, and might be quite useful in certain cases.

NathanSkene · 2023-03-09T12:11:26Z

You can get the bootstrapping plots, which show the probability of each gene being there, at each position in the ranked list. There is a function in EWCE for this. But I don’t think there is a subset of genes driving the enrichments, in most cases Sent from Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Brian M. Schilder ***@***.***> Sent: Thursday, March 9, 2023 12:05:26 PM To: NathanSkene/EWCE ***@***.***> Cc: Subscribed ***@***.***> Subject: Re: [NathanSkene/EWCE] Driver gene identification (Issue #77) This email from ***@***.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders list<https://spam.ic.ac.uk/SpamConsole/Senders.aspx> to disable email stamping for this address. @Al-Murphy<https://github.com/Al-Murphy> I think what @forrestzhaosen<https://github.com/forrestzhaosen> is asking for is actually something a bit different. i.e. Given cell-type-specific enrichment for some gene list, what are the cell-type specific genes that are most strongly driving that enrichment. Since not all genes will overlap between the gene list and the genes with the top specificity quantiles, simply taking the latter wouldn't be sufficient. EWCE doesn't currently return this kind of information, but I take your point @forrestzhaosen<https://github.com/forrestzhaosen> . One simply solution would be extend what @Al-Murphy<https://github.com/Al-Murphy> is suggesting and find the intersection between genes with the gene list and the genes in the top specificity quantiles for a given enriched celltype. Ideally, we would want to collect some stats on this across all bootstrap iterations. I don't think this would be too hard to implement, and might be quite useful in certain cases. — Reply to this email directly, view it on GitHub<#77 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AH5ZPE2CXDNV5MQHAB62HG3W3HBQNANCNFSM6AAAAAAVULGEXQ>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

bschilder · 2023-03-09T12:20:28Z

You can get the bootstrapping plots, which show the probability of each gene being there, at each position in the ranked list. There is a function in EWCE for this. But I don’t think there is a subset of genes driving the enrichments, in most cases

Oh right, forgot about that function! This what more what I mean. If I'm remembering correctly though, the plots were implemented in such a way that the enrichments tests had to be rerun, which isn't ideal since it doubles compute time and isn't tied directly to your first round of enrichment results (due to the stochasticity).

If am indeed remembering correctly, maybe we can add a feature to the main bootstrapping function that computes the per-gene probability and stores them in another slot within the output.

bschilder · 2023-03-09T14:27:40Z

I believe this is the function in question @NathanSkene .
Does this seem like what you're looking for @forrestzhaosen ?
https://nathanskene.github.io/EWCE/reference/generate_bootstrap_plots.html

And here is what i mean about the bootstrapping tests being redone:

EWCE/R/generate_bootstrap_plots.r

Line 157 in 9b5b21f

for (s in seq_len(nReps)) {

Looking into the code now to see if we can just record these gene-wise probabilities during the bootstrap tests conducted by bootstrap_enrichment_test

forrestzhaosen · 2023-03-09T15:40:51Z

Thank you all! That's really helpful.
I just tried generate_bootstrap_plots but met this ERROR:
Error in rownames<-(*tmp*, value = sprintf("Rep%s", seq_len(nReps))) :
attempt to set 'rownames' on an object with no dimensions
I couldn't figure this out by reading your source code. I am using EWCE version 1.6.
Do you get any quick idea about it?

bschilder · 2023-03-09T15:58:00Z

Thank you all! That's really helpful. I just tried generate_bootstrap_plots but met this ERROR: Error in rownames<-(*tmp*, value = sprintf("Rep%s", seq_len(nReps))) : attempt to set 'rownames' on an object with no dimensions I couldn't figure this out by reading your source code. I am using EWCE version 1.6. Do you get any quick idea about it?

@forrestzhaosen Please do include a reproducible example using the Bugs template in a new Issue. Otherwise there's not much we can do for you.

In other news, making progress on implementing the built-in gene scoring for
Looks something like this atm:

forrestzhaosen · 2023-03-09T16:06:34Z

Thanks! That's amazing.

bschilder · 2023-03-09T17:03:23Z

Actually, needs to be cell-type specific scores, and be based on the specificity of each gene within a cellular context.
So I think this works better.

bschilder · 2023-03-10T00:30:46Z

Ok, so i rewrote much of generate_bootstrap_plots plots to make a number of improvments:
- Revamp wrap code into reusable subfunctions.
- Avoid resampling random genes when the gene data is stored in the bootstrap results as gene_data.
It will also tell you which of these options it's using.
- Save with ggsave instead if grDevices.
- Facet by celltype instead of generating tons of separate plots.
- Let users decide cutoff threshold with new arg adj_pval_thresh
- Now returns a named list with the plots themselves ("plot") and the paths to where they're saved ("paths") rather than just a higher-level directory path in which users had to search for the right files (and didn't ever have access to the ggplot2 objects).

Here's what the plots look like now:

Reprex

## Load the single cell data
sct_data <- ewceData::ctd()

## Set the parameters for the analysis
## Use 5 bootstrap lists for speed, for publishable analysis use >10000
reps <- 5

## Load the gene list and get human orthologs
hits <- ewceData::example_genelist()[1:100]

## Bootstrap significance test,
##  no control for transcript length or GC content
## Use pre-computed results to speed up example
full_results <- EWCE::example_bootstrap_results()
 
output <- EWCE::generate_bootstrap_plots(
    sct_data = sct_data,
    hits = hits,
    reps = reps,
    full_results = full_results,
    listFileName = "Example",
    sctSpecies = "mouse",
    genelistSpecies = "human",
    annotLevel = 1,
    save_dir = tempdir()
)

Plots

Plot 1

Plot 2

Plot 3

Plot 4

Let me know if anything seems awry with these @NathanSkene

bschilder · 2023-03-10T00:59:12Z

actually, shouldn't the y-axes be labeled "Specificity in cell type"? Since it's the specificity matrix that it ultimately being used to generate these plots (currently, and before I touched anything)

bschilder · 2023-03-10T10:46:52Z

Note to self, I'll also revamp the generate_bootstrap_plots_for_transcriptome func to match the upgrades in generate_bootstrap_plots. This is not automatic since a lot of the original code was just a separate copy of generate_bootstrap_plots, as opposed to using subfunctions that could be applied to both. This kind of structure inflates the size of EWCE's total code and leaves room for unintended divergence between function behaviour. When I have time, I'll try to refactor some of the code to alleviate this issue.

bschilder · 2023-03-10T10:56:56Z

@Al-Murphy I've pushed the changes I've made so far to generate_bootstrap_plots so that @forrestzhaosen can start using it right away. But I'd hold off on pushing upstream to Bioc just yet as I want to finish the generate_bootstrap_plots_for_transcriptome upgrades first.

@forrestzhaosen to install the dev version of EWCE use:

remotes::install_github("NathanSkene/EWCE", dependencies = TRUE, upgrade = "always")

forrestzhaosen · 2023-03-10T22:13:09Z

Thank you so much! It's greatly appreciated.

bschilder · 2023-04-03T17:52:56Z

Changes to the generate_bootstrap_plots_for_transcriptome function have now been pushed as well.

bschilder added the enhancement label Mar 9, 2023

bschilder mentioned this issue Mar 9, 2023

Extracting genes driving the enrichment and their mean bootstrap expression #76

Closed

bschilder self-assigned this Mar 10, 2023

bschilder closed this as completed Apr 3, 2023

bschilder mentioned this issue Nov 28, 2023

How can I extract a list of genes that drive the (significant) enrichment for a particular cell type? #90

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Driver gene identification #77

Driver gene identification #77

forrestzhaosen commented Mar 8, 2023 •

edited

Loading

Al-Murphy commented Mar 9, 2023 •

edited

Loading

bschilder commented Mar 9, 2023 •

edited

Loading

NathanSkene commented Mar 9, 2023 via email

bschilder commented Mar 9, 2023 •

edited

Loading

bschilder commented Mar 9, 2023

forrestzhaosen commented Mar 9, 2023

bschilder commented Mar 9, 2023

forrestzhaosen commented Mar 9, 2023

bschilder commented Mar 9, 2023

bschilder commented Mar 10, 2023 •

edited

Loading

bschilder commented Mar 10, 2023 •

edited

Loading

bschilder commented Mar 10, 2023

bschilder commented Mar 10, 2023

forrestzhaosen commented Mar 10, 2023

bschilder commented Apr 3, 2023

Driver gene identification #77

Driver gene identification #77

Comments

forrestzhaosen commented Mar 8, 2023 • edited Loading

Al-Murphy commented Mar 9, 2023 • edited Loading

bschilder commented Mar 9, 2023 • edited Loading

NathanSkene commented Mar 9, 2023 via email

bschilder commented Mar 9, 2023 • edited Loading

bschilder commented Mar 9, 2023

forrestzhaosen commented Mar 9, 2023

bschilder commented Mar 9, 2023

forrestzhaosen commented Mar 9, 2023

bschilder commented Mar 9, 2023

bschilder commented Mar 10, 2023 • edited Loading

Reprex

Plots

Plot 1

Plot 2

Plot 3

Plot 4

bschilder commented Mar 10, 2023 • edited Loading

bschilder commented Mar 10, 2023

bschilder commented Mar 10, 2023

forrestzhaosen commented Mar 10, 2023

bschilder commented Apr 3, 2023

forrestzhaosen commented Mar 8, 2023 •

edited

Loading

Al-Murphy commented Mar 9, 2023 •

edited

Loading

bschilder commented Mar 9, 2023 •

edited

Loading

bschilder commented Mar 9, 2023 •

edited

Loading

bschilder commented Mar 10, 2023 •

edited

Loading

bschilder commented Mar 10, 2023 •

edited

Loading