-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pando Not Returning Expected GRNs and Limiting Target Genes to a Handful - Possibility of Problem with Mouse Genome? #30
Comments
Some details regarding my code include adapting the provided motif2tf data frame in Pando to decapitalize the letters of each transcription factor name after the first letter to make it compatible with BSgenome.Mmusculus.UCSC.mm10 and then subsetting the JASPAR2022 pfm object to only include transcription factors that are contained within the updated motif2tf data frame. I can edit my own motif2tf data frame object to keep transcription factors in singular fashion in the future so that I don't have to subset the pfm object in order to avoid the downstream colMaxs error, but I just wanted to keep things simple for this first iteration. Also, that last part of my code (tf network for Tcf7l2) does not even work because it's not listed as a transcription factor in the network graph (as shown in the screenshot). All this is really odd because my project revolves around showing the importance of this transcription factor in driving a lot of zonated hepatocyte activity. |
Hi @Sandman-1, |
Curious as to why the targets do not have gene names and have nomenclature like C6.... etc? I encountered the same problem. |
Yeah that's quite odd, I haven't seen that before... could you maybe give me access to some example dataset where that happens? Could imagine that there is a problem with the motif to tf matching but not sure. |
@joschif |
I can try. Here's some code I used to convert JASPAR motifs to what is currently used by Pando, maybe its of use to you: opts[['species']] <- 9606
opts[['all_versions']] <- T
collections <- c('CORE', 'CNE', 'PHYLOFACTS', 'SPLICE', 'POLII', 'FAM', 'PBM', 'PBM_HOMEO', 'PBM_HLH', 'UNVALIDATED')
names(collections) <- collections
motif_collection_list <- map(collections, function(c){
opts[['collection']] <- c
PFMatrixList <- getMatrixSet(JASPAR2020, opts)
motif2tf <- PFMatrixList@listData %>%
map_dfr(function(x){
tibble(
name = x@name,
collection = c,
tf = unlist(str_split(str_replace(x@name, '(.+)\\(.+\\)', '\\1'), '::')),
strand = x@strand,
symbol = ifelse(is.null(x@tags$symbol), '-', x@tags$symbol),
family = ifelse(is.null(x@tags$family), '-', x@tags$family)
)
}, .id='motif')
return(list(pfm=PFMatrixList, df=motif2tf))
})
motifs <- map(motif_collection_list, function(x) x$pfm) %>%
purrr::reduce(c)
motif2tf <- map_dfr(motif_collection_list, function(x) x$df, .id='collection') |
When running infer_grn() what can I enter for gene category because the target is |
|
@joschif |
I ran into the same issue with analyzing the mouse data and got very few TFs (40-50) with the same target gene named "F3". The analysis was pretty much the same as what @sylestiel did. and I prepared the motifs and motif2tf data as @joschif suggested above by changing the Taxonomy ID to 10090. Also tried the motifs from cisBP and got the same error. |
I rerun the infer_grn with
only target genes like "F3", "C2", etc were successfully run. I was not able to find the source code for infer_grn, but the "find_peaks_near_genes" function has the same log message in the source code of "find_peaks_near_genes" function, there the inbuilt data "EnsDb.Hsapiens.v93.annot.UCSC.hg38" was used
I am not sure whether this is also used in the infer_grn function. But as the genes with only one letter were successfully run, and those with more than one letter failed, (eg, F3 and F3 are the same in mouse and human, while Sorcs1 and SORCS1 are different in mouse and human), do you think this might be the problem? How would you suggest fixing it? Thanks a lot! |
Ah thanks, I think that might be the source of the problem. Have you tried running with |
@joschif Thank you for the reply!
Do you have any suggestions on what might be the problem? |
Yeah that's a known problem. The lm fromulas don't work with feature names starting with numbers. I haven't gotten around to fixing it, but I would suggest to filter these genes from the list of input features/TFs as they should be the minority and usually are not protein-coding genes anyway |
Thanks a lot! After removing the genes starting with numbers, the code was successfully run and I got a lot of TFs and targeting genes! I would still like to try the "GREAT" method. I created the mouse data as below
and then changed the names of the elementmetadata as what you have in the human data. did you also get the "EnsDb.Hsapiens.v93.annot.UCSC.hg38" data this way? If not, could you let me know how you did it? |
How can I remove the Rik genes? |
Wow, I missed a lot! Trying GRN Analysis using the Signac method then. Eager to see new results! :))) |
Have either of you gotten it to work yet? If so, I could use some help here. Thanks! |
@sylestiel
hope it also works with your data :) |
That was very helpful. It worked! Thank you very much!!!! |
Did either of you get GREAT to work with mouse data? |
Hello everybody, I am using multiome data generated from mouse, so I tried a lot to solve this problem, but no success, so It would be helpful if you could provide me the mouse genome region to enable inferring the GRN. I checked this website: http://hgdownload.cse.ucsc.edu/goldenpath/mm10/phastCons60way/ Thanks for your helpful with the forementioned codes in your comments. |
I had the exact same issues as you. Finally, I ended up with the following approach. library(EnsDb.Mmusculus.v79) DefaultAssay(muo_data) <- "ATAC" extract gene annotations from EnsDbannotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Mmusculus.v79) change to UCSC style since the data was mapped to mm10seqlevelsStyle(annotations) <- 'UCSC' add the gene information to the objectAnnotation(muo_data) <- annotations Hope it helps! |
Hi @sylestiel I am still curious to use 'GREAT' for peak_to_gene_metod. |
Good afternoon! I have been sporadically tinkering with my GRN analysis for some time, and I am running into various problems. For starters, the number of unique target genes detected by Pando is less than 10 in my dataset, even though I have adjusted all parameters to allow for 0 correlation and a p value as insignificant as 1 for all steps. When I look at the modules, fit, and coefs slots in the SeuratPlus object, I am also seeing some transcription factors written as pure numbers without any names. Please see my code below and the attached screenshots. I don't know what's wrong, and I also don't know how to explain these problems clearly. However, I would appreciate any support and am willing to provide additional information.
`pfm <- getMatrixSet(
x = JASPAR2022,
opts = list(collection = "CORE", tax_group = 'vertebrates', all_versions = FALSE))
x <- character()
for(i in 1:length(pfm@listData)){
x[i] <- pfm@listData[[i]]@name
}
motif_2_tf <- data.frame(motif = names(pfm@listData), tf = x, origin = "JASPAR", gene_id = NA, name = NA, family = NA, symbol = NA, motif_tf = NA)
data("motif2tf")
motif2tf <- subset(motif2tf, motif %in% motif_2_tf$motif)
firstup <- function(x) {
substr(x, 1, 1) <- toupper(substr(x, 1, 1))
substr(x, 2, nchar(x)) <- tolower(substr(x, 2, nchar(x)))
x
}
motif2tf$tf <- firstup(motif2tf$tf)
TKO_int_cont_hep <- readRDS("~/TKO Multiome/Final Seurat Objects/TKO Control Hepatocytes Filtered and Integrated.rds")
TKO_int_cont_hep <- FindVariableFeatures(TKO_int_cont_hep, assay = "RNA", nfeatures = nrow(TKO_int_cont_hep@assays$RNA)) %>%
FindTopFeatures(assay = "Peaks", min.cutoff = 'q0') %>%
RunTFIDF(assay = "Peaks")
TKO_int_cont_hep <- initiate_grn(TKO_int_cont_hep, peak_assay = "Peaks", rna_assay = "RNA", exclude_exons = FALSE) %>%
find_motifs(pfm = pfm, motif_tfs = motif2tf, genome = BSgenome.Mmusculus.UCSC.mm10)
x <- TKO_int_cont_hep@grn@regions@motifs
x <- x[,unique(motif2tf$motif)]
TKO_int_cont_hep@grn@regions@motifs <- x
TKO_int_cont_hep <- infer_grn(TKO_int_cont_hep, peak_to_gene_method = "GREAT", upstream = 1e+06, downstream = 1e+06, method = "xgb", tf_cor = 0, scale = TRUE, interaction_term = "*") %>%
find_modules(p_thresh = 1, nvar_thresh = 1, min_genes_per_module = 1, rsq_thresh = 0) %>%
get_network_graph(graph_name = "full_graph", umap_method = "none") %>%
get_tf_network(graph = "full_graph", tf = "Tcf7l2")`
The text was updated successfully, but these errors were encountered: