Issue with read.data #30

asmagen · 2019-05-20T21:42:22Z

Similarly to read.preclustered.datasets, the unique commands needs to be removed from all the following code (taken from read.data function). We have already don't that for read.preclustered.datasets previously. The structure of initialize.project is that each dataset has a value of origins and experiments which can be duplicated. Applying unique onto these vectors is problematic.

dataset.labels <- rep(paste(unique(environment$origins)[environment$datasets ==
                dataset], " (", unique(environment$experiments)[environment$datasets == dataset],
                ")", sep = ""), ncol(measurements))
            origins <- rep(unique(environment$origins)[environment$datasets == dataset],
                ncol(measurements))
            experiments <- rep(unique(environment$experiments)[environment$datasets == dataset],
                ncol(measurements))

See example of how I run initialize.project:

combined <- initialize.project(datasets = c("TbetWT_Exp.1","TbetWT_Exp.2","TbetKO_Exp.1","TbetKO_Exp.2"), 
    origins = c("Tbet NKp46 WT cells","Tbet NKp46 WT cells","Tbet NKp46 KO cells","Tbet NKp46 KO cells"),
    experiments = c("Replicate 1","Replicate 2","Replicate 1","Replicate 2"),
    data.path = '/data/magenay/ILC/data',
    work.path = '/data/magenay/ILC',
    marker.genes = markers)#,clear.history=T

The text was updated successfully, but these errors were encountered:

asmagen · 2019-05-21T11:46:54Z

Here is a reproducible example (take any four datasets you have localy but don’t change the origins and experiments labels):

pooled_env <- initialize.project(datasets = c("LCMV1", "LCMV2","LCMV3", "LCMV4"),
origins = c("LCMV WT", "LCMV WT", "LCMV KO", "LCMV KO"),
experiments = c("Rep1", "Rep2","Rep1", "Rep2"),
data.path = file.path(tempdir(), "LCMV"),
work.path = file.path(tempdir(), "LCMV/LCMV_analysis"))

pooled_env <- read.data(pooled_env)

table(combined$dataset.labels)
NA (NA) LCMV KO (Replicate 2) LCMV WT (Replicate 1)
13242 8022 5368

Mamie · 2019-05-22T13:26:35Z

@asmagen The problem after removing the unique command is that rerunning cause the length of origins, experiments, and dataset.labels are modified again, which causes error in the subsequent workflow. See the following output.

> library(robustSingleCell)
> download_LCMV()
> LCMV1 <- initialize.project(datasets = "LCMV1", 
                          origins = "CD44+ cells",
                          experiments = "Rep1",
                          data.path = file.path(tempdir(), "LCMV"),
                          work.path = file.path(tempdir(), "LCMV/LCMV_analysis"))
> LCMV1 <- read.data(LCMV1, subsample = 500)
> LCMV1 <- get.variable.genes(LCMV1)
> summary(LCMV1)
                   Length  Class      Mode     
datasets                 1 -none-     character
origins                483 -none-     character
experiments            483 -none-     character
labels                   1 -none-     function 
data.path                1 -none-     character
project                  1 -none-     character
work.path                1 -none-     character
baseline.work.path       1 -none-     character
res.data.path            1 -none-     character
baseline.data.path       1 -none-     character
marker.genes            67 -none-     character
genes.filter         27998 -none-     logical  
counts             5932689 -none-     numeric  
normalized         5932689 -none-     numeric  
genes                12283 -none-     character
dataset_ids            483 -none-     character
dataset.labels         483 -none-     character
criteria               500 -none-     logical  
confounders              2 data.frame list     
nsamples                 1 -none-     numeric  
HVG                    806 -none-     character

> LCMV1 <- read.data(LCMV1, subsample = 500, rerun = T)
> summary(LCMV1)
                   Length  Class      Mode     
datasets                 1 -none-     character
origins             233289 -none-     character
experiments         233289 -none-     character
labels                   1 -none-     function 
data.path                1 -none-     character
project                  1 -none-     character
work.path                1 -none-     character
baseline.work.path       1 -none-     character
res.data.path            1 -none-     character
baseline.data.path       1 -none-     character
marker.genes            67 -none-     character
genes.filter         27998 -none-     logical  
counts             5932689 -none-     numeric  
normalized         5932689 -none-     numeric  
genes                12283 -none-     character
dataset_ids            483 -none-     character
dataset.labels      233289 -none-     character
criteria               500 -none-     logical  
confounders              2 data.frame list     
nsamples                 1 -none-     numeric  
HVG                    806 -none-     character

Notice how the length of origins, experiments and dataset.labels are inconsistent across the two run.

Any suggestion on how this should be resolved? The easiest fix is to just disable rerun since we are not going to use this data structure for the next update anyway.

asmagen · 2019-05-22T13:46:21Z

Good catch, thanks for letting me know. Since we need to retain compatibility with earlier versions after the SCE update, I would suggest to set a couple of additional variables (baseline.origins, baseline.experiments) to which the initialize.project is storing origins and experiments. Then read.data reads from baseline.origins, baseline.experiments and saves to origins, experiments. Basically it prevents reading and writing to the same variables, which is what is causing the problem. Lmk if that makes sense.

asmagen · 2019-05-23T14:26:19Z

Found another issue with read.data - genes.filter used to filter genes from the expression matrix is used incorrectly:
counts = counts[genes.filter,]
normalized = normalized[genes.filter,]
where the length of the genes.filter vector is inconsistent with the number of rows in the expression matrix in some instances while the right way to do it is:
counts = counts[names(genes.filter)[genes.filter==T],]
normalized = normalized[names(genes.filter)[genes.filter==T],]
I am going to write an update shortly but you @Mamie should take this into account in the SCE code.

asmagen added the bug Something isn't working label May 20, 2019

asmagen assigned Mamie May 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with read.data #30

Issue with read.data #30

asmagen commented May 20, 2019

asmagen commented May 21, 2019

Mamie commented May 22, 2019 •

edited

Loading

asmagen commented May 22, 2019

asmagen commented May 23, 2019

Issue with read.data #30

Issue with read.data #30

Comments

asmagen commented May 20, 2019

asmagen commented May 21, 2019

Mamie commented May 22, 2019 • edited Loading

asmagen commented May 22, 2019

asmagen commented May 23, 2019

Mamie commented May 22, 2019 •

edited

Loading