Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with read.data #30

Open
asmagen opened this issue May 20, 2019 · 4 comments
Open

Issue with read.data #30

asmagen opened this issue May 20, 2019 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@asmagen
Copy link
Owner

asmagen commented May 20, 2019

Similarly to read.preclustered.datasets, the unique commands needs to be removed from all the following code (taken from read.data function). We have already don't that for read.preclustered.datasets previously. The structure of initialize.project is that each dataset has a value of origins and experiments which can be duplicated. Applying unique onto these vectors is problematic.

dataset.labels <- rep(paste(unique(environment$origins)[environment$datasets ==
                dataset], " (", unique(environment$experiments)[environment$datasets == dataset],
                ")", sep = ""), ncol(measurements))
            origins <- rep(unique(environment$origins)[environment$datasets == dataset],
                ncol(measurements))
            experiments <- rep(unique(environment$experiments)[environment$datasets == dataset],
                ncol(measurements))

See example of how I run initialize.project:

combined <- initialize.project(datasets = c("TbetWT_Exp.1","TbetWT_Exp.2","TbetKO_Exp.1","TbetKO_Exp.2"), 
    origins = c("Tbet NKp46 WT cells","Tbet NKp46 WT cells","Tbet NKp46 KO cells","Tbet NKp46 KO cells"),
    experiments = c("Replicate 1","Replicate 2","Replicate 1","Replicate 2"),
    data.path = '/data/magenay/ILC/data',
    work.path = '/data/magenay/ILC',
    marker.genes = markers)#,clear.history=T
@asmagen asmagen added the bug Something isn't working label May 20, 2019
@asmagen
Copy link
Owner Author

asmagen commented May 21, 2019

Here is a reproducible example (take any four datasets you have localy but don’t change the origins and experiments labels):

pooled_env <- initialize.project(datasets = c("LCMV1", "LCMV2","LCMV3", "LCMV4"),
origins = c("LCMV WT", "LCMV WT", "LCMV KO", "LCMV KO"),
experiments = c("Rep1", "Rep2","Rep1", "Rep2"),
data.path = file.path(tempdir(), "LCMV"),
work.path = file.path(tempdir(), "LCMV/LCMV_analysis"))

pooled_env <- read.data(pooled_env)

table(combined$dataset.labels)
NA (NA) LCMV KO (Replicate 2) LCMV WT (Replicate 1)
13242 8022 5368

@Mamie
Copy link
Collaborator

Mamie commented May 22, 2019

@asmagen The problem after removing the unique command is that rerunning cause the length of origins, experiments, and dataset.labels are modified again, which causes error in the subsequent workflow. See the following output.

> library(robustSingleCell)
> download_LCMV()
> LCMV1 <- initialize.project(datasets = "LCMV1", 
                          origins = "CD44+ cells",
                          experiments = "Rep1",
                          data.path = file.path(tempdir(), "LCMV"),
                          work.path = file.path(tempdir(), "LCMV/LCMV_analysis"))
> LCMV1 <- read.data(LCMV1, subsample = 500)
> LCMV1 <- get.variable.genes(LCMV1)
> summary(LCMV1)
                   Length  Class      Mode     
datasets                 1 -none-     character
origins                483 -none-     character
experiments            483 -none-     character
labels                   1 -none-     function 
data.path                1 -none-     character
project                  1 -none-     character
work.path                1 -none-     character
baseline.work.path       1 -none-     character
res.data.path            1 -none-     character
baseline.data.path       1 -none-     character
marker.genes            67 -none-     character
genes.filter         27998 -none-     logical  
counts             5932689 -none-     numeric  
normalized         5932689 -none-     numeric  
genes                12283 -none-     character
dataset_ids            483 -none-     character
dataset.labels         483 -none-     character
criteria               500 -none-     logical  
confounders              2 data.frame list     
nsamples                 1 -none-     numeric  
HVG                    806 -none-     character

> LCMV1 <- read.data(LCMV1, subsample = 500, rerun = T)
> summary(LCMV1)
                   Length  Class      Mode     
datasets                 1 -none-     character
origins             233289 -none-     character
experiments         233289 -none-     character
labels                   1 -none-     function 
data.path                1 -none-     character
project                  1 -none-     character
work.path                1 -none-     character
baseline.work.path       1 -none-     character
res.data.path            1 -none-     character
baseline.data.path       1 -none-     character
marker.genes            67 -none-     character
genes.filter         27998 -none-     logical  
counts             5932689 -none-     numeric  
normalized         5932689 -none-     numeric  
genes                12283 -none-     character
dataset_ids            483 -none-     character
dataset.labels      233289 -none-     character
criteria               500 -none-     logical  
confounders              2 data.frame list     
nsamples                 1 -none-     numeric  
HVG                    806 -none-     character

Notice how the length of origins, experiments and dataset.labels are inconsistent across the two run.

Any suggestion on how this should be resolved? The easiest fix is to just disable rerun since we are not going to use this data structure for the next update anyway.

@asmagen
Copy link
Owner Author

asmagen commented May 22, 2019

Good catch, thanks for letting me know. Since we need to retain compatibility with earlier versions after the SCE update, I would suggest to set a couple of additional variables (baseline.origins, baseline.experiments) to which the initialize.project is storing origins and experiments. Then read.data reads from baseline.origins, baseline.experiments and saves to origins, experiments. Basically it prevents reading and writing to the same variables, which is what is causing the problem. Lmk if that makes sense.

@asmagen
Copy link
Owner Author

asmagen commented May 23, 2019

Found another issue with read.data - genes.filter used to filter genes from the expression matrix is used incorrectly:
counts = counts[genes.filter,]
normalized = normalized[genes.filter,]
where the length of the genes.filter vector is inconsistent with the number of rows in the expression matrix in some instances while the right way to do it is:
counts = counts[names(genes.filter)[genes.filter==T],]
normalized = normalized[names(genes.filter)[genes.filter==T],]
I am going to write an update shortly but you @Mamie should take this into account in the SCE code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants