Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError with NaN tensors in gimVI training #2215

Open
sunericd opened this issue Jul 28, 2023 · 12 comments
Open

ValueError with NaN tensors in gimVI training #2215

sunericd opened this issue Jul 28, 2023 · 12 comments
Assignees
Labels

Comments

@sunericd
Copy link

When training the gimVI model, I am running into a ValueError in the first epoch for some datasets and not others. In all cases, the inputs are AnnData objects with raw counts for both the RNAseq and for the spatial data (dtype=float64). I have tried filtering out cells with zero counts and/or normalizing the RNAseq data but am still running into the same error. Strangely, gimVI seems to be able to train successfully on one dataset but when I remove 4 genes from the spatial data (32 -> 28 genes), it fails on that dataset. Happy to share data if that would be helpful and if there are suggestions for doing so (screenshots of basic data info below).

    import scvi
    from scvi.external import GIMVI
    
    # preprocessing of data
    spatial_adata = spatial_adata[:, spatial_adata.var_names.isin(RNAseq_adata.var_names)]
    
    # indices for filtering out zero-expression cells
    filtered_cells_spatial = (spatial_adata.X.sum(axis=1) > 1)
    filtered_cells_RNAseq = (RNAseq_adata.X.sum(axis=1) > 1)
    
    # make copies of subsets
    spatial_adata = spatial_adata[filtered_cells_spatial,:].copy()
    RNAseq_adata = RNAseq_adata[filtered_cells_RNAseq,:].copy()
    
    # setup anndata for scvi
    GIMVI.setup_anndata(spatial_adata)
    GIMVI.setup_anndata(RNAseq_adata)
        
    # train gimVI model
    model = GIMVI(RNAseq_adata, spatial_adata, **kwargs)
    model.train(200)
ValueError: Expected parameter loc (Tensor of shape (128, 10)) of distribution Normal(loc: torch.Size([128, 10]), scale: torch.Size([128, 10])) to satisfy the constraint Real(), but found invalid values:
tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], device='cuda:0')

Versions:

0.20.3

adata32

adata28

@sunericd sunericd added the bug label Jul 28, 2023
@martinkim0
Copy link
Contributor

Hi, thanks for bringing up this up and sorry you're running into this issue.

One possible source for these NaN errors are the exponential activations in the model - we've had similar issues before with other models depending on the dataset used as well as any non-default arguments that are passed during model initialization. Unfortunately, there's no straightforward way to modify these at the moment, and we cannot change them on our end because of reproducibility with the original manuscript.

Would you be able to share the dataset as well as a reproducible notebook?

@sunericd
Copy link
Author

sunericd commented Aug 1, 2023

Thanks for the fast response. Here is a link to a minimal notebook and data files for reproducing the errors (when I tested it today, gimVI seems to fail on the 32 gene data and succeeds on the 28 gene data, so maybe there are some variable behaviors there):

https://www.dropbox.com/sh/5smbltpudpntmh4/AABrxiGL8jlNxdq3dvHOx3bga?dl=0

@martinkim0
Copy link
Contributor

Thanks! I'll try to get this running and get back to you

@izabellaleahz
Copy link

Hello, I am also running into the same issue. Is there a way steps to modifying could be publicly posted?

@james-cranley
Copy link

james-cranley commented Aug 18, 2023

any news on this? I am also experiencing the same error (mine is when running mvi.train())

@james-cranley
Copy link

PS I tried with the same code and same object using an old environment (with scvi 0.2.3 installed) and it ran. the env where I was getting this error had 1.0.3 installed

@james-cranley
Copy link

with the environment which previously gave this issue activated, I ran:

pip uninstall scvi-tools
pip install scvi-tools==0.20.3

then restarted the kernel and the notebook ran without error.

@lila167
Copy link

lila167 commented Aug 25, 2023

With scvi-tools==1.0.3 installed, I got a similar error while training MultiVI and could fix the problem by installing scvi-tools==0.20.3.

@Accelerator-thu
Copy link

One possible cause of this issue is invalid data input, I guess the input needs to be a non-negative matrix.

BTW, it would be reasonable to add an assert line at the outer modules, avoiding misdirections like this.

@sunericd
Copy link
Author

sunericd commented Nov 6, 2023

Has there been any updates on this? I have tried running different versions (0.19.0, 0.20.3, and most recently 1.0.4) but none of them consistently work for the test data and notebook. Using scanpy==1.9.6 and scvi-tools==1.0.4, I did notice that if I restart the kernel a few times, sometimes it is able to complete the model.train(200) call successfully so perhaps it is due to some stochastic part of the model?

@Dana162001
Copy link

Hi, I would also like to hear if someone solved the problem, in my case model.train() function runs with 2 epochs but everything more than that and I am getting the same value error:
ValueError: Expected parameter loc (Tensor of shape (128, 10)) of distribution Normal(loc: torch.Size([128, 10]), scale: torch.Size([128, 10])) to satisfy the constraint Real(), but found invalid values:
tensor([[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]], device='cuda:0',

@ktpolanski
Copy link

Same story here - downgraded from 1.0.4 to 0.20.3 and MultiVI ran fine. Dropping a comment mostly to be notified if something budges on this front.

@martinkim0 martinkim0 added the P0 label Jul 12, 2024
@martinkim0 martinkim0 added this to the scvi-tools 1.2 milestone Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants