Skip to content
This repository has been archived by the owner on Apr 9, 2024. It is now read-only.

Reimplemtation and Model Architecture issues #3

Open
junkeun-yi opened this issue Jul 6, 2022 · 1 comment
Open

Reimplemtation and Model Architecture issues #3

junkeun-yi opened this issue Jul 6, 2022 · 1 comment

Comments

@junkeun-yi
Copy link

Hello, I am currently trying to reimplement your project in PyTorch in my repository at https://github.com/junkeun-yi/SAVi-pytorch

I have been having a few issues with matching the performance of SAVi-small in the MOVi-C dataset, and wanted some clarifications on the implementation.

I realized that in the paper https://arxiv.org/pdf/2111.12594.pdf, the small variant uses a 64 features for the encoded features, but the code

suggests that it is only using 32 features, so I made my implementation that way, but I wanted clarification on if this was correct.

Also, my reimplementation has repeatedly achieved subpar performance on the SAVi-small, MOVi-C experiment, where my code gets 62.6 points whereas the official code achieves 64.7 points. As far as I can tell, my code is exactly the same as the original code except for some convolution padding (due to torch not having 'same' padding in strided convolutions and transposed convolutions).

In addition, would it be possible for the authors to provide loss and evaluation curves, as well as number of gpus used for the different experiments? It would be very appreciated and helpful in creating a reimplementation.

Thank you!

@tkipf
Copy link
Contributor

tkipf commented Jul 19, 2022

Thank you for your comment!

First of all, thank you for sharing you PyTorch re-implementation. I will have a closer look soon, maybe I can spot some difference (apart from the ones you mention).

Regarding the small SAVi model variant: the code here is correct, but there was indeed a typo in the paper, Appendix Table A.5 -- the number of channels in the very last line (Conv 1x1) should be 32 for the small model instead of 64. The second-to-last line is indeed 64 (we implement this using an MLP after the position embedding w/ a hidden layer size of 64).

I am unable to share training/evaluation curves right now, but what I can say is that they look very unsurprising: the FG-ARI steadily goes up over the course of training and converges fast at first (e.g. around 85% eval FG-ARI for SAVi-S on MOVi-A after 10k steps) and then converges more slowly for the remainder of training (e.g. ~91% eval FG-ARI for SAVi-S on MOVi-A after 50k steps).

Maybe the difference in results can be explained by the conv padding -- you should be able to try this by changing the config in our JAX/Flax codebase.

In terms of #GPUs: you should be able to run this model on a single GPU or on a multi-GPU setting (with data parallelism -- which the code base supports using pmap) -- ultimately it depends on the GPU memory what you can fit on a single GPU. A single V100 GPU should fit both the SAVi-S and SAVi-M models.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants