Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDE GAN example #80

Merged
merged 21 commits into from
Feb 26, 2021
Merged

SDE GAN example #80

merged 21 commits into from
Feb 26, 2021

Conversation

patrick-kidger
Copy link
Collaborator

Added an SDE-GAN example. Only a draft as I'm not satisfied that it's working yet / that it's as fast as it could be.

I've added a couple other things in here too. For one, I've added a CHANGELOG.txt file. (Anything I've left out?)
I've also tweaked the citation request with the new paper, to reflect the huge amount of work we've put into this repo for that paper.

@lxuechen
Copy link
Collaborator

Thanks for the update. Is the changelog necessary, as I've been documenting things in the release notes? I'll take a look at the other stuff tomorrow.

@patrick-kidger
Copy link
Collaborator Author

Haha, I honestly hadn't spotted the release notes. Agreed, the changelog is unnecessary.

FYI this example still doesn't work yet. Still needs more tweaks. (The same was true when we wrote the paper, it took me like 2 months of tweaking to get it working. Hopefully we can do it quicker now, but this isn't surprising.)

@lxuechen
Copy link
Collaborator

Thanks for the update, @patrick-kidger ! I'm assuming the example works now.

I actually have further thoughts about this gan model, though I'm quite busy before the xmax break and couldn't really afford to context switch between different things.

Would you mind if I go through this PR in a week of time?

Aside, build for Py3.6 seems to be failing due to numpy being incompatible after its latest release.

@patrick-kidger
Copy link
Collaborator Author

Yup, the example works now! The most important points were to change the initialisation and to switch from Adam to SGD.

I'd be very interested to hear your thoughts on SDEs/GANs. After xmas is fine; likewise no rush on this PR. It looks like SDE-GANs are probably going to be in ICML rather than ICLR, after all...

Aside, eurgh. Looks like the problem is that it's grabbing the SciPy pre-release, which requires Python 3.7+.

@lxuechen
Copy link
Collaborator

Not related to this thread in particular, but I found this during late night reading. Some food for thought regarding the log-ODE scheme.

@patrick-kidger
Copy link
Collaborator Author

patrick-kidger commented Dec 16, 2020

Ah, thankyou! I've actually come across this before. Unfortunately I don't think it composes with autograd.grad in the desired way though -- from one of PyTorch devs: "The current vmap prototype unfortunately does not support doing a batch vjp where the input and the v vector share a batch dimension."

At some point I expect we'll revisit the log-ODE method, as I believe James wants to write a paper on it. When that happens one option would be to write a for loop over the vjp, in C++ parallelised with OpenMP.

Copy link
Collaborator

@lxuechen lxuechen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the example! I've only read the code but have not played with the code, so these mostly are just questions. Are there ways to simplify this example and potentially speed up running it (e.g. by reducing the time horizon)?

examples/sde_gan.py Outdated Show resolved Hide resolved
###################
init_noise = torch.randn(batch_size, self._initial_noise_size, device=ts.device)
x0 = self._initial(init_noise)
# TODO: step size
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These step sizes seem quite large. What happens if we use a smaller step size? Is it just slower, or does it tend to not work as well?

In particular, this step size seems to be larger than the step size for generating the data. This seems a bit weird, unless I'm missing something obvious.

Copy link
Collaborator Author

@patrick-kidger patrick-kidger Dec 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just slower. The step size used for generating the data needn't be connected to the step size we use in our model. (We don't output the data every step; only every 10 steps.)

This actually touches on what I've found to be the single biggest issue with neural SDEs: training them is really, really slow. Even with this example, which:

  • is a simple toy dataset, actually generated from an SDE;
  • has a large batch size;
  • has a huge step size;
  • isn't making adaptive steps;
  • is backprop'ing through the solver, without the adjoint method;
  • is on a decent GPU;
  • has the benefit of having been tweaked for a few months now.

Even this still takes a substantial amount of time to run.

I'm not really sure what the solution is, to be honest. (Lots of GPUs would be one way.) Large step sizes are just one way of ameliorating this -- I suspect the prior on model space is pretty similar to that of small step sizes.

examples/sde_gan.py Outdated Show resolved Hide resolved
examples/sde_gan.py Show resolved Hide resolved
examples/sde_gan.py Outdated Show resolved Hide resolved
examples/sde_gan.py Outdated Show resolved Hide resolved
examples/sde_gan.py Outdated Show resolved Hide resolved
@patrick-kidger
Copy link
Collaborator Author

patrick-kidger commented Dec 22, 2020

A few of your comments above are about the optimisation procedure. This is easily the thing that actually seems to matter most.

I didn't try every combination of SGD/Adam/Adadelta/... with every combination of weight decay, SWA, etc, but as a general summary of my observations:

  • Adam/AMSgrad give losses that jump around all over the show during training, whilst Adadelta gives very well-behaved smoothly-varying losses. However only SGD manages to get anything approaching good convergence towards the end of training: the others tend to get stuck at some loss away from zero. (Probably related to the theoretical lack-of-convergence that these optimisers exhibit.)
  • Weight decay seems to help a little bit.
  • SWA can save you from the oscillating losses that Adam gives you, helps a fair chunk with SGD, but does basically nothing with Adadelta. I've only tried uniform weight averaging, not EMA.

Judging from the GAN optimisation literature at the moment, it seems like no-one else has much idea either. SWA is obviously related to the stronger convergence guarantees that Cesaro means often get / is a trick to avoid the diffusive regime at the end of training. Weight decay and negative momentum seem related to the idea of solving a GAN as an ODE, but I suspect that's almost certainly suboptimal for min-max games, just like it's suboptimal for minimisation problems.

All in all things seem to be in a bit of a mess. It feels to me like all the basic pieces are there in the literature, and someone just needs to figure out the correct things to put together to get it working reliably.

As it stands this example represents a "good enough" trade-off between readability / efficacy / use of only standard library without custom optimisers.

Things I'd like to try, but decided were out of scope for an intentionally simple example:

  • Negative momentum. Unfortunately the default PyTorch optimizers throw an error if you try it.
  • Adabound, which interpolates between Adam and SGD. Something interpolating between Adadelta and SGD would also be interesting.
  • Using Adadelta to begin training and then switch to SGD at the end.
  • Coming up with some custom optimiser that has the right blend of the above ideas!

If you have any thoughts on the matter I'd be very interested.

@patrick-kidger patrick-kidger changed the base branch from master to dev December 24, 2020 03:50
@patrick-kidger
Copy link
Collaborator Author

FYI even though I think this PR is done, I'm thinking of leaving it unmerged until the paper is on arXiv (so we have a better link than just the workshop paper).

Base automatically changed from dev to master January 5, 2021 12:06
@patrick-kidger
Copy link
Collaborator Author

patrick-kidger commented Feb 9, 2021

Finally done with this PR and ready to merge it, I think. @lxuechen how does it look?

@lxuechen
Copy link
Collaborator

Finally done with this PR and ready to merge it, I think. @lxuechen how does it look?

Looks good. If you could rebase from master to rid the test failures, then I think it's mostly ready for merge; otherwise I can try do the rebase some time over the weekend. Overall, the example seems a somewhat more complicated than the others, but understandably training GANs isn't easy.

Sorry again for this ultra slow response.

@patrick-kidger
Copy link
Collaborator Author

No worries at all.

The test failure looks to be because the version number isn't updated. I can bump the number if you really want, but as we're only touching the README and a new example file then I think it should be fine to just merge anyway.

I've just done a small update to start using Fire, and add the license at the top.

@patrick-kidger patrick-kidger marked this pull request as ready for review February 24, 2021 16:47
@lxuechen lxuechen merged commit f965bc9 into master Feb 26, 2021
@patrick-kidger patrick-kidger deleted the dev-sde-gan-example branch February 26, 2021 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants