Implicit Gradient Descent

In this repository we implement implicit forms of common gradient descent algorithms available in pyTorch 1.0 and compare performance to traditional (explicit) SGD methods.

Introduction & Background

For a detailed discussion of the theoretical advantages of implicit gradient descent, see this paper. To paraphrase, a key finding is that whereas in explicit methods the learning rate schedule needs to be carefully adjusted to balance statistical efficiency and numerical stability, with implicit methods the stability constraint effectively vanishes. Effectively, any learning rate (or sequence thereof) yields a stable procedure allowing for higher rates which yields faster convergence. On the other hand as we shall see computing the update is computationally expensive and in many cases intractable, so in this repo we experiment to find situations in which the trade-off is beneficial.

A little background: an update rule for gradient descent typically looks something like:

$x_{n+1} = x_n - \delta_n \nabla F(x_{n})$

where is given.

The implicit update is instead:

$x_{n+1} = x_n - \delta_n \nabla F(x_{n+1})$

where here we choose use a first-order expansion to approximate $$\nabla F(x_{n+1})$

$$\nabla F(x_{n+1}) \approx \nabla F(x_n) - H_F(x_{n})(x_{n+1} - x_n)$

where $H_F(\cdot)$ denotes the Hessian of F. Now we can simply rearrange to find our update rule:

$$x_{n+1} = x_n - \delta_n (I + \delta_n H_F(x_n))^{-1} \nabla F(x_n)$

In our investigation we restrict our attention to compositions of affine functions, bounded non-linearities and ReLU (MLP, CNN and RNN) and so we can approximate well the Hessian without computing second-order partials of loss w.r.t parameters (see background.pdf for details). This enables us to utilize pyTorch auto-differentiation tools, and the increase in workload from explicit method is due exclusively to needed to invert a square matrix of size MxM where M is the number of parameters. In addition, rather than straight-forward SGD update we also implement alternative update-rules (such as implicit ADAM).

Goals

The implicit update requires the inversion of the MxM matrix where M is the number of parameters, making it impractical for modern convolution of architectures which regularly comprised by millions (or more) of parameters. On the other hand we are interested in identifying if and when (e.g for what M?) the increase in speed of training afforded by higher allowable learning rates can off-set the computational cost of inverting such a matrix.

While training deep networks from scratch with such methods seems impractical, we'd like to investigate potential applications into transfer learning (see our repo on neural splicing for example & discussion) where weights from a pre-trained model are loaded into a slightly modified architecture where only a small fraction of the parameters are initialized randomly and trained to begin with (on the order of tens of thousands). We hope this approach can accelerate fine-tuning in these cases.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Background.pdf		Background.pdf
Readme.md		Readme.md
arch.py		arch.py
core.py		core.py
isgd.py		isgd.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Implicit Gradient Descent

Introduction & Background

Goals

About

Releases

Packages

Languages

mb4310/Implicit-Gradient-Descent

Folders and files

Latest commit

History

Repository files navigation

Implicit Gradient Descent

Introduction & Background

Goals

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages