From 9563537e86363fac2768200f5748000ec6b3a911 Mon Sep 17 00:00:00 2001 From: Ronghang Hu Date: Sat, 26 Sep 2015 11:47:32 -0700 Subject: [PATCH] Update examples and docs --- docs/tutorial/solver.md | 28 +++++++++---------- examples/mnist/lenet_adadelta_solver.prototxt | 2 +- examples/mnist/lenet_solver_adam.prototxt | 2 +- examples/mnist/lenet_solver_rmsprop.prototxt | 2 +- ...mnist_autoencoder_solver_adadelta.prototxt | 2 +- .../mnist_autoencoder_solver_adagrad.prototxt | 2 +- ...mnist_autoencoder_solver_nesterov.prototxt | 2 +- 7 files changed, 20 insertions(+), 20 deletions(-) diff --git a/docs/tutorial/solver.md b/docs/tutorial/solver.md index b150f6487bc..b719f715a4b 100644 --- a/docs/tutorial/solver.md +++ b/docs/tutorial/solver.md @@ -8,12 +8,12 @@ The responsibilities of learning are divided between the Solver for overseeing t The Caffe solvers are: -- Stochastic Gradient Descent (`SGD`), -- AdaDelta (`ADADELTA`), -- Adaptive Gradient (`ADAGRAD`), -- Adam (`ADAM`), -- Nesterov's Accelerated Gradient (`NESTEROV`) and -- RMSprop (`RMSPROP`) +- Stochastic Gradient Descent (`type: "SGD"`), +- AdaDelta (`type: "AdaDelta"`), +- Adaptive Gradient (`type: "AdaGrad"`), +- Adam (`type: "Adam"`), +- Nesterov's Accelerated Gradient (`type: "Nesterov"`) and +- RMSprop (`type: "RMSProp"`) The solver @@ -51,7 +51,7 @@ The parameter update $$\Delta W$$ is formed by the solver from the error gradien ### SGD -**Stochastic gradient descent** (`solver_type: SGD`) updates the weights $$ W $$ by a linear combination of the negative gradient $$ \nabla L(W) $$ and the previous weight update $$ V_t $$. +**Stochastic gradient descent** (`type: "SGD"`) updates the weights $$ W $$ by a linear combination of the negative gradient $$ \nabla L(W) $$ and the previous weight update $$ V_t $$. The **learning rate** $$ \alpha $$ is the weight of the negative gradient. The **momentum** $$ \mu $$ is the weight of the previous update. @@ -113,7 +113,7 @@ If learning diverges (e.g., you start to see very large or `NaN` or `inf` loss v ### AdaDelta -The **AdaDelta** (`solver_type: ADADELTA`) method (M. Zeiler [1]) is a "robust learning rate method". It is a gradient-based optimization method (like SGD). The update formulas are +The **AdaDelta** (`type: "AdaDelta"`) method (M. Zeiler [1]) is a "robust learning rate method". It is a gradient-based optimization method (like SGD). The update formulas are $$ \begin{align} @@ -125,7 +125,7 @@ E[g^2]_t &= \delta{E[g^2]_{t-1} } + (1-\delta)g_{t}^2 \end{align} $$ -and +and $$ (W_{t+1})_i = @@ -139,7 +139,7 @@ $$ ### AdaGrad -The **adaptive gradient** (`solver_type: ADAGRAD`) method (Duchi et al. [1]) is a gradient-based optimization method (like SGD) that attempts to "find needles in haystacks in the form of very predictive but rarely seen features," in Duchi et al.'s words. +The **adaptive gradient** (`type: "AdaGrad"`) method (Duchi et al. [1]) is a gradient-based optimization method (like SGD) that attempts to "find needles in haystacks in the form of very predictive but rarely seen features," in Duchi et al.'s words. Given the update information from all previous iterations $$ \left( \nabla L(W) \right)_{t'} $$ for $$ t' \in \{1, 2, ..., t\} $$, the update formulas proposed by [1] are as follows, specified for each component $$i$$ of the weights $$W$$: @@ -159,7 +159,7 @@ Note that in practice, for weights $$ W \in \mathcal{R}^d $$, AdaGrad implementa ### Adam -The **Adam** (`solver_type: ADAM`), proposed in Kingma et al. [1], is a gradient-based optimization method (like SGD). This includes an "adaptive moment estimation" ($$m_t, v_t$$) and can be regarded as a generalization of AdaGrad. The update formulas are +The **Adam** (`type: "Adam"`), proposed in Kingma et al. [1], is a gradient-based optimization method (like SGD). This includes an "adaptive moment estimation" ($$m_t, v_t$$) and can be regarded as a generalization of AdaGrad. The update formulas are $$ (m_t)_i = \beta_1 (m_{t-1})_i + (1-\beta_1)(\nabla L(W_t))_i,\\ @@ -181,7 +181,7 @@ Kingma et al. [1] proposed to use $$\beta_1 = 0.9, \beta_2 = 0.999, \varepsilon ### NAG -**Nesterov's accelerated gradient** (`solver_type: NESTEROV`) was proposed by Nesterov [1] as an "optimal" method of convex optimization, achieving a convergence rate of $$ \mathcal{O}(1/t^2) $$ rather than the $$ \mathcal{O}(1/t) $$. +**Nesterov's accelerated gradient** (`type: "Nesterov"`) was proposed by Nesterov [1] as an "optimal" method of convex optimization, achieving a convergence rate of $$ \mathcal{O}(1/t^2) $$ rather than the $$ \mathcal{O}(1/t) $$. Though the required assumptions to achieve the $$ \mathcal{O}(1/t^2) $$ convergence typically will not hold for deep networks trained with Caffe (e.g., due to non-smoothness and non-convexity), in practice NAG can be a very effective method for optimizing certain types of deep learning architectures, as demonstrated for deep MNIST autoencoders by Sutskever et al. [2]. The weight update formulas look very similar to the SGD updates given above: @@ -206,10 +206,10 @@ What distinguishes the method from SGD is the weight setting $$ W $$ on which we ### RMSprop -The **RMSprop** (`solver_type: RMSPROP`), suggested by Tieleman in a Coursera course lecture, is a gradient-based optimization method (like SGD). The update formulas are +The **RMSprop** (`type: "RMSProp"`), suggested by Tieleman in a Coursera course lecture, is a gradient-based optimization method (like SGD). The update formulas are $$ -(v_t)_i = +(v_t)_i = \begin{cases} (v_{t-1})_i + \delta, &(\nabla L(W_t))_i(\nabla L(W_{t-1}))_i > 0\\ (v_{t-1})_i \cdot (1-\delta), & \text{else} diff --git a/examples/mnist/lenet_adadelta_solver.prototxt b/examples/mnist/lenet_adadelta_solver.prototxt index 776d1e06139..16176c0ffae 100644 --- a/examples/mnist/lenet_adadelta_solver.prototxt +++ b/examples/mnist/lenet_adadelta_solver.prototxt @@ -20,5 +20,5 @@ snapshot: 5000 snapshot_prefix: "examples/mnist/lenet_adadelta" # solver mode: CPU or GPU solver_mode: GPU -solver_type: ADADELTA +type: "AdaDelta" delta: 1e-6 diff --git a/examples/mnist/lenet_solver_adam.prototxt b/examples/mnist/lenet_solver_adam.prototxt index d22c5718f3f..4b5336b1a04 100644 --- a/examples/mnist/lenet_solver_adam.prototxt +++ b/examples/mnist/lenet_solver_adam.prototxt @@ -22,5 +22,5 @@ max_iter: 10000 snapshot: 5000 snapshot_prefix: "examples/mnist/lenet" # solver mode: CPU or GPU -solver_type: ADAM +type: "Adam" solver_mode: GPU diff --git a/examples/mnist/lenet_solver_rmsprop.prototxt b/examples/mnist/lenet_solver_rmsprop.prototxt index 74dadc51069..924b72d306e 100644 --- a/examples/mnist/lenet_solver_rmsprop.prototxt +++ b/examples/mnist/lenet_solver_rmsprop.prototxt @@ -23,5 +23,5 @@ snapshot: 5000 snapshot_prefix: "examples/mnist/lenet_rmsprop" # solver mode: CPU or GPU solver_mode: GPU -solver_type: RMSPROP +type: "RMSProp" rms_decay: 0.98 diff --git a/examples/mnist/mnist_autoencoder_solver_adadelta.prototxt b/examples/mnist/mnist_autoencoder_solver_adadelta.prototxt index 065647df31b..26c4084a374 100644 --- a/examples/mnist/mnist_autoencoder_solver_adadelta.prototxt +++ b/examples/mnist/mnist_autoencoder_solver_adadelta.prototxt @@ -16,4 +16,4 @@ snapshot: 10000 snapshot_prefix: "examples/mnist/mnist_autoencoder_adadelta_train" # solver mode: CPU or GPU solver_mode: GPU -solver_type: ADADELTA +type: "AdaDelta" diff --git a/examples/mnist/mnist_autoencoder_solver_adagrad.prototxt b/examples/mnist/mnist_autoencoder_solver_adagrad.prototxt index cc0ed9e310a..065cdb20ddc 100644 --- a/examples/mnist/mnist_autoencoder_solver_adagrad.prototxt +++ b/examples/mnist/mnist_autoencoder_solver_adagrad.prototxt @@ -14,4 +14,4 @@ snapshot: 10000 snapshot_prefix: "examples/mnist/mnist_autoencoder_adagrad_train" # solver mode: CPU or GPU solver_mode: GPU -solver_type: ADAGRAD +type: "AdaGrad" diff --git a/examples/mnist/mnist_autoencoder_solver_nesterov.prototxt b/examples/mnist/mnist_autoencoder_solver_nesterov.prototxt index 2a59fd45c8d..c95e3fe7e49 100644 --- a/examples/mnist/mnist_autoencoder_solver_nesterov.prototxt +++ b/examples/mnist/mnist_autoencoder_solver_nesterov.prototxt @@ -17,4 +17,4 @@ snapshot_prefix: "examples/mnist/mnist_autoencoder_nesterov_train" momentum: 0.95 # solver mode: CPU or GPU solver_mode: GPU -solver_type: NESTEROV +type: "Nesterov"