Skip to content

Commit

Permalink
[Training] SG with Momentum Optimizer (onnx#1959)
Browse files Browse the repository at this point in the history
* SG with Momentum

* Registrate Op

Fix

Update other docs

* Add shape inference code and polish definition

* Update docs

* Add test cases and fix several bugs

* Remove accidently added copy

* Alpha -> alpha & Beta -> beta

* Clarify an attribute

* Fix an attribute

* Fix bug

* Fix missing attributes

* sync doc

* Remove unused domain

* sync with master

Co-authored-by: Chin Huang <[email protected]>
  • Loading branch information
wschin and chinhuang007 authored Mar 11, 2020
1 parent 8d15705 commit c2fefcb
Show file tree
Hide file tree
Showing 36 changed files with 830 additions and 1 deletion.
109 changes: 109 additions & 0 deletions docs/Changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -15424,3 +15424,112 @@ This version of the operator has been available since version 1 of the 'ai.onnx.
<dd>Allow inputs and outputs to be any kind of tensor.</dd>
</dl>

### <a name="ai.onnx.training.Momentum-1"></a>**ai.onnx.training.Momentum-1**</a>

Compute one iteration of stochastic gradient update with momentum.
This operator can conduct the optimization of multiple tensor variables.

Let's define the behavior of this operator. As you can imagine, SG with momentum requires
several parameters:

- The learning-rate "R".
- The update count "T". That is, the number of conducted training iterations. It should
be zero in the first training iteration.
- A L2-norm regularization coefficient "norm_coefficient".
- A decay coefficient of previous accumulated gradient (i.e., momentum) "alpha".
- The scaling coefficient of current gradient "beta".
- An attribute to choose either standard momentum or Nesterov's momentum "mode" should
be used.

For the sake of simplicity, assume that there is only one tensor (called "X") to be optimized.
Other necessary inputs are "X"'s gradient (called "G") and "X"'s momentum (called "V"). This
Momentum operator maps all these inputs to the new value of "X" (called "X_new") and its new
momentum (called "V_new").

This operator supports two different momentum algorithms. Set the attribute "mode" to
"nesterov" if Nesterov's momentum is desired. Otherwise, set the attribute "model" to
"standard" to use standard momentum. Computation details are described subsequently.

Let "+", "-", "*", and "/" are all element-wise operations with numpy-style broadcasting.

Pseudo code for SG with standard momentum:

// Add gradient of 0.5 * norm_coefficient * ||X||^2, where ||X|| is the sum of squared
// values of all elements in X.
G_regularized = norm_coefficient * X + G

// In the first training iteration, beta should always be 1.
beta_adjusted = T > 0 ? beta : 1

// Compute the current momentum based on previous momentum and the current gradient.
V_new = alpha * V + beta_adjusted * G_regularized

// Update X.
X_new = X - R * V_new

Pseudo code for SG with Nesterov's momentum:

// Add gradient of 0.5 * norm_coefficient * ||X||^2, where ||X|| is the sum of squared
// values of all elements in X.
G_regularized = norm_coefficient * X + G;

// In the first training iteration, beta should always be 1.
beta_adjusted = T > 0 ? beta : 1

// Compute the current momentum based on previous momentum and the current gradient.
V_new = alpha * V + beta_adjusted * G_regularized;

// Compute final update direction and then update X.
X_new = X - R * (G_regularized + alpha * V_new)

If one assign this operators to optimize multiple inputs, for example, "X_1" and "X_2". The same
pseudo code would be extended to handle all tensors jointly. More specifically, we can view "X" as a
concatenation of "X_1" and "X_2" (of course, their gradient and accumulate gradient should
be concatenated too) and then our pseudo code becomes applicable.

#### Version

This version of the operator has been available since version 1 of the 'ai.onnx.training' operator set.

#### Attributes

<dl>
<dt><tt>alpha</tt> : float (required)</dt>
<dd>The decay factor of momentum. It should be a scalar.</dd>
<dt><tt>beta</tt> : float (required)</dt>
<dd>The coefficient of gradient in computing new momentum. It should be a scalar.</dd>
<dt><tt>mode</tt> : string (required)</dt>
<dd>Its value should be either "nesterov" or "standard". The value "nesterov" leads to the use of Nesterov's momentum while "standard" invokes stochastic gradient method using standard momentum</dd>
<dt><tt>norm_coefficient</tt> : float (required)</dt>
<dd>Coefficient of 0.5 * norm_coefficient * ||X||^2.</dd>
</dl>

#### Inputs (3 - &#8734;)

<dl>
<dt><tt>R</tt> : T1</dt>
<dd>The learning rate.</dd>
<dt><tt>T</tt> : T2</dt>
<dd>Update count of "X". It should be a scalar.</dd>
<dt><tt>inputs</tt> (variadic, heterogeneous) : T3</dt>
<dd>It sequentially contains the current values of optimized tensors, then their gradient tensors, and finally their momentum tensors. For example, if two tensors "X_1" and "X_2" are optimized, The expected input list would be ["X_1", "X_2", gradient of "X_1", gradient of "X_2", momentum of "X_1", momentum of "X_2"].</dd>
</dl>

#### Outputs (1 - &#8734;)

<dl>
<dt><tt>outputs</tt> (variadic, heterogeneous) : T3</dt>
<dd>It sequentially contains the new values of optimized tensors and then the new values of their momentum tensors. For example, if two tensors "X_1" and "X_2" are optimized, the output list would be [new value of "X_1," new value of "X_2" new momentum of "X_1", new momentum of "X_2"].</dd>
</dl>

#### Type Constraints

<dl>
<dt><tt>T1</tt> : tensor(float), tensor(double)</dt>
<dd>Constrain input types to float scalars.</dd>
<dt><tt>T2</tt> : tensor(int64)</dt>
<dd>Constrain input types to 64-bit integer scalars.</dd>
<dt><tt>T3</tt> : tensor(float), tensor(double)</dt>
<dd>Constrain input types to float tensors.</dd>
</dl>

241 changes: 241 additions & 0 deletions docs/Operators.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,7 @@
* <a href="#ai.onnx.training.Adagrad">ai.onnx.training.Adagrad</a>
* <a href="#ai.onnx.training.Gradient">ai.onnx.training.Gradient</a>
* <a href="#ai.onnx.training.GraphCall">ai.onnx.training.GraphCall</a>
* <a href="#ai.onnx.training.Momentum">ai.onnx.training.Momentum</a>

## ai.onnx (default)
### <a name="Abs"></a><a name="abs">**Abs**</a>
Expand Down Expand Up @@ -21529,3 +21530,243 @@ This version of the operator has been available since version 1 of the 'ai.onnx.
</dl>


### <a name="ai.onnx.training.Momentum"></a><a name="ai.onnx.training.momentum">**ai.onnx.training.Momentum**</a>

Compute one iteration of stochastic gradient update with momentum.
This operator can conduct the optimization of multiple tensor variables.

Let's define the behavior of this operator. As you can imagine, SG with momentum requires
several parameters:

- The learning-rate "R".
- The update count "T". That is, the number of conducted training iterations. It should
be zero in the first training iteration.
- A L2-norm regularization coefficient "norm_coefficient".
- A decay coefficient of previous accumulated gradient (i.e., momentum) "alpha".
- The scaling coefficient of current gradient "beta".
- An attribute to choose either standard momentum or Nesterov's momentum "mode" should
be used.

For the sake of simplicity, assume that there is only one tensor (called "X") to be optimized.
Other necessary inputs are "X"'s gradient (called "G") and "X"'s momentum (called "V"). This
Momentum operator maps all these inputs to the new value of "X" (called "X_new") and its new
momentum (called "V_new").

This operator supports two different momentum algorithms. Set the attribute "mode" to
"nesterov" if Nesterov's momentum is desired. Otherwise, set the attribute "model" to
"standard" to use standard momentum. Computation details are described subsequently.

Let "+", "-", "*", and "/" are all element-wise operations with numpy-style broadcasting.

Pseudo code for SG with standard momentum:

// Add gradient of 0.5 * norm_coefficient * ||X||^2, where ||X|| is the sum of squared
// values of all elements in X.
G_regularized = norm_coefficient * X + G

// In the first training iteration, beta should always be 1.
beta_adjusted = T > 0 ? beta : 1

// Compute the current momentum based on previous momentum and the current gradient.
V_new = alpha * V + beta_adjusted * G_regularized

// Update X.
X_new = X - R * V_new

Pseudo code for SG with Nesterov's momentum:

// Add gradient of 0.5 * norm_coefficient * ||X||^2, where ||X|| is the sum of squared
// values of all elements in X.
G_regularized = norm_coefficient * X + G;

// In the first training iteration, beta should always be 1.
beta_adjusted = T > 0 ? beta : 1

// Compute the current momentum based on previous momentum and the current gradient.
V_new = alpha * V + beta_adjusted * G_regularized;

// Compute final update direction and then update X.
X_new = X - R * (G_regularized + alpha * V_new)

If one assign this operators to optimize multiple inputs, for example, "X_1" and "X_2". The same
pseudo code would be extended to handle all tensors jointly. More specifically, we can view "X" as a
concatenation of "X_1" and "X_2" (of course, their gradient and accumulate gradient should
be concatenated too) and then our pseudo code becomes applicable.

#### Version

This version of the operator has been available since version 1 of the 'ai.onnx.training' operator set.

#### Attributes

<dl>
<dt><tt>alpha</tt> : float (required)</dt>
<dd>The decay factor of momentum. It should be a scalar.</dd>
<dt><tt>beta</tt> : float (required)</dt>
<dd>The coefficient of gradient in computing new momentum. It should be a scalar.</dd>
<dt><tt>mode</tt> : string (required)</dt>
<dd>Its value should be either "nesterov" or "standard". The value "nesterov" leads to the use of Nesterov's momentum while "standard" invokes stochastic gradient method using standard momentum</dd>
<dt><tt>norm_coefficient</tt> : float (required)</dt>
<dd>Coefficient of 0.5 * norm_coefficient * ||X||^2.</dd>
</dl>

#### Inputs (3 - &#8734;)

<dl>
<dt><tt>R</tt> : T1</dt>
<dd>The learning rate.</dd>
<dt><tt>T</tt> : T2</dt>
<dd>Update count of "X". It should be a scalar.</dd>
<dt><tt>inputs</tt> (variadic, heterogeneous) : T3</dt>
<dd>It sequentially contains the current values of optimized tensors, then their gradient tensors, and finally their momentum tensors. For example, if two tensors "X_1" and "X_2" are optimized, The expected input list would be ["X_1", "X_2", gradient of "X_1", gradient of "X_2", momentum of "X_1", momentum of "X_2"].</dd>
</dl>

#### Outputs (1 - &#8734;)

<dl>
<dt><tt>outputs</tt> (variadic, heterogeneous) : T3</dt>
<dd>It sequentially contains the new values of optimized tensors and then the new values of their momentum tensors. For example, if two tensors "X_1" and "X_2" are optimized, the output list would be [new value of "X_1," new value of "X_2" new momentum of "X_1", new momentum of "X_2"].</dd>
</dl>

#### Type Constraints

<dl>
<dt><tt>T1</tt> : tensor(float), tensor(double)</dt>
<dd>Constrain input types to float scalars.</dd>
<dt><tt>T2</tt> : tensor(int64)</dt>
<dd>Constrain input types to 64-bit integer scalars.</dd>
<dt><tt>T3</tt> : tensor(float), tensor(double)</dt>
<dd>Constrain input types to float tensors.</dd>
</dl>


#### Examples

<details>
<summary>momentum</summary>

```python
# Define operator attributes.
norm_coefficient = 0.001
alpha = 0.95
beta = 0.1

# Create operator.
node = onnx.helper.make_node('Momentum',
inputs=['R', 'T', 'X', 'G', 'V'],
outputs=['X_new', 'V_new'],
norm_coefficient=norm_coefficient,
alpha=alpha,
beta=beta,
mode='standard',
domain='ai.onnx.training'
)

# Define operator inputs.
r = np.array(0.1, dtype=np.float32) # scalar
t = np.array(0, dtype=np.int64) # scalar
x = np.array([1.2, 2.8], dtype=np.float32)
g = np.array([-0.94, -2.5], dtype=np.float32)
v = np.array([1.7, 3.6], dtype=np.float32)

# Compute expected outputs of Momentum.
x_new, v_new = apply_momentum(r, t, x, g, v,
norm_coefficient, alpha, beta)

# Check results.
expect(node, inputs=[r, t, x, g, v],
outputs=[x_new, v_new], name='test_momentum',
opset_imports=[onnx.helper.make_opsetid('ai.onnx.training', 1)])
```

</details>


<details>
<summary>momentum_multiple</summary>

```python
# Define operator attributes.
norm_coefficient = 0.001
alpha = 0.95
beta = 0.85

node = onnx.helper.make_node('Momentum',
inputs=['R', 'T', 'X1', 'X2',
'G1', 'G2', 'H1', 'H2'],
outputs=['X1_new', 'X2_new',
'V1_new', 'V2_new'],
norm_coefficient=norm_coefficient,
alpha=alpha,
beta=beta,
mode='standard',
domain='ai.onnx.training'
)

# Define operator inputs.
r = np.array(0.1, dtype=np.float32) # scalar
t = np.array(0, dtype=np.int64) # scalar

x1 = np.array([1.0], dtype=np.float32)
g1 = np.array([-1.0], dtype=np.float32)
v1 = np.array([2.0], dtype=np.float32)

x2 = np.array([1.0, 2.0], dtype=np.float32)
g2 = np.array([-1.0, -3.0], dtype=np.float32)
v2 = np.array([4.0, 1.0], dtype=np.float32)

# Compute expected outputs of Momentum.
x1_new, v1_new = apply_momentum(r, t, x1, g1, v1,
norm_coefficient, alpha, beta)
x2_new, v2_new = apply_momentum(r, t, x2, g2, v2,
norm_coefficient, alpha, beta)

# Check results.
expect(node, inputs=[r, t, x1, x2, g1, g2, v1, v2],
outputs=[x1_new, x2_new, v1_new, v2_new], name='test_momentum_multiple',
opset_imports=[onnx.helper.make_opsetid('ai.onnx.training', 1)])
```

</details>


<details>
<summary>nesterov_momentum</summary>

```python
# Define operator attributes.
norm_coefficient = 0.01
alpha = 0.95
beta = 1.0

# Create operator.
node = onnx.helper.make_node('Momentum',
inputs=['R', 'T', 'X', 'G', 'V'],
outputs=['X_new', 'V_new'],
norm_coefficient=norm_coefficient,
alpha=alpha,
beta=beta,
mode='nesterov',
domain='ai.onnx.training'
)

# Define operator inputs.
r = np.array(0.1, dtype=np.float32) # scalar
t = np.array(0, dtype=np.int64) # scalar
x = np.array([1.2, 2.8], dtype=np.float32)
g = np.array([-0.94, -2.5], dtype=np.float32)
v = np.array([1.7, 3.6], dtype=np.float32)

# Compute expected outputs of Adagrad.
x_new, v_new = apply_nesterov(r, t, x, g, v,
norm_coefficient, alpha, beta)

# Check results.
expect(node, inputs=[r, t, x, g, v],
outputs=[x_new, v_new], name='test_nesterov_momentum',
opset_imports=[onnx.helper.make_opsetid('ai.onnx.training', 1)])
```

</details>


Loading

0 comments on commit c2fefcb

Please sign in to comment.