NestYogi is a novel hybrid optimization algorithm designed to enhance the generalization and convergence rates of deep learning models, particularly in facial biometric applications. Proposed in the paper "Novel Hybrid Optimization Techniques for Enhanced Generalization and Faster Convergence in Deep Learning Models: The NestYogi Approach to Facial Biometrics", NestYogi integrates the adaptive learning capabilities of the Yogi optimizer, anticipatory updates of Nesterov momentum, and the generalization power of Stochastic Weight Averaging (SWA).
doi: (https://doi.org/10.3390/math12182919)
This repository provides implementations of the NestYogi optimizer in both TensorFlow and PyTorch, facilitating its integration into various deep learning frameworks for tasks such as face detection and recognition.
- Hybrid Optimization Strategy: Combines Yogi's adaptive learning rates, Nesterov momentum, and SWA for improved performance.
- Framework Compatibility: Implemented for both TensorFlow and PyTorch.
- Advanced Regularization: Supports L1 and L2 regularization, gradient clipping, and weight decay.
- Lookahead Mechanism: Optional Lookahead for stabilizing training.
- AMSGrad Variant: Option to use the AMSGrad variant for enhanced stability.
- Extensive Documentation: Detailed analysis and usage instructions based on the original research paper.
- © 2024 by the authors. Licensee MDPI, Basel, Switzerland. This code implementation for The article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
- Initialize State: Set up step counter, moment vectors, and any other required state variables.
- Compute Gradients: Optionally clip and decay gradients.
- Update Moments: Adjust first and second moments using gradients and specified hyperparameters.
- Bias Correction: Correct bias in moment estimates due to initialization.
- Parameter Update: Update parameters using the corrected moments and specified momentum type.
- Lookahead: Periodically sync parameters with slow-moving versions for stability.
The init method initializes the optimizer with several parameters, ensuring they are within valid ranges:
- params: Parameters to optimize.
- lr: Learning rate, must be non-negative.
- betas: Coefficients for computing running averages of gradient and its square, each must be in [0, 1).
- eps: A small term to improve numerical stability, must be non-negative.
- l1_regularization_strength: Strength for L1 regularization, non-negative.
- l2_regularization_strength: Strength for L2 regularization, non-negative.
- initial_accumulator_value: Initial value for accumulators, used for maintaining running averages.
- activation: Can be 'sign' or 'tanh', used in computing v_delta.
- momentum_type: 'classical' or 'nesterov', determines the type of momentum used.
- weight_decay: Coefficient for weight decay, non-negative.
- amsgrad: Boolean, whether to use AMSGrad variant.
- clip_grad_norm: Value for gradient clipping, optional.
- lookahead: Boolean, whether to use the Lookahead mechanism.
- k and alpha: Parameters for Lookahead, k is the interval and alpha is the interpolation factor.
In the step
method, the state is initialized if it is empty for a parameter. This includes step counter, first moment vector (m
), second moment vector (v
), optional max second moment vector (max_v
), and optionally a copy of the parameter for Lookahead (slow_param
).
If clip_grad_norm
is set, gradients are clipped to this norm. If weight_decay
is non-zero, it is applied to the gradients.
The bias correction terms are computed to adjust for initialization bias in the running averages:
bias_correction1 = 1 - beta1 ** state['step']
bias_correction2 = 1 - beta2 ** state['step']
If L1 or L2 regularization strengths are set, they are applied to the gradients:
- L1: Adds the L1 regularization term.
- L2: Adds the L2 regularization term.
The first moment vector m
is updated as:
m.mul_(beta1).add_(grad, alpha=1 - beta1)
The second moment vector v
is updated differently depending on the activation function:
- If
activation
is 'sign':
v_delta = (grad.pow(2) - v).sign_()
- If
activation
is 'tanh':
v_delta = torch.tanh(10 * (grad.pow(2) - v))
If amsgrad
is enabled, the maximum of v
and the existing max_v
is taken to ensure the second moment does not decrease:
torch.max(state['max_v'], v, out=state['max_v'])
v_hat = state['max_v']
Otherwise, v_hat
is just v
.
The parameters are updated differently depending on the momentum type:
- Classical momentum:
p.addcdiv_(m, denom, value=-step_size)
- Nesterov momentum:
m_nesterov = m.mul(beta1).add(grad, alpha=1 - beta1)
p.addcdiv_(m_nesterov, denom, value=-step_size)
If lookahead
is enabled, parameters are periodically interpolated with their slow-moving versions:
if group['lookahead'] and state['step'] % group['k'] == 0:
slow_p = state['slow_param']
slow_p.add_(p - slow_p, alpha=group['alpha'])
p.copy_(slow_p)
Feature | Variable | Description |
---|---|---|
Learning Rate | lr |
Controls the step size for updates. |
Betas | betas |
Coefficients for computing running averages of gradient and its square. |
Epsilon | eps |
Small term to improve numerical stability. |
L1 Regularization Strength | l1_regularization_strength |
Strength for L1 regularization. |
L2 Regularization Strength | l2_regularization_strength |
Strength for L2 regularization. |
Initial Accumulator Value | initial_accumulator_value |
Initial value for accumulators, used for maintaining running averages. |
Activation Function | activation |
Function used for computing v_delta (sign or tanh ). |
Momentum Type | momentum_type |
Type of momentum used (classical or nesterov ). |
Weight Decay | weight_decay |
Coefficient for weight decay. |
AMSGrad | amsgrad |
Whether to use AMSGrad variant (True or False ). |
Gradient Clipping Norm | clip_grad_norm |
Value for gradient clipping. |
Lookahead Mechanism | lookahead |
Whether to use the Lookahead mechanism (True or False ). |
Lookahead Sync Interval | k |
Interval for Lookahead synchronization. |
Lookahead Interpolation Factor | alpha |
Interpolation factor for Lookahead. |
These variables are initialized and used internally within the step
method:
State Variable | Description |
---|---|
state['step'] |
Step counter for the parameter. |
state['m'] |
First moment vector. |
state['v'] |
Second moment vector. |
state['max_v'] |
Maximum second moment vector (used if amsgrad is True ). |
state['slow_param'] |
Copy of the parameter for Lookahead (used if lookahead is True ). |