Releases: kyegomez/Sophia
e9
e8
The provided code for the Hutchinson estimator assumes that the input tensors are 1D. However, in many network architectures, the parameters can be multi-dimensional tensors. To handle this case, we need to modify the Hutchinson estimator to compute the dot product and Hessian-vector product correctly for multi-dimensional tensors.
class HutchinsonEstimator(HessianEstimator):
def estimate(self, p, grad):
u = torch.randn_like(grad)
grad_dot_u = torch.sum(grad * u)
hessian_vector_product = torch.autograd.grad(grad_dot_u, p, retain_graph=True)[0]
return u * hessian_vector_product
e11
Decoupled sophia
Algorithmic Pseudocode for Decoupled Sophia
Create a new class DecoupledSophia that inherits from torch.optim.Optimizer.
Initialize the optimizer with the model, input data, and other necessary parameters.
Implement the step method:
If a closure is provided, compute the loss.
Iterate through the parameter groups and their parameters.
If the gradient is not available for a parameter, skip it.
Initialize the state for the parameter if it doesn't exist.
Update the biased first moment estimate.
Update the Hessian estimate every k steps using the chosen estimator.
Update the parameters using the decoupled update rule.
Implement the Hessian estimators as separate methods, e.g., hutchinson and gauss_newton_bartlett.
e10
Here are five optimization suggestions for the Sophia class:
Use torch.einsum to compute the dot product in the hutchinson method.
Use torch.no_grad() to avoid unnecessary gradient computations during the parameter update.
Use in-place operations for updating the parameters.
Cache the result of group['eps'] and group['rho'] to avoid repeated computations.
Use a more efficient method to compute the softmax and loss in the gauss_newton_bartlett method.
Pseudocode
Modify the hutchinson method to use torch.einsum for the dot product.
Use torch.no_grad() in the step method during the parameter update.
Replace add_ with addcdiv_ for in-place operations in the step method.
Cache the result of group['eps'] and group['rho'] in the step method.
Compute the softmax and loss more efficiently in the gauss_newton_bartlett method.
PyTorch Python Code
import torch
class Sophia(torch.optim.Optimizer):
def init(self, model, input_data, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=0, k=10, estimator="Hutchinson", rho=1):
self.model = model
self.input_data = input_data
defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay, k=k, estimator=estimator, rho=rho)
super(Sophia, self).init(params, defaults)
def step(self, closure=None):
loss = None
if closure is not None:
loss = closure()
for group in self.param_groups:
eps = group['eps']
rho = group['rho']
for p in group["params"]:
if p.grad is None:
continue
grad = p.grad.data
if grad.is_sparse:
raise RuntimeError("Sophia does not support sparse gradients")
state = self.state[p]
if len(state) == 0:
state['step'] = 0
state['m'] = torch.zeros_like(p.data)
state['h'] = torch.zeros_like(p.data)
m, h = state['m'], state['h']
beta1, beta2 = group['betas']
state['step'] += 1
if group['weight_decay'] != 0:
grad = grad.add(group["weight_decay"], p.data)
m.mul_(beta1).add_(1 - beta1, grad)
if state['step'] % group['k'] == 1:
if group['estimator'] == "Hutchinson":
hessian_estimate = self.hutchinson(p, grad)
elif group['estimator'] == "Gauss-Newton-Bartlett":
hessian_estimate = self.gauss_newton_bartlett(p, grad)
else:
raise ValueError("Invalid estimator choice")
h.mul_(beta2).add_(1 - beta2, hessian_estimate)
with torch.no_grad():
p.data.add_(-group['lr'] * group['weight_decay'], p.data)
p.data.addcdiv_(-group['lr'], m, h.add(eps).clamp(max=rho))
return loss
def hutchinson(self, p, grad):
u = torch.randn_like(grad)
grad_dot_u = torch.einsum("...,...->", grad, u)
hessian_vector_product = torch.autograd.grad(grad_dot_u, p, retain_graph=True)[0]
return u * hessian_vector_product
def gauss_newton_bartlett(self, p, grad):
B = len(self.input_data)
logits = [self.model(xb) for xb in self.input_data]
y_hats = [torch.softmax(logit, dim=0) for logit in logits]
g_hat = torch.autograd.grad(sum([self.loss_function(logit, y_hat) for logit, y_hat in zip(logits, y_hats)]) / B, p, retain_graph=True)[0]
return B * g_hat * g_hat
Copy code
This updated Sophia class incorporates the suggested optimizations, making the code more efficient and potentially faster.
e7
e6
e5
Research Analysis: Sophia Paper's Training Strategy
Architecture
Model: Autoregressive models on OpenWebText
Context length: 1024
Model type: Decoder-only Transformers
Model sizes: 125M (small), 355M (medium), and 770M (large)
Datasets
OpenWebText (Gokaslan & Cohen, 2019)
Baselines
Adam with decoupled weight decay (AdamW) (Loshchilov & Hutter, 2017)
Lion (Chen et al., 2023)
Algorithmic Pseudocode
Initialize the model (GPT-2) with the desired number of parameters (small, medium, or large).
Load the OpenWebText dataset.
Set the context length to 1024.
Set the batch size to 480.
Use a cosine learning rate schedule with the final learning rate equal to 0.05 times the peak learning rate.
Apply gradient clipping with a threshold of 1.0.
Use a fixed 2k steps of learning rate warm-up.
Train the model using the Sophia optimizer with the chosen Hessian estimator (Sophia-H or Sophia-G) and hyperparameters.
Train the model for 100K, 200K, or 400K steps.
Evaluate the model using log perplexity on OpenWebText and in-context learning results on SuperGLUE.
Training Code with Hugging Face Transformers API
High-Level Architecture
Load the OpenWebText dataset from Hugging Face Datasets.
Preprocess the dataset:
Tokenize the text using a tokenizer.
Group the tokenized text into chunks of a specified sequence length.
Save the preprocessed dataset.
Algorithmic Pseudocode
Load the OpenWebText dataset.
Initialize the tokenizer.
Define a tokenize function that tokenizes the text and adds an end-of-sequence token.
Apply the tokenize function to the dataset using the map function.
Define a group_texts function that concatenates all texts and splits them into chunks of the specified sequence length.
Apply the group_texts function to the tokenized dataset using the map function.
Save the preprocessed dataset.
Algorithmic Pseudocode
Load the OpenWebText dataset.
Preprocess the dataset:
Tokenize the text using a tokenizer.
Group the tokenized text into chunks of a specified sequence length.
Initialize the GPT-2 model and tokenizer.
Set up the training arguments.
Create the Trainer with the model, training arguments, and preprocessed dataset.
Train the model using the DecoupledSophia optimizer with the chosen Hessian estimator and hyperparameters.
Evaluate the model using log perplexity on OpenWebText and in-context learning results on SuperGLUE.
e4
To make Sophia decoupled, we can separate the Hessian estimation from the main optimizer. This will allow users to plug in different Hessian estimators without modifying the core optimizer code. Here's the research analysis, algorithmic pseudocode, and Python implementation for a decoupled Sophia optimizer.
Architectural Analysis
Create a base Hessian estimator class that defines the interface for all Hessian estimators.
Implement specific Hessian estimators (e.g., Hutchinson, Gauss-Newton-Bartlett) as subclasses of the base Hessian estimator class.
Modify the Sophia optimizer to accept a Hessian estimator object during initialization.
Update the optimizer's step method to use the provided Hessian estimator object for Hessian estimation.
Algorithm Pseudocode
Base Hessian Estimator
Define an abstract method estimate that takes the parameter θ and gradient as input and returns the Hessian estimate.
Hutchinson Estimator
Inherit from the base Hessian estimator class.
Implement the estimate method using the Hutchinson algorithm.
Gauss-Newton-Bartlett Estimator
Inherit from the base Hessian estimator class.
Implement the estimate method using the Gauss-Newton-Bartlett algorithm.
Decoupled Sophia Optimizer
Modify the Sophia optimizer to accept a Hessian estimator object during initialization.
Update the optimizer's step method to use the provided Hessian estimator object for Hessian estimation.
e3
To make Sophia decoupled, we can separate the Hessian estimation from the main optimizer. This will allow users to plug in different Hessian estimators without modifying the core optimizer code. Here's the research analysis, algorithmic pseudocode, and Python implementation for a decoupled Sophia optimizer.
Architectural Analysis
Create a base Hessian estimator class that defines the interface for all Hessian estimators.
Implement specific Hessian estimators (e.g., Hutchinson, Gauss-Newton-Bartlett) as subclasses of the base Hessian estimator class.
Modify the Sophia optimizer to accept a Hessian estimator object during initialization.
Update the optimizer's step method to use the provided Hessian estimator object for Hessian estimation.
Algorithm Pseudocode
Base Hessian Estimator
Define an abstract method estimate that takes the parameter θ and gradient as input and returns the Hessian estimate.
Hutchinson Estimator
Inherit from the base Hessian estimator class.
Implement the estimate method using the Hutchinson algorithm.
Gauss-Newton-Bartlett Estimator
Inherit from the base Hessian estimator class.
Implement the estimate method using the Gauss-Newton-Bartlett algorithm.
Decoupled Sophia Optimizer
Modify the Sophia optimizer to accept a Hessian estimator object during initialization.
Update the optimizer's step method to use the provided Hessian estimator object for Hessian estimation.