🏷️chapter_backprop
In the previous sections, we used mini-batch
stochastic gradient descent to train our models.
When we implemented the algorithm,
we only worried about the calculations involved
in forward propagation through the model.
In other words, we implemented the calculations
required for the model to generate output
corresponding to come given input,
but when it came time to calculate the gradients of each of our parameters,
we invoked the backward
function,
relying on the autograd
module to figure out what to do.
The automatic calculation of gradients profoundly simplifies the implementation of deep learning algorithms. Before automatic differentiation, even small changes to complicated models would require recalculating lots of derivatives by hand. Even academic papers would too often have to allocate lots of page real estate to deriving update rules.
While we plan to continue relying on autograd
,
and we have already come a long way
without every discussing how these gradients
are calculated efficiently under the hood,
it's important that you know
how updates are actually calculated
if you want to go beyond a shallow understanding of deep learning.
In this section, we'll peel back the curtain on some of the details
of backward propagation (more commonly called backpropagation or backprop).
To convey some insight for both the techniques and how they are implementated,
we will rely on both mathematics and computational graphs
to describe the mechanics behind neural network computations.
To start, we will focus our exposition on
a simple multilayer perceptron with a single hidden layer
and
Forward propagation refers to the calculation and storage
of intermediate variables (including outputs)
for the neural network within the models
in the order from input layer to output layer.
In the following, we work in detail through the example of a deep network
with one hidden layer step by step.
This is a bit tedious but it will serve us well
when discussing what really goes on when we call backward
.
For the sake of simplicity, let’s assume
that the input example is
The hidden variable
Assuming the loss function is
According to the definition of
where the Frobenius norm of the matrix is equivalent
to the calculation of the
We refer to
Plotting computational graphs helps us visualize the dependencies of operators and variables within the calculation. The figure below contains the graph associated with the simple network described above. The lower-left corner signifies the input and the upper right corner the output. Notice that the direction of the arrows (which illustrate data flow) are primarily rightward and upward.
Backpropagation refers to the method of calculating
the gradient of neural network parameters.
In general, back propagation calculates and stores
the intermediate variables of an objective function
related to each layer of the neural network
and the gradient of the parameters
in the order of the output layer to the input layer
according to the ‘chain rule’ in calculus.
Assume that we have functions
Here we use the
The parameters of the simple network with one hidden layer
are
Next, we compute the gradient of the objective function
with respect to variable of the output layer
Next, we calculate the gradients of the regularization term with respect to both parameters.
Now we are able calculate the gradient
To obtain the gradient with respect to
Since the activation function
Finally, we can obtain the gradient
When training networks, forward and backward propagation depend on each other. In particular, for forward propagation, we traverse the compute graph in the direction of dependencies and compute all the variables on its path. These are then used for backpropagation where the compute order on the graph is reversed. One of the consequences is that we need to retain the intermediate values until backpropagation is complete. This is also one of the reasons why backpropagation requires significantly more memory than plain 'inference'—we end up computing tensors as gradients and need to retain all the intermediate variables to invoke the chain rule. Another reason is that we typically train with minibatches containing more than one variable, thus more intermediate activations need to be stored.
- Forward propagation sequentially calculates and stores intermediate variables within the compute graph defined by the neural network. It proceeds from input to output layer.
- Back propagation sequentially calculates and stores the gradients of intermediate variables and parameters within the neural network in the reversed order.
- When training deep learning models, forward propagation and back propagation are interdependent.
- Training requires significantly more memory and storage.
- Assume that the inputs
$\mathbf{x}$ are matrices. What is the dimensionality of the gradients? - Add a bias to the hidden layer of the model described in this chapter.
- Draw the corresponding compute graph.
- Derive the forward and backward propagation equations.
- Compute the memory footprint for training and inference in model described in the current chapter.
- Assume that you want to compute second derivatives. What happens to the compute graph? Is this a good idea?
- Assume that the compute graph is too large for your GPU.
- Can you partition it over more than one GPU?
- What are the advantages and disadvantages over training on a smaller minibatch?