Skip to content

Commit

Permalink
Re-rendered all figrues to not have type 3 fonts for camera-ready ver…
Browse files Browse the repository at this point in the history
…sion
  • Loading branch information
duvenaud committed May 18, 2015
1 parent b3205f7 commit ab7220f
Show file tree
Hide file tree
Showing 2 changed files with 6 additions and 55 deletions.
59 changes: 5 additions & 54 deletions paper/hypergrad_paper.tex
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@
\usepackage{amsmath}
\usepackage{hyperref}
\usepackage{multirow}
\newcommand{\theHalgorithm}{\arabic{algorithm}}
%\newcommand{\theHalgorithm}{\arabic{algorithm}}
\usepackage[seed=03492]{randomorder} % This will be updated with the arXiv ID numbers following the period, but not the version number.
\usepackage{arxiv}
%\usepackage[accepted]{icml2015stylefiles/icml2015}

%\usepackage{arxiv}
\usepackage[accepted]{icml2015stylefiles/icml2015}
\usepackage{textcomp}

\newcommand{\vw}{\mathbf{w}}
\newcommand{\vv}{\mathbf{v}}
Expand Down Expand Up @@ -154,7 +154,7 @@ \section{Hypergradients}
\label{sec:hypergradients}

Reverse-mode differentiation (RMD) has been an asset to the field of machine
learning~\citep{lecun1989backpropagation} (see the \ref{sec:appendix} for a refresher). The RMD method, known as
learning~\citep{lecun1989backpropagation} (see the appendix for a refresher). The RMD method, known as
``backpropagation'' in the deep learning community, allows the gradient of a
scalar loss with respect to its parameters to be computed in a single backward
pass.
Expand Down Expand Up @@ -748,57 +748,8 @@ \section*{Acknowledgments}
Thanks to Jason Rolfe for helpful feedback.
We thank Analog Devices International and Samsung Advanced Institute of Technology for their support.


\section*{Appendix: Forward vs. reverse-mode differentiation}
\label{sec:appendix}
By the chain rule, the gradient of a set of nested functions is given by the product of the individual derivatives of each function:
%
\begin{align*}
\pderiv{f_4(f_3(f_2(f_1(x))))}{x} = \pderiv{f_4}{f_3} \cdot \pderiv{f_3}{f_2} \cdot \pderiv{f_2}{f_1} \cdot \pderiv{f_1}{x}
\end{align*}
If each function has multivariate inputs and outputs, the gradients are
Jacobian matrices.

Forward and reverse mode differentiation differ
only by the order in which they evaluate this product.
%
Forward-mode differentiation works by multiplying gradients in the same order as
the functions are evaluated:
%
\begin{align*}
\pderiv{f_4(f_3(f_2(f_1(x))))}{x} = \pderiv{f_4}{f_3} \cdot \left( \pderiv{f_3}{f_2} \cdot \left( \pderiv{f_2}{f_1} \cdot \pderiv{f_1}{x} \right) \right)
\end{align*}
%
Reverse-mode multiplies the gradients in the opposite order, starting from the
final result:
%
\begin{align*}
\pderiv{f_4(f_3(f_2(f_1(x))))}{x} = \left( \left( \pderiv{f_4}{f_3} \cdot \pderiv{f_3}{f_2} \right) \cdot \pderiv{f_2}{f_1} \right) \cdot \pderiv{f_1}{x}
\end{align*}
%
In an optimization setting, the final result of the nested functions, $f_4$, is
a scalar, while the input $x$ and intermediate values, $f_1 - f_3$, can be
vectors. In this scenario the advantage of reverse-mode
differentiation is very clear. Let's imagine that the dimensionality of all the
intermediate vectors is $D$. In reverse mode, we start from the (scalar) output,
and multiply by the next $D \times D$ Jacobian at each step. The value we
accumulate is just a $D$-dimensional vector. In forward mode, however, we must
accumulate an entire $D \times D$ matrix at each step. But do we have still
have to compute and instantiate the $D \times D$ Jacobian matrices themselves
either way? In general, yes. But in the (common) case that the vector-to-vector
functions are either elementwise operations or (reshaped) matrix multiplications, the
Jacobian matrices can actually be very sparse, and multiplication by the
Jacobian can be performed efficiently without instantiation~\cite{pearlmutter2008reverse}.

The main drawback of reverse-mode differentiation is that intermediate values
must be maintained in memory during the forward pass. In sections
\ref{sec:reversible learning} and \ref{sec:reversible computation}, we show how
to drastically reduce the memory requirements of reverse-mode differentiation
when differentiating through the entire learning procedure.

\bibliography{references.bib}
\bibliographystyle{icml2015stylefiles/icml2015}


\end{document}

2 changes: 1 addition & 1 deletion paper/icml2015stylefiles/icml2015.sty
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@
% change that text.
%%%%%%%%%%%%%%%%%%%%
\newcommand{\ICML@appearing}{\textit{Proceedings of the
$\mathit{31}^{st}$ International Conference on Machine Learning},
$\mathit{32}^{nd}$ International Conference on Machine Learning},
Lille, France, 2015. JMLR: W\&CP volume 37.
Copyright 2015 by the author(s).}

Expand Down

0 comments on commit ab7220f

Please sign in to comment.