This is a list of peer-reviewed representative papers on deep learning dynamics (training/optimization dynamics of neural networks). We hope to enjoy the grand adventure of exploring deep learning dynamics with more researchers. Corrections and suggestions are welcomed.
The success of deep learning attributes to both deep network architecture and stochastic optimization. Understanding optimization dynamics of neural networks/deep learning dynamics is a key challenge in theoretical foundations of deep learning and a promising way to further improve empirical success of deep learning. We consider learning dynamics of optimization as a reductionism approach. Many deep learning techniques can be analyzed and interpreted from a dynamical perspective. In the context of neural networks, learning dynamical analysis provides new insights and theories beyond conventional convergence analysis of stochastic optimiztion. A large body of related works have been published on top machine learning conferences and journals. However, a lterature review in this line of research is largely missing. It is highly valuable to continuously collect and share these great works. This is exactly the main purpose of the paper list. Note that this paper list does not focus on the conventional convergence analysis in optimization and forward dynamics of neural networks.
The paper list covers five main directions:
(1) Learning Dynamics of GD and SGD,
(2) Learning Dynamics of Momentum,
(3) Learning Dynmaics of Adaptive Gradient Methods,
(4) Learning Dynamics with Training Techniques (e.g. Weight Decay, Normalization Layers, Gradient Clipping, etc.),
(5) Learning Dynamics beyond Standard Training (e.g. Self-Supervised Learning, Continual Learning, Privacy, etc.).
- Gradient descent only converges to minimizers. In COLT 2016. [pdf]
- Stochastic gradient descent as approximate bayesian inference. In JMLR 2017. [pdf]
- How to escape saddle points efficiently. In ICML 2017. [pdf]
- Gradient descent can take exponential time to escape saddle points. In NeurIPS 2017/ [pdf]
- Gradient descent learns linear dynamical systems. In JMLR 2018. [pdf]
- A bayesian perspective on generalization and stochastic gradient descent. In ICLR 2018. [pdf]
- Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. In ITA 2018. [pdf]
- An alternative view: When does SGD escape local minima? In ICML 2018. [pdf]
- On the optimization of deep networks: Implicit acceleration by overparameterization. In ICML 2018. [pdf]
- Comparing Dynamics: Deep Neural Networks versus Glassy Systems. In ICML 2018. [pdf]
- How sgd selects the global minima in over-parameterized learning: A dynamical stability perspective. In NeurIPS 2018. [pdf]
- Neural tangent kernel: Convergence and generalization in neural networks. In NeurIPS 2018. [pdf]
- Stochastic modified equations and dynamics of stochastic gradient algorithms i: Mathematical foundations. In JMLR 2019. [pdf]
- On the diffusion approximation of nonconvex stochastic gradient descent. In AMSA 2019. [pdf]
- Gradient descent provably optimizes over-parameterized neural networks. In ICLR 2019. [pdf]
- The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects. In ICML 2019. [pdf]
- Gradient descent finds global minima of deep neural network. In ICML 2019. [pdf]
- First exit time analysis of stochastic gradient descent under heavy-tailed gradient noise. In NeurIPS 2019. [pdf]
- Wide neural networks of any depth evolve as linear models under gradient descent. In NeurIPS 2019. [pdf]
- On the noisy gradient descent that generalizes as sgd. In ICML 2020. [pdf]
- Stochastic gradient and Langevin processes. In ICML 2020. [pdf]
- Continuous-time Lower Bounds for Gradient-based Algorithm. In ICML 2020. [pdf]
- An empirical study of stochastic gradient descent with structured covariance noise. In AISTATS 2020. [pdf]
- Stochasticity of deterministic gradient descent: Large learning rate for multiscale objective function. In NeurIPS 2020. [pdf]
- The surprising simplicity of the early-time learning dynamics of neural networks. In NeurIPS 2020. [pdf]
- Dynamical mean-field theory for stochastic gradient descent in Gaussian mixture classification. In NeurIPS 2020. [pdf]
- A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. In ICLR 2021. [pdf]
- On the origin of implicit regularization in stochastic gradient descent. In ICLR 2021. [pdf]
- Noise and fluctuation of finite learning rate stochastic gradient descent. In ICML 2021. [pdf]
- The heavy-tail phenomenon in SGD. In ICML 2021. [pdf]
- Sgd: The role of implicit regularization, batch-size and multiple-epochs. In NeurIPS 2021. [pdf]
- On the validity of modeling sgd with stochastic differential equations (sdes). In NeurIPS 2021. [pdf]
- Label noise sgd provably prefers flat global minimizers. In NeurIPS 2021. [pdf]
- Imitating deep learning dynamics via locally elastic stochastic differential equations. In NeurIPS 2021. [pdf]
- Shape matters: Understanding the implicit bias of the noise covariance. In COLT 2021. [pdf]
- Sgd with a constant large learning rate can converge to local maxima. In ICLR 2022. [pdf]
- Strength of minibatch noise in sgd. In ICLR 2022. [pdf]
- What Happens after SGD Reaches Zero Loss?--A Mathematical Framework. In ICLR 2022. [pdf]
- Eliminating Sharp Minima from SGD with Truncated Heavy-tailed Noise. In ICLR 2022. [pdf]
- Large Learning Rate Tames Homogeneity: Convergence and Balancing Effect. In ICLR 2022. [pdf]
- Neural Network Weights Do Not Converge to Stationary Points: An Invariant Measure Perspective. In ICML 2022. [pdf]
- Three-stage evolution and fast equilibrium for sgd with non-degenerate critical points. In ICML 2022. [pdf]
- Power-law escape rate of sgd. In ICML 2022. [pdf]
- On Constraints in First-Order Optimization: A View from Non-Smooth Dynamical Systems. In JMLR 2022. [pdf]
- High-dimensional limit theorems for SGD: Effective dynamics and critical scaling. In NeurIPS 2022. [pdf]
- Chaotic dynamics are intrinsic to neural networks training with SGD. In NeurIPS 2022. [pdf]
- Dynamics of SGD with Stochastic Polyak Stepsizes. [pdf]
- Toward Equation of Motion for Deep Neural Networks: Continuous-time Gradient Descent and Discretization Error Analysis. In NeurIPS 2022. [pdf]
- Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs. In NeurIPS 2022. [pdf]
- SGD with Large Step Sizes Learns Sparse Features. In ICML 2023. pdf]
- Phase diagram of early training dynamics in deep neural networks: effect of the learning rate, depth, and width. In NeurIPS 2023. [pdf]
- On the Overlooked Structure of Stochastic Gradients. In NeurIPS 2023. [pdf]
- A Dynamical Systems Perspective on Nesterov Acceleration. In ICML 2019. [pdf]
- Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. In NeurIPS 2019. [pdf]
- Positive-negative momentum: Manipulating stochastic gradient noise to improve generalization. In ICML 2021. [pdf]
- Optimization with Momentum: Dynamical, Control-Theoretic, and Symplectic Perspectives. In JMLR 2022. [pdf]
- Better SGD using Second-order Momentum. In NeurIPS 2022. [pdf]
- Trajectory of Mini-Batch Momentum: Batch Size Saturation and Convergence in High Dimensions. In NeurIPS 2022. [pdf]
- Stochastic modified equations and adaptive stochastic gradient algorithms. In ICML 2017. [pdf]
- Escaping saddle points with adaptive gradient methods. In ICML 2019. [pdf]
- Towards theoretically understanding why sgd generalizes better than adam in deep learning. In NeurIPS 2020. [pdf]
- Adaptive Inertia: Disentangling the effects of adaptive learning rate and momentum. In ICML 2022. [pdf]
- Noise Is Not the Main Factor Behind the Gap Between Sgd and Adam on Transformers, But Sign Descent Might Be. In ICLR 2023. [pdf]
- How Does Adaptive Optimization Impact Local Neural Network Geometry?. In NeurIPS 2023. [pdf]
- L2 regularization versus batch and weight normalization. In NeurIPS 2017. [pdf]
- Three mechanisms of weight decay regularization. In ICLR 2018. [pdf]
- Norm matters: efficient and accurate normalization schemes in deep networks. In NeurIPS 2018. [pdf]
- Theoretical analysis of auto rate-tuning by batch normalization. In ICLR 2019. [pdf]
- Toward understanding the importance of noise in training neural networks. In ICML 2019. [pdf]
- A quantitative analysis of the effect of batch normalization on gradient descent. In ICML 2019. [pdf]
- Why gradient clipping accelerates training: A theoretical justification for adaptivity. In ICLR 2020. [pdf]
- On the training dynamics of deep networks with L2 regularization. In NeurIPS 2020. [pdf]
- Reconciling modern deep learning with traditional optimization analyses: The intrinsic learning rat. In NeurIPS 2020. [pdf]
- Spherical Motion Dynamics: Learning Dynamics of Normalized Neural Network using SGD and Weight Decay. In NeurIPS 2021. [pdf]
- Fast Mixing of Stochastic Gradient Descent with Normalization and Weight Decay. In NeurIPS 2022. [pdf]
- On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them: A Gradient-Norm Perspective. In NeurIPS 2023. [pdf]
- Variational annealing of gans: A langevin perspective. In ICML 2019. [pdf]
- Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup. In NeurIPS 2019. [pdf]
- On the dynamics of gradient descent for autoencoders. In AISTAT 2019. [pdf]
- Understanding the role of training regimes in continual learning. In NeurIPS 2020. [pdf]
- Layer-wise conditioning analysis in exploring the learning dynamics of dnns. In ECCV 2020. [pdf]
- Understanding self-supervised learning dynamics without contrastive pairs. In ICML 2021. [pdf]
- Differential privacy dynamics of langevin diffusion and noisy gradient descent. In NeurIPS 2021. [pdf]
- Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models. In NeurIPS 2022. [pdf]
- Deep Active Learning by Leveraging Training Dynamics. In NeurIPS 2022. [pdf]
- Learning dynamics of deep linear networks with multiple pathways. In NeurIPS 2022. [pdf]
- Towards a Better Understanding of Representation Dynamics under TD-learning. In ICML 2023. pdf]
- A Theoretical Analysis of the Learning Dynamics under Class Imbalance. In ICML 2023. pdf]
- Understanding Multi-phase Optimization Dynamics and Rich Nonlinear Behaviors of ReLU Networks. In NeurIPS 2023. [pdf]
- CORNN: Convex optimization of recurrent neural networks for rapid inference of neural dynamics. In NeurIPS 2023. [pdf]
- Mean-field Langevin dynamics: Time-space discretization, stochastic gradient, and variance reduction. In NeurIPS 2023. [pdf]
- Loss Dynamics of Temporal Difference Reinforcement Learning. In NeurIPS 2023. [pdf]
- A Dynamical System View of Langevin-Based Non-Convex Sampling. In NeurIPS 2023. [pdf]
If you find the paper list useful for your research, you are highly welcomed to cite our representative works on this topic!
They covered important related works and touched fundamental issues in this line of research.
[1] ICLR 2021: SGD dynamics for flat minima selection.
@inproceedings{
xie2021diffusion,
title={A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima},
author={Zeke Xie and Issei Sato and Masashi Sugiyama},
booktitle={International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=wXgk_iCiYGo}
}
[2] ICML 2022 (Oral): SGD and Adam dynamics for saddle-point escaping and minima selection.
@InProceedings{xie2022adaptive,
title = {Adaptive Inertia: Disentangling the Effects of Adaptive Learning Rate and Momentum},
author = {Xie, Zeke and Wang, Xinrui and Zhang, Huishuai and Sato, Issei and Sugiyama, Masashi},
booktitle = {Proceedings of the 39th International Conference on Machine Learning},
pages = {24430--24459},
year = {2022}
volume = {162},
series = {Proceedings of Machine Learning Research}
}