IMPLICIT REGULARIZATION OF SGD VIA THER-MOPHORESIS

Abstract

A central ingredient in the impressive predictive performance of deep neural networks is optimization via stochastic gradient descent (SGD). While some theoretical progress has been made, the effect of SGD in neural networks is still unclear, especially during the early phase of training. Here we generalize the theory of thermophoresis from statistical mechanics and show that there exists an effective force from SGD that pushes to reduce the gradient variance in certain parameter subspaces. We study this effect in detail in a simple two-layer model, where the thermophoretic force functions to decreases the weight norm and activation rate of the units. The strength of this effect is proportional to squared learning rate and inverse batch size, and is more effective during the early phase of training when the model's predictions are poor. Lastly we test our quantitative predictions with experiments on various models and datasets.

1. INTRODUCTION

Deep neural networks have achieved remarkable success in the past decade on tasks that were out of reach prior to the era of deep learning. Yet fundamental questions remain regarding the strong performance of over-parameterized models and optimization schemes that typically involve only first-order information, such as stochastic gradient descent (SGD) and its variants. In particular, optimization via SGD is known in many cases to result in models that generalize better than those trained with full-batch optimization. To explain this, much work has focused on how SGD navigates towards so-called flat minima, which tend to generalize better than sharp minima (Hochreiter & Schmidhuber, 1997; Keskar et al., 2017) . This has been argued by nonvacuous PAC-Bayes bounds (Dziugaite & Roy, 2017) and Bayesian evidence (Smith & Le, 2018) . More recently, Wei & Schwab (2019) discuss how optimization via SGD pushes models to flatter regions within a minimal valley by decreasing the trace of the Hessian. However, these perspectives apply to models towards the end of training, whereas it is known that proper treatment of hyperparameters during the early phase is vital. In particular, when training a deep network one typically starts with a large learning rate and small batch size if possible. After training has progressed, the learning rate is annealed and decreased so that the model can be further trained to better fit the training set (Krizhevsky et al., 2012; Simonyan & Zisserman, 2015; He et al., 2016b; a; You et al., 2017; Vaswani et al., 2017) . Crucially, using a small learning rate during the first phase of training usually leads to poor generalization and also result in large gradient variance practically (Jastrzebski et al., 2020; Faghri et al., 2020) . However, limited theoretical work has been done to understand the effect of SGD on the early phase of training. Jastrzebski et al. ( 2020) argue for the existence of a "break-even" point on an SGD trajectory. This point depends strongly on the hyperparameter settings. They argue that the breakeven point with large learning rate and small batch size tends to have a smaller leading eigenvalue of the Hessian spectrum, and this eigenvalue sets an upper bound for the leading eigenvalue beyond this point. They also present experiments showing that large learning rate SGD will reduce the variance of the gradient. However their analysis focuses only on the leading eigenvalue of the Hessian spectrum and requires the strong assumption that the loss function in the leading eigensubspace is quadratic.

