IMPLICIT REGULARIZATION OF SGD VIA THER-MOPHORESIS

Abstract

A central ingredient in the impressive predictive performance of deep neural networks is optimization via stochastic gradient descent (SGD). While some theoretical progress has been made, the effect of SGD in neural networks is still unclear, especially during the early phase of training. Here we generalize the theory of thermophoresis from statistical mechanics and show that there exists an effective force from SGD that pushes to reduce the gradient variance in certain parameter subspaces. We study this effect in detail in a simple two-layer model, where the thermophoretic force functions to decreases the weight norm and activation rate of the units. The strength of this effect is proportional to squared learning rate and inverse batch size, and is more effective during the early phase of training when the model's predictions are poor. Lastly we test our quantitative predictions with experiments on various models and datasets.

1. INTRODUCTION

Deep neural networks have achieved remarkable success in the past decade on tasks that were out of reach prior to the era of deep learning. Yet fundamental questions remain regarding the strong performance of over-parameterized models and optimization schemes that typically involve only first-order information, such as stochastic gradient descent (SGD) and its variants. In particular, optimization via SGD is known in many cases to result in models that generalize better than those trained with full-batch optimization. To explain this, much work has focused on how SGD navigates towards so-called flat minima, which tend to generalize better than sharp minima (Hochreiter & Schmidhuber, 1997; Keskar et al., 2017) . This has been argued by nonvacuous PAC-Bayes bounds (Dziugaite & Roy, 2017) and Bayesian evidence (Smith & Le, 2018) . More recently, Wei & Schwab (2019) discuss how optimization via SGD pushes models to flatter regions within a minimal valley by decreasing the trace of the Hessian. However, these perspectives apply to models towards the end of training, whereas it is known that proper treatment of hyperparameters during the early phase is vital. In particular, when training a deep network one typically starts with a large learning rate and small batch size if possible. After training has progressed, the learning rate is annealed and decreased so that the model can be further trained to better fit the training set (Krizhevsky et al., 2012; Simonyan & Zisserman, 2015; He et al., 2016b; a; You et al., 2017; Vaswani et al., 2017) . Crucially, using a small learning rate during the first phase of training usually leads to poor generalization and also result in large gradient variance practically (Jastrzebski et al., 2020; Faghri et al., 2020) . However, limited theoretical work has been done to understand the effect of SGD on the early phase of training. Jastrzebski et al. ( 2020) argue for the existence of a "break-even" point on an SGD trajectory. This point depends strongly on the hyperparameter settings. They argue that the breakeven point with large learning rate and small batch size tends to have a smaller leading eigenvalue of the Hessian spectrum, and this eigenvalue sets an upper bound for the leading eigenvalue beyond this point. They also present experiments showing that large learning rate SGD will reduce the variance of the gradient. However their analysis focuses only on the leading eigenvalue of the Hessian spectrum and requires the strong assumption that the loss function in the leading eigensubspace is quadratic. Meanwhile Li et al. ( 2020) studied the simple setting of two-layer neural networks. They demonstrate that in this model, training with large learning rate in the early phase tends to result in better generalization than training with small learning rate. To explain this, they hypothesize a separation of features in the data: easy-to-generalize yet hard-to-fit features, and hard-to-generalize, easierto-fit features. They argue that a model trained with small learning rate will memorize easy-togeneralize, hard-to-fit patterns during phase one, and then generalize worse on hard-to-generalize, easier-to-fit patterns, while the opposite scenario occurs when training with large learning rate. However, this work relies heavily on the existence of these two distinct types of features in the data and the specific network architecture. Moreover, their analysis focuses mainly on learning rate instead of the effect of SGD. In this paper, we study the dynamics of model parameter motion during SGD training by borrowing and generalizing the theory of thermophoresis from physics. With this framework, we show that during SGD optimization, especially during the early phase of training, the activation rate of hidden nodes is reduced as is the growth of parameter weight norm. This effect is proportional to squared learning rate and inverse batch size. Thus, thermophoresis in deep learning acts as an implicit regularization that may improve the model's ability to generalize. We first give a brief overview of the theory of thermophoresis in physics in the next section. Then we generalize this theory to models beyond physics and derive particle mass flow dynamics microscopically, demonstrating the existence of thermophoresis and its relation to relevant hyperparameters. Then we focus on a simple two-layer model to study the effect of thermophoresis in detail. Notably, we find the thermophoretic force is strongest during the early phase of training. Finally, we test our theoretical predictions with a number of experiments, finding strong agreement with the theory.

2. THERMOPHORESIS IN PHYSICS

Thermophoresis, also known as the Soret effect, describes particle mass flow in response to both diffusion and a temperature gradient. The effect was first discovered in electrolyte solutions (Ludwig, 1859; Soret, 1897; Chipman, 1926) . However it was discovered in other systems such as gases, colloids, and biological fluids and solid (Janek et al., 2002; Köhler & Morozov, 2016) . Thermophoresis typically refers to particle diffusion in a continuum with a temperature gradient. In one method of analysis, the non-uniform steady-state density ρ is given by the "Soret Equilibrium" (Eastman, 1926; Tyrell & Colledge, 1954; Wurger, 2014) , ∇ρ + ρS T ∇T = 0 , where T is temperature and S T is called the Soret coefficient. In other work by de Groot & Mazur (1962) , mass flow was calculated by non-equilibrium theory. They considered two types of processes for entropy balance: a reversible process stands for the entropy transfer and an irreversible process corresponds to the entropy production, or dissipation. The resulting mass flow induced by diffusion and temperature gradient was found to be J = -D∇ρ -ρD T ∇T , where D is the Einstein diffusion coefficient and D T is defined as thermal diffusion coefficient. Comparing the steady state in 1 and setting the flow to be zero, the Soret coefficient is simply S T = D T D . The Soret coefficient can be calculated from molecular interaction potentials based on specific molecular models (Wurger, 2014).

3. THERMOPHORESIS IN GENERAL

In this section, we first study a kind of generalized random walk that has evolution equations for a particle state with coordinate q = {q i } i=1,...,n as q t+1 = q t -ηγf (q t , ξ) , (4)

