IMPLICIT REGULARIZATION OF SGD VIA THER-MOPHORESIS

Abstract

A central ingredient in the impressive predictive performance of deep neural networks is optimization via stochastic gradient descent (SGD). While some theoretical progress has been made, the effect of SGD in neural networks is still unclear, especially during the early phase of training. Here we generalize the theory of thermophoresis from statistical mechanics and show that there exists an effective force from SGD that pushes to reduce the gradient variance in certain parameter subspaces. We study this effect in detail in a simple two-layer model, where the thermophoretic force functions to decreases the weight norm and activation rate of the units. The strength of this effect is proportional to squared learning rate and inverse batch size, and is more effective during the early phase of training when the model's predictions are poor. Lastly we test our quantitative predictions with experiments on various models and datasets.

1. INTRODUCTION

Deep neural networks have achieved remarkable success in the past decade on tasks that were out of reach prior to the era of deep learning. Yet fundamental questions remain regarding the strong performance of over-parameterized models and optimization schemes that typically involve only first-order information, such as stochastic gradient descent (SGD) and its variants. In particular, optimization via SGD is known in many cases to result in models that generalize better than those trained with full-batch optimization. To explain this, much work has focused on how SGD navigates towards so-called flat minima, which tend to generalize better than sharp minima (Hochreiter & Schmidhuber, 1997; Keskar et al., 2017) . This has been argued by nonvacuous PAC-Bayes bounds (Dziugaite & Roy, 2017) and Bayesian evidence (Smith & Le, 2018) . More recently, Wei & Schwab (2019) discuss how optimization via SGD pushes models to flatter regions within a minimal valley by decreasing the trace of the Hessian. However, these perspectives apply to models towards the end of training, whereas it is known that proper treatment of hyperparameters during the early phase is vital. In particular, when training a deep network one typically starts with a large learning rate and small batch size if possible. After training has progressed, the learning rate is annealed and decreased so that the model can be further trained to better fit the training set (Krizhevsky et al., 2012; Simonyan & Zisserman, 2015; He et al., 2016b; a; You et al., 2017; Vaswani et al., 2017) . Crucially, using a small learning rate during the first phase of training usually leads to poor generalization and also result in large gradient variance practically (Jastrzebski et al., 2020; Faghri et al., 2020) . However, limited theoretical work has been done to understand the effect of SGD on the early phase of training. Jastrzebski et al. (2020) argue for the existence of a "break-even" point on an SGD trajectory. This point depends strongly on the hyperparameter settings. They argue that the breakeven point with large learning rate and small batch size tends to have a smaller leading eigenvalue of the Hessian spectrum, and this eigenvalue sets an upper bound for the leading eigenvalue beyond this point. They also present experiments showing that large learning rate SGD will reduce the variance of the gradient. However their analysis focuses only on the leading eigenvalue of the Hessian spectrum and requires the strong assumption that the loss function in the leading eigensubspace is quadratic. Meanwhile Li et al. (2020) studied the simple setting of two-layer neural networks. They demonstrate that in this model, training with large learning rate in the early phase tends to result in better generalization than training with small learning rate. To explain this, they hypothesize a separation of features in the data: easy-to-generalize yet hard-to-fit features, and hard-to-generalize, easierto-fit features. They argue that a model trained with small learning rate will memorize easy-togeneralize, hard-to-fit patterns during phase one, and then generalize worse on hard-to-generalize, easier-to-fit patterns, while the opposite scenario occurs when training with large learning rate. However, this work relies heavily on the existence of these two distinct types of features in the data and the specific network architecture. Moreover, their analysis focuses mainly on learning rate instead of the effect of SGD. In this paper, we study the dynamics of model parameter motion during SGD training by borrowing and generalizing the theory of thermophoresis from physics. With this framework, we show that during SGD optimization, especially during the early phase of training, the activation rate of hidden nodes is reduced as is the growth of parameter weight norm. This effect is proportional to squared learning rate and inverse batch size. Thus, thermophoresis in deep learning acts as an implicit regularization that may improve the model's ability to generalize. We first give a brief overview of the theory of thermophoresis in physics in the next section. Then we generalize this theory to models beyond physics and derive particle mass flow dynamics microscopically, demonstrating the existence of thermophoresis and its relation to relevant hyperparameters. Then we focus on a simple two-layer model to study the effect of thermophoresis in detail. Notably, we find the thermophoretic force is strongest during the early phase of training. Finally, we test our theoretical predictions with a number of experiments, finding strong agreement with the theory.

2. THERMOPHORESIS IN PHYSICS

Thermophoresis, also known as the Soret effect, describes particle mass flow in response to both diffusion and a temperature gradient. The effect was first discovered in electrolyte solutions (Ludwig, 1859; Soret, 1897; Chipman, 1926) . However it was discovered in other systems such as gases, colloids, and biological fluids and solid (Janek et al., 2002; Köhler & Morozov, 2016) . Thermophoresis typically refers to particle diffusion in a continuum with a temperature gradient. In one method of analysis, the non-uniform steady-state density ρ is given by the "Soret Equilibrium" (Eastman, 1926; Tyrell & Colledge, 1954; Wurger, 2014) , ∇ρ + ρS T ∇T = 0 , where T is temperature and S T is called the Soret coefficient. In other work by de Groot & Mazur (1962) , mass flow was calculated by non-equilibrium theory. They considered two types of processes for entropy balance: a reversible process stands for the entropy transfer and an irreversible process corresponds to the entropy production, or dissipation. The resulting mass flow induced by diffusion and temperature gradient was found to be J = -D∇ρ -ρD T ∇T , ( ) where D is the Einstein diffusion coefficient and D T is defined as thermal diffusion coefficient. Comparing the steady state in 1 and setting the flow to be zero, the Soret coefficient is simply S T = D T D . The Soret coefficient can be calculated from molecular interaction potentials based on specific molecular models (Wurger, 2014) .

3. THERMOPHORESIS IN GENERAL

In this section, we first study a kind of generalized random walk that has evolution equations for a particle state with coordinate q = {q i } i=1,...,n as q t+1 = q t -ηγf (q t , ξ) , where f is a vector function, γ and ξ are random variables, and η is a small number controlling the step size. Notice that this is a generalized inhomogeneous random walk for the particle. Before further analysis, it is noted that the evolution equations 4 is similar to SGD updates in machine learning and we will show this in the next section. To isolate the effect of thermophoresis, we assume the random walk is unbiased, in which case P (γf (q, ξ) = a) = P (γf (q, ξ) = -a), for an arbitrary vector a. Thus there is no explicit force exerted on the particle. This simplification was used to demonstrate a residual thermophoretic force in the absence of a gradient. Including gradients is straightforward and corresponds to an external field that creates a bias term. We also denote the probability density, which we also call the mass density, as ρ(q) and g i (q) := γ 2 f 2 i (q, ξ)dµ(γ, ξ), so that ηg i (q) is the standard deviation of the random walk in the ith direction. From a position q, we consider a subset of coordinate indices, U ⊆ {1, . . . , n}, wherein sign(f i (q, x)) = sign(f j (q, x)) and ∂ i g j (q) ≥ 0 (7) for all i, j ∈ U . We note here that indices will correspond to parameters when we study learning dynamics. The first property is necessary for our derivation. The second condition will be used at the end to conclude that each g i decreases. In order to study the dynamics of the particle and its density function, we focus on the probability mass flow induced by the inhomogeneous random walk. We will show that there is always a flow from regions with larger g i (q) to those with smaller g i (q) for i ∈ U , which is a generalization of thermophoresis in physics. Since η 1, the movement of the particle will have a mean free path of g i (q) in ith direction. Therefore the random walk equation 4 becomes q i = q i -ηg i (q)ζ i , where i = 1, . . . , n and ζ i is a binary random variable with P (ζ i = -1) = P (ζ i = 1) = 0.5. Moreover, from Eq. 7, we also have that ζ i = ζ j for all i and j ∈ U . Next we will show that the flow projecting on the subspace U is always toward smaller g i (q). Notice that although U can be multi-dimensional, the degree of freedom of the particle dynamics is 1 within U due to the sharing of the ζs, and therefore the mass flow projecting on it is also 1-dimensional. For each i ∈ U , we define the average flow in this dimension to be the mass that enters q i from q - i minus the mass from the opposite direction q + i . From Eq. 8 and the assumption that η 1, only mass close to q i will move across q i at each step. We let the farthest mass that will flow across q i in step i be q i + ∆ + i and q i -∆ - i , where ∆ + i and ∆ + i are positive. ∆ + i and ∆ - i are thus defined implicitly by the equations: ∆ + i = ηg i (q + ∆ + ) and ∆ - i = ηg i (q -∆ -) , respectively. Notice that if the random walk were homogeneous, we would have ∆ + i = ∆ - i . In our inhomogeneous case, we have ∆ + i ∼ ∆ - i ∼ ηg i (q) up to leading order of η, and the next to leading order will be calculated in order to compute the difference between ∆ + i and ∆ - i . Now we are ready to calculate the mass flow through q. The mass flow projecting onto the subspace U is calculated by the mass through q from q + ∆ + minus the mass from q -∆ -where ∆ + i and ∆ - i are as above for i ∈ U . It is straightforward to show that 1 ∆ + i -∆ - i = 2η 2 j∈U g j (q)∂ j g i (q) + O(η 3 ). With this, we can compute the flow density, J, through q, finding where the derivation can be found in Appendix A.3. This can be understood as described in Diagram 1. Notice that this probability mass flow consists of two terms at order ηfoot_1 . The first represents diffusion and the second corresponds to our goal in this section, namely the flow due to thermophoresis. By the second property of the g i in Eq. 7, we find that the coefficient of thermophoresis (Soret coefficient), which is defined as J = -η 2 i∈U g 2 i (q) i∈U g i (q)∂ i ρ(q) -η 2 i,j∈U g i (q)g j (q)∂ j g i (q) i∈U g 2 i (q) ρ(q) + O(η 3 ), c := -η 2 i,j∈U g i (q)g j (q)∂ j g i (q) 2 i∈U g 2 i (q) (11) ≤ 0, is negative. This means that there is an effective force exerted on a particle at position q towards the smaller variance regime (by analogy, the colder area). The coefficient is proportional to η 2 .

4.1. TWO-LAYER MODEL

To study the physics behind SGD optimization in detail, we consider the simple setting of onehidden layer neural networks. The network is a function f : R M → R parameterized as follows: f (x; V, W, b) = Vσ(Wx + b), = N i=1 V i σ( M j=1 W ij x j + b i ). We also write f (x) for simplicity. The network has a scalar output, which is widely used in regression and binary classification. x is the network input with dimension M , W and b are the weights and biases in the first layer with dimension N ×M and N respectively, where N is the number of hidden nodes in the hidden layer, and σ is the ReLU activation function defined as σ(a) = max(0, a). The dataset is drawn i.i.d. from the data distribution, {(x, y)|(x, y) ∼ D(x, y)}. In this paper we consider two cases, where either x i ≥ 0 2 or x i ∼ N (0, 1)foot_2 . Here y ∈ Y and we denote the marginal distribution of y as D Y . Finally, we have the loss function L : R × Y → R + .

4.2. TRAINING

We consider optimization via SGD, where the gradient of the loss on a batch of size |B| is given by ∇L B (V, W, b) = 1 |B| |B| i=1 ∇ f L(f (x i ), y i )∇f (x i ). In our two-layer model, we have ∇ Vi f (x) = σ( M j=1 W ij x j + b i ), ∇ Wij f (x) = V i x j σ ( M k=1 W ik x k + b i ), ∇ bi f (x) = V i σ ( M k=1 W ik x k + b i ). For an input vector x, we call the hidden node i activated when σ ( M k=1 W ik x k + b i ) = 1, or equivalently W ik x k + b i > 0. We thus define the activation rate of the network to be σ = 1 N N i=1 E x σ ( M k=1 W ik x k + b i ) . ( ) This is an important concept to which we will return. Henceforth, we drop the index i, since the dynamical equations are invariant with respect to node index, and write V := V i , W j := W ij and b := b i by abuse of notation. We also denote h v (V, W, b) := E x [∇ V f (x)] 2 , h w (V, W, b) := E x [∇ W f (x)] 2 , h b (V, W, b) := E x [∇ b f (x)] 2 , where E x denotes average over input x. We have the following property for the functions h: Property 4.1. Given W, if V 2 1 ≤ V 2 2 and b 1 ≤ b 2 , we have h v (V 1 , W, b 1 ) ≤ h v (V 2 , W, b 2 ), h w (V 1 , W, b 1 ) ≤ h w (V 2 , W, b 2 ), h b (V 1 , W, b 1 ) ≤ h b (V 2 , W, b 2 ), σ (V 1 , W, b 1 ) ≤ σ (V 2 , W, b 2 ). Here we define a ≤ b as min(b -a) ≥ 0. It is straightforward to see the following: Property 4.2. When the case of x i ≥ 0 is considered, if V 2 1 ≤ V 2 2 , W 1 ≤ W 2 and b 1 ≤ b 2 , we have h v (V 1 , W 1 , b 1 ) ≤ h v (V 2 , W 2 , b 2 ), h w (V 1 , W 2 , b 1 ) ≤ h w (V 2 , W, b 2 ). h b (V 1 , W 2 , b 1 ) ≤ h b (V 2 , W, b 2 ), σ (V 1 , W, b 1 ) ≤ σ (V 2 , W, b 2 ). In our analysis, we focus for simplicity on binary classification tasks, where the loss is typically binary cross-entropy: L(f, y) = y ln p(f ) + (1 -y) ln(1 -p(f )) and p(f ) = 1/(1 + exp(f )). We thus have ∇ f L(f, y) = p(f ) -y. (15) Substituting into Eq. 13, the mini-batch gradient becomes ∇L B (V, W, b) = 1 |B| |B| i=1 (p i -y i )∇f (x i ). Our results also hold straightforwardly for squared error.

5. THERMOPHORESIS IN DEEP LEARNING

In this section, we will show that the parameters in the one hidden layer model and their dynamics approximately satisfy the criteria of the previous section and that the biases are pushed negative and V 2 is suppressed during training, the effects of both of which are proportional to squared learning rate η 2 and inverse batch size 1/|B|. The gradient that dominates model training is defined in 16. Because training samples are i.i.d., the variance of the gradient is var ∇L B (V, W, b) = var 1 |B| |B| i=1 (p i -y i )∇f (x i ) , ( ) = 1 |B| var (p -y)∇f (x) The gradient has two components: p -y corresponding to γ in equation 4 and ∇f (x) corresponding to f (q, ξ). We assume that the dataset is unbiased, in which case P (y = 0) = P (y = 1) = 0.5 and P (p -y = a) = P (p -y = -a), and that p -y and ∇f (x) are independent in the first period of training given that the dataset is complex and can't be learned by linear model. It is straightforward to see that it satisfies Eq. 5. Next we will show that V and b are always in the set of U , i.e. they satisfy the conditions of Eq. 7. First, if V i ≥ 0, we have ∇ Vi f (x) = σ( M j=1 W ij x j + b i ), ≥ 0. and ∇ bi f (x) = V i σ ( M k=1 W ik x k + b i ), ≥ 0. ( ) Since we also have Property 4.1, the conditions in Eq. 7 are satisfied. If V i < 0, we consider a coordinate transform that maps V i to Vi = -V i . It is easy to show that Eq. 7 is again satisfied after this transform. Next we consider W. The gradient of f with respect to W ij is the product of ∇ bi f and x i . If x i for i = 1, . . . , M are always ≥ 0, which is usually the case in convolutional neural networks, it is easy to show that W ij is also in set U and smaller W ij corresponds to smaller variance according to Property 4.2. If x i ∼ N (0, 1), on the other hand, W is excluded from U . For the following, we only consider the case where x i ∼ N (0, 1), and g V (V i , W i , b i ) = 1 |B| (p -y)σ( M j=1 W ij x j + b i ) 2 dµ(x, y) , := 1 |B| φ 1 (W i , b i ) , g b (V i , W i , b i ) = 1 |B| (p -y)V i σ ( M j=1 W ij x j + b i ) 2 dµ(x, y) , := V i |B| φ 2 (W i , b i ) , ( ) where g is defined as in Eq. 6. Inserting these into Eq. 10, we find the thermophoresis flow density to be J t = η 2 |B| ψ , where ψ = Viφ1φ 2 2 +Viφ1φ2∂ b φ1+V 3 φ 2 2 ∂ b φ2 2 √ φ 2 1 +V 2 i φ 2 2 ρ. This flow biases the model toward smaller b i and smaller V ifoot_3 with a strength proportional to squared learning rate η 2 and inverse batch size. We also note that ψ can be bounded by a function multiplying with a scalar (p-y) 2 µ(x, y). It is clear that this scalar measures the L-2 distance between model predictions and sample labels and decreases on average during training as prediction getting better. Thus thermophoresis is more effective during the early phase of training. Therefore there exists an effective force that pushes to decrease the model's activation rate, defined in equation 14, and reduces the weight norm of the second layer. The strength of this force scales as F ∝ η 2 |B| . ( ) In Li & Liang (2018) This also sheds light on the connection between sparsity, weight norm, and generalization. Our theory can also be generalized beyond two-layer models. We have shown that there exists an effective force in deep neural networks from SGD that reduces the gradient variance and have derived quantitative properties of it.

6. EXPERIMENTS

The essential result from the previous section is that there exists an effective force from SGD, analogous to thermophoresis, that pushes to decrease the gradient variance, and in one-hidden-layer neural networks decreases the model's activation rate and reduces the weight norm of the second layer. The strength of the force is proportional to squared learning rate and inverse batch size. In this section, we present experiments to test these results. Further experiments can be found in the appendix. First we consider a one hidden layer model with input dimension 100 and 100 hidden units. The input data, x, is distributed as N (0, I) where I is identity matrix, and the label is randomly chosen from {0, 1}. Batch size is set to 1 and the learning rate is varied from 0.025 to 0.1. We calculate the activation rate and L2 norm of the vector V after each training iteration. The result for activation rate is shown in the first row of Fig. 2 . The leftmost plot shows activation rate as a function of true iteration on the x-axis, and we see that activation rate decreases during training, and the decreasing is more rapid with larger learning rate. In the middle plot we rescale the x-axis by a factor proportional to learning rate ηfoot_4 . This rescaling factor is to offset the movement difference due to learning rate difference. It is clear that even after this rescaling, we still observe that larger learning rates decrease activation rate faster. Finally, on the rightmost plot we rescale the x-axis with a factor proportional to squared learning rate η 2 . We see that all trajectories now overlap, which matches our prediction in the previous section that decreasing rate is proportional to η 2 . Further experiments can be found in Appendix A.6, where we show that the evolution of the weight norm, scaling with batch size, and other results are consistent with our theoretical predictions. We also study other models and other datasets including CIFAR10.

7. CONCLUSION

In this paper we generalized the theory of thermophoresis from statistical mechanics and showed that there exists an effective thermophoretic force from SGD that pushes to reduce the gradient variance. We studied this effect in detail for a simple two-layer model, where the thermophoretic force serves to decrease the weight norm and the activation rate of the units. We found that the strength of this effect is proportional to square of the learning rate, inversely proportional to batch size, and is more effective during the early phase of training when the model's predictions are poor. We found good agreement between our predictions and experiments on various models and datasets.

A APPENDIX

A.1 PROOF OF PROPERTY 4.1 Proof. By definition, we have σ( M j=1 W j x j + b 1 ) ≤ σ( M j=1 W j x j + b 2 ) , σ ( M j=1 W j x j + b 1 ) ≤ σ ( M j=1 W j x j + b 2 ) . ( ) For h v , h v (V 1 , W, b 1 ) = E x g 2 v (x, V 1 , W, b 1 ) , = E x σ 2 ( M j=1 W j x j + b 1 ) , ≤ E x σ 2 ( M j=1 W j x j + b 2 ) , = h v (V 2 , W, b 2 ) . Similarly, we have h wi (V 1 , W, b 1 ) = E x V 2 1 x 2 i σ ( M k=1 W k x k + b 1 ) , ≤ E x V 2 2 x 2 i σ ( M k=1 W k x k + b 1 ) , ≤ E x V 2 2 x 2 i σ ( M k=1 W k x k + b 2 ) , = h wi (V 2 , W, b 2 ) . Clearly the inequality also holds for h b . A.2 DERIVATION OF EQ. 9 ∆ + i -∆ - i = ηg i (q + ∆ + ) -ηg i (q -∆ -) , = η n j∈U (∆ + j + ∆ - j )∂ j g i (q) + O(η∆ 2 ) , = 2η 2 j∈U g j (q)∂ j g i (q) + O(η 3 ) . A.3 DERIVATION OF EQ. 10 J = - 1 2 |∆ + |ρ(q + ∆ + ) + 1 2 |∆ -|ρ(q -∆ -) , = 1 2 |∆ + |[ρ(q -∆ -) -ρ(q + ∆ + )] + 1 2 (|∆ -| -|∆ + |)ρ(q -∆ -) , = - 1 2 |∆ + |(|∆ + + ∆ -|) ρ(q + ∆ + ) -ρ(q -∆ -) |∆ + + ∆ -| - 1 2 (|∆ + | 2 -|∆ -| 2 ) |∆ + | + |∆ -| ρ(q -∆ -) , ≈ - 1 2 |∆ + |(∆ + + ∆ -)∇ρ(q) - 1 2 i∈U (∆ + i + ∆ - i )(∆ + i -∆ - i ) |∆ + | + |∆ -| ρ(q -∆ -) , = -η 2 i∈U g 2 i (q) i∈U g i (q)∂ i ρ(q) -η 2 i,j∈U g i (q)g j (q)∂ j g i (q) i∈U g 2 i (q) ρ(q) + O(η 3 ) .

A.4 SANITY CHECK OF GENERALIZED THEORY

If |U | = 1 and g i (q) = g i (q i ), the model will reduce to aforementioned physics model and the Soret coefficient reduces to c = η 2 2 g(q)g (q) , = [( ηg(q) 2 ) 2 ] , ≈ ∇T , where T is the effective temperature in the model. This result is consistent with thermophoresis model in physics.

A.5 SPARSITY, WEIGHT NORM AND THEIR RELATION TO GENERALIZATION

In this section, we demonstrate how sparsity is related to the Hessian norm. We first denote the model's probabilistic prediction on a C-class classification as p µ k = exp z µ k C l=1 exp z µ l , ( ) where k is the probability for label k, µ is the data index, z is model output and C is the total number of categories. We consider cross entropy loss of the form L(w) = - 1 B B µ=1 C k=1 y µ k log p µ k , where y is sample labels and p stands for model probability prediction, similar to the previous definition. We denote the loss for individual sample to be L µ = - C k=1 y µ k log p µ k . The gradient with respect to the model output is (∇ z L µ ) k = -y µ k + p µ k . ( ) And it is easy to show that the Hessian with respect to output is  (∇ 2 z L µ ) kl = δ kl p µ k -p µ k p µ l . Input dimension is 10 and the number of hidden nodes is also 10. x is distributed as N (0, I) where I is identity matrix and the label is randomly chosen from {0, 1}. For the first hyperparameter setting, batch size is 1 and learning rate varies from 0.025 to 0.1. We calculate the activation rate and L2 norm of the vector V after each training iteration. The result for activation rate and weight norm are shown in Fig. 3 and Fig. 4 respectively. Both figures contain three plots. The leftmost plots correspond to plot with raw iteration as x-axis. It is shown that both activation rate and weight norm are decreasing for all cases. Additionally, the decreasing rate is larger with larger learning rate. The middle plots correspond to plot with rescaled iteration as x-axis, where the rescaled factor is proportional to learning rate ηfoot_5 . This rescalation factor is to offset the movement difference due to learning rate difference. It is clear that even after this rescalation, we still observe activation rate and weight norm difference for different learning rates. Lastly, the rightmost plots correspond to plot with rescaled iteration as x-axis, where the rescaled factor is proportional to squared learning rate η 2 . The result that all trajectories overlap with each other matches our prediction in the previous section, that decreasing rate is proportional to η 2 . For the second hyperparameter setting, learning rate is fixed to be 0.05 and batch size varies from 1 to 3. Again we calculate the activation rate and L2 norm of the vector V after each training We observe similar tendency as we discussed in the previous hyperparameter setting. The result shows that both activation rate and weight norm are decreasing for all cases. While there exist decreasing rate difference in the left plots due to batch size discrepancy. This difference can be offset be rescaling x-axis according to proportional factor 1/|B|, which is the result in the right plots and it is consistent with our theory prediction. Subsequently we consider that second case discussed in the previous section. This is also a consider binary classification with BCE loss. The model, however, has the form f (x) = Vσ(Wx) . ( ) Input dimension is 10 and the number of hidden nodes is also 10. x is distributed uniformly between 0 and 1 as U (0, 1) and the label is randomly chosen from {0, 1}. The first setting is the same as the first setting in the aforementioned experiment. The result for activation rate and weight norm are shown in Fig. 7 and Fig. 8 respectively. Similar to previous experiment, both figures contain three plots. The leftmost, middle, right plots correspond to plot with raw iteration, rescaled iteration with factor η and rescaled iteration with factor η 2 respectively as xaxis. The result that all trajectories overlap with each other matches our prediction in the previous section, that decreasing rate is proportional to η 2 . The second setting is also similar to the second setting in the previous experiment where we fix learning rate to be 0.05 and vary batch size from 1 to 3. The result for activation rate and weight norm are shown in Fig. 9 and Fig. 10 respectively. The result shows that both activation rate and weight norm are decreasing for all cases. While there exist decreasing rate difference in the left plots due to batch size discrepancy. This difference can be offset be rescaling x-axis according to proportional factor 1/|B|, which is the result in the right plots and it is consistent with our theory prediction. Furthermore, we use real image dataset CIFAR10 for the next experiment instead of artificial data. The model in this experiment has one hidden layer with 300 hidden nodes. We first fix batch size to be 1000 and vary learning rate η. The result of activation rate and weight norm decreasing are shown in Fig. 11 and Fig. 12 respectively. We then fix learning rate to be 0.02 and vary batch size. The result of activation rate and weight norm decreasing are shown in Fig. 13 and Fig. 14 respectively. It is clear that the result matches our theoretical predictions. We skip detailed analysis here as it is similar to our discussion in previous experiments. 



A brief derivation can be found in Appendix A.2. Usually in convolutional neural networks or intermediate layers. Often found when the data are normalized. larger Vi if Vi < 0. For example, if raw iteration number for η = 0.05 is 1000 and rescaled iteration number is also 1000, the rescaled iteration number for η = 0.1 is 1000 then its true iteration number is 500. For example, if raw iteration number for η = 0.05 is 1000 and rescaled iteration number is also 1000, the rescaled iteration number for η = 0.1 is 1000 then its true iteration number is 500.



Figure 1: Diagram of mass flow in a generalized inhomogeneous random walk used in the derivation of the Soret coefficient.

Figure 2: All rows include rescaled x-axes as described in the main text. Top Row: Plots of activation rate as a function of (rescaled) training iterations with different learning rates. The model is a two-layer fully-connected network with 100 hidden units. Training data is drawn from a normal distribution. Middle Row: Plots of average gradient variance as a function of (rescaled) training iterations with different learning rates in 6-layer fully-connected neural networks. Training data is drawn from normal distribution. Bottom Row: Same as middle row but for 6-layer convolutional neural networks trained on Fashion-MNIST.

Figure 3: Plots of activation rate as functions of training iterations with different learning rate.

Figure 4: Plots of L2 norm of V as functions of training iterations with different learning rate.

Figure 5: Plots of activation rate as functions of training iterations with different batch size.

Figure 7: Plots of activation rate as functions of training iterations with different learning rate.

Figure 9: Plots of activation rate as functions of training iterations with different batch size.

Figure 10: Plots of L2 norm of V as functions of training iterations with different batch size.

Figure 11: Plots of activation rate as functions of training iterations with different learning rate. Dataset is CIFAR10.

Figure 12: Plots of L2 norm of V as functions of training iterations with different learning rate. Dataset is CIFAR10.

, Theorem 4.1 presents a linear relation between learning rate and training iterations for a target training error and small learning rate. This implies that if one uses a learning rate k times larger, the model will require k times fewer optimization steps for the same training performance. Together with our results, this implies the following: for the same model and initialization, comparing two optimization schemes with η 1 ≤ η 2 each achieving a given training error, the activation rate for scheme 1 will be at least as large as that for scheme 2, i.e. σ 1 ≥ σ 2 . Similarly, denoting the weight norm for scheme 1(2) by v 1 (v 2 ), we have that v 1 ≥ v 2 .

annex

Therefore the Hessian with respect to model parameters isTo study the spectrum of the Hessian, we calculate the trace and havewhereThe trace of K therefore can be calculated by chain rule,where δ and h carry backward and forward information respectively. They are defined asIt can further be shown thatas well asTogether with the previous calculations and the definition of K, we haveFinally, we derive an upper bound for the trace of the Hessian,Notice that activation rate and weight norm control the magnitude of σW n 2 F and W n σ 2 F . Therefore smaller activation rate and weight norm lead to tiger upper bound of the Hessian trace and thus indicate smaller matrix norm. This analysis connects sparsity with Hessian norm, Hessian trace specifically.

