A VIEW OF MINI-BATCH SGD VIA GENERATING FUNC-TIONS: CONDITIONS OF CONVERGENCE, PHASE TRAN-SITIONS, BENEFIT FROM NEGATIVE MOMENTA

Abstract

Mini-batch SGD with momentum is a fundamental algorithm for learning large predictive models. In this paper we develop a new analytic framework to analyze noise-averaged properties of mini-batch SGD for linear models at constant learning rates, momenta and sizes of batches. Our key idea is to consider the dynamics of the second moments of model parameters for a special family of "Spectrally Expressible" approximations. This allows to obtain an explicit expression for the generating function of the sequence of loss values. By analyzing this generating function, we find, in particular, that 1) the SGD dynamics exhibits several convergent and divergent regimes depending on the spectral distributions of the problem; 2) the convergent regimes admit explicit stability conditions, and explicit loss asymptotics in the case of power-law spectral distributions; 3) the optimal convergence rate can be achieved at negative momenta. We verify our theoretical predictions by extensive experiments with MNIST, CIFAR10 and synthetic problems, and find a good quantitative agreement.

1. INTRODUCTION

We consider a classical mini-batch Stochastic Gradient Descent (SGD) algorithm (Robbins & Monro, 1951; Bottou & Bousquet, 2007) with momentum (Polyak, 1964) : w t+1 = w t + v t+1 , v t+1 = -α t ∇ w L Bt (w t ) + β t v t . Here, L B (w) = 1 b b i=1 l(f (x i , w), y i ) is the sampled loss of a model y = f (x, w), computed using a pointwise loss l( y, y) on a mini-batch B = {(x i , y i )} b s=1 of b data points representing the target function y = f * (x). The momentum term v n represents information about gradients from previous iterations and is well-known to significantly improve convergence both generally (Polyak, 1987) and for neural networks (Sutskever et al., 2013) . Re-sampling of the mini-batch B t at each SGD iteration t creates a specific gradient noise, structured according to both the local geometry of the model f (x, w) and the quality of current approximation y. In the context of modern deep learning, f (x, w) is usually very complex, and the quantitative prediction of the SGD behavior becomes a challenging task that is far from being complete at the moment. Our goal is to obtain explicit expressions characterizing the average case convergence of mini-batch SGD for the classical least-squares problem of minimizing quadratic objective L(w). This setup is directly related to modern neural networks trained with a quadratic loss function, since networks can often be well described -e.g., in the large-width limit (Jacot et al., 2018; Lee et al., 2019) or during the late stage of training (Fort et al., 2020) -by their linearization w.r.t. parameters w. A fundamental way to characterize least squares problems is through their spectral distributions: the eigenvalues λ k of the Hessian and the coefficients c k of the expansion of the optimal solution w * over the Hessian eigenvectors. Then, one can estimate certain metrics of the problem through spectral expressions, i.e. explicit formulas that operate with spectral distributions λ k , c k but not with other details of the solution w * or the Hessian. A simple example is the standard stability condition for full-batch gradient descent (GD): α < 2/λ max . Various exact or approximate spectral expressions are available for full-batch GD-based algorithms (Fischer, 1996) and ridge regression (Canatar et al., 2021; Wei et al., 2022) . Here, we aim at obtaining spectral expressions and associated results (stability conditions, phase structure, loss asymptotics,...) for average train loss under minibatch SGD. An important feature of spectral distributions in deep learning problems is that they often obey macroscopic laws -quite commonly a power law with a long tail of eigenvalues converging to 0 (see Cui et al. ( 2021 As an illustration, consider the full-batch GD for least squares regression on a MNIST dataset. Standard optimization results (Polyak, 1987) do not take into account fine spectral details and give either non-strongly convex bound L GD (w t ) = O(t -foot_0 ) or strongly-convex bound L GD (w t ) ≤ L(w 0 )( λmax-λmin λmax+λmin ) 2t . Both these bounds are rather crude and poorly agree with the experimentally observed (Bordelon & Pehlevan, 2021; Velikanov & Yarotsky, 2022) loss trajectory which can be approximately described as L(w t ) ∼ Ct -ξ , ξ ≈ 0.25 (cf. our Fig. 1 ). In contrast, fitting power-laws to both eigenvalues λ k and coefficients c k and using the spectral expression L GD (w t ) = k (1 -αλ k ) 2t λ k c 2 k allows to accurately predict both exponent ξ and constant C. Accordingly, one of the purposes of the present paper is to investigate whether similar predictions can be made for mini-batch SGD under power-law spectral distributions. Outline and main contributions. We develop a new, spectrum based analytic approach to the study of mini-batch SGD. The results obtained within this approach and its key steps are naturally divided into three parts: 1. We show that in contrast to the full-batch GD, loss trajectories of the mini-batch SGD cannot be determined merely from the spectral properties of the problem. To overcome this difficulty, we propose a natural family of Spectrally Expressible (SE) approximations for SGD dynamics that admit an analytic solution. We provide multiple justifications for these approximations, including theoretical scenarios where they are exact and empirical evidence of their accuracy for describing optimization of models on MNIST and CIFAR10. 2. To characterize SGD dynamics under SE approximation, we derive explicit spectral expressions for the generating function of the sequence of loss values, L(z) ≡ t L(w t )z t , and show that it decomposes into the "signal" V (z) and "noise" U (z) generating functions. Analyzing U (z), we derive a novel stability condition of mini-batch SGD in terms of only the problem spectrum λ k . In the practically relevant case of large momentum parameter β ≈ 1, stability condition simplifies to the restriction of effective learning rate α eff ≡ α 1-β < 2b λcrit with some critical value λ crit determined by the spectrum. Finally, we find the characteristic divergence time when stability condition is violated. 3. By assuming power-law distributions for both eigenvalues λ k ∝ k -ν and coefficient partial sums S k = l≤k λ l c 2 l ∝ k -κ , we show that SGD exhibits distinct "signal-dominated" and "noise-dominated" convergence regimes (previously known for SGD without momenta (Varre et al., 2021) ) depending on the sign of κ + 1 -2ν. For both regimes we obtain power-law loss convergence rates and find the explicit constant in the leading term. Using these rates, we demonstrate a dynamical phase transition between the phases and find its characteristic transition time. Finally, we analyze optimal hyper parameters in both phases. In particular, we show that negative momenta can be beneficial in the "noisedominated" phase but not in the "signal-dominated" phase. We discuss related work in Appendix A and experimental details 1 in Appendix F.



Our code: https://anonymous.4open.science/r/PowerLawOptimization-1401/



); Bahri et al. (2021); Kopitkov & Indelman (2020); Velikanov & Yarotsky (2021); Atanasov et al. (2021); Basri et al. (2020) and Figs. 1, 9). The typically simple form of macroscopic laws allows to theoretically analyze spectral expressions and obtain fine-grained results.

