NAG-GS: SEMI-IMPLICIT, ACCELERATED AND ROBUST STOCHASTIC OPTIMIZERS

Abstract

Classical machine learning models such as deep neural networks are usually trained by using Stochastic Gradient Descent-based (SGD) algorithms. The classical SGD can be interpreted as a discretization of the stochastic gradient flow. In this paper we propose a novel, robust and accelerated stochastic optimizer that relies on two key elements: (1) an accelerated Nesterov-like Stochastic Differential Equation (SDE) and ( 2) its semi-implicit Gauss-Seidel type discretization. The convergence and stability of the obtained method, referred to as NAG-GS, are first studied extensively in the case of the minimization of a quadratic function. This analysis allows us to come up with an optimal step size (or learning rate) in terms of rate of convergence while ensuring the stability of NAG-GS. This is achieved by the careful analysis of the spectral radius of the iteration matrix and the covariance matrix at stationarity with respect to all hyperparameters of our method. We show that NAG-GS is competitive with state-of-the-art methods such as momentum SGD with weight decay and AdamW for the training of machine learning models such as the logistic regression model, the residual networks models on standard computer vision datasets, and Transformers in the frame of the GLUE benchmark.

1. INTRODUCTION

Nowadays, machine learning, and more particularly deep learning, has achieved promising results on a wide spectrum of AI application domains. In order to process large amounts of data, most competitive approaches rely on the use of deep neural networks. Such models require to be trained and the process of training usually corresponds to solving a complex optimization problem. The development of fast methods is urgently needed to speed up the learning process and obtain efficiently trained models. In this paper, we introduce a new optimization framework for solving such problems.

Main contributions of our paper:

• We propose a new accelerated gradient method of Nesterov type for convex and non-convex stochastic optimization; • We analyze the properties of the method both theoretically and experimentally; • We show that our method is robust to the selection of hyperparameters, memory-efficient compared with AdamW and competitive with baseline methods in various benchmarks. Organization of our paper: • Section 1.1 gives the theoretical background for our method. • In Section 2, we propose an accelerated system of Stochastic Differential Equations (SDE) and an appropriate solver that rely on a particular discretization of the SDE's system. The obtained method, referred to as NAG-GS (Nesterov Accelerated Gradient with Gauss-Seidel Splitting), is first discussed in terms of convergence in the simple but central case of quadratic functions. Moreover, we apply our method for solving a 1-dimensional non-convex SDE for which we bring strong numerical evidences of the superior acceleration allowed by NAG-GS method compared to classical SDE solvers, see Appendix B. • In Section 3, NAG-GS is tested to tackle stochastic optimization problems of increasing complexity and dimension, starting from the logistic regression model to the training of large machine learning models such as ResNet20, ResNet50 and Transformers.

1.1. PRELIMINARIES

We start here with some general considerations in the deterministic setting for obtaining an accelerated Ordinary Differential Equations (ODE) that will be extended in the stochastic setting in Section 2.1. We consider iterative methods for solving the unconstrained minimization problem: min x∈V f (x), where V is a Hilbert space, and f : V → R ∪ {+∞} is a properly closed convex extended real-valued function. In the following, for simplicity, we shall consider the particular case of R n for V and consider function f smooth on the entire space. We also suppose V is equipped with the canonical inner product ⟨x, y⟩ = n i x i y i and the correspondingly induced norm ∥x∥ = ⟨x, x⟩. Finally, we will consider in this section the class of functions S 1,1 L,µ which stands for the set of strongly convex functions of parameter µ > 0 with Lipschitz-continuous gradients of constant L > 0. For such class of functions, it is well-known that the global minimizer exists uniquely Nesterov (2018). One well-known approach to derive the Gradient Descent (GD) method is discretizing the so-called gradient flow: ẋ(t) = -∇f (x(t)), t > 0. (2) The simplest forward (explicit) Euler method with step size α k > 0 leads to the GD method: x k+1 ← x k -α k ∇f (x k ). In the terminology of numerical analysis, it is well-known that this method is conditionally A-stable and for f ∈ S 1,1 L,µ with 0 ≤ µ ≤ L ≤ ∞, the step size α k = 1/L is allowed to get linear rate of convergence. Note that the highest rate of convergence is achieved for α k = 2 µ+L . In this case: ∥x k -x ⋆ ∥ 2 ≤ Q f -1 Q f + 1 2k ∥x 0 -x ⋆ ∥ 2 with Q f = L µ , usually referred to as the condition number of function f Nesterov (2018). One can also consider the backward (implicit) Euler method: x k+1 ← x k -α k ∇f (x k+1 ), which is unconditionally A-stable. Here-under, we summarize the methodology proposed by Luo & Chen (2021) to come up with a general family of accelerated gradient flows by focusing on the following simple problem: min x∈R n f (x) = 1 2 x T Ax (4) for which the gradient flow in equation 2 reads simply as: ẋ(t) = -Ax(t), t > 0, where A is a n-by-n symmetric positive semi-definite matrix ensuring that f ∈ S 1,1 L,µ where µ and L respectively correspond to the minimum and maximum eigenvalues of matrix A, which are real and positive by hypothesis. Instead of solving directly equation 5, authors of Luo & Chen (2021) turn to a general linear ODE system: ẏ(t) = Gy(t), t > 0. (6) The main idea consists in seeking such a system 6 with some asymmetric block matrix G that transforms the spectrum of A from the real line to the complex plane and reduces the condition number from κ (A) = L µ to κ(G) = O L µ . Afterwards, accelerated gradient methods can be constructed from A-stable methods for solving equation 6 with a significant larger step size and consequently improve the contraction rate from O Q f -1 Q f +1 2k to O √ Q f -1 √ Q f +1 2k . Furthermore, to handle the convex case µ = 0, authors from Luo & Chen (2021) combine the transformation idea with a suitable time scaling technique. In this paper we consider one transformation that relies on the embedding of A into some 2 × 2 block matrix G with a rotation built-in Luo & Chen (2021): G N AG = -I I µ/γ -A/γ -µ/γI (7)

