NAG-GS: SEMI-IMPLICIT, ACCELERATED AND ROBUST STOCHASTIC OPTIMIZERS

Abstract

Classical machine learning models such as deep neural networks are usually trained by using Stochastic Gradient Descent-based (SGD) algorithms. The classical SGD can be interpreted as a discretization of the stochastic gradient flow. In this paper we propose a novel, robust and accelerated stochastic optimizer that relies on two key elements: (1) an accelerated Nesterov-like Stochastic Differential Equation (SDE) and ( 2) its semi-implicit Gauss-Seidel type discretization. The convergence and stability of the obtained method, referred to as NAG-GS, are first studied extensively in the case of the minimization of a quadratic function. This analysis allows us to come up with an optimal step size (or learning rate) in terms of rate of convergence while ensuring the stability of NAG-GS. This is achieved by the careful analysis of the spectral radius of the iteration matrix and the covariance matrix at stationarity with respect to all hyperparameters of our method. We show that NAG-GS is competitive with state-of-the-art methods such as momentum SGD with weight decay and AdamW for the training of machine learning models such as the logistic regression model, the residual networks models on standard computer vision datasets, and Transformers in the frame of the GLUE benchmark.

1. INTRODUCTION

Nowadays, machine learning, and more particularly deep learning, has achieved promising results on a wide spectrum of AI application domains. In order to process large amounts of data, most competitive approaches rely on the use of deep neural networks. Such models require to be trained and the process of training usually corresponds to solving a complex optimization problem. The development of fast methods is urgently needed to speed up the learning process and obtain efficiently trained models. In this paper, we introduce a new optimization framework for solving such problems.

Main contributions of our paper:

• We propose a new accelerated gradient method of Nesterov type for convex and non-convex stochastic optimization; • We analyze the properties of the method both theoretically and experimentally; • We show that our method is robust to the selection of hyperparameters, memory-efficient compared with AdamW and competitive with baseline methods in various benchmarks.

Organization of our paper:

• Section 1.1 gives the theoretical background for our method. • In Section 2, we propose an accelerated system of Stochastic Differential Equations (SDE) and an appropriate solver that rely on a particular discretization of the SDE's system. The obtained method, referred to as NAG-GS (Nesterov Accelerated Gradient with Gauss-Seidel Splitting), is first discussed in terms of convergence in the simple but central case of quadratic functions. Moreover, we apply our method for solving a 1-dimensional non-convex SDE for which we bring strong numerical evidences of the superior acceleration allowed by NAG-GS method compared to classical SDE solvers, see Appendix B. • In Section 3, NAG-GS is tested to tackle stochastic optimization problems of increasing complexity and dimension, starting from the logistic regression model to the training of large machine learning models such as ResNet20, ResNet50 and Transformers.

