NAG-GS: SEMI-IMPLICIT, ACCELERATED AND ROBUST STOCHASTIC OPTIMIZERS

Abstract

Classical machine learning models such as deep neural networks are usually trained by using Stochastic Gradient Descent-based (SGD) algorithms. The classical SGD can be interpreted as a discretization of the stochastic gradient flow. In this paper we propose a novel, robust and accelerated stochastic optimizer that relies on two key elements: (1) an accelerated Nesterov-like Stochastic Differential Equation (SDE) and ( 2) its semi-implicit Gauss-Seidel type discretization. The convergence and stability of the obtained method, referred to as NAG-GS, are first studied extensively in the case of the minimization of a quadratic function. This analysis allows us to come up with an optimal step size (or learning rate) in terms of rate of convergence while ensuring the stability of NAG-GS. This is achieved by the careful analysis of the spectral radius of the iteration matrix and the covariance matrix at stationarity with respect to all hyperparameters of our method. We show that NAG-GS is competitive with state-of-the-art methods such as momentum SGD with weight decay and AdamW for the training of machine learning models such as the logistic regression model, the residual networks models on standard computer vision datasets, and Transformers in the frame of the GLUE benchmark.

1. INTRODUCTION

Nowadays, machine learning, and more particularly deep learning, has achieved promising results on a wide spectrum of AI application domains. In order to process large amounts of data, most competitive approaches rely on the use of deep neural networks. Such models require to be trained and the process of training usually corresponds to solving a complex optimization problem. The development of fast methods is urgently needed to speed up the learning process and obtain efficiently trained models. In this paper, we introduce a new optimization framework for solving such problems. Main contributions of our paper: • We propose a new accelerated gradient method of Nesterov type for convex and non-convex stochastic optimization; • We analyze the properties of the method both theoretically and experimentally; • We show that our method is robust to the selection of hyperparameters, memory-efficient compared with AdamW and competitive with baseline methods in various benchmarks. Organization of our paper: • Section 1.1 gives the theoretical background for our method. • In Section 2, we propose an accelerated system of Stochastic Differential Equations (SDE) and an appropriate solver that rely on a particular discretization of the SDE's system. The obtained method, referred to as NAG-GS (Nesterov Accelerated Gradient with Gauss-Seidel Splitting), is first discussed in terms of convergence in the simple but central case of quadratic functions. Moreover, we apply our method for solving a 1-dimensional non-convex SDE for which we bring strong numerical evidences of the superior acceleration allowed by NAG-GS method compared to classical SDE solvers, see Appendix B. • In Section 3, NAG-GS is tested to tackle stochastic optimization problems of increasing complexity and dimension, starting from the logistic regression model to the training of large machine learning models such as ResNet20, ResNet50 and Transformers.

1.1. PRELIMINARIES

We start here with some general considerations in the deterministic setting for obtaining an accelerated Ordinary Differential Equations (ODE) that will be extended in the stochastic setting in Section 2.1. We consider iterative methods for solving the unconstrained minimization problem: min x∈V f (x), where V is a Hilbert space, and f : V → R ∪ {+∞} is a properly closed convex extended real-valued function. In the following, for simplicity, we shall consider the particular case of R n for V and consider function f smooth on the entire space. We also suppose V is equipped with the canonical inner product ⟨x, y⟩ = n i x i y i and the correspondingly induced norm ∥x∥ = ⟨x, x⟩. Finally, we will consider in this section the class of functions S 1,1 L,µ which stands for the set of strongly convex functions of parameter µ > 0 with Lipschitz-continuous gradients of constant L > 0. For such class of functions, it is well-known that the global minimizer exists uniquely Nesterov (2018) . One well-known approach to derive the Gradient Descent (GD) method is discretizing the so-called gradient flow: ẋ(t) = -∇f (x(t)), t > 0. (2) The simplest forward (explicit) Euler method with step size α k > 0 leads to the GD method: x k+1 ← x k -α k ∇f (x k ). In the terminology of numerical analysis, it is well-known that this method is conditionally A-stable and for f ∈ S 1,1 L,µ with 0 ≤ µ ≤ L ≤ ∞, the step size α k = 1/L is allowed to get linear rate of convergence. Note that the highest rate of convergence is achieved for α k = 2 µ+L . In this case: ∥x k -x ⋆ ∥ 2 ≤ Q f -1 Q f + 1 2k ∥x 0 -x ⋆ ∥ 2 with Q f = L µ , usually referred to as the condition number of function f Nesterov (2018) . One can also consider the backward (implicit) Euler method: x k+1 ← x k -α k ∇f (x k+1 ), which is unconditionally A-stable. Here-under, we summarize the methodology proposed by Luo & Chen (2021) to come up with a general family of accelerated gradient flows by focusing on the following simple problem: min x∈R n f (x) = 1 2 x T Ax (4) for which the gradient flow in equation 2 reads simply as: ẋ(t) = -Ax(t), t > 0, where A is a n-by-n symmetric positive semi-definite matrix ensuring that f ∈ S 1,1 L,µ where µ and L respectively correspond to the minimum and maximum eigenvalues of matrix A, which are real and positive by hypothesis. Instead of solving directly equation 5, authors of Luo & Chen (2021) turn to a general linear ODE system: ẏ(t) = Gy(t), t > 0. (6) The main idea consists in seeking such a system 6 with some asymmetric block matrix G that transforms the spectrum of A from the real line to the complex plane and reduces the condition number from κ(A) = L µ to κ(G) = O L µ . Afterwards, accelerated gradient methods can be constructed from A-stable methods for solving equation 6 with a significant larger step size and consequently improve the contraction rate from O Q f -1 Q f +1 2k to O √ Q f -1 √ Q f +1 2k . Furthermore, to handle the convex case µ = 0, authors from Luo & Chen (2021) combine the transformation idea with a suitable time scaling technique. In this paper we consider one transformation that relies on the embedding of A into some 2 × 2 block matrix G with a rotation built-in Luo & Chen (2021) : G N AG = -I I µ/γ -A/γ -µ/γI (7) where γ is a positive time scaling factor which satisfies γ = µγ, γ(t = 0) = γ 0 > 0. (8) Note that, given A positive definite, we can easily show that for the considered transformation, we have that R(λ) < 0 for all λ ∈ σ(G) with σ(G) denoting the spectrum of G, i.e. the set of all eigenvalues of G. Further, we will denote by ρ(G) := max λ∈σ(G) |λ| the spectral radius of matrix G. Let us now consider the NAG block Matrix and let y = (x, v), the dynamical system given in equation 6 with y(0) = y 0 ∈ R 2n reads: ẋ = v -x, v = µ γ (x -v) - 1 γ Ax with initial conditions x(0) = x 0 and v(0) = v 0 . Before going further, let us remark that this linear ODE can be expressed as the following second-order ODE by eliminating v: γ ẍ + (γ + µ) ẋ + Ax = 0, ( ) where Ax is therefore the gradient of f w.r.t. x. Thus, one could generalize this approach for any function f ∈ S 1,1 L,µ by replacing A and Ax by ∇f (x) respectively within equation 7, equation 9 and equation 10. Finally, some additional and useful insights are discussed in Appendix A.

2. MODEL AND THEORY

2.1 ACCELERATED STOCHASTIC GRADIENT FLOW In the previous section, we presented a family of accelerated Gradient flows obtained by an appropriate spectral transformation G of matrix A, see equation 9. One can observe the presence of a gradient term of the smooth function f (x) at x in the second differential equation. Let us recall that Ax can be replaced by ∇f (x) for any function f ∈ S 1,1 L,µ . In the frame of this paper, function f (x) may correspond to some loss function used to train neural networks. For such setting, we assume that the gradient input ∇f (x) is contaminated by noise due to finite-sample estimate of the gradient. The study of accelerated Gradient flows is now adapted to include and model the effect of the noise; to achieve this we consider the dynamics given in equation 6 perturbed by a general martingale process. This leads us to consider the following Accelerated Stochastic Gradient (ASG) flows: dx dt = v -x, dv dt = µ γ (x -v) - 1 γ Ax + dZ dt , which corresponds to an (Accelerated) system of SDE's, where Z(t) is a continuous Ito martingale. We assume that Z(t) has the simple expression dZ = σdW , where W = (W 1 , ..., W n ) is a standard n-dimensional Brownian Motion. As a simple and first approach, we consider the volatility parameter σ constant. In the next section, we present the discretizations considered for ASG flows given in equation 11.

2.2. DISCRETIZATION: GAUSS-SEIDEL SPLITTING AND SEMI-IMPLICITNESS

In this section, we present the main strategy to discretize the Accelerated SDE's system from equation 11. The main motivation behind the discretization method is to derive integration schemes that are, in the best case, unconditionally A-stable or conditionally A-stable with the highest possible integration step. In the classical terminology of (discrete) optimization methods, this amounts to ensure convergence of the obtained methods with the largest possible step size and consequently improve the contraction rate (or the rate of convergence). In Section 1.1, we have briefly recalled that the most well-known unconditionally A-stable scheme was the backward Euler method (see equation 3), which is an implicit method and hence can achieve faster convergence rate. However, this requires to either solve a linear system either, in the case of a general convex function, to compute the root of a non-linear equation, both situations leading to a high computational cost. This is the main reason why few implicit schemes are used in practice for solving high-dimensional optimization problems. But still, it is expected that an explicit scheme closer to the implicit Euler method will have good stability with a larger step size than the one offered by a forward Euler method. Motivated by the Gauss-Seidel (GS) method for solving linear systems, we consider the matrix splitting G = M + N with M being the lower triangular part of G and N = G -M , we propose the following Gauss-Seidel splitting scheme for equation 6 perturbated with noise: y k+1 -y k α k = M y k+1 + N y k + 0 σ W k+1 -W k α k (12) which for G = G N AG (see ( 7)), gives the following semi-implicit scheme with step size α k > 0: x k+1 -x k α k = v k -x k+1 , v k+1 -v k α k = µ γ k (x k+1 -v k+1 ) - 1 γ k Ax k+1 + σ W k+1 -W k α k . ( ) Note that due to properties of Brownian motion we can simulate its values at the selected points by: W k+1 = W k + ∆W k , where ∆W k are independent random variables with distribution N (0, α k ). On a practical point of view, we will use ∆W k = √ α k η k , where η k ∼ N (0, 1). Furthermore, ODE corresponding to the parameter γ is also discretized implicitly: γ k+1 -γ k α k = µ -γ k+1 , γ 0 > 0. ( ) As already mentioned earlier, heuristically, for general f ∈ S 1,1 L,µ with µ ≥ 0, we just replace Ax in equation 13 with ∇f (x) and obtain the following NAG-GS scheme: x k+1 -x k α k = v k -x k+1 , v k+1 -v k α k = µ γ k (x k+1 -v k+1 ) - 1 γ k ∇f (x k+1 ) + σ W k+1 -W k α k . Finally, we come to the following methods, which are addressed as NAG-GS method (see Algorithm 1). There we assume that the gradient ∇f (x k+1 ) is computed with some unknown noise. Algorithm 1 Nesterov Accelerated Gradients with Gauss-Seidel splitting (NAG-GS). Input: Choose point x 0 ∈ R n , some µ ≥ 0, γ 0 > 0. Set v 0 := x 0 . for k = 1, 2, . . . do Choose step size α k > 0. ▷ Update parameters and state x: Set a k := α k (α k + 1) -1 . Set γ k+1 := (1 -a k )γ k + a k µ. Set x k+1 := (1 -a k )x k + a k v k . ▷ Update state v: Set b k := α k µ(α k µ + γ k+1 ) -1 . Set v k+1 := (1 -b k )v k + b k x k+1 -µ -1 b k ∇f (x k+1 ). end for Moreover, the step size update can be performed with different strategies, for instance one may choose the method proposed by Nesterov (Nesterov, 2018, Method 2.2.7 ) which specifies to compute α k ∈ (0, 1) such that Lα 2 k = (1 -α k )γ k + α k µ. Note that for γ 0 = µ, hence the sequences γ k = µ and α k = µ L for all k ≥ 0. Other strategies are discussed in Luo & Chen (2021) such as Lα 2 k = γ k (1 + α k ). Finally, one could simply choose a constant step size variant of Algorithm 1, we select and discuss this approach in Section 2.3. Let us mention that full-implicit discretizations have been considered and studied by the authors, these will be briefly discussed in Appendix A.2. However, their interests are, at the moment, limited for ML applications since the obtained implicit schemes are connected to a specific family of second-order methods which are intractable for real-life ML models. Table 1 : Summary on the comparison of NAG-GS to the reference optimizer for different neural architectures (greater is better). Target metrics are ACC@1 and ACC@5 for RESNET20 and RESNET50 respectively and average score on GLUE for ROBERTA. We mentioned in previous section that one could choose a constant step size strategy for Algorithm 1. We propose to study how to select a maximum (constant) step size that ensures an optimal contraction rate while guaranteeing the convergence, or the stability of NAG-GS method once used to solve SDE's system 11. Ultimately, we show that the choice of the optimal (constant) step size is actually mostly influenced by the values of µ, L and γ. These (hyper)parameters are central and in order to show this, we study two key quantities, namely the spectral radius of the iteration matrix and the covariance matrix associated to NAG-GS method summarized by Algorithm 1. Note that this theoretical study only concerns the case f (x) = 1 2 x T Ax. Considering the size limitation of the paper, we present below only the main theoretical result and place its proof in Appendix A.1.4: Theorem 1 For G N AG 7, given γ ≥ µ, and assuming 0 < µ = λ 1 ≤ . . . ≤ λ n = L < ∞; if 0 < α ≤ µ+γ+ √ (µ-γ) 2 +4γL L-µ , then the NAG-GS method summarized by Algorithm 1 is convergent for the n-dimensional case, with n > 2. Remark 1 It is important to mention that theoretical result about the accelerated convergence for the stochastic setting holds exactly in the same way as for the deterministic setting. All the steps of the convergence analysis are fully detailed in Appendix A.1 and organized as follows: • Appendices A.1.1 and A.1.2 respectively provide the full analysis of the spectral radius of the iteration matrix associated to NAG-GS method and the covariance matrix at stationarity w.r.t. all (hyper)parameters µ, L, γ and σ, for the dimensional case n = 2. The theoretical results obtained are summarized in Appendix A.1.3 to come up with an optimal (constant) step size in terms of contraction rate. • Numerical tests are performed and detailed in Appendix A.1.5 to support the theoretical results obtained for the quadratic case.

3. EXPERIMENTS

We test NAG-GS method on several neural architectures: logistic regression, transformer models for natural language processing tasks, and residual networks for computer vision tasks. Section 3.1 presents the numerical results for logistic regression. Sections 3.2 and 3.3 focus on the tests carried out on transformers and residual networks. For these two neural architectures, in order to perform a benchmark of our method as fair as possible, we replace the reference optimizers with ours and adjust only hyperparameters of our optimizer. We keep intact the model architectures and model hyperparameters such as dropout rate, schedule, batch size, number of training epoches, evaluation methodology. Moreover, we carry out ablation study on small real-world models. The results of the benchmark for transformer models and residual networks are summarized in Table 1 . Furthermore, in Section 3.4, the experiments highlight the importance of updating γ during the training process. In Section 3.5 we present some preliminary experiments to study the relations between convergence and Hessian spectrum. Moreover, in Appendix C.2 we bring some numerical evidences that support the theoretical constraints on the optimizer parameters that have been derived in Section 2. The implementation details are discussed in Appendix C.3.

3.1. LOGISTIC REGRESSION

In this section, we benchmark NAG-GS method against state-of-the-art optimizers on the logistic regression training problem for MNIST dataset LeCun et al. (2010) . Since this problem is convex and non-quadratic, we consider this problem as the natural and next test case after the theoretical analysis and numerical tests of NAG-GS method in Section 2.3 for the quadratic convex problem. In Figure 2 we present the comparison of NAG-GS method with competitors. We confirm numerically that NAG-GS method allows the use of a larger range of values for the learning rate than SGD Momentum and AdamW optimizers, highlighting the robustness of our method w.r.t. the selection of hyperparameters. Moreover, the results indicate that the semi-implicit nature of NAG-GS method indeed ensures the acceleration effect through the use of larger learning rates while keeping a good accuracy of the model, and this holds not only for the convex quadratic problems but also for non-quadratic convex ones. 

3.2. TRANSFORMER MODELS

In this section we test NAG-GS optimizer in the frame of natural language processing for the tasks of fine-tuning pretrained model on GLUE benchmark datasets Wang et al. (2018) . We use pretrained RoBERTa Liu et al. (2019) model from Hugging Face's TRANSFORMERS Wolf et al. (2020) library. In this benchmark, the reference optimizer is AdamW Ilya et al. (2019) with polynomial learning rate schedule. The training setup defined in Liu et al. (2019) is used for both NAG-GS and AdamW optimizers. We search for an optimal learning rate for NAG-GS optimizer with fixed γ and µ to get the best performance on the task at hand. Note that NAG-GS is used with constant schedule which makes it simpler to tune. In terms of learning rate values, the one allowed by AdamW is around 10 -5 while NAG-GS allows a much bigger value of 10 -2 . Evaluation results on GLUE tasks are presented in Table 3 . Despite a rather restraint search space for NAG-GS hyperparameters, it demonstrates better performance on some tasks and worse performance on others. Figure 3 shows the behavior of loss values and target metrics on GLUE (see Appendix C). (2009) for a few learning rate values along with fixed γ and µ. We select the best overall performing parameters for our optimizer while, for the reference optimizer SGD-MW, we use the parameters reported in the literature. Also, we found that, unlike fine-tuning on Transformers, piecewise-constant schedule gives better target metrics than constant one. The best acc@5 performance for NAG-GS is within three percentage points worse than for SGD-MW (see Table 1 , and Figure 4b ). ResNet20. Since ResNet20 has lesser number of parameters than ResNet50, we carried out more intensive experiments in order to evaluate more deeply the performance of NAG-GS for computer vision tasks (residual networks in particular) and to show that NAG-GS with appropriate choice of optimizer parameters is on par with SGD-MW (see Table 1 , and Figure 4a ). The classification problem is solved using CIFAR-10 Krizhevsky (2009) . Experimental setup is the same in all experiments except optimizer and its parameters. The best test score for NAG-GS is achieved for α = 0.11, γ = 17, and µ = 0.01.

3.4. UPDATABLE SCALING FACTOR γ

According to the theory of NAG-GS optimizer presented in Section 2, the scaling factor γ decays exponentially fast to µ and, in the case γ 0 = µ, γ remains constant along iterations. So, a natural question arises: is the update on γ necessary? Our experiments confirm that scaling factor γ should be updated accordingly to Algorithm 1, even in this highly non-convex setting, in order to get better metrics on test sets. We use experimental setup for ResNet20 from Section 3.3 and search for hyperparameters for NAG-GS with updatable γ and with constant one. Common hyperoptimization library OPTUNA Akiba et al. ( 2019) are used with a budget of 160 iterations to sample NAG-GS parameters. Figure 5 plots the evolution of the best score value along optimization time. The final difference is about 0.5 which is a significant difference in terms of final classification error.

3.5. NON-CONVEXITY AND HESSIAN SPECTRUM

Theoretical analysis of NAG-GS highlights the importance of the smallest eigenvalue of the Hessian matrix for convex and strongly convex functions. Unfortunately, the objective functions usually considered for the training of neural networks are not convex. In this section we try to address this issue. The smallest model in our experimental setup is ResNet20. However, we cannot afford to compute exactly the Hessian matrix since ResNet20 has almost 300k parameters. Instead, we use Hessian-vector product (HVP) H(x) and apply matrix-free algorithms for finding the extreme eigenvalues. We estimate the extreme eigenvalues of the Hessian spectrum with power iterations Under review as a conference paper at ICLR 2023 The best acc@1 on test set for updatable and fixed scaling factor γ during hyperoptimization. NAG-GS with updatable γ gives more frequently better results than the ones obtained with constant γ. The final difference is about 0.5. (PI) along with Rayleight quotient (RQ) Golub & van Loan (2013) . PI is used to get a good initial vector which is used later in the optimization of RQ. In order to get more useful initial vector for the estimation of the smallest eigenvalue, we apply the spectral shift H(x)λ max x and use the corresponding eigenvector. Figure 6 shows the extreme eigenvalues of ResNet20 Hessian at the end of each epoch for the batch size 256 in the same setup as in Section 3.3. The largest eigenvalue is strictly positive while the smallest one is negative and usually oscillates around µ = -1. It turns out that there is an island of hyperparameters in the vicinity of that µ. We report that training ResNet20 with hyperparameters included in this island gives good target metrics. The domain of negative momenta is non-conventional and not well understood, to the best of our knowledge. Moreover, there is no theoretical guarantees for NAG-GS in the non-convex case and negative µ. However, Velikanov et al. (2022) reports the existence of regions of convergence for SGD with negative momentum, which supports our observations. The theoretical aspects of these observations will be studied in further works. 

4. RELATED WORKS

The approach of interpreting and analyzing optimization methods from the ODEs discretization perspective is well-known and widely used in practice Muehlebach & Jordan (2019) ; Wilson et al. ( 2021); Shi et al. (2021) . The main advantage of this approach is to construct a direct correspondence between the properties of some classes of ODEs and their associated optimization methods. In particular, gradient descent and Nesterov accelerated methods are discussed in Su et al. (2014) as a particular discretization of ODEs. In the same perspective, many other optimization methods were analyzed, we can mention the mirror descent method and its accelerated versions Krichene et al. (2015) , the proximal methods Attouch et al. (2019) and ADMM Franca et al. (2018) . It is well known that discretization strategy is essential for transforming a particular ODE to an efficient optimization method, Shi et al. ( 2019 A ADDITIONAL REMARKS RELATED TO THEORETICAL BACKGROUND An accelerated ODE has been presented in Section 1.1 which relied on a specific spectral transformation. In this brief Appendix, we add some useful insights: • Equation 10 is a variant of heavy ball model with variable damping coefficients in front of ẍ and ẋ. • Thanks to the scaling factor γ , both the convex case µ = 0 and the strongly convex case µ > 0 can be handled in a unified way. • In the continuous time, one can solve easily equation 8 as follows: γ(t) = µ + (γ 0µ)e -t , t ≥ 0. Since γ 0 > 0, we have that γ(t) > 0 for all t ≥ 0 and γ(t) converges to µ exponentially and monotonically as t → +∞. In particular, if γ 0 = µ > 0, then γ(t) = µ for all t ≥ 0. We remark here the links between the behavior of the scaling factor γ(t) and the sequence {γ k } ∞ k=0 introduced by Nesterov Nesterov (2018) in its analysis of optimal first-order methods in discrete-time, see (Nesterov, 2018, Lemma 2.2.3 ). • Authors from Luo & Chen (2021) prove the exponential decay property L(t) ≤ e -t L 0 , t > 0 for a Taylored Lyapunov function L(t) := f (x(t)) -f (x ⋆ ) + γ(t) 2 ∥v(t)x ⋆ ∥ 2 where x ⋆ ∈ argmin f is a global minimizer of f . Again we note the similarity between the Lyapunov function proposed here and the estimating sequence {ϕ k (x)} ∞ k=0 of function f introduced by Nesterov in its optimal first-order methods analysis Nesterov (2018) . In (Nesterov, 2018, Lemma 2.2.3), this sequence that takes the form ϕ k (x) = ϕ ⋆ k (x) + γ k 2 ∥v k -x∥ 2 where γ k+1 := (1 -α k )γ k + α k µ and v k+1 := 1 γ k+1 [(1 -α k )γ k v k + α k µy k -α k ∇f (y k )] which stand for a forward Euler discretization respectively of equation 8 and second ODE of equation 9. We ask the attentive reader to remember that this discussion mainly concerns the continuous time case. A second central part of our analysis was based on the methods of discretization of equation 9. Indeed, these discretization's ensure together with the spectral transformation 7 the optimal convergence rates of the methods and their particular ability to handle noisy gradients.

A.1 CONVERGENCE/STABILITY ANALYSIS OF QUADRATIC CASE: DETAILS

As briefly mentioned in Section 2.3, the two key elements to come up with a maximum (constant) step size for Algorithm 1 are the study of the spectral radius of iteration matrix associated to NAG-GS scheme (Appendix A.1.1) and the covariance matrix at stationarity (Appendix A.1.2) w.r.t. all the significant parameters of the scheme that are : the step size (integration step/time step) α, the convexity parameters 0 ≤ µ ≤ L ≤ ∞ of function f (x), the volatility of the noise σ and the positive scaling parameter γ. Note that this theoretical study only concerns the case f (x) = 1 2 x T Ax. Among others results, we will show in particular that for specific intervals for µ and L, a higher step size for α can be reached while ensuring the A-stability of NAG-GS method. Let remark finally that the two following sections consider the special case n = 2, the theoretical results obtained are summarized in Appendix A.1.3 and finally generalized for n-dimensional case in Appendix A.1.4 with a global convergence results for NAG-GS method.

A.1.1 SPECTRAL RADIUS ANALYSIS

Let us assume f (x) = 1 2 x ⊤ Ax and since A ∈ S n + by hypothesis, it is diagonalizable and can be presented as A = diag(λ 1 , . . . , λ n ) without loss of generality, that is to say that we will consider a system of coordinates composed of the eigenvectors of matrix A. Let us note that µ = λ 1 ≤ • • • ≤ λ i ≤ • • • ≤ λ n = L. In this setting, y = (x, v) ∈ R 4 and the matrices M and N from the Gauss-Seidel Splitting of G N AG equation 7 are: M = -I 2×2 0 2×2 µ/γI 2×2 -A/γ -µ/γI 2×2 =    -1 0 0 0 0 -1 0 0 0 0 -µ/γ 0 0 µ/γ -L/γ 0 -µ/γ    , N = 0 2×2 I 2×2 0 2×2 0 2×2 For the minimization of f (x) = 1 2 x ⊤ Ax, given the property of Brownian motion ∆W k = W k+1 - W k = √ α k η k where η k ∼ N (0, 1), equation 12 reads: y k+1 = (I 4×4 -αM ) -1 (I 4×4 + αN )y k + (I 4×4 -αM ) -1 0 σ √ αη k (16) Since matrix M is lower-triangular, matrix I 4×4 -αM is as well and can be factorized as follows: I 4×4 -αM = DT = (1 + α)I 2×2 0 2×2 0 2×2 (1 + αµ γ )I 2×2 I 2×2 0 2×2 α(A-µI2×2) γ(1+ αµ γ ) I 2×2 Hence (I 4×4 -αM ) -1 = T -1 D -1 where D -1 can be easily computed. It remains to compute T -1 ; T can be decomposed as follows: T = I 4×4 + Q with Q a nilpotent matrix such that QQ = O 4×4 . For such decomposition, it is well known that: T -1 = (I 4×4 + Q) -1 = I 4×4 -Q = I 2×2 0 2×2 α(µI2×2-A) γ(1+τ k ) I 2×2 where τ k = αµ γ . Combining these results, equation 16 finally reads: y k+1 =      1 α+1 0 α 1+α 0 0 1 α+1 0 α 1+α 0 0 1 1+τ 0 0 α(µ-L) γ(τ +1)(α+1) 0 α 2 (µ-L) γ(1+τ )(1+α) + 1 1+τ      y k + 0 σ √ α 1+τ η k = Ey k + 0 σ √ α 1+τ η k with E denoting the iteration matrix associated to the NAG-GS method. Hence equation 18 includes two terms, the first is the product of the iteration matrix times the current vector y k and the second one features the effect of the noise. For the latter, it will be studied in Appendix A.1.2 on the point of view of maximum step size for the NAG-GS method through the key quantity of covariance matrix. Let us focus on the first term; it is clear that in order to get the maximum contraction rate, we should look for α that minimize the spectrum radius of E. Since the spectral radius is the maximum of absolute value of the eigenvalues of iteration matrix E, we start by computing them. Let us find the expression of λ i ∈ σ(E) for 1 ≤ i ≤ 4 that satisfy det(E -λI 4×4 ) = 0 as functions of the scheme's parameters. Solving det(E -λI 4×4 ) = 0 ≡ (γλ -γ + αλµ)(λ + αλ -1)(γ -2γλ + γλ 2 + α 2 λ 2 µ -αγλ -αλµ + Lα 2 λ + αγλ 2 + αλ 2 µ -α 2 λµ) (α + 1) 2 (γ + αµ) 2 = 0 leads to the following eigenvalues: λ 1 = γ γ + αµ λ 2 = 1 1 + α λ 3 = 2γ + αγ + αµ -Lα 2 + α 2 µ 2(γ + αγ + αµ + α 2 µ) + α L 2 α 2 -2Lα 2 µ -2Lαµ -2γLα -4γL + α 2 µ 2 + 2αµ 2 + 2γαµ + µ 2 + 2γµ + γ 2 2(γ + αγ + αµ + α 2 µ) λ 4 = 2γ + αγ + αµ -Lα 2 + α 2 µ 2(γ + αγ + αµ + α 2 µ) - α L 2 α 2 -2Lα 2 µ -2Lαµ -2γLα -4γL + α 2 µ 2 + 2αµ 2 + 2γαµ + µ 2 + 2γµ + γ 2 2(γ + αγ + αµ + α 2 µ) ) Let us first mention some general behavior or these eigenvalues: Given γ and µ positive, we observe that: 1. λ 1 and λ 2 are positive decreasing functions w.r.t. α. Moreover, for bounded γ and µ, we have lim α→∞ |λ 1 (α)| = 0 = lim α→∞ |λ 2 (α)|.

2.. one can show that for

α ∈ [ µ+γ-2 √ γL L-µ , µ+γ+2 √ γL L-µ ], functions λ 3 (α) and λ 4 (α) are complex values and one can easily show that both share the same absolute value. Note that the lower bound of the interval µ+γ-2 √ γL L-µ is negative as soon as γ ∈ [2L -µ -2 L 2 -µL, 2L - µ + 2 L 2 -µL] ⊆ R + . Moreover, one can easily show that lim α→∞ |λ 3 (α)| = 0 and lim α→∞ |λ 4 (α)| = L-µ µ = κ(A) -1. The latter limit shows that eigenvalues λ 4 plays a central role in the convergence of the NAG-GS method since it is the one that can reach the value one and violate the convergence condition, as soon as κ(A) > 2. The analysis of λ 4 also allows us to come up with a good candidate for the step size α that minimize the spectral radius of matrix E, especially and obviously at critical point α max = µ+γ+2 √ γL L-µ which is positive since L ≥ µ by hypothesis. Note that the case L → µ gives some preliminary hints that the maximum step size can be almost "unbounded" in some particular case. Now, let us study these eigenvalues in more details, it seems that three different scenarios must be studied: 1. For any variant of Algorithm 1 for which γ 0 = µ, then γ = µ for all k ≥ 0 and therefore λ 1 (α) = λ 2 (α). Moreover, at α = µ+γ+2 √ γL L-µ = 2µ+2 √ µL L-µ , we can easily check that |λ 1 (α)| = |λ 2 (α)| = |λ 3 (α)| = |λ 4 (α)|. Therefore α = 2µ+2 √ µL L-µ is the step size ensuring the minimal spectral radius and hence the maximum contraction rate. Figure 7 . One can easily check that µ+γ+ √ (µ-γ) 2 +4γL L-µ - µ+γ+2 √ γL L-µ = (µ -γ) 2 > 0. Hence the second candidate for step size α will be bigger than the first one and the distance between them increasing as the squared distance between γ and µ. Figure 8 shows the evolution of the absolute values of the eigenvalues of iteration matrix E w.r.t. α for this setting. 3. For γ > µ: the analysis of this setting gives the same results as the previous point. According to Algorithm 1, γ is either constant and equal to µ either decreasing to µ along iterations. Hence, the case γ > µ will be considered for the theoretical analysis when γ ̸ = µ. As a first summary, the detailed analysis of the eigenvalues of iteration matrix E w.r.t. the significant parameters of the NAG-GS method leads us to come up with two candidates for the step size that minimize the spectral radius of E, hence ensuring the highest contraction rate possible. These results will be gathered with those obtained in Appendix A.1.2 dedicated to the covariance matrix analysis. Let us now look at the behavior of the dynamics in expectation; given the properties of the Brownian motion and by applying the Expectation operator E on both sides of the system of SDE's 11, the resulting "averaged" equations identify with the "deterministic" setting studied by Luo & Chen (2021) . For such setting, authors from Luo & Chen (2021) demonstrated that, if 0 ≤ α ≤ 2 √ κ(A) , then a Gauss-Seidel splitting-based scheme for solving equation 9 is A-stable for quadratic objectives in the deterministic setting. We conclude this section by showing that our two candidates we derived above for step size are higher than the limit 2 √ κ(A) given in (Luo & Chen, 2021 , Theorem 1). It can be intuitively understood in the case L → µ, however we give a formal proof in Lemma 1. Lemma 1 Given γ > 0, and assuming 0 < µ < L, then for γ = µ and γ > µ the following inequalities respectively hold: 2µ + 2 √ µL L -µ > 2 κ(A) µ + γ + (µ -γ) 2 + 4γL L -µ > 2 κ(A) where κ(A) = L µ . Proof: Let us start for the case µ = γ, hence first inequality from equation 21 becomes: 2µ + 2 √ Lµ L -µ > 2 L/µ ≡(µ + Lµ) L/µ > (L -µ) ≡ µL + L > L -µ ≡ µL > -µ which holds for any positive µ, L and satisfied by hypothesis. For the case γ > µ, we have: µ + γ + (µ -γ) 2 + 4γL L -µ > 2 L/µ ≡ (µ -γ) 2 + 4γL > 2 L/µ (L -µ) -γ -µ ≡(µ -γ) 2 + 4γL > (µ + 2 µ L (µ -L) + γ) 2 ≡γ > -2µ 2 + µ 3 /L + µ 2 µ/L + µL -µL µ/L -µ -µ/L(µ -L) + L where second inequality hold since L ≥ µ and last inequality holds since -µµ/L(µ-L)+L > 0 (one can easily check this by using L > µ). It remains to show that: µ > -2µ 2 + µ 3 /L + µ 2 µ/L + µL -µL µ/L -µ -µ/L(µ -L) + L which holds for any µ and L positive (technical details are skipped; it mainly consists in the study of a table of signs of a polynomial equation in µ). Since γ > µ by hypothesis, therefore inequality γ > -2µ 2 + µ 3 /L + µ 2 µ/L + µL -µL µ/L -µ -µ/L(µ -L) + L holds for any µ and L positive as well, conditions satisfied by hypothesis. This concludes the proof.

□

Furthermore, let us note that both step size candidates, that are { 2µ+2 √ µL L-µ , µ+γ+ √ (µ-γ) 2 +4γL L-µ } respectively for the cases γ = µ and γ > µ show that NAG-GS method converges in the case L → µ with a step size that tends to ∞, this behavior cannot be anticipated by the upper-bound given by (Luo & Chen, 2021 , Theorem 1). Some simple numerical experiments are performed in Appendix A.1.5 to support this theoretical result. Finally, based on previous discussions, let remark that for α ∈ [ µ+γ+ √ (µ-γ) 2 +4γL L-µ , ∞] when γ ̸ = µ or α ∈ [ 2µ+2 √ µL L-µ , ∞] when γ = µ, we have ρ(E(α)) = |λ 4 (α)| and one can show that ρ(E) is strictly monotonically increasing function of α for all L > µ > 0 and γ > 0 (see Appendix A.1.6 for the discussion).

A.1.2 COVARIANCE ANALYSIS

In this section, we study the contribution to the computation of maximum step size for NAG-GS method through the analysis of covariance matrix at stationarity. Let us start by computing the covariance matrix C obtained at iteration k + 1 from Algorithm 1: C k+1 = E(y k+1 y T k+1 ) By denoting ξ k = 0 σ √ α 1+τ η k , let us replace y k+1 by its expression given in equation 18, equation 22 writes: C k+1 = E(y k+1 y T k+1 ) = E (Ey k + ξ k )(Ey k + ξ k ) T = E Ey k y T k E T + E ξ k ξ T k ( ) which holds since expectation operator E(.) is a linear operator and by assuming statistical independence between ξ k and Ey k . On the one hand, by using again the properties of linearity of E and since E is seen as a constant by E(.), one can show that E Ey k y T k E T = EC k E T . One the other hand, since η k ∼ N (0, 1), then Equation equation 23 becomes: C k+1 = EC k E T + Q (24) where Q = 0 2×2 0 2×2 0 2×2 α k σ 2 (1+τ k ) 2 I 2×2 . Let us now look at limiting behavior of Equation equation 24, that is lim k→∞ C k . Let be C = lim k→∞ C k the covariance matrix reached in the asymptotic regime, also referred to as stationary regime. Applying the limit on both sides of Equation equation 24, C then satisfies C = ECE T + Q (25) Hence equation 25 is a particular case of discrete Lyapunov equation. For solving such equation, the vectorization operator denoted⃗ . is applied on both sides on equation 25, this amounts to solve the following linear system: (I 4 2 ×4 2 -E ⊗ E) ⃗ C = ⃗ Q (26) where A ⊗ B =    a 11 B • • • a 1n B . . . . . . . . . a m1 B • • • a mn B    stands for the Kronecker product. The solution is given by: C = ← ---------------- (I 4 2 ×4 2 -E ⊗ E) -1 ⃗ Q (27) where ←a stands for the un-vectorized operator. Let us note that, even for the 2-dimensional case considered in this section, the dimension of matrix C rapidly growth and cannot be written in plain within this paper. For the following, we will keep its symbolic expression. The stationary matrix C quantifies the spreading of the limit of the sequence {y k }, as a direct consequence of the Brownian motion effect. Now we look at the directions that maximize the scattering of the points, in other words we are looking for the eigenvectors and the associated eigenvalues of C. Actually, the required information for the analysis of the step size is contained within the expression of the eigenvalues λ i (C). The obtained eigenvalues are rationale functions w.r.t. the parameters of the schemes, while their numerator brings less interest for us (supported further), we will focus on the their denominator. We obtained the following expressions: λ 1 (C) = N 1 (α, µ, L, γ, σ) D 1 (α, µ, L, γ, σ) , s.t.D 1 (α, µ, L, γ, σ) = -L 2 α 3 µ -L 2 α 2 µ -γL 2 α 2 + 2Lα 3 µ 2 + 4Lα 2 µ 2 + 4γLα 2 µ + 2Lαµ 2 + 8γLαµ + 2γ 2 Lα + 4γLµ + 4γ 2 L (28) λ 2 (C) = N 2 (α, µ, L, γ, σ) D 2 (α, µ, L, γ, σ) , s.t.D 2 (α, µ, L, γ, σ) = α 3 µ 3 + 3α 2 µ 3 + 3γα 2 µ 2 + 2αµ 3 + 8γαµ 2 + 2γ 2 αµ + 4γµ 2 + 4γ 2 µ (29) λ 3 (C) = N 3 (α, µ, L, γ, σ) D 3 (α, µ, L, γ, σ) , s.t.D 3 (α, µ, L, γ, σ) = α 3 µ 3 + 3α 2 µ 3 + 3γα 2 µ 2 + 2αµ 3 + 8γαµ 2 + 2γ 2 αµ + 4γµ 2 + 4γ 2 µ (30) λ 4 (C) = N 4 (α, µ, L, γ, σ) D 4 (α, µ, L, γ, σ) , s.t.D 4 (α, µ, L, γ, σ) = -L 2 α 3 µ -L 2 α 2 µ -γL 2 α 2 + 2Lα 3 µ 2 + 4Lα 2 µ 2 + 4γLα 2 µ + 2Lαµ 2 + 8γLαµ + 2γ 2 Lα + 4γLµ + 4γ 2 L (31) One can observe that: 1. given α, L, µ, γ positive, the denominators of eigenvalues λ 2 and λ 3 are positive as well, unlike eigenvalues λ 1 and λ 4 for which some vertical asymptotes may appear. The latter will be studied in more details further. Note that, even if some eigenvalues share the same denominator, it is not the case for the numerator. This will be illustrated later in Figures 11 and 12 to ease the analysis. 2. Interestingly, the volatility of the noise defined by the parameter σ does not appear within the expressions of the denominators. It gives us hint that these vertical asymptotes are due to the fact that spectral radius is getting close to 1 (discussed further in Appendix A.1.3). Moreover, the parameter σ appears only within the numerators and based on intensive numerical tests, this parameter has a pure scaling effect onto the eigenvalues λ i (C) when studied w.r.t. α without modifying the trends of the curves. Let us now study in more details the denominator of λ 1 and λ 4 and seek for critical step size as a function of γ, µ and L at which a vertical asymptote may appear by solving: -L 2 α 3 µ -L 2 α 2 µ -γL 2 α 2 + 2Lα 3 µ 2 + 4Lα 2 µ 2 + 4γLα 2 µ + 2Lαµ 2 + 8γLαµ + 2γ 2 Lα + 4γLµ + 4γ 2 L = 0 ≡µ(2µ -L)α 3 + (µ + γ)(4µ -L)α 2 + (2µ 2 + 8γµ + 2γ 2 )α + 4γ(µ + γ) = 0 This polynomial equation in α has three roots: α 1 = -γ -µ µ , α 2 = µ + γ -γ 2 -6γµ + µ 2 + 4γL L -2µ , α 3 = µ + γ + γ 2 -6γµ + µ 2 + 4γL L -2µ . First, it is obvious that first root α 1 is negative given γ, µ assumed nonnegative and therefore can be disregarded. Concerning α 2 and α 3 , those are real roots as soon as: γ 2 -6γµ + µ 2 + 4γL ≥ 0 ≡(γ -µ) 2 -4γµ + 4γL ≥ 0 ≡(γ -µ) 2 ≥ 4γ(µ -L) which is always satisfied since γ > 0 and 0 < µ < L by hypothesis. Further, it is obvious that the study must include three scenarios: 1. scenario 1: L -2µ < 0, or equivalently µ > L/2. Given µ and γ positive by hypothesis, it implies that α 3 is negative and hence can be disregarded. It remains to check if α 2 can be positive, it amounts to verify if µ + γ -γ 2 -6γµ + µ 2 + 4γL < 0 ≡(µ + γ) 2 < γ 2 -6γµ + µ 2 + 4γL ≡µ < L 2 which never holds by hypothesis. Therefore, for first scenario, there is no positive critical step size at which a vertical asymptote for the eigenvalues may appear. 2. scenario 2: L -2µ > 0, or equivalently µ < L/2. Obviously, α 3 is positive and hence shall be considered for the analysis of maximum step size for our NAG-GS method. It remains to check if α 2 is positive, that is to verify if the numerator can be negative. We have seen at the first scenario that α 2 is negative as soon as µ < L 2 which is verified by hypothesis. Therefore, only α 3 is positive. 3. scenario 3: L -2µ = 0. For such situation, the critical step size is located at ∞ and can be disregarded as a potential limitation in our study. In summary, a potential critical and limiting step size only exists in the case µ < L/2, or equivalently if κ(A) > 2. In this setting, the critical step size is positive and is equal to α crit = µ+γ+ √ γ 2 -6γµ+µ 2 +4γL L-2µ . Figures 9 to 10 display the evolution of the eigenvalues λ i (C) for 1 ≤ i ≤ 4 w.r.t. to α for the two first scenarios, that are for µ > L/2 and µ < L/2. For the first scenario, the parameters σ, γ, µ and L have been respectively set to {1, 3/2, 1, 3/2}. For the second scenario, σ, γ, µ and L have been respectively set to {1, 3/2, 1, 3}. As expected, one can observe on Figure 9 that no vertical asymptote is present. Furthermore, one can observe λ i (C) seem to convergence to some limit point when α → ∞, numerically we report that this limit point is zero, for all the values of γ and σ considered. Finally, again as expected by the results presented in this section, Figure 10 shows the presence of two vertical asymptotes for the eigenvalues λ 1 and λ 4 , and none for λ 2 and λ 3 . Moreover, the critical step size is approximately located at α = 6, the theoretical formula α crit = µ+γ+ √ γ 2 -6γµ+µ 2 +4γL L-2µ predicts 6. Finally, one can observe that, after the vertical asymptotes, all the eigenvalues converge to some limit points, again numerically we report that this limit point is zero, for all the values of γ and σ considered. 

A.1.3 A CONCLUSION FOR THE 2-DIMENSIONAL CASE

In Appendix A.1.1 and Appendix A.1.2, several theoretical results have been derived for coming up with appropriate choices of constant step size for Algorithm 1. Key insights and interesting values for the step size have been discussed from the study of the spectral radius of iteration matrix E and through the analysis of the covariance matrix in the asymptotic regime. Let us summarize the theoretical results obtained: • from the spectral radius analysis of iteration matrix E; two scenarios have been highlighted, that are: 1. case γ = µ: the step size α that minimize the spectral radius of matrix E is α = 2µ+2 √ µL L-µ , 2. case γ > µ: the step size α that minimize the spectral radius of matrix E is α = µ+γ+ √ (µ-γ) 2 +4γL L-µ . • from the analysis of covariance matrix C at stationarity: in the case L -2µ > 0, or equivalently µ < L/2, we have seen that there is vertical asymptote for two eigenvalues of C at α crit = µ+γ+ √ γ 2 -6γµ+µ 2 +4γL L-2µ , leading to an intractable scattering of the limit points {y k } k→∞ generated by Algorithm 1. In the case µ > L/2, there is no positive critical step size at which a vertical asymptote for the eigenvalues may appear. Therefore, for quadratic functions such that µ > L/2, we can safely choose either α = 2µ+2 √ µL L-µ when γ = µ either α = µ+γ+ √ (µ-γ) 2 +4γL L-µ when γ > µ to get the minimal spectral radius for iteration E and hence the highest contraction rate for the NAG-GS method. For quadratic functions such that µ < L/2, we must show that NAG-GS method is stable for both step sizes. Let us denote by α c = { 2µ+2 √ µL L-µ , µ+γ+ √ (µ-γ) 2 +4γL L-µ }, two values of step size for the two scenarios γ = µ and γ > µ. In other to prove stability, we must show that ρ(E(α c )) < 1. Let us start by computing α such that ρ(E(α)) = 1. As proved in Appendix A.1.6, for α ∈ [α c , ∞], ρ(E(α)) = -λ 4 with λ 4 given in equation 20, we then have to compute α such that: -λ 4 = - 2γ + αγ + αµ -Lα 2 + α 2 µ 2(γ + αγ + αµ + α 2 µ) + α(L 2 α 2 -2Lα 2 µ -2Lαµ -2γLα -4γL + α 2 µ 2 + 2αµ 2 + 2γαµ + µ 2 + 2γµ + γ 2 ) 1/2 2(γ + αγ + αµ + α 2 µ) = 1 This leads to computing the roots of a quadratic polynomial equation in α, the positive root is: α = γ + µ + 4Lγ + γ 2 -6γµ + µ 2 L -2µ which not surprisingly identifies to α crit from the covariance matrix analysis 1 . Furthermore, by recalling that ρ(E(α)) is strictly monotonically increasing function over the interval [α c , ∞], showing that ρ(E(α c )) < 1 is equivalent to show that α c is strictly lower than α crit . The formal proof is given in Lemma 2. Lemma 2 Given γ > 0, and assuming 0 < µ < L/2, then for γ = µ and γ > µ the following inequalities respectively hold: µ + γ + γ 2 -6γµ + µ 2 + 4γL L -2µ > 2µ + 2 √ µL L -µ µ + γ + γ 2 -6γµ + µ 2 + 4γL L -2µ > µ + γ + (µ -γ) 2 + 4γL L -µ Proof: Let us focus on the case γ > µ, since 0 < µ < L/2 by hypothesis, second inequality from equation 36 can be written as: (L -µ)(γ + µ + (γ -µ) 2 + 4γ(L -µ)) -(L -2µ)(γ + µ + (γ -µ) 2 + 4γL) > 0 ≡γµ + µ 2 + (L -µ) γ 2 + µ 2 + γ(4L -6µ) + (2µ -L) (γ -µ) 2 + 4γL > 0 Given γ, µ > 0, it remains to show that: (L -µ) γ 2 + µ 2 + γ(4L -6µ) + (2µ -L) (γ -µ) 2 + 4γL > 0 (37) In order to show this, we study the conditions for γ such that the left-hand side of equation 37 is positive. With simple manipulations, one can show that canceling the left-hand side of equation 37 boils down to canceling the following quadratic polynomial: (L -µ) γ 2 + µ 2 + γ(4L -6µ) + (2µ -L) (γ -µ) 2 + 4γL = 0 ≡ (-3µ + 2L)γ 2 + (2µ 2 -8Lµ + 4L 2 )γ + 2Lµ 2 -3µ 3 = 0 The two roots are: γ 1 = -µ 2 -2L 2 -2 -2µ 4 + L 4 -4µL 3 + 4µ 2 L 2 + µ 3 L + 4µL 2L -3µ γ 2 = -µ 2 -2L 2 + 2 -2µ 4 + L 4 -4µL 3 + 4µ 2 L 2 + µ 3 L + 4µL 2L -3µ which are reals and distinct as soon as: -2µ 4 + L 4 -4µL 3 + 4µ 2 L 2 + µ 3 L > 0 ≡(L -2µ)(L -µ)(-µ 2 + L 2 -µL) > 0 which holds since 0 < µ < L/2 by hypothesis (one can easily show that -µ 2 + L 2 -µL is positive in such setting). Moreover, the denominator 2L -3µ is strictly positive since 0 < µ < L/2. One can check that γ 1 is negative for all γ, L > 0 and 0 < µ < L/2 (simply show that -µ 2 -2L 2 + 4µL is negative) and can be disregarded since γ is positive by hypothesis. Therefore, proving that equation 37 holds is equivalent to show that: γ > -µ 2 -2L 2 + 2 (L -2µ)(L -µ)(-µ 2 + L 2 -µL) + 4µL 2L -3µ It remains to show that: µ > -µ 2 -2L 2 + 2 (L -2µ)(L -µ)(-µ 2 + L 2 -µL) + 4µL 2L -3µ ≡0 > µ 2 + (L -2µ)(L -µ)(-µ 2 + L 2 -µL) -L 2 + µL ≡ -µ 2 + L 2 -µL > (L -2µ)(L -µ) ≡µ < 2 3 L which holds by hypothesis. Since γ > µ by hypothesis, therefore inequality equation 38 holds for any µ and L positive as well, conditions satisfied by hypothesis. Finally, since µ+γ+ √ (µ-γ) 2 +4γL L-µ > µ+γ+2 √ γL L-µ for any γ, µ, L > 0, then first inequality in equation 36 holds. This concludes the proof. □ We conclude this section by discussing several important insights here-under: • Except for α crit , we do not report significant information coming from the analysis of λ i (C) for the computation of the step size and the validity of the candidates for α, that are { 2µ+2 √ µL L-µ , µ+γ+ √ (µ-γ) 2 +4γL L-µ } respectively for the cases γ = µ and γ > µ. • Concerning the effect of the volatility σ of the noise, we have mentioned earlier that the parameter σ appears only within the numerators λ i (C) and based on intensive numerical tests, this parameter has a pure scaling effect onto the eigenvalues λ i (C) when studied w.r. • The theoretical analysis summarized in this section is valid for the 2-dimensional case, we show in Appendix A.1.4 how to generalize our results for the n-dimensional case. This has no impact on the our results.  N i (α, µ, L, γ, σ) w.r.t σ for scenario µ > L/2; γ = 3/2, µ = 1, L = 3/2, α = µ+γ+ √ (µ-γ) 2 +4γL L-µ . A.1.4 EXTENSION TO N-DIMENSIONAL CASE In this section we show that we can easily extend the results gathered for the 2-dimensional case in Appendix A.1.1 and Appendix A.1.2 to the n-dimensional case with n > 2. Let us start by recalling that for NAG transformation 7, the general SDE's system to solve for the quadratic case is: ẏ(t) = -I n×n I n×n 1/γ(µI n×n -A) -µ/γI n×n y(t) + 0 n×1 dZ dt , t > 0. Let recall that y = (x, v) with x, v ∈ R n , let n be even and let consider the permutation matrix P associated to permutation indicator π given here-under in two-line form: π = (1 2) (3 4) • • • (n -1 n) (n + 1 n + 2) • • • (2n -1 2n) (2 * 1 -1 2 * 1) (2 * 3 -1 2 * 3) • • • (2n -3 2n -2) (3 4) • • • (2n -1 2n) where the bottom second-half part of π corresponds to the complementary of the bottom first half  N i (α, µ, L, γ, σ) w.r.t σ for scenario µ < L/2; γ = 3/2, µ = 1, L = 3, α = µ+γ+ √ (µ-γ) 2 +4γL L-µ . w.r.t. to the set {1, 2, ..., 2n} in the increasing order. For avoiding ambiguities, the ones element of P are at indices (π(1, j), π(2, j)) for 1 ≤ j ≤ 2n. For such convention and since permutation matrix P associated to indicator matrix, equation 39 can be equivalently written as follows: ẏ(t) = P P T -I n×n I n×n 1/γ(µI n×n -A) -µ/γI n×n P P T y(t) + 0 n×1 Ż , ≡P T ẏ(t) = P T -I n×n I n×n 1/γ(µI n×n -A) -µ/γI n×n P P T y(t) + P T 0 n×1 Ż , Since we assumed w.l.o.g.  A = diag(λ 1 , . . . , λ n ) with µ = λ 1 ≤ • • • ≤ λ j ≤ • • • ≤ λ n = L,                         ẋ1 ẋ2 v1 v2 . . . ẋ2i-1 ẋ2i v2i-1 v2i . . . ẋn-1 ẋn vn-1 vn                         =              I2 -I2 1/γ(µI2 -A1) -µ/γI2 0 0 0 0 0 . . . 0 0 0 0 0 I2 -I2 1/γ(µI2 -Ai) -µ/γI2 0 0 0 0 0 . . . 0 0 0 0 0 I2 -I2 1/γ(µI2 -Am) -µ/γI2              •                         x1 x2 v1 v2 . . . x2i-1 x2i v2i-1 v2i . . . xn-1 xn vn-1 vn                         +                           0 0 Ż1 Ż2 . . . 0 0 Ż2i-1 Ż2i . . . 0 0 Żn-1 Żn                           which boils down to m = n 2 independent 2-dimensional SDE's systems where A i = diag(λ 2i-1 , λ 2i ) with 1 ≤ i ≤ m such that λ 1 = µ and λ n = L. Therefore, the m SDE's systems can be studied and theoretically solved independently with the schemes and the associated step sizes presented in previous sections. However, in practice, we will tackle the full SDE's system 39. Let now use the "decoupled" structure given in equation 41 to come up with a general step size that will ensure the convergence of each system and hence the convergence of the full original system given in equation 39. Let us denote by α i the maximum step size for the i-th SDE's system with 1 ≤ i ≤ m = n/2. For convenience, let us consider the case γ > µ, we apply the same method as detailed in Appendix A.1.1 and Appendix A.1.2 to compute the expression of α i , we obtain: α i = µ + γ + (µ -γ) 2 + 4γλ 2i λ 2i -µ Finally, in Theorem 1, we show that choosing α = µ+γ+ √ (µ-γ) 2 +4γL L-µ ensures the convergence of NAG-GS method used to solve the SDE's system 39 in the n-dimensional case for n > 2. Theorem 1 is enunciated in Section 2.3 and the proof is given here-under. Proof: Let us consider the SDE's system in the form given by equation 41 and let α i = µ+γ+ √ (µ-γ) 2 +4γλ2i λ2i-µ be the step size selected for solving the i-th SDE's system with 1 ≤ i ≤ m = n/2. In order to prove the of the NAG-GS method by choosing a single step size α such that 0 < α ≤ µ+γ+ √ (µ-γ) 2 +4γL L-µ , it suffices to show that: α = µ + γ + (µ -γ) 2 + 4γL L -µ ≤ min 1≤i≤m=n/2 α i combined with results from Appendix A.1.3 and from Lemma 2. For proving that equation 42 holds, it sufficient to show that for any λ such that 0 < µ ≤ λ ≤ L < ∞ we have: µ + γ + (µ -γ) 2 + 4γL L -µ ≤ µ + γ + (µ -γ) 2 + 4γλ λ -µ . ( ) which is equivalent to show: µ + γ + (µ -γ) 2 + 4γL L -µ - µ + γ + (µ -γ) 2 + 4γλ λ -µ ≤ 0 ≡γ( 1 L -µ - 1 λ -µ ) + µ( 1 L -µ - 1 λ -µ ) + (µ -γ) 2 + 4γL L -µ - (µ -γ) 2 + 4γλ λ -µ ≤ 0 (44) Since 0 < µ ≤ λ ≤ L < ∞ by hypothesis, one can easily show that first two terms of the last inequality are negative. It remains to show that: (µ -γ) 2 + 4γL L -µ - (µ -γ) 2 + 4γλ λ -µ ≤ 0 ≡(-γ 2 -4γλ + 2γµ -µ 2 )L 2 + (4γλ 2 + 2γ 2 µ + 2µ 3 )L+ γ 2 λ 2 -2γ 2 λµ -2γλ 2 µ + λ 2 µ 2 -2λµ 3 ≤ 0 Note that we can easily show that the coefficient of L 2 is negative, hence last inequality is satisfied as soon as L ≤ -γ 2 λ+2γ 2 µ+2γλµ-λµ 2 +2µ 3 γ 2 +4γλ-2γµ+µ 2 or L ≥ λ. The latter condition is satisfied by hypothesis, this concludes the proof.

Note that one can check that

-γ 2 λ+2γ 2 µ+2γλµ-λµ 2 +2µ 3 γ 2 +4γλ-2γµ+µ 2 ≤ λ. □ The theoretical results derived in these sections along with the key insights are validated in Appendix A.1.5 through numerical experiments conducted for NAG-GS method in the quadratic case.

A.1.5 NUMERICAL TESTS FOR QUADRATIC CASE

In this section we report some simple numerical tests for NAG-GS method (Algorithm 1) used to tackle the accelerated SDE's system given in 11 where: • the objective function is f (x) = (xce) T A(xce) with A ∈ S 3 + , e a all-ones vector of dimension 3 and c a positive scalar. For such strongly convex setting, since the feasible set is V = R 3 , the minimizer arg min f uniquely exists and is simply equal to ce; it will be denoted further by x ⋆ . The matrix A is generated as follows: A = QAQ -1 where matrix D is a diagonal matrix of size 3 and Q is random orthogonal matrix. This test procedure allows us to specify the minimum and maximum eigenvalues of A that are respectively µ and L and hence it allows us to consider the two scenarios discussed in Appendix A.1.1, that are µ > L/2 and µ < L/2. • The noise volatility σ is set to 1, we report that this corresponds to a significant level of noise. • Initial parameter γ 0 is set to µ. • Different values for the step size α will be considered in order to empirically demonstrate the optimal choice α c in terms of contraction rate, but also validate the critical values for step size in the case µ < L/2 and, finally, highlight the effect of the step size in terms of scattering of the final iterates generated by NAG-GS around the minimizer of f . On a practical point of view, we consider m = 200000 points. For each of them, NAG-GS method is ran for a maximum number of iterations allowing to reach the stationarity, and the initial state x 0 is generated using normal Gaussian distribution. Since f (x) is a quadratic function, it is expected that the points will converge to some Gaussian distribution around the minimizer x ⋆ = ce. Furthermore, since the initial distribution is also Gaussian, then it is expected than the intermediate distributions (at each iteration of NAG-GS method) are Gaussian as well. Therefore, in order to quantify the rate of convergence of NAG-GS method for different values of step size, we will monitor ∥x kx ⋆ ∥, that is the distance between the empirical mean of the distribution at iteration k and the minimizer x ⋆ of f . Figures 13 and 14 respectively show the evolution of ∥x kx ⋆ ∥ along iteration and the final distribution of points obtained by NAG-GS at stationarity for the scenario µ > L/2, for the latter the points are projected onto the three planes to have a full visualization. As expected by the theory presented in Appendix A.1.3, there is no critical α, hence one may choose arbitrary large values for step size while NAG-GS method still converges. Moreover, the choice of α = α c gives the highest rate of convergence. Finally, one can observe that the distribution of limit points tighten more and more around the minimizer x ⋆ of f as the chosen step increases, as expected by the analysis of Figure 9 . Hence, one may choose a very large step size α so that the limit points converge to x ⋆ almost surely but at a cost of a (much) slower convergence rate. Here comes the tradeoff between the convergence rate and the limit points scattering. Finally, Figures 15 and 16 provide similar results for the scenario µ < L/2. For such scenario, the theory detailed in Appendix A.1.3 and Appendix A.1.4 predicts a critical α from which convergence of NAG-GS is destroyed. In order to illustrate this gradually, different values of α have been chosen within the set {α c , α c /2, (α c + α crit )/2, 0.98α crit }. First, one can observe that the choice of α = α c gives again the highest rate of convergence, see Figure 15 . Moreover, one can clearly see that for α → α crit , the convergence starts to fail and the spreading of the limit points tends to infinity. We report that for α = α crit , NAG-GS method diverges. Again, these numerical results are fully predicted by the theory derived in previous sections.

A.1.6 MONOTONICITY OF SPECTRAL RADIUS OF E FOR NAG-GS METHOD

Let us recall that on [α c , ∞], the spectral radius ρ(E(α)) is equal to |λ 4 |, the expression of λ 4 as a function of parameters of interests for the convergence analysis of NAG-GS method was given in equation 20 and recalled here-under for convenience: λ 4 = 2γ + αγ + αµ -Lα 2 + α 2 µ 2(γ + αγ + αµ + α 2 µ) - α L 2 α 2 -2Lα 2 µ -2Lαµ -2γLα -4γL + α 2 µ 2 + 2αµ 2 + 2γαµ + µ 2 + 2γµ + γ 2 2(γ + αγ + αµ + α 2 µ) (46) Let start by showing that λ 4 is negative on [α c , ∞]. Firstly, one can easily observe that the denominator of λ 4 is positive, secondly let us compute the values for α such that:  2γ + αγ + αµ -Lα 2 + α 2 µ- α L 2 α 2 -2Lα 2 µ -2Lαµ -2γLα -4γL + α 2 µ 2 + 2αµ 2 + 2γαµ + µ 2 + 2γµ + γ 2 = 0 ≡ -4γ 2 -4αγ(µ + γ) + α 2 (γ 2 -4γL + 2γµ + µ 2 ) -α 2 (γ 2 -4γL + 6γµ + µ 2 ) = 0 ≡(-4γµ)α 2 -4γ(µ + γ)α -4γ 2 = 0 (47) To ease the analysis, let us decompose -λ 4 (α) = t 1 (α) + t 2 (α) such that: t 1 (α) = - 2γ + αγ + αµ -Lα 2 + α 2 µ 2(γ + αγ + αµ + α 2 µ) t 2 (α) = α L 2 α 2 -2Lα 2 µ -2Lαµ -2γLα -4γL + α 2 µ 2 + 2αµ 2 + 2γαµ + µ 2 + 2γµ + γ 2 2(γ + αγ + αµ + α 2 µ) ) Let us now show that dt1(α) dα > 0 and dt2(α) dα > 0 for any L > µ > 0. We first obtain: dt 1 (α) dα = (2γ + 2µ + 4αµ)(2γ + αγ + αµ -Lα 2 + α 2 µ) (2γ + 2αγ + 2αµ + 2α 2 µ) 2 - γ + µ -2Lα + 2αµ 2γ + 2αγ + 2αµ + 2α 2 µ = (Lα 2 + γ)(γ + µ) + 2αγ(L + µ) 2(α + 1) 2 (γ + αµ) 2 which is strictly positive since L > µ > 0 and γ > 0 by hypothesis. Furthermore: dt 2 (α) dα = (γ + µ)(L -µ)(α 3 L -3αγ) + α 2 (L(-γ 2 -µ 2 ) + 2γ(L 2 -Lµ + µ 2 )) + γ(γ 2 -2γ(2L -µ) + µ 2 ) 2(α + 1) 2 (αµ + γ) 2 α 2 (L 2 -2Lµ + µ 2 ) -2α(γ + µ)(L -µ) + γ 2 -2γ(2L -µ) + µ 2 (51) The remaining of the demonstration is significantly long and technically heavy in the case γ > µ. Then we limit the last part of the demonstration for the case µ = γ for which we have shown previously than α c = µ+γ+2 √ γL L-µ = 2µ+2 √ µL L-µ . In practice, with respect to NAG-GS method summarized by Algorithm 1, γ quickly decrease to µ and equality µ = γ holds for most part of the iterations of the Algorithm, hence this case is more important to detail here. The first term of the numerator of Equation 51 is positive as soon as α ≥ 3γ L . In the case µ = γ, we determine the conditions under which the second term of the the numerator of Equation 51 is positive, that is: α 2 (L(-2µ 2 ) + 2µ(L 2 -Lµ + µ 2 )) + µ(2µ 2 -2µ(2L -µ)) > 0 ≡ α 2 (L(-2µ 2 ) + 2µ(L 2 -Lµ + µ 2 )) > µ(-2µ 2 + 2µ(2L -µ)) First one can see that: (L(-2µ 2 ) + 2µ(L 2 -Lµ + µ 2 )) > 0, µ(-2µ 2 + 2µ(2L -µ)) > 0 (53) hold as soon as L > µ > 0 which is satisfied by hypothesis. Therefore, the second term of the numerator of Equation 51 is positive as soon as α > µ(-2µ 2 + 2µ(2L -µ)) (L(-2µ 2 ) + 2µ(L 2 -Lµ + µ 2 )) = 2µ L -µ which exists since L > µ > 0 by hypothesis (the second root of equation 53 being negative). Finally, since α ∈ [α c , ∞] by hypothesis, dt2(α) dα is positive as soon as: α c > 3µ L α c > 2µ L -µ (55) hold with α c = 2µ+2 √ µL L-µ . One can easily show that both inequalities hold as soon as L > µ > 0 which is satisfied by hypothesis. This concludes the proof of the strict monotonicity of ρ(E(α)) w.r.t. α for α ∈ [α c , ∞] assuming L > µ > 0 and γ = µ.

A.2 FULLY-IMPLICIT SCHEME

In this section we present an iterative method based on the NAG transformation G N AG 7 along with a fully implicit discretization to tackle equation 4 in the stochastic setting, the resulting method shall be referred to as "NAG-FI" method. We propose the following discretization for equation 6 perturbated with noise; given step size α k > 0: x k+1 -x k α k = v k+1 -x k+1 , v k+1 -v k α k = µ γ k (x k+1 -v k+1 ) - 1 γ k Ax k+1 + σ W k+1 -W k α k . As done for the NAG-GS method, on a practical point of view, we will use W k+1 -W k = ∆W k = √ α k η k where η k ∼ N (0, 1) , by the properties of the Brownian motion. In the quadratic case, that is f (x) = 1 2 x T Ax, solving equation 56 is equivalent to solve: x k v k + σ √ α k η k = (1 + α k )I -α k I α k γ k (A -µI) (1 + α k µ γ k )I x k+1 v k+1 where η k ∼ N (0, 1). Furthermore, the parameter equation 8 is again discretized implicitly: γ k+1 -γ k α k = µ -γ k+1 , γ 0 > 0. As done for NAG-GS method, heuristically, for general f ∈ S 1,1 L,µ with µ ≥ 0, we just replace Ax k+1 in equation 56 with ∇f (x k+1 ) and obtain the following NAG-FI scheme: x k+1 -x k α k = v k+1 -x k+1 , v k+1 -v k α k = µ γ k (x k+1 -v k+1 ) - 1 γ k ∇f (x k+1 ) + σ W k+1 -W k α k . However, it appears that there is limited empirical success for such methods when used for training NN when compared to well-tuned Stochastic Gradient Descent schemes, see for instance Botev et al. (2017); Zeiler (2012) . To the best of our knowledge, no theoretical explanations have been brought to formally support these empirical observations. This will be part of our future research directions. Besides these nice preliminar theoretical results and numerical observations for small dimension problems, there is a limitation of NAG-FI method that comes from the numerical feasibility for computing the root of the non-linear equation 61 that can be very challenging in practice. We will try to address this issue in future works.

B CONVERGENCE TO THE STATIONARY DISTRIBUTION

Another way to study convergence of the proposed algorithms is to consider Fokker -Planck equation for the density function ρ(t, x). We will consider the simple case of the scalar SDE for the stochastic gradient flow (similarly as in equation 11). Here f : R → R: dx = -∇f (x)dt + dZ = -∇f (x)dt + σdW, x(0) ∼ ρ(0, x). It is well known, that the density function for x(t) ∼ ρ(t, x) satisfies the corresponding Fokker -Planck equation: ∂ρ(t, x) ∂t = ∇ (ρ(t, x)∇f (x)) + σ 2 2 ∆ρ(t, x) For the equation 62 one could write down the stationary (with t → ∞) distribution ρ * (x) = lim t→∞ ρ(t, x) = 1 Z exp - 2 σ 2 f (x) , Z = x∈V exp - 2 σ 2 f (x) dx. It is useful to compare different optimization algorithms in terms of convergence in the probability space, because it allows to study the methods in the non-convex setting. We have to address two problems with this approach. Firstly, we need to specify some distance functional between current distribution ρ t = ρ(t, x) and stationary distribution ρ * = ρ * (x). Secondly, we don't need to have access to the densities ρ t , ρ * themselves. For the first problem we will consider the following distance functionals between probability distributions in the scalar case: • Kullback -Leibler divergence. Several studies dedicated to convergence in probability space are available Arnold et al. (2001) ; Chewi et al. (2020) ; Lambert et al. (2022) . We used approach proposed in Pérez-Cruz (2008) to estimate KL divergence between continuous distributions based on their samples. • Wasserstein distance. Wasserstein distance is relatively easy to compute for scalar densities. Also, it was shown, that stochastic gradient process with constant learning rate is exponentially ergodic in the Wasserstein senseLatz (2021). • Kolmogorov -Smirnov statistics. We used the two-sample Kolmogorov-Smirnov test for goodness of fit. To the best of our knowledge, the explicit formula for the stationary distribution of Fokker-Planck equations for the ASG SDE equation 11 remains unknown. That is why we've decided to get samples from the empirical stationary distributions using Euler-Maruyama integration Maruyama (1955) with a small enough stepsize of corresponding SDE with a bunch of different independent initializations. We tested two functions, which are presented on the figure 18. We initially generated 100 points uniformly in the function domain. Then we independently solved initial value problem in equation 11 for each of them with Maruyama (1955) . Results of the integration are presented on the figure 19 . One can see, that in relatively easy scenario 18a, NAG-GS converges faster 19a, than gradient flow to its stationary distribution, while in hard scenario 18b, NAG-GS is more robust to the large stepsize. 19b. 

C ADDITIONAL EXPERIMENTAL DETAILS

In this section we provide additional experimental details. In particular, we discuss a little bit more our experimental setup and give some insights about NAG-GS as well. Our computational resources are limited to a single Nvidia DGX-1 with 8 GPUs Nvidia V100. Almost all experiments were carried out on a single GPU. The only exception is for the training of ResNet50 on ImageNet which used all 8 GPUs.

C.1 FINE-TUNING ROBERTA ON GLUE

After completion of experiments on toy models, we applied NAG-GS for the training of large neural networks that could be used in practice. We performed a grid search to come up with an optimal learning rate α for RoBERTa model (see Table 4 ). It is worth mentioning that there is a stability region for NAG-GS which is visible to the naked eye, we highlighted in blue some cells from Table 4 to support this claim. Furthermore, based on these simple experiments, it appears that choosing α ∼ 10 -1 is a good way to go for a wide variety of tasks. In the subsequent experiments we limited ourselves to RoBERTa's training on small GLUE tasks in order to study relations between α and γ in the context of high-dimensional non-convex objective functions (see Figure 20 ). One can see that there is oblong area in which NAG-GS converges pretty well. The ridge of this area can be empirically described with linear equation log α = a log γ + b in logarithmic domain. In Appendix C.2, we perform similar study for other types of model. In Section 3.5 we mentioned that the lowest eigenvalues µ of approximated Hessian matrices evaluated during the training of ResNet20 model were negative. Furthermore, our theoretical analysis of NAG-GS in the convex case include some conditions on the optimizer parameters α, γ, and µ. In particular it is required that µ > 0 and γ ≥ µ. In order to bring some insights about these remarks in the non-convex setting and inspired by Velikanov et al. (2022) , we experimentally study the convergence regions of NAG-GS and sketch out the phase diagrams of convergence for different projection planes, see Figure 21 . We consider the same setup as in Section 3.3 and use hyperoptimization library OPTUNA Akiba et al. (2019) . Our preliminary experiments on RoBERTa (see Appendix C.1) shows that α should be of magnitude 10 -1 . With the estimate of Hessian spectrum of ResNet20 we define the following search space α ∼ LogUniform(10 -2 , 10 2 ), γ ∼ LogUniform(10 -2 , 10 2 ), µ ∼ Uniform(-10, 100). We sample a fixed number of triples and train ResNet20 model on CIFAR-10. Objective function is a top-1 classification error. We report that there is a convergence almost everywhere within the projected search space onto α-γ plane (see Figure 21 ). The analysis of projections onto α-µ and γ-µ planes brings different conclusions: there are regions of convergence for negative µ for some α < α th and γ > γ th . Also, as it was mentioned in, there is a subdomain of negative µ comparable to a domain of positive µ in a sense of the target metrics. Moreover, the majority of sampled points are located in the vicinity of the band λ min < µ < λ max . 



CONCLUSIONS AND FURTHER WORKSWe have presented a new and theoretically motivated stochastic optimizer called NAG-GS. It comes from the semi-implicit Gauss-Seidel type discretization of a well-chosen accelerated Nesterov-like SDE. These building blocks ensure two central properties for NAG-GS: (1) the ability to accelerate the optimization process and (2) a great robustness to the selection of hyperparameters, in particular for the choice of the learning rate. We demonstrate these features theoretically and provide detailed analysis of the convergence of the method in the quadratic case. Moreover, we show that NAG-GS is competitive with state-of-the-art methods for tackling a wide variety of stochastic optimization problems of increasing complexity and dimension, starting from the logistic regression model to the training of large machine learning models such as ResNet20, ResNet50 and Transformers. In all tests, NAG-GS demonstrates better or competitive performance compared with standard optimizers. Further works will focus on the theoretical analysis of NAG-GS in the non-convex case and the derivation of efficient and tractable higher-order methods based on the full-implicit discretization of the accelerated Nesterov-like SDE. It explains why the critical α does not include σ, this singularity is due to the spectral radius reaching the value 1. https://github.com/user/nag-gs



Figure 4: Evaluation of NAG-GS with SGD-MW on ResNet-20 (left) and ResNet-50 (right) on CIFAR-10 and ImageNet respectively.

Figure 6: Evolution of the extreme eigenvalues (the larget and the smallest ones) during training RESNET20 on CIFAR-10 with NAG-GS.

);Zhang et al. (2018) investigate the most proper discretization techniques for different classes of ODEs. A similar analysis but for stochastic first-order methods is presented inLaborde & Oberman (2020);Malladi et al. (2022).

shows the evolution of the absolute values of the eigenvalues of iteration matrix E w.r.t. α for such setting.2. As soon as γ < µ, one can easily show that λ 1 (α) < λ 2 (α). Therefore the step size α with the minimal spectral radius is such that |λ 4 (α)| = |λ 2 (α)|. One can show that the equality holds for α =

Figure 7: Evolution of absolute values of λ i w.r.t α; µ = γ.

Figure 9: Evolution of λ i (C) w.r.t α for scenario µ > L/2; σ = 1, γ = 3/2, µ = 1, L = 3/2.

Figure 10: Evolution of λ i (C) w.r.t α for scenario µ < L/2; σ = 1, γ = 3/2, µ = 1, L = 3.

Figure 11: Evolution of N i (α, µ, L, γ, σ) w.r.t σ for scenario µ > L/2; γ = 3/2, µ = 1, L = 3/2,

Figure12: Evolution of N i (α, µ, L, γ, σ) w.r.t σ for scenario µ < L/2; γ = 3/2, µ = 1, L = 3,

Figure 13: Evolution of ∥x kx ⋆ ∥ along iteration for the scenario µ > L/2; c = 5, γ = µ = 1, L = 1.9 and σ = 1 for α ∈ {α c , α c /2, 2α c , 10α c } with α c = 2µ+2 √ µL L-µ = 5.29.

Two pits function. f1(x) = 1 50 (2 log (cosh(x)) -5)

Figure19: Convergence in probabilities of Euler integration of Gradient Flow (GF Euler) and NAG-GS for the non-convex scalar problems.

(a) Projection in XY plane. (b) Projection in Y Z plane. (c) Projection in XZ plane.

Figure 14: Initial (blue x) and final (red circles) distributions of points generated by NAG-GS method for the scenario µ > L/2; c = 5, γ = µ = 1, L = 1.9 and σ = 1 for α ∈ {α c , α c /2, 2α c , 10α c } with α c = 2µ+2 √ µL L-µ = 5.29.

Figure 15: Evolution of ∥x kx ⋆ ∥ along iteration for the scenario µ < L/2; c = 5, γ = µ = 1, L = 3 and σ = 1 for α ∈ {α c , α c /2, (α c + α crit )/2, 0.98α crit } with α c = 2µ+2 √ µL L-µ = 2.73 and

(a) Projection in XY plane. (b) Projection in Y Z plane. (c) Projection in XZ plane.

Figure 16: Initial (blue x) and final (red circles) distributions of points generated by NAG-GS method for scenario µ < L/2; c = 5, γ = µ = 1, L = 3 and σ = 1 for α ∈ {α c , α c /2, (α c + α crit )/2, 0.98α crit } with α c = 2µ+2 √ µL L-µ = 2.73 and α crit = µ+γ+ √ γ 2 -6γµ+µ 2 +4γL L-2µ = 4.83.



t. α without modifying the trends of the curves. For compliance purpose, Figures 11 and 12 respectively show the evolution of the numerators N i (α, µ, L, γ, σ) of eigenvalues expressions of C given in Equations equation 28 to equation 31 w.r.t. σ, for both scenarios µ < L/2 and µ > L/2. One can observe monotonic polynomial increasing behavior of N i (α, µ, L, γ, σ) w.r.t σ for all 1 ≤ i ≤ 4.

one can easily see that Equation equation 40 has the structure:

Performance in fine-tuning on GLUE benchmark for different learning rate α, fixed moment µ = 1, and factor γ ∈ {1, 1.5}. Largest metric values for NAG-GS optimizer is highlighted with bold typeface. Performance metric is accuracy for all tasks with exception Matthews correlation for COLA and Pearson correlation for STS-B. Dash in cell means an absence of a run. Higher is better.

The comparision of a single step duration for different optimizers on RESNET20 on CIFAR-10. ADAM-like optimizers have in twice larger state than momentum SGDs or NAG-GS.OPTIMIZER MEAN, S VARIANCE, S REL. MEAN REL. VARIANCEIn our work we implemented NAG-GS in PyTorchPaszke et al. (2017) andJAX Bradbury et al.  (2018);Babuschkin et al. (2020). Both implementations are used in our experiments and available online 2 . According to Algorithm 1 the size of NAG-GS state equals to number of optimization parameters which makes NAG-GS comparable to SGD with momentum. It worth to note that Adam-like optimizers has twice larger state than NAG-GS. Arithmetic complexity of NAG-GS is linear O(n) in the number of parameters. Table5shows a comparison of computational efficiency of common optimizers used in practice. Although forward pass and gradient computations usually give the main contribution to training step, there is a settings when efficiency of gradient updates are important (e.g. batch size or a number of intermediate activations are small with respect to a number of parameters).

annex

From the first equation, we get v k+1 = x k+1 -x k α k +x k+1 that we substitute within the second equation, we obtain:Computing x k+1 is then equivalent to computing a fixed point of the operator given by the righ-hand side of equation 60. Hence, it is also equivalent to finding the root of function:with g : R n → R n . In order to compute the root of this function, we consider a classical Newton-Raphson procedure detailed in Algorithm 2. In Algorithm 2, J g (.) denotes the Jacobean operator Algorithm 2 Newton-Raphson methodof function g equation 61 w.r.t. u, I n denotes the identity matrix of size n and ∇ 2 f designates the Hessian matrix of objective function f . Note that the iterative method detailed in Algorithm 2 corresponds to a particular variant of second-order methods known as " Levenberg-Marquardt" Levenberg (1944) ; Marquardt (1963) applied to the unconstrained minimization problem min x∈R n f (x) for a twice-differentiable function f . Finally NAG-FI method is summarized by Algorithm 3.

Algorithm 3 NAG-FI Method

Compute the root u of equation 61 by using Algorithm 2 Set x k+1 = u end for By following similar stability analysis as the one performed for NAG-GS, one can show that this method is unconditionnaly A-stable as expected by the theory of implicit schemes. In particular, one can show that eigenvalues of the iterations matrix are positive decreasing functions w.r.t. step size α, allowing then the choice of any positive value for α. Similarly, one can show that the eigenvalues of the covariance matrix at stationarity associated to NAG-FI method are decreasing functions w.r.t. α that tend to 0 as soon as lim α → ∞. It implies that Algorithm 3 is theoretically able to generate iterates that converge to arg min f almost surely, even in the stochastic setting with potentially quadratic rate of converge. This theoretical result is quickly highlighted in Figure 17 that shows the final distribution of points generated by NAG-FI once used in test setup detailed in Appendix A.1.5, in the most interesting and critical scenario µ < L/2. As expected we report α can be chosen as big as desired, we choose here α = 1000α c . Moreover, for increasing value of α, the final distributions of points are more and more concentrated on the arg min f . Therefore, NAG-FI method constitutes a good basis for deriving efficient second-order methods for tackling stochastic optimization problems, which is hard to find in the current SOTA. Indeed, second-order methods and more generally some variants of preconditioned gradient methods have recently been proposed and used in the deep learning community for the training of NN for instance. 

