FIRST-ORDER OPTIMIZATION ALGORITHMS VIA DIS-CRETIZATION OF FINITE-TIME CONVERGENT FLOWS

Abstract

In this paper the performance of several discretization algorithms for two first-order finite-time optimization flows. These flows are, namely, the rescaled-gradient flow (RGF) and the signed-gradient flow (SGF), and consist of non-Lipscthiz or discontinuous dynamical systems that converge locally in finite time to the minima of gradient-dominated functions. We introduce three discretization methods for these first-order finite-time flows, and provide convergence guarantees. We then apply the proposed algorithms in training neural networks and empirically test their performances on three standard datasets, namely, CIFAR10, SVHN, and MNIST. Our results show that our schemes demonstrate faster convergences against standard optimization alternatives, while achieving equivalent or better accuracy.

1. INTRODUCTION

Consider the unconstrained minimization problem for a given cost function f : R n → R. When f is sufficiently regular, the standard algorithm in continuous time (dynamical system) is given by ẋ = F GF (x) -∇f (x) with ẋ d dt x(t), known as the gradient flow (GF). Generalizing GF, the q-rescaled GF (q-RGF) Wibisono et al. (2016) given by ẋ = -c ∇f (x) ∇f (x) q-2 q-1 2 (2) with c > 0 and q ∈ (1, ∞] has an asymptotic convergence rate f (x(t)) -f (x ) = O 1 t q-1 under mild regularity, for x(0) -x > 0 small enough, where x ∈ R n denotes a local minimizer of f . However, we recently proved Romero & Benosman (2020) that the q-RGF, as well as our proposed q-signed GF (q-SGF) ẋ = -c ∇f (x) 1 q-1 1 sign(∇f (x)), (3) where sign(•) denotes the sign function, applied element-wise for (real-valued) vectors, are both finite-time convergent, provided that f is gradient dominated of order p ∈ (1, q). In particular, if f is strongly convex, then q-RGF and q-SGF is finite-time convergent for any q ∈ (2, ∞], since f must be gradient dominated of order p = 2.

CONTRIBUTION

In this paper, we explore three discretization schemes for the q-RGF (2) and q-SGF (3) and provide some convergence guarantees using results from hybrid dynamical control theory. In particular, we explore a forward-Euler discretization of RGF/SGF, followed by an explicit Runge-Kutta discretization, and finally a novel Nesterov-like discretization. We then test their performance on both synthetic and real-world data in the context of deep learning, namely, over the well-known datasets CIFAR10, SVHN, and MNIST.

RELATED WORK

Propelled by the work of Wang & Elia (2011) and Su et al. (2014) , there has been a recent and significant research effort dedicated to analyzing optimization algorithms from the perspective of dynamical systems and control theory, especially in continuous time Wibisono et al. (2016) ; Wilson (2018) ; Lessard et al. (2016) ; Fazlyab et al. (2017b) ; Scieur et al. (2017); Franc ¸a et al. (2018) ; Fazlyab et al. (2018) ; Fazlyab et al. (2018) ; Taylor et al. (2018); Franc ¸a et al. (2019a) ; Orvieto & Lucchi (2019) ; Muehlebach & Jordan (2019) . A major focus within this initiative is in accceleration, both in terms of trying to gain new insight into more traditional optimization algorithms from this perspective, or even to exploit the interplay between continuous-time systems and their potential discretizations for novel algorithm design Muehlebach & Jordan (2019) ; Fazlyab et al. (2017a) ; Shi et al. (2018) ; Zhang et al. (2018); Franc ¸a et al. (2019b) ; Wilson et al. (2019) . Many of these papers also focus on explicit mappings and matchings of convergence rates from the continuous-time domain into discrete time. For older work connecting ordinary differential equations (ODEs) and their numerical analysis, with optimization algorithms, see Botsaris (1978a; b) ; Zghier (1981) ; Snyman (1982; 1983) ; Brockett (1988) ; Brown (1989) . In Helmke & Moore (1994) , the authors studied relationships between linear programming, ODEs, and general matrix theory. Further, Schropp (1995) and Schropp & Singer (2000) explored several aspects linking nonlinear dynamical systems to gradient-based optimization, including nonlinear constraints. Tools from Lyapunov stability theory are often employed for this purpose, mainly because there already exists a rich body of work within the nonlinear systems and control theory community for this purpose. In particular, typically in previous works, one seeks asymptotically Lyapunov stable gradient-based systems with an equilibrium (stationary point) at an isolated extremum of the given cost function, thus certifying local convergence. Naturally, global asymptotic stability leads to global convergence, though such an analysis will typically require the cost function to be strongly convex everywhere. For physical systems, a Lyapunov function can often be constructed from first principles via some physically meaningful measure of energy (e.g., total energy = potential energy + kinetic energy). In optimization, the situation is somewhat similar in the sense that a suitable Lyapunov function may often be constructed by taking simple surrogates of the objective function as candidates. For instance, V (x) f (x) -f (x ) can be a good initial candidate. Further, if f is continuously differentiable and x is an isolated stationary point, then another alternative is V (x) ∇f (x) 2 . However, most fundamental and applied research conducted in systems and control regarding Lyapunov stability theory deals exclusively with continuous-time systems. Unfortunately, (dynamical) stability properties are generally not preserved for simple forward-Euler and sample-and-hold discretizations and control laws Stuart & Humphries (1998) . Furthermore, practical implementations of optimization algorithms in modern digital computers demand discrete-time. Nonetheless, it has been extensively noted that a vast amount of general Lyapunov-based results appear to have a discrete-time equivalent. In that sense, we aim here to start from the q-RGF, and q-SGF continuous flows, characterized by their Lyapunov-based finite-time convergence, and seek discretization schemes, which allow us to 'shadow' the solutions of these flows in discrete time, hoping to achieve an acceleration of the discrete methods inspired from the finite-time convergence characteristics of the continuous flows.

2. OPTIMIZATION ALGORITHMS AS DISCRETE-TIME SYSTEMS

Generalizing (1), (2), and (3), consider a continuous-time algorithm (dynamical system) modeled via an ordinary differential equation (ODE) ẋ = F (x) (4) for t ≥ 0, or, more generally, a differential inclusion ẋ(t) ∈ F(x(t)) (5) a.e. t ≥ 0 (e.g. for the q = ∞ case), such that x(t) → x as t → t . In the case of the q-RGF (2) and q-SGF (3) for f gradient dominated of order p ∈ (1, q), we have finite-time convergence, and thus t = t (x(0)) < ∞. Most of the popular numerical optimization schemes can be written in a state-space form (i.e., recursively), as X k+1 = F d (k, X k ) (6a) x k = G(X k ) (6b) for k ∈ Z + {0, 1, 2, . . .} and a given X 0 ∈ R m (typically m ≥ n), where F d : Z + × R m → R n and G : R m → R n . Naturally, (6) can be seen as a discrete-time dynamical system constructed by discretizing (4) in time. In particular, we have x k ≈ x(t k ), where {0 = t 0 < t 1 < t 2 < . . .} denotes a time partition and x(•) a solution to (4) or (5) as appropriate. Therefore, we call X k and x k , respectively, the state and output at time step k. Whenever F d (k, X) does not depend on k, we will drop k and thus write F d (X) instead. Whenever m = n, we will denote take G(X) X and replace X and X k by x and x k , respectively. Example 1. The standard gradient descent (GD) algorithm x k+1 = x k -η∇f (x k ) with step size (learning rate) η > 0 can be readily written in the form ( 6) by taking m = n, F d (x) x -η∇f (x), and G(x) x. • If the step sizes are adaptive, i.e. if we replace η by a sequence {η k } with η k > 0, then we only need to replace F d (k, x) x -η k ∇f (x), provided that {η k } is not computed using feedback from {x k } (e.g. through a line search method). • If we do wish to use feedbackfoot_0 (and no memory past the most recent output and step size), then we can set m = n + 1, G([x; η]) x, and F d ([x; η]) [F (1) d ([x; η]); F (2) d ([x; η])], where F (1) d ([x; η]) x -η∇f (x), and F (2) d is a user-defined function that dictates the updates in the step size. In particular, an open-loop (no feedback) adaptive step size {η k } may also be achieved under this scenario, provided that it is possible to write η k+1 = F • If we wish to use individual step sizes for each the n components of {x k }, then it suffices to take η k as an n-dimensional vector (thus m = 2n), and make appropriate changes in F d and G. In each of these cases, GD can be seen as a forward-Euler discretization of the GF (1), i.e., x k+1 = x k + ∆t k F GF (x k ) with F GF = -∇f and adaptive time step ∆t k t k+1 -t k chosen as ∆t k = η k . Example 2. The proximal point algorithm (PPA) x k+1 = arg min x∈R n f (x) + 1 2η k x -x k 2 (9) with step size η k > 0 (open loop, for simplicity) can also be written in the form (6), by taking m = n, and G(x) x. Naturally, we need to assume sufficient regularity for F d (k, x) to exist and we must design a consistent way to choose F d (k, x) when multiple minimizers exist in the definition of F d (k, x). Alternatively, these conditions must be satisfied, at the very least, at every (k, x) ∈ {(0, x 0 ), (1, x 1 ), (2, x 2 ), . . .} for a particular chosen initial x 0 ∈ R n . F d (k, x) arg min x ∈R n {f (x ) + 1 2η k x -x 2 }, Assuming sufficient regularity, we have ∇ x {f (x) + 1 2η k x -x k 2 }| x=x k+1 = 0, and thus ∇f (x k+1 ) + 1 η k (x k+1 -x k ) = 0 ⇐⇒ x k+1 = x k + ∆t k F GF (x k+1 ) with ∆t k = η k , which is precisely the backward-Euler discretization of the GF (1). Example 3. The Nesterov accelerated gradient descent (N-AGD) y k = x k + β k (x k -x k-1 ) (11a) x k+1 = y k -η k ∇f (y k ) (11b) with step size η k > 0 and momentum coefficient β k > 0 (both open loop, for simplicity), can be written in the form (6) by taking m = 2n, F d (k, [y; x]) (1 + β k+1 )(y -η k ∇f (y)) -β k+1 x y -η k ∇f (y) and G([y; x]) x for y, x ∈ R n . In other words, X k = [y k ; x k ]. Traditionally, β k = k-1 k-2 , but clearly, if we set η k = η > 0 and β k = β ∈ (0, 1) (in practice, η ≈ 0 and β ≈ 1), then we can drop k from F d (k, [y; x]). There exist a few approaches in the literature on the interpretation of N-AGD (11b) as the discretization of a second-order continuous-time dynamical system, namely via a vanishing step size argument Su et al. (2014) 

3. PROPOSED ALGORITHMS VIA DISCRETIZATION

In this section, we propose three classes of optimization algorithms via discretization of the q-RGF (2) and q-SGF (3). But first, we review the necessary conditions to ensure finite-time convergence of these flows. Given q ∈ (1, ∞], let F q-RGF (x) and F q-SGF (x) be defined, respectively, by the RHS of ( 2) and (3). The hyperparameter c > 0 is not explicitly denoted in F q-RGF , F q-SGF . Next, borrowing terminology from Wilson et al. (2019) , we say that f (assumed continuously differentiable) is µ-gradient dominated of order p ∈ (1, ∞] (with µ > 0) near the local minimizer x if p -1 p ∇f (x) p p-1 ≥ µ 1 p-1 (f (x) -f ) for every x ∈ R n near x = x , where f = f (x ). When µ > 0 is unknown or unimportant, but known to exist, we will omit it in the previous definition. It can be proved that continuously differentiable strongly convex functions are gradient dominated of order p = 2. Furthermore, if f is gradient dominated (of any order) w.r.t. x , then x is an isolated stationary point of f . Remark 1. For strongly convex functions, gradient dominance of order p = 2 can be established. In fact, gradient dominance is usually defined exclusively for order p = 2, often referred to as the Polyak-Łojasiewicz (PL) inequality, which was introduced by Polyak (1963) to relax the (strong) convexity assumption commonly used to show convergence of the GD algorithm (7). The PL inequality can also be used to relax convexity assumptions of similar gradient and proximal-gradient methods Karimi et al. (2016) ; Attouch & Bolte (2009) . Our adopted generalized notion of gradient dominance is strongly tied to the Łojasiewicz gradient inequality from real analytic geometry, established by Łojasiewicz (1963; 1965) foot_1 independently and simultaneously from Polyak (1963) , and generalizing the PL inequality. More precisely, this inequality is typically written as ∇f (x) ≥ C • |f (x) -f | θ (14) for every x ∈ R n in a small enough open neighborhood of the stationary point x = x , for some C > 0 and θ ∈ 1 2 , 1 . This inequality is guaranteed for analytic functions Łojasiewicz (1965) . More precisely, when x is a local minimizer of f , the aforementioned relationship is explicitly given by C = p p -1 p-1 p µ 1 p , θ = p -1 p . Therefore, analytic functions are always gradient dominated. However, while analytic functions are always smooth, smoothness is not required to attain gradient dominance. We are now ready to state the finite-time convergence of the q-RGF (2) and q-SGF (3). Theorem 1. Romero & Benosman (2020) Suppose that f : R n → R is continuously differentiable and µ-gradient dominated of order p ∈ (1, ∞) near a strict local minimizer x ∈ R n . Let c > 0 and q ∈ (p, ∞]. Then, any maximal solution x(•), in the sense of Filippov, to the q-RGF (2) or q-SGF (3) will converge in finite time to x , provided that x(0) -x > 0 is sufficiently small. More precisely, lim t→t x(t) = x , where the convergence time t < ∞ may depend on which flow is used, but in both cases is upper bounded by t ≤ ∇f (x 0 ) 1 θ -1 θ cC 1 θ 1 -θ θ , where x 0 = x(0), C = p p-1 p-1 p µ 1 p , θ = p-1 p , and θ = q-1 q . In particular, given any compact and positively invariant subset S ⊂ D, both flows converge in finite time with the aforementioned convergence time upper bound (which can be tightened by replacing D with S) for any x 0 ∈ S. Furthermore, if D = R n , then we have global finite-time convergence, i.e. finite-time convergence to any maximal solution (in the sense of Filippov) x(•) with arbitrary x 0 ∈ R n . In essence, the analysis (introduced in Romero & Benosman ( 2020)) consists of leveraging the gradient dominance to show that the energy function E(t) f (x(t)) -f satisfies the Lyapunov-like differential inequality Ė(t) = O(E(t) α ) for some α < 1. The detailed proof is recalled in Appendix C for completeness.

3.1.1. FORWARD-EULER DISCRETIZATION

First, we propose a simple forward Euler discretization of the finite-time convergent flows x k+1 = x k + ηF (x k ), η > 0 (17) where F ∈ {F q-RGF , F q-SGF }. We show later, in Theorem 2, that this simple method leads, for small enough η > 0, to solutions that are -close to the finite-time flows.

3.1.2. EXPLICIT RUNGE-KUTTA DISCRETIZATION

We propose to use the following discretization x k+1 = x k + η K i=1 α i F (y i k ), y 1 k = x k , K i=1 α i = 1, y i k = x k + η i-1 j=1 β j F (y j ), i > 1, for η > 0, K ∈ {1, 2, 3, . . .}, and F ∈ {F q-RGF , F q-SGF }. This method is well-known to be numerically stable under the consistency condition Stuart & Humphries (1996) . However, in our optimization framework, we want to be able to guarantee that the stable numerical solution of (18) remains close to the solution of the continuous flows. In other words, we seek arbitrarily small global error, also known as shadowing. This will be discussed in Theorem 2. i=K i=1 α i = 1,

3.1.3. NESTEROV-LIKE DISCRETIZATION

First, we rewrite Nesterov's accelerated GD as x k+1 = x k + βy k -η∇f (x k + βy k ) (19a) y k+1 = x k+1 -x k , where y k now serves as a momentum term. We argue that Nesterov's acceleration can be interpreted as actually applying the discretization given by ( 19) to the GF (1), i.e., by seeing the term -η∇f (x k + βy k ) as a mapping applied to the GF flow (1) at x k + βy k , as -ηF GF (x k + βy k ). Therefore, given any optimization flow represented by the continuous-time system ẋ = F (x), locally convergent to a local minimizer x ∈ R n of a cost function f : R n → R, we can replicate Nesterov's acceleration of (1). More precisely, we obtain the algorithm x k+1 = x k + ηF (x k + βy k ) + βy k (20a) y k+1 = x k+1 -x k . ( ) Based on this idea, we propose two 'Nesterov-like' discrete optimization algorithms. The first one based on the q-RGF continuous flow, is defined as: x k+1 = x k + η -c ∇f (x k + βy k ) ∇f (x k + βy k ) q-2 q-1 + βy k (21a) y k+1 = x k+1 -x k . ( ) The second algorithm is based on the q-SGF continuous flow, and is given by: x k+1 = x k + η -c ∇f (x k + βy k ) 1 q-1 1 sign(∇f (x k + βy k ))) + βy k (22a) y k+1 = x k+1 -x k . (22b) 3.1.4 CONVERGENCE ANALYSIS We present here some convergence results of the three proposed discretizations. The analysis summarized in Theorem 2 is based on tools from hybrid control theory, and is detailed in Appendix D 3 . Theorem 2. Suppose that f : R n → R is continuously differentiable, locally L f -Lipschitz, and µ- gradient dominated of order p ∈ (2, ∞) in a compact neighborhood S of a strict local minimizer x ∈ R n . Let c > 0 and q ∈ (p, ∞]. Then, for a given initial condition x 0 ∈ S any maximal solution x(t), x(0) = x 0 , (in the sense of Filippov) to the q-RGF given by (2) or the q-SGF flow (3), there exists an arbitrarily small > 0 such that the solution x k of any of the discrete algorithms ( 17), ( 18), (21), or ( 22), with sufficiently small η > 0, are -close to x(t), i.e., x k -x ≤ , and s.t. the following convergence bound holds f (x k ) -f (x ) ≤ L f + [(f (x 0 ) -f (x )) (1-α) -c(1 -α)ηk] 1/(1-α) , L f > 0, k ≤ k , ( ) where α = θ θ , θ = p-1 p , θ = q-1 q , c = c p p-1 p-1 p µ 1 p 1 θ , and k = (f (x0)-f (x )) (1-α) c(1-α)η . Theorem 2 thus shows that that -convergence of x k → x can be achieved in a finite number of steps upper bounded by k = (f (x0)-f (x )) (1-α) c(1-α)η . This is a preliminary convergence result, which is meant to show the existence of discrete solutions obtained via the proposed discretization algorithms, which are -close to the continuous solutions of the finite-time flows. We also underline here that after x k reaches an -neighborhood of x , then x k+1 ≈ x k , ∀k > k , since x is an equilibrium point of the continuous flows; see Definition 2 in Appendix B.

4.1. NUMERICAL TESTS ON AN ACADEMIC EXAMPLE

Let us show first on a simple numerical example that the acceleration in convergence, proven in continuous time for certain range of the hyperparmeters, can translate to some convergence acceleration in discrete time, as shown in Theorem 2. We consider the Rosenbrock function f : R 2 → R, given by f (x 1 , x 2 ) = (a -x 1 ) 2 + b(x 2 -x 2 1 ) 2 , with parameters a, b ∈ R. This function admits exactly one stationary point (x 1 , x 2 ) = (a, a 2 ) for b ≥ 0, and is locally strongly convex, hence locally satisfies gradient dominance of order p = 2, which allows us to select q > 2 in q-RGF and q-SGF to achieve finite-time convergence in continuous-time. We report in Figure 1 the mean performance of all three discretizations for q-RGF and q-SGFfoot_3 with fixed step sizefoot_4 , for several values of q, for 10 random initial conditions in [0, 2]. We observe for all three discretizations that, as expected from the continuous flow analysis, for q close to 2, q-RGF behaves similar to GD in terms of convergence rate, whereas for q > 2 the finite-time convergence in continuous time seems to translate to some acceleration in this simple discretization method. Similarly for q-SGF, q closer to 2 translates to less accelerated algorithms, with a behavior similar to GD, whereas larger q values lead to accelerated convergence. We report here the results of our experiments using deep neural network (DNN) training on three well-known datasets, namely, CIFAR10, MNIST, and SVHN. We report results on CIFAR10, and SVHN in the sequel, while results on MNIST can be found in Appendix E. Note that, we use Pytorch platform to conduct all the tests reported in this paper, and do not use GPUs. We underline here that the DNNs are non-convex globally, however, one could assume at least local convexity, hence local gradient dominance of order p = 2, thus, we will select q > 2 in our experiments (see (Remark 2, Appendix E) for more explanations on the choice of q).

4.2.1. EXPERIMENT ON CIFAR10

In this experiment, we use the proposed algorithms to train a VGG16 CNN model with cross entropy loss, e.g., Simonyan & Zisserman (2015) on the CIFAR10 dataset. We divided the dataset into a training set spread in 50 batches of 1000 images each, and a test set of 10 batches with 1000 images each. We ran 20 epochs of training over all the training batches. Since Nesterov accelerated GD is one of the most efficient methods is DNN applications, to conduct fair comparisons we implemented our Nesterov-like discretization of q-RGF (c = 1, q = 3, η = 0.04, µ = 0.9), and the Nesterovlike discretization of q-SGF (c = 10 -3 , q = 3, η = 0.04, µ = 0.9). We compare against the mainstream algorithms 6 , such as, Nesterov's accelerated gradient descent (GD), Adam, Adaptive gradient (AdaGrad), per-dimension learning rate method for gradient descent (AdaDelta), and Root Mean Square Propagation (RMSprop) 7 . Note that, all algorithms have been implemented in their stochastic version 8 , i.e., using mini-batches implementation, with constant step size. For instance, in Figures 2, 3 9 , we see the training loss for the different optimization algorithms. We notice that the proposed Algorithms, RGF, and SGF, quickly separate from the GD, and RMS algorithms in terms of convergence speed, but also ends up with an overall better performance on the test set 84% vs. 83% for GD, and RMS. We also note that other algorithms, such as, AdaGrad and AdaDelta behave similarly to RGF in terms of convergence speed, but lack behind in terms of final performance 75% and 68%, respectively. Finally, in Figure 3 , we notice that Adam is slower in terms of computation time w.r.t SGF, and RGF, with an average lag of 8 min, and 80 min, respectively. However, to be fair one, has to underline that Adam is an adaptive method based on the extra computations and memory overhead of lower order moments estimate and bias correction, Kingma & Ba (2015) . Furthermore, to better compare the performance of these algorithms, we report in Figure 2 the loss on the test dataset over the learning iterations. We confirm that RGF and SGF performs better than GD, RMSprop, and Adam, while avoiding the overfitting observed with AdaGrad and AdaDelta. 6 We run several tests by trying to optimally tune the parameters of each algorithms on a validation set (tenth of training set), and we are reporting the best final accuracy performance we could obtain for each one. We have implemented all algorithms with the same total number of iterations, so that we can compare the convergence speed of all algorithms against a common iteration range. Details of the hyper-parameter values are given in Appendix. 7 Original reference for each method can be found in: https://pytorch.org/docs/stable/optim.html 8 We want to underline here that our first tests were done in the deterministic setting, however, to compare the propsed optimization methods against the best optimization algorithms available for DNNs training, we decided to also conduct comparison tests in the stochastic setting. Since the results remain qualitatively unchanged, we only report here the results due to the stochastic setting. 9 To avoid overloading the figures we reported only the computation time plots of the three most competitive methods: RGF, SGF and Adam.

4.2.2. EXPERIMENTS ON SVHN DATASET

We test the proposed algorithms to train the same VGG16 CNN model with cross entropy loss on the SVHN dataset. We divided the dataset into a training set of 70 batches with 1000 images each, and a test set of 10 of 1000 images each, and ran 20 epochs of training over all the training batches. We tested the Nesterov-like discretization of q-RGF (c = 1, q = 3, η = 0.04, µ = 0.09), and the Nesterov-like discretization of q-SGF (c = 10 -3 , q = 11, η = 0.04, µ = 0.09) against Nesterov's accelerated gradient descent (GD), and Adamfoot_5 . Note from Figures 4, 5 it is clear that RGF and SGF give a good performance in terms of convergence speed, and final test performance 93%. We can also observe in Figure 5 that SGF, RGF are faster than GD, and all three methods are faster (in average 41 min for GD, 58 min for SGF, 75 min for RGF) than Adam as expected since it is an adaptive scheme with more computation steps (see our discussion of Adam in Section 4.2.1). More 

5. CONCLUSION

We studied connections between optimization algorithms and continuous-time representations (dynamical systems) via discretization. We then reviewed two families of non-Lipschitz or discontinuous first-order optimization flows for continuous-time optimization, namely the q-RGF and q-SGF, whose distinguishing characteristic is their finite-time convergence. We then proposed three discretization methods for these flows, namely a forward-Euler discretization, followed by an explicit Runge-Kutta discretization, and finally a novel Nesterov-like discretization. Based on tools from hybrid systems control theory, we proved a convergence bound for these algorithms. Finally, we conducted numerical experiments on known deep neural net benchmarks, which showed that the proposed discrete algorithms can outperform some state of the art algorithms, when tested on large DNN models.

A DISCONTINUOUS SYSTEMS AND DIFFERENTIAL INCLUSIONS

foot_6 Recall that for an initial value problem (IVP) ẋ(t) = F (x(t)) (24a) x(0) = x 0 (24b) with F : R n → R n , the typical way to check for existence of solutions is by establishing continuity of F . Likewise, to establish uniqueness of the solution, we typically seek Lipschitz continuity. When is discontinuous, we may understand (24a) as the Filippov differential inclusion ẋ(t) ∈ K[F ](x(t)), where K[F ] : R n ⇒ R n denotes the Filippov set-valued map given by K[F ](x) δ>0 µ(S)=0 co F (B δ (x) \ S), where µ denotes the usual Lebesgue measure and co the convex closure, i.e. closure of the convex hull co. For more details, see Paden & Sastry (1987) . We can generalize (25) to the differential inclusion Bacciotti & Ceragioli (1999) ẋ(t) ∈ F(x(t)), where F : R n ⇒ R n is some set-valued map. Definition 1 (Carathéodory/Filippov solutions). We say that x : [0, τ ) → R n with 0 < τ ≤ ∞ is a Carathéodory solution to (27) if x(•) is absolutely continuous and ( 27) is satisfied a.e. in every compact subset of [0, τ ). Furthermore, we say that x(•) is a maximal Carathéodory solution if no other Carathéodory solution x (•) exists with x = x | [0,τ ) . If F = K[F ], then Carathéodory solutions are referred to as Filippov solutions. For a comprehensive overview of discontinuous systems, including sufficient conditions for existence (Proposition 3) and uniqueness (Propositions 4 and 5) of Filippov solutions, see the work of Cortés (2008) . In particular, it can be established that Filippov solutions to (24) exist, provided that the following assumption (Assumption 1) holds. Assumption 1 (Existence of Filippov solutions). F : R n → R n is defined almost everywhere (a.e.) and is Lebesgue-measurable in a non-empty open neighborhood U ⊂ R n of x 0 ∈ R n . Further, F is locally essentially bounded in U , i.e., for every point x ∈ U , F is bounded a.e. in some bounded neighborhood of x. More generally, Carathéodory solutions to ( 27) exist (now with arbitrary x 0 ∈ R n ), provided that the following assumption (Assumption 2) holds. Assumption 2 (Existence of Carathéodory solutions). F : R n ⇒ R n has nonempty, compact, and convex values, and is upper semi-continuous. Filippov & Arscott (1988) proved that, for the Filippov set-valued map F = K[F ], Assumptions 1 and 2 are equivalent (with arbitrary x 0 ∈ R n in Assumption 1). Uniqueness of the solution requires further assumptions. Nevertheless, we can characterize the Filippov set-valued map in a similar manner to Clarke's generalized gradient, as seen in the following proposition. Proposition 1 (Theorem 1 of Paden & Sastry (1987) ). Under Assumption 1, we have K[F ](x) = lim k→∞ F (x k ) : x k ∈ R n \ (N F ∪ S) s.t. x k → x (28) for some (Lebesgue) zero-measure set N F ⊂ R n and any other zero-measure set S ⊂ R n . In particular, if F is continuous at a fixed x, then K[F ](x) = {F (x)}. For instance, for the GF (1), we have K[-∇f ](x) = {-∇f (x)} for every x ∈ R n , provided that f is continuously differentiable. Furthermore, if f is only locally Lipschitz continuous and regular (see Definition 3 of Appendix B), then K[-∇f ](x) = -∂f (x), where ∂f (x) lim k→∞ ∇f (x k ) : x k ∈ R n \ N f s.t. x k → x (29) denotes Clarke's generalized gradient Clarke (1981) of f , with N f denoting the zero-measure set over which f is not differentiable (Rademacher's theorem). It can be established that ∂f coincides with the subgradient of f , provided that f is convex. Therefore, the GF (1) interpreted as Filippov differential inclusion may also be seen as a continuous-time variant of subgradient descent methods.

B FINITE-TIME STABILITY OF DIFFERENTIAL INCLUSIONS

We are now ready to focus on extending some notions from traditional Lipschitz continuous systems to differential inclusions. Definition 2. We say that x ∈ R n is an equilibrium of ( 27) if x(t) ≡ x on some small enough non-degenerate interval is a Carathéodory solution to ( 27). In other words, if and only if 0 ∈ F(x ). We say that ( 27) is (Lyapunov) stable at x ∈ R n if, for every ε > 0, there exists some δ > 0 such that, for every maximal Carathéodory solution x(•) of ( 27), we have x 0 -x < δ =⇒ x(t)-x < ε for every t ≥ 0 in the interval where x(•) is defined. Note that, under Assumption 2, if ( 27) is stable at x , then x is an equilibrium of ( 27) Bacciotti & Ceragioli (1999) . Furthermore, we say that ( 27) is (locally and strongly) asymptotically stable at x ∈ R n if is stable at x and there exists some δ > 0 such that, for every maximal Carathéodory solution x : [0, τ ) → R n of ( 27), if x 0 -x < δ then x(t) → x as t → τ . Finally, ( 27) is (locally and strongly) finite-time stable at x if it is asymptotically stable and there exists some δ > 0 and T : B δ (x ) → [0, ∞) such that, for every maximal Carathéodory solution x(•) of ( 27) with x 0 ∈ B δ (x ), we have lim t→T (x0) x(t) = x . We will now construct a Lyapunov-based criterion adapted from the literature of finite-time stability of Lipschitz continuous systems. Lemma 1. Let E(•) be an absolutely continuous function satisfying the differential inequality Ė(t) ≤ -c E(t) α a.e. in t ≥ 0, with c, E(0) > 0 and α < 1. Then, there exists some t > 0 such that E(t) > 0 for t ∈ [0, t ) and E(t ) = 0. Furthermore, t > 0 can be bounded by t ≤ E(0) 1-α c(1 -α) , with this bound tight whenever (30) holds with equality. In that case, but now with α ≥ 1, then E(t) > 0 for every t ≥ 0, with lim t→∞ E(t) = 0. This will be represented by t = ∞, with E(∞) lim t→∞ E(t). Proof. Suppose that E(t) > 0 for every t ∈ [0, T ] with T > 0. Let t be the supremum of all such T 's, thus satisfying E(t) > 0 for every t ∈ [0, t ). We will now investigate E(t ). First, by continuity of E, it follows that E(t ) ≥ 0. Now, by rewriting Ė(t) ≤ -c E(t) α ⇐⇒ d dt E(t) 1-α 1 -α ≤ -c, a.e. in t ∈ [0, t ), we can thus integrate to obtain E(t) 1-α 1 -α - E(0) 1-α 1 -α ≤ -c t, everywhere in t ∈ [0, t ), which in turn leads to E(t) ≤ [E(0) 1-α -c(1 -α)t] 1/(1-α) (34) and t ≤ E(0) 1-α -E(t) 1-α c(1 -α) ≤ E(0) 1-α c(1 -α) , ( ) where the last inequality follows from E(t) > 0 for every t ∈ [0, t ). Taking the supremum in ( 35) then leads to the upper bound (31). Finally, we conclude that E(t ) = 0, since E(t ) > 0 is impossible given that it would mean, due to continuity of E, that there exists some T > t such that E(t) > 0 for every t ∈ [0, T ], thus contradicting the construction of t . Finally, notice that if E is such that (30) holds with equality, then (34) and the first inequality in ( 35) hold with equality as well. The tightness of the bound (31) thus follows immediately. Furthermore, notice that if α ≥ 1, and E is a tight solution to the differential inequality (30), i.e. -α) , then clearly E(t) > 0 for every t ≥ 0 and E(t) → 0 as t → ∞. E(t) = [E(0) 1-α -c(1 -α)t] 1/(1 Cortés & Bullo ( 2005) proposed (Proposition 2.8) a Lyapunov-based criterion to establish finitetime stability of discontinuous systems, which fundamentally coincides with our Lemma 1 for the particular choice of exponent α = 0. Their proposition was, however, directly based on Theorem 2 of Paden & Sastry (1987) . Later, Cortés (2006) proposed a second-order Lyapunov criterion, which, on the other hand, fundamentally translates to E(t) V (x(t)) being strongly convex. Finally, Hui et al. (2009) generalized Proposition 2.8 of Cortés & Bullo (2005) in their Corollary 3.1, to establish semistability. Indeed, these two results coincide for isolated equilibria. We now present a novel result that generalizes the aforementioned first-order Lyapunov-based results, by exploiting our Lemma 1. More precisely, given a Laypunov candidate function V (•), the objective is to set E(t) V (x(t)), and we aim to check that the conditions of Lemma 1 hold. To do this, and assuming V to be locally Lipschitz continuous, we first borrow and adapt from Bacciotti & Ceragioli (1999) the definition of set-valued time derivative of V : D → R w.r.t. the differential inclusion ( 27), given by V (x) {a ∈ R : ∃v ∈ F(x) s.t. a = p • v, ∀p ∈ ∂V (x)}, for each x ∈ D. Notice that, under Assumption 2 for Filippov differential inclusions F = K[F ], the set-valued time derivative of V thus coincides with with the set-valued Lie derivative L F V (•). Indeed, more generally V could be seen as a set-valued Lie derivative L F V w.r.t. the set-valued map F. Definition 3. V (•) is said to be regular if every directional derivative, given by V (x; v) lim h→0 V (x + h v) -V (x) h , exists and is equal to V • (x; v) lim sup x →x h→0 + V (x + h v) -V (x ) h , known as Clarke's upper generalized derivative Clarke (1981). In practice, regularity is a fairly mild and easy to guarantee condition. For instance, it would suffice that V is convex or continuously differentiable to ensure that it is Lipschitz and regular. Assumption 3. V : D → R is locally Lipscthiz continuous and regular, with D ⊆ R n open. Under Assumption 3, Clarke's generalized gradient ∂V (x) {p ∈ R n : V • (x; v) ≥ p • v, ∀v ∈ R n } is non-empty for every x ∈ D, and is also given by ∂V (x) = lim k→∞ ∇V (x k ) : x k ∈ R n \ N V s.t. x k → x , where N V denotes the set of points in D ⊆ R n where V is not differentiable (Rademacher's theorem) Clarke (1981) . Through the following lemma (Lemma 2), we can formally establish the correspondence between the set-valued time-derivative of V and the derivative of the energy function E(t) V (x(t)) associated with an arbitrary Carathéodory solution x(•) to the differential inclusion (27). Lemma 2 (Lemma 1 of Bacciotti & Ceragioli (1999) ). Under Assumption 3, given any Carathéodory solution x : [0, τ ) → R n to (27), then E(t) V (x(t)) is absolutely continuous and Ė(t) = d dt V (x(t)) ∈ V (x(t)) a.e. in t ∈ [0, τ ). We are now ready to state and prove our Lyapunov-based sufficient condition for finite-time stability of differential inclusions. Theorem 3. Suppose that Assumptions 2 and 3 hold for some set-valued map F : R n ⇒ R n and some function V : D → R, where D ⊆ R n is an open and positively invariant neighborhood of a point x ∈ R n . Suppose that V is positive definite w.r.t. x and that there exist constants c > 0 and α < 1 such that sup V (x) ≤ -c V (x) α (41) a.e. in x ∈ D. Then, (27) is finite-time stable at x , with settling time upper bounded by t ≤ V (x 0 ) 1-α c(1 -α) , where x(0) = x 0 . In particular, any Carathéodory solution x(•) with x(0) = x 0 ∈ D will converge in finite time to x under the upper bound (42). Furthermore, if D = R n , then (27) is globally finite-time stable. Finally, if V (x) is a singleton a.e. in x ∈ D and (41) holds with equality, then the bound (42) is tight. Proof. Note that, by Proposition 1 of Bacciotti & Ceragioli (1999) , we know that ( 27) is Lyapunov stable at x . All that remains to be shown is local convergence towards x (which must be an equilibrium) in finite time. Indeed, given any maximal solution x : [0, t ) → R n to (27) with x(0) = x 0 = x , we know by Lemma 2, that E(t) = V (x(t)) is absolutely continuous with Ė(t) ∈ V (x(t)) a.e. in t ∈ [0, t ] . Therefore, we have Ė(t) ≤ sup V (x(t)) ≤ -c V (x(t)) α = -c E(t) α a.e. in t ∈ [0, t ]. Since E(0) = V (x 0 ) > 0, given that x 0 = x , the result then follows by invoking Lemma 1 and noting that E(t ) = 0 ⇐⇒ V (t , x(t )) = 0 ⇐⇒ x(t ) = x . Finite-time stability still follows without Assumption 2, provided that x is an equilibrium of (27). In practical terms, this means that trajectories starting arbitrarily close to x may not actually exist, but nevertheless there exists a neighborhood D of x over which, any trajectory x(•) that indeed exists and starts at x(0) = x 0 ∈ D must converge in finite time to x , with settling time upper bounded by T (x 0 ) (the bound still tight in the case that (41) holds with equality).

C PROOF OF THEOREM 1

Let us focus on the q-RGF (2) (the case of q-SGF (3) follows exactly the same steps) with the candidate Lyapunov function V f -f . Clearly, V is Lipschitz continuous and regular (given that it is continuously differentiable). Furthermore, V is positive definite w.r.t. x . Notice that, due to the dominated gradient assumption, x must be an isolated stationary point of f . To see this, notice that, if x were not an isolated stationary point, then there would have to exist some x sufficiently near x such that x is both a stationary point of f , and satisfies f (x ) > f , since x is a strict local minimizer of f . But then, we would have 0 = p -1 p ∇f (x ) p p-1 ≥ µ 1 p-1 (f (x ) -f ) > 0, and subsequently 0 > 0, which is absurd. Therefore, F (x) -c∇f (x)/ ∇f (x) q-2 q-1 is continuous for every x ∈ D \ {x }, for some small enough open neighborhood D of x . Let us assume that D is positively invariant w.r.t. (2), which can be achieved, for instance, by replacing D with its intersection with some small enough strict sublevel set of f . Notice that F (x) = c ∇f (x) 1 q-1 with q ∈ (p, ∞] ⊂ (1, ∞], i.e., 1 q-1 ∈ [0, ∞). If q = ∞, which results in the normalized gradient flow ẋ = -∇f (x) ∇f (x) proposed by Cortés (2006) , then F (x) = c > 0. We can thus show that F (x) is discontinuous at x = 0 for q = ∞. On the other hand, if q ∈ (p, ∞) ⊂ (1, ∞), then we have F (x) → 0 as x → x , and thus F (x) is continuous (but not Lipschitz) at x = x . Regardless, we may freely focus exclusively on D \ {x } since {x } is obviously a zero-measure set. Let F K[F ]. We thus have, for each x ∈ D \ {x }, sup V (x) = sup {a ∈ R : ∃v ∈ F(x) s.t. a = p • v, ∀p ∈ ∂V (x)} (45a) = sup {∇V (x) • v : v ∈ F(x)} (45b) = ∇V (x) • F (x) (45c) = -c ∇f (x) 2-q-2 q-1 To prove Theorem 2, we borrow some tools and results from hybrid control systems theory. Hybrid control systems are characterized by continuous flows with discrete jumps between the continuous flows. They are often modeled by differential inclusions added to discrete mappings to model the jumps between the differential inclusions. We see the case of the optimization flows proposed here as a simple case of a hybrid systems with one differential inclusion, with a possible jump or discontinuity at the optimum. Based on this, we will use the tools and results of Sanfelice & Teel (2010) , which study how a certain class of hybrid systems behave after discretization with a certain class of discretization algorithms. In other words, Sanfelice & Teel (2010) quantifies, under some conditions, how close are the solutions of the discretized hybrid dynamical system to the solutions of the original hybrid system. In this section we will denote the differential inclusion of the continuous optimization flow by F : R n ⇒ R n , and its discretization in time by F d : R n ⇒ R n . We first recall a definition, which we will adapt from the general case of jumps between multiple differential inclusions (Definition 3.2, Sanfelice & Teel (2010) ) to our case of one differential inclusion or flow. Definition 4. ((T, )-closeness). Given T > 0, > 0, η > 0, two solutions x t : [0, T ] → R n , and x k : {0, 1, 2, ...} → R n are (T, )-close if: (a) for all t ≤ T there exists k ∈ {1, 2, ...} such that |t -kη| < , and x t (t) -x k (k) < , (b) for all k ∈ {1, 2, ...} there exists t ≤ T such that |t -kη| < , and x t (t) -x k (k) < . Next, we will recall Theorem 5.2 in Sanfelice & Teel (2010) , while adapting it to our special case of a hybrid system with one differential inclusionfoot_7 . Theorem 4. (Closeness of continuous and discrete solutions on compact domains) Consider the differential inclusion Ẋ(t) ∈ F(X(t)), for a given set-valued mapping F : R m ⇒ R m assumed to be outer semicontinuous, locally bounded, nonempty, and with convex values for every x ∈ C, for some closed set C ⊆ R m . Consider the discrete-time system represented by the flow F d : R n ⇒ R n , such that, for each compact set K ⊂ R n , there exists ρ ∈ K ∞ , and η > 0 such that for each x ∈ K and each η ∈ (0, η ], F d (x) ⊂ x + η conF(x + ρ(η)B) + ηρ(η)B. Then, for every compact set K ⊂ R n , every > 0, and every time horizon T ∈ R ≥0 there exists η > 0 such that: for any η ∈ (0, η ] and any discrete solution x k with x k (0) ∈ K + δB, δ > 0, there exists a continuous solution x t with x t (0) ∈ K such that x k and x t are (T, )-close. To prove Theorem 2 we will use the results of Theorem 4, where we will have to check that condition (48) is satisfied for the three proposed discretizations. We are now ready to prove Theorem 2. First, note that outer semicontinuity follows from the upper semicontinuity and the closedness of the Filippov differential inclusion map. Furthermore, local boundedness follows from continuity everywhere outside stationary points, which are isolated. Now, let us examine their discretization by the three proposed algorithms:

FORWARD-EULER DISCRETIZATION

The mapping F d in this case is a singleton, given by F d (x) x + ηF (x), where η > 0, which clearly satisfies condition (48).

RUNGE-KUTTA DISCRETIZATION

Once again, the mapping F d is singleton, this time given by F d (x) x + η K i=1 α i F(y i ) y i = x + η i-1 j=1 β j F(y j ), where η, α 1 , . . . , α K , β 1 , . . . , β K-1 > 0 are such that K i=1 α i = 1. By equation (50b) one can establish a function ρ ∈ K ∞ such that for each x k ∈ K ⊂ R n , for each η > 0, y i k ∈ x k + ηρ(η)B. Next, by equation (50a) together with the condition i=K i=1 α i = 1, one can write that for any x k ∈ K and η > 0, F d (x k ) ⊂ x k + ηconF(y i k ) ⊂ x k + ηconF(x k + ηρ(η)B) + ηρ(η)B.

NESTEROV-LIKE DISCRETIZATION

In this case the discrete-time flow F d is defined as F d (x k ) = x k + ηF(x k + µy k ) + µy k (51a) y k = x k -x k-1 . In this case, to take into account the integral effect of the Nesterov-like discretization, let us extend the continuous-time flow as ẋt = ẋt d dt t+η t xt(s)ds η = F(x t ) xt+η-xt η , we then compare the solution of the extended continuous-time system (52) with the extended discretetime system xk+1 = x k+1 x k+1 -x k = x k + ηF(x k (1 + µ) + x k-1 ) + µ(x k -x k-1 ) ηF(x k (1 + µ) + x k-1 ) + µ(x k -x k-1 ) , which could be rearranged as F d (x k ) = xk+1 = xk + M xk + η Ft (x k ), where M = 0 µ 0 -1 + µ , and Ft ( x k ) = F(x k ) F(x k ) , which shows that for any xk ∈ K ⊂ R 2n , and η > 0, we have (using similar recursive reasoning as above) that F d (x k ) ⊂ xk + ηcon Ft (x k + ηρ(η)B) + ηρ(η)B. Then, using Theorem 4 we conclude about the (T, )-closeness between the continuous-time solutions of the flows F : q-RGF (2), q-SGF (3), and the discrete-time solutions of their respective discretization by any of the three discretization methods. Furthermore, for the Nesterov-like discretization, we can also conclude about the (T, )closeness of the integral of the continuous-time solutions xt (2) and its discretization xk (2). Finally, using the Lyapunov function V = f -f as defined in the proof of Theorem 1, together with inequalities (45g), ( 34), and a local Lipschitz bound on f , one can derive the convergence bound given by ( 23), as follows: f (x k ) -f (x ) -(f (x t ) -f (x )) = f (x k ) -f (x t ) ≤ ˜ = L f , L f > 0, > 0, f (x k ) -f (x ) -(f (x t ) -f (x )) ≤ f (x k ) -f (x t ) ≤ ˜ , f (x k ) -f (x ) ≤ ˜ + f (x t ) -f (x ) , f (x k ) -f (x ) ≤ ˜ + [(f (x 0 ) -f (x )) (1-α) -c(1 -α)ηk] 1/(1-α) , for k ≤ k , where α = θ θ , θ = p-1 p , θ = q-1 q , c = c p p-1 p-1 p µ 1 p 1 θ , k = (f (x0)-f (x )) (1-α) c(1-α)η .

E ADDITIONAL DETAILS AND NUMERICAL RESULTS

In this section, we will expand upon the numerical results experiments discussed in the paper. In particular, we report more details on the hyper-parameters values usedfoot_8 for the numerical tests, and report some results for the MNIST experiments. E.1 HYPER PARAMETERS VALUES USED IN THE TESTS OF SECTION 4.1 • GD fixed step size: η = 10 -3 • RGF Euler disc. w/fixed step size: q = 2.2, η = 10 -3 • RGF Euler disc. w/fixed step size: q = 3, η = 10 -2 • RGF Euler disc. w/fixed step size: q = 6, η = 10 -2 • RGF Euler disc. w/fixed step size: q = 10, η = 10 -2 • GD Nesterov acceleration fixed step size: η = 10 -4 ; µ = 0.9 • SGF Nesterov-like disc. w/fixed step size: q = 2.2, η = 10 -4 , µ = 0.9 • SGF Nesterov-like disc. w/fixed step size: q = 3, η = 10 -3 , µ = 0.9 • SGF Nesterov-like disc. w/fixed step size: q = 6, η = 10 -3 , µ = 0.9 • SGF Nesterov-like disc. w/fixed step size: q = 10, η = 10 -2 , µ = 0.09 • RGF Runge Kutta disc. w/fixed step size: q = 2.2, K = 2, η = 10 -2 , β 1 = 0.09, α 1 = α 2 = 0.5 • RGF Runge Kutta disc. w/fixed step size: q = 3, K = 2, η = 10 -2 , β 1 = 0.09, α 1 = α 2 = 0.5 • RGF Runge Kutta disc. w/fixed step size: q = 6, K = 2, η = 10 -2 , β 1 = 0.09, α 1 = α 2 = 0.5 • RGF Runge Kutta disc. w/fixed step size: q = 10, K = 2, η = 10 -2 , β 1 = 0.09, α 1 = α 2 = 0.5 In this experiment we optimize the CNN network described using a Pytorch code sequence in MODEL 1 with a negative log likelihood loss, on the MNIST dataset. We use 10 epochs of training, with 60 batches of 1000 images for training, and 10 batches of 1000 images for testing. We tested Algorithm 1 RGF (c = 1, q = 3, η = 0.06, µ = 0.9), and Algorithm 2 SGF (c = 0.001, q = 2.1, η = 0.06, µ = 0.9) against Nesterov's accelerated gradient descent (GD) (η = 0.06, µ = 0.9, Nesterov=True), and Adam (η = 0.004, remaining coefficients=nominal values of Torch.Optim). We also tested other algorithms such as RMSprop, AdaGrad, and AdaDelta, but since their convergence performance, as well as, test performance were not very competitive w.r.t. GD, on this experiment, we decided not to report them here, to avoid overloading the graphs. In Figures 7, 8 we can see the training loss over the training iterations, where we see that GD, RGF and SGF perform better than Adam in terms of convergence speed (20 sec lead in average), and in terms of test performance 98% for Adam, 98% for GD, and 99% for both RGF and SGF. The RGF and SGF perform slightly better than GD in terms of convergence speed. The gain is relatively small (5 sec to 10 sec in average) which is expected in such a small DNN network (please refer to VGG16-CIFAR10, and VGG16-SVHN test results for larger computation-time gains). We also notice, in Figure 7 that all algorithms behave well in terms of avoiding overfitting the model. 



Also known as closed-loop design in control-theoretic and reinforcement learning terminology, meaning that η k = ϕ(k, x k ) for some ϕ : Z+ × R n → R+ that does not depend on {X0, X1, X2, . . .}. On the other hand, open-loop design can be seen as closed loop with ϕ(k, •) constant for each k ∈ Z+. For more modern treatments in English, seeŁojasiewicz & Zurro (1999);Bolte et al. (2007) Note that there might be several ways of approaching this proof. For instance, one could follow the general results on stochastic approximation of set-valued dynamical systems, using the notion of perturbed solutions to differential inclusions presented inBenaïm et al. (2005). To avoid overloading the figures we had to choose one flow at a time, either q-RGF or q-SGF. More results can be found in Appendix E We did multiple iterations to find the best step size for each algorithm (best values where between 10 -4 and 10 -2 depending on the algorithm). Details of the step size for each test are given in Appendix E. We also tested Adaptive gradient (AdaGrad), per-dimension learning rate method for gradient descent (AdaDelta), and Root Mean Square Propagation (RMSprop). However, since their performance was not competitive we decided not to report the plots to avoid overloading the figures. The notions introduced here are solely needed for rigorously dealing with the singular discontinuity at the equilibrium point of the q-RGF and q-SGF flows. However, the reader can skip these definitions and still be able to intuitively follow the proofs of Theorems 1, and 2. A set-valued mapping F : R n ⇒ R n is outer semicontinuous if for each sequence {xi} ∞ i=1 converging to a point x ∈ R n and each sequence yi ∈ F(xi) converging to a point y, it holds that y ∈ F(x). It is locally bounded if, for each x ∈ R n , there exists compact sets K, K ⊂ R n such that x ∈ K and F(K) ∪x∈K F(x) ⊂ K . In what follows, we use the following notations: Given a set A, conA denotes the convex hull, and B denotes the closed unit ball in a Euclidean space. In all the tests, for q-RGF and q-SGF c = 1 unless otherwise stated.



k ). If this is not possible (and still open-loop step size), then we may take F (2) d (k, X) η k+1 , and of course add a k-argument in F d .

Figure 1: Example of the proposed discretization algorithms of finite-time q-RGF and q-SGF

Figure 2: Losses for several optimization algorithms-VGG-16-CIFAR10: Train loss (left), test loss (right)-We add an S before the name of an algorithm to denote its stochastic implementation.

Figure 3: Training loss vs. computation time for several optimization algorithms-VGG-16-CIFAR10

Figure 4: Losses for several optimization algorithms -CNN-SVHN: Train loss (left), test loss (right)

Figure 6: Example of the proposed discretization algorithms of finite-time q-RGF and q-SGF

Figure 7: Losses for several optimization algorithms-CNN-MNIST: Train loss (left), test loss (right)

Figure 11: Losses for several optimization algorithms-SVHN: Train loss (left), test loss (right)

, or via symplectic Euler schemes of crafted Hamiltonian systems Muehlebach & Jordan (2019); Franc ¸a et al. (2019b).

E.2 HYPER PARAMETERS VALUES USED IN THE TESTS OF SECTION 4.2

Note that the description of the coefficients for each of the prior art methods can be found in: https://pytorch.org/docs/stable/optim.html.• GD: η = 4.10 -2 , µ = 0.9, Nesterov=True • RGF: η = 4.10 -2 , µ = 0.9 • SGF: η = 4.10 -3 , µ = 0.9 • ADAM: η = 8.10 -4 (remaining coefficients=nominal values) • RMS: η = 10 -3 (remaining coefficients=nominal values)• ADAGRAD: η = 10 -3 (remaining coefficients=nominal values) • ADADELTA: η = 4.10 -2 , ρ = 0.9, = 10 -6 , weight decay = 0

E.3 MORE TESTS ON THE ROSENBROCK FUNCTION

Due to space limitations, we have decided to report in the main paper only one test for q-RGF with Euler discretization, one test for q-SGF with Nesterov-like discretization, and one test for q-RGF with Runge Kutta discretization. For the sake of completeness we report here the remaining tests for each algorithm. One can observe similar qualitative behavior in Figure 6 as the one noticed in the results of Section 4.5.The step-size and other hyper-parameters for each test are given below:• GD fixed step size: η = 10 -3 • SGF Euler disc. w/fixed step size: q = 2.1, η = 10 -3 • SGF Euler disc. w/fixed step size: q = 2.5, η = 10 -3 • SGF Euler disc. w/fixed step size: q = 2.8, η = 10 -3 • SGF Euler disc. w/fixed step size: q = 100, η = 10 -2 • GD Nesterov acceleration fixed step size: η = 10 -4 ; µ = 0.9 • RGF Nesterov-like disc. w/fixed step size: q = 2.2, η = 10 -4 , µ = 0.9 • RGF Nesterov-like disc. w/fixed step size: q = 3, η = 10 -3 , µ = 0.9 • RGF Nesterov-like disc. w/fixed step size: q = 6, η = 10 -3 , µ = 0.9 • RGF Nesterov-like disc. w/fixed step size: q = 10, η = 10 -3 , µ = 0.9 • SGF Runge Kutta disc. w/fixed step size:Remark 2. Choice of q: The settling time upper bound (15) decreases as q → ∞, which appears to lead to faster convergence when discretized. On the other hand, the larger q is, the stiffer the ODE, so more prone to numerical instability, so q cannot be too large. Therefore, assuming p to be not too large, it appears that q ∈ (p, p + δ] works best, with δ > 0 as small as needed to avoid numerical issues. For instance, if we know the cost function to be strongly convex (locally), then we search for q slightly larger than p = 2 at first, but continue to increase until performance deteriorates. If, on the other hand, we don't know the order p > 1, then it's currently unclear how to choose q. We will investigate this further in future work. Furthermore, there is evidence that gradient dominance does hold locally in many deep learning contexts (Zhou and Liang, 2017, https://arxiv.org/abs/1710.06910). Indeed, since convexity readily leads to gradient dominance of order p = ∞, it suffices that a slightly stronger form of it holds (but weaker than strong convexity), in order to have p < ∞, and thus for us to be able to choose q > p. In the main paper, due to space limitation, we decided to report the results of the Nesterov-like discretization only, which also seems like a fair comparison since our Nesterov-like discretization of q-RGF and q-SGF flows can be compared against Nesterov implementation of GD algorithm. However, we also wanted to test the performance of the simple Euler discretization of the proposed flows against the simple GD algorithm, to do so we run some extra tests on the SVHN dataset. These results are presented below.We tested the proposed Euler algorithms to train the same VGG16 CNN model with cross entropy loss. We divided the dataset into a training set of 70 batches with 1000 images each, and a test set of 10 of 1000 images each, and ran 20 epochs of training over all the training batches. We tested the Euler discretization of q-RGF (c = 1, q = 2.1, η = 0.04, µ = 0.9), and the Euler discretization of q-SGF (c = 10 -3 , q = 2.1, η = 0.04, µ = 0.9) against gradient descent (GD) and Adam (same optimal tuning as in Section 4.2). All algorithms have been implemented in their stochastic version.In Figures 9 , 10 we can see that both algorithms, Euler q-RGF and Euler q-SGF, converge faster (40 min lead in average) than SGD and Adam for these tests, and reach an overall better performance on the test-set. E.6 EXPERIMENT 5: RUNGE-KUTTA DISCRETIZATION ON SVHN DATASET Finally, for compleetness, we also wanted to test the performance of the Runge-Kutta discretization of the proposed flows against SGD, to do so we run some extra tests on the SVHN dataset. These results are presented below.We tested the proposed Runge-Kutta algorithms to train the same VGG16 CNN model with cross entropy loss. We divided the dataset into a training set of 70 batches with 1000 images each, and a test set of 10 of 1000 images each, and ran 20 epochs of training over all the training batches. We tested the Runge-Kutta discretization of q-RGF (c = 1, q = 2.1, K = 2, η = 10 -2 , β1 = 10 -2 , α1 = α2 = 0.5), and the Runge-Kutta discretization of q-SGF (c = 10 -3 , q = 2.1, K = 2, η = 10 -2 , β1 = 10 -2 , α1 = α2 = 0.5) against 

