A COMPUTATIONALLY EFFICIENT SPARSIFIED ONLINE NEWTON METHOD

Abstract

Second-order methods have enormous potential in improving the convergence of deep neural network (DNN) training, but are prohibitive due to their large memory and compute requirements. Furthermore, computing the matrix inverse or the Newton direction, which is needed in second-order methods, requires high precision computation for stable training as the preconditioner could have a large condition number. This paper provides a first attempt at developing computationally efficient sparse preconditioners for DNN training which can also tolerate low precision computation. Our new Sparsified Online Newton (SONew) algorithm emerges from the novel use of the LogDet matrix divergence measure; we combine it with sparsity constraints to minimize regret in the online convex optimization framework. Our mathematical analysis allows us to reduce the condition number of our sparse preconditioning matrix, thus improving the stability of training with low precision. We conduct experiments on a feed-forward neural-network autoencoder benchmark, where we compare training loss of optimizers when run for a fixed number of epochs. In the float32 experiments, our methods outperform the best-performing first-order optimizers and perform comparably to Shampoo, a state-of-the-art second-order optimizer. However, our method is even more effective in low precision, where SONew finishes training considerably faster while performing comparably with Shampoo on training loss.

1. INTRODUCTION

Stochastic first order methods which use the negative gradient direction to update parameters have become the standard for training deep neural networks (DNNs). Gradient-based preconditioning involves finding an update direction, by multiplying the gradient with a preconditioner matrix carefully chosen from gradients observed in previous iterations, to improve convergence. (Full-matrix) Adagrad (Duchi et al., 2011b) , online Newton method (Hazan et al., 2007) and natural gradient descent (Amari, 1998) use a full-matrix preconditioner, but computing and storing the full matrix is infeasible when there are millions of parameters. Thus, diagonal versions such as diagonal Adagrad, Adam (Kingma & Ba, 2014) , and RMSprop (Hinton et al., 2012) are now widely used to train DNNs due to their scalability. Several higher-order methods have previously been applied to deep learning (Gupta et al., 2018; Anil et al., 2020; Goldfarb et al., 2020; Martens & Grosse, 2015) . All these methods use Kronecker products that reduce computational and storage costs to make them feasible for training neural networks. However, these methods rely on matrix inverses or pth-roots that require high precision arithmetic as the matrices they deal with can have large condition numbers (Anil et al., 2020; 2022) . Meanwhile, deep learning hardware accelerators have evolved towards using lower precision (bfloat16, float16, int8) (Henry et al., 2019; Jouppi et al., 2017) to reduce overall computational and memory costs and improve training performance. This calls for further research in developing efficient optimization techniques that work with low precision. Indeed there is recent work along these directions, from careful quantization of Adam (Dettmers et al., 2021) to 8-bits to optimizer agnostic local loss optimization (Amid et al., 2022) that leverage first-order methods to match higher-order methods. In this paper, we present a first attempt towards computationally efficient sparse preconditioners for DNN training. Regret analysis when using a preconditioner reveals that the error is bounded by two summations (see (3) below); the first summation depends on the change in the preconditioning matrix, while the second depends on the generalized gradient norm. We take the approach of minimizing the second term while regularizing two successive preconditioners to be close in the LogDet matrix divergence measure (Kulis et al., 2009) . This technique gives us an Online Newton method (Hazan et al., 2007) . To make it computationally efficient, we further sparsify the preconditioner by finding a sparse approximation that is close in LogDet divergence. Thus we are consistent in using the same measure (LogDet divergence) in both the regularization and sparsification steps. This gives us our Sparsified Online Newton (SONew) method, which only requires O(n) time and memory complexity per iteration. We achieve this by imposing structured sparsity, such as tridiagonal and banded sparsity patterns, in the preconditioner. This is unlike most existing online Newton methods that require at least O(n 2 ) space and time complexity. By making each step linear time, the SONew method can be applied to train modern DNNs. Further, for some sparsity structures, our method is easily parallelized thus making negligible the overhead of computing the preconditioner. We also show that introducing sparsity allows us to reduce the condition number of the problem; as a consequence our preconditioner allows us to train DNNs even in low precision arithmetic. We establish regret bound guarantees of our algorithm in the online convex optimization framework. This involves using various properties about LogDet divergence and connections to other Bregman matrix divergences (Bregman, 1967) , such as the von Neumann matrix divergence (Kulis et al., 2009) . We conduct experiments on an MLP Autoencoder, where we obtain better training loss compared to first order methods. We also conduct experiments on large-scale benchmarks in Appendix A.5, and observe comparable or improved performance than Adam (Kingma & Ba, 2014) . Our MLP experiments in limited precision arithmetic (bfloat16) showed comparable performance with second-order methods, while being considerably faster.

2. BACKGROUND

The inner product between matrices is defined as ⟨A, B⟩ = Tr(A T B), where Tr(.) denotes the matrix trace. The Frobenius norm of a matrix A is ∥A∥ F = Tr(A T A), while its spectral norm is ∥A∥ 2 = max x ∥Ax∥ 2 / ∥x∥ 2 . We use I n ∈ R n×n to denote an identity matrix. We use S n , S ++ n to denote the set of symmetric, and positive definite matrices respectively. The generalized norm of a vector x ∈ R n with respect to matrix A ∈ S ++ n is defined as ∥x∥ A = √ x T Ax. We use det (A) to denote the determinant of matrix A, and diag(A) to denote diagonal matrix with diag(A) ii = A ii . We use G and G to denote a graph and its sub-graph with a vertex set [n] = {1, . . . , n}. Let E G denote set of edges in graph G, and neig G (i) denote neighbours of vertex i in graph G. The sparsity graph/pattern of a matrix A ∈ R n×n is a graph G with edges E G = {(i, j) : A ij ̸ = 0} corresponding to the non-zero entries in A. We extensively use S n (G) ++ to denote the set of positive definite matrices with sparsity structure given by graph G. Given an index set I = {i 1 , i 2 , .., i n }, we use A II to denote the corresponding principal sub-matrix of A.

2.1. LOGDET MATRIX DIVERGENCE

Let ϕ : S ++ n → R be a strictly convex, differentiable function. Then the Bregman matrix divergence between X, Y ∈ S ++ n is defined as (Bregman, 1967; Kulis et al., 2009) : D ϕ (X, Y ) = ϕ(X) -ϕ(Y ) -Tr(∇ϕ(Y ) T (X -Y )). Since ϕ is convex, D ϕ (X, U ) ≥ 0 for all X, Y ≻ 0. A well known example is when ϕ(X) = ∥X∥ 2 F , the corresponding Bregman divergence D ϕ (X, Y ) = ∥X -Y ∥ 2 F is the squared Frobenius norm. In this paper, we extensively use the divergence when the convex function is ϕ(X) = -log det (X); the corresponding divergence measure D ℓd (X, Y ) is called the LogDet matrix divergence: D ℓd (X, Y ) = -log det XY -1 + Tr(XY -1 ) -n. (1) The LogDet divergence is scale invariant to invertible matrices A, i.e. D ℓd (A T XA, A T Y A) = D ℓd (X, Y ). The following revealing form of LogDet divergence in terms of eigendecompositions of X = V ΛV T and Y = U ΘU T (Kulis et al., 2009) : D ℓd (X, Y ) = i j (v T i u j ) 2 (λ i /θ j -log(λ i /θ j ) -1). These two properties are later used in Section 3 to highlight the significance of LogDet divergence in our algorithm.

3. SONEW: SPARSIFIED ONLINE NEWTON METHOD

We now present our proposed SONew algorithm.

3.1. REGRET MINIMIZATION VIA LOGDET DIVERGENCE

We set up our problem under the online convex optimization framework (OCO) (Shalev-Shwartz et al., 2012; Hazan et al., 2016) , where at each round the learner makes a prediction w t in an online fashion and receives a convex loss f t (w t ) and gradient g t = ∇f t (w t ) as feedback. The goal of the learner is to reduce regret R T by predicting w t so that a low aggregate loss T t=1 f t (w t ) is achieved compared to the best possible, w * = arg min w T t=1 f t (w). Formally, the regret is given by R T (w 1 , . . . , w T ) = T t=1 f t (w t ) - T t=1 f t (w * ). To upper bound this regret, we proceed as in Hazan et al. (2016) by analyzing the error in the iterates for the update w t+1 := w t -ηX t g t , where X t ∈ R n×n . Then ∥w t+1 -w * ∥ 2 X -1 t = ∥w t -ηX t g t -w * ∥ 2 X -1 t = ∥w t -w * ∥ 2 X -1 t + η 2 g T t X t g t -2η(w t -w * ) T g t . The convexity of f t implies that f t (w t ) -f t (w * ) ≤ (w t -w * ) T g t leading to f t (w t ) -f t (w * ) ≤ ∥w t -w * ∥ 2 X -1 t - ∥w t+1 -w * ∥ 2 X -1 t + η 2 g T t X t g t . Summing over all t ∈ [T ] and rearranging reveals the following upper bound on overall regret: R T ≤ 1 2η ∥w 1 -w * ∥ 2 X -1 1 + 1 2η T t=2 (w t -w * ) T (X -1 t -X -1 t-1 )(w t -w * ) + η 2 T t=1 g T t X t g t . Since w * is unknown, finding X t which minimizes (3) is infeasible. So to minimize regret, we attempt to minimize the last term in (3) while regularizing X -1 t to be "close" to X -1 t-1 . The nearness measure we choose is the LogDet matrix divergence, thus leading to the following objective X t = arg min X∈S ++ n g T t Xg t , such that D ℓd (X, X t-1 ) ≤ c t , where D ℓd is as in (1). Why do we use the LogDet divergence? From (2), due to the term λ i /θ j , D ℓd (X, X t-1 ) prioritizes matching the smaller eigenvalues of X t-1 with those of X, i.e., matching the larger eigenvalues of X -1 t-1 and X -1 . As a consequence, LogDet divergence regularizes X by matching up its large eigenvalues with those of X t-1 . For e.g., if smallest and largest eigenvalue of X t-1 are θ n and θ 1 , then for an eigenvalue λ of X, when λ > θ n , θ 1 , the penalty from (2) for θ n is higher than for θ 1 , (λ/θ n -log(λ/θ n ) -1) > (λ/θ 1 -log(λ/θ 1 ) -1). This intuition leads us to formulate (4) as our objective. We recall that there is precedence of using the LogDet divergence in the optimization literature; indeed the celebrated BFGS algorithm (Broyden, 1967; Fletcher, 1970; Goldfarb, 1970; Shanno, 1970) can be shown to be the unique solution obtained when the LogDet divergence between successive preconditioners, subject to a secant constraint, is minimized (as shown in the beautiful 4-page paper by Fletcher (1991) ). The optimization problem in (4) is convex in X since the LogDet divergence is convex in its first argument. The Lagrange L(X, λ t ) = g T t Xg t + λ t (D ℓd (X, X t-1 ) -c t ) = Tr(Xg t g T t ) + λ t (-log det XX -1 t-1 + Tr(XX -1 t-1 ) -n)). Setting ∇L(X, λ t ) = 0, and using the fact that ∇ log det (X) = X -1 we get the following update rule: X -1 t = X -1 t-1 + g t g T t /λ t . Note that setting c t = 0 (equivalently λ t = ∞) ∀t ∈ [n] in (4) results in no change to the preconditioner in any round. In this case, with X 0 = I n , we get online gradient descent (Zinkevich, 2003) . On the other hand, setting λ t = 1 gives the update rule of the online Newton method Hazan et al. (2007) . Our update rule differs from (full-matrix) Adagrad (Duchi et al., 2011b) which has X -2 t = X -2 t-1 + g t g T t . Maintaining and updating X t as in (5) is possible by using Sherman-Morrison formula but requires O(n 2 ) storage and time complexity. This becomes impractical when n is in the order of millions which is typically the case in DNNs.

Algorithm 1 Sparsified Online Newton (SONew) Algorithm

Inputs: λt := coefficient in the update (5), G := sparsity graph (banded/tridiagonal), ϵ := damping parameter, T := total number of iterations/mini-batches, ηt := step size/learning rate. Output: wT +1 1: H0 = ϵI d , w1 = 0 2: for t ∈ {1, . . . , T } do 3: compute gt = ∇ft(wt) 4: Ht := Ht-1 + PG(gtg L := 0, D := 0 3: Ljj := 1, ∀j ∈ [n] 4: for j ∈ {1, . . . , n} do ▷ parallelizable 5: Let HjI j and HI j I j be defined as in Section 2, where Ij = {j + 1, . . . , j + b} ∩ [n], 6: Solve for LI j j in the linear system HI j I j LI j j = -HI j j ▷ O(b 3 ) time.

7:

Djj := 1/(Hjj + H T I j j LI j j ) 8: end for 9: return L, D 10: end function

3.2. SPARSIFYING THE PRECONDITIONER

To reduce the memory required to maintain and update X t using (5), we consider the following general problem: find sparse X ≻ 0 with ∥X∥ 0 ≤ αn, α > 1, such that the objective D ℓd (X, (X -1 t-1 + g t g T t /λ t ) -1 ) is minimized. Due to the L 0 -norm constraint, this is a non-convex problem, which makes it difficult to solve exactly. Since L 1 -norm serves as a convex relaxation for the L 0 norm, we could use it instead, resulting in the following optimization problem also known as graphical lasso estimator (Friedman et al., 2008) : min X∈S ++ n D ℓd (X, (X -1 t-1 + g t g T t /λ t ) -1 ) + λ ∥X∥ 1 . The sparsity introduced by L 1 -norm penalty will reduce the memory usage. However, the time taken to solve the above problem, even with the current best methods (Bollhöfer et al., 2019; Hsieh et al., 2013; Fattahi & Sojoudi, 2019; Zhang et al., 2018) , can still be too large (as these methods take several minutes for a matrix of size million), making it impractical to embed in DNN training since preconditioning may need to be done after processing every mini-batch. In this paper, we take a different direction where we use fixed sparsity pattern constraints, specified by a fixed undirected graph G. To sparsify the solution in (5), we formulate the subproblem X t = arg min X∈Sn(G) ++ D ℓd (X, (X -1 t-1 + g t g T t /λ t ) -1 ), where S n (G) ++ denotes the set of positive definite matrices with the fixed sparsity pattern given by graph G. Note that even for sparsification we use the LogDet measure; thus both the steps (4) and (6) use the same measure. Algorithm 1 presents one proposed SONew method, which solves (6) using O(n) time and memory for banded matrices with band size b, in particular a tridiagonal matrix, corresponding to a chain graph, is a banded matrix with bandsize 1. Maintaining H t ∈ S n (G) in line 4. Solving the subproblem in (6) naively is impractical since X -1 t-1 is a dense matrix. However, the structure of the LogDet divergence comes to the rescue; (6) can be expanded as follows: X t = arg min X∈Sn(G) ++ -log det (X) + Tr(X(X -1 t-1 + g t g T t /λ t )). Let us define the projection onto S n (G), P G : R n×n → R n×n as: P G (M ) ij = M ij if (i, j) ∈ E G , 0 otherwise , Note that the Tr(.) term in (7) involves only the non-zero elements of X ∈ S n (G) ++ . Hence, (7) can be written as X t = arg min X∈Sn(G) ++ -log det (X) + Tr(XP G (X -1 t-1 + g t g T t /λ t )), Computing matrix X -1 t-1 can be avoided by analyzing optimality condition of (9). Let g(X) be the objective function in (9), then the optimality condition of ( 9) is P G (∇g(X)) = 0. Expanding g(X) gives P G (X -1 t ) = P G (X -1 t-1 + g t g T t /λ t ) = P G (X -1 t-1 ) + P G (g t g T t /λ t ), H t = H t-1 + P G (g t g T t /λ t ). ( ) Thus we only need to maintain H t ∈ S n (G), which is H t = P G (X -1 t ) . This matrix is updated as H t = H t-1 + P G (g t g T t /λ t ), which can be done in O(|E G | ) memory and time, while computing the matrix X -1 t would have cost O(n 2 ). In SONew, this key observation is used to maintain H t in line 4. Computing X t in line 5. Now that H t is known at every round t, we can replace P G (X -1 t-1 + g t g T t /λ t ) in ( 9) with H t as: X t = arg min X∈Sn(G) ++ -log det (X) + Tr(XH t ). For an arbitrary graph G, solving this subproblem might be difficult. Theorems 1 and 2 show embarrassingly parallelizable explicit solutions to the subproblem 11 for tridiagonal and banded sparsity patterns. Proofs of Theorems 1 and 2 are given in Appendix A.1. Theorem 1 (Explicit solution of ( 11) for tridiagonal matrix/chain graph). Let the sparsity structure G be a chain with edges E G = {(j, j + 1) : j ∈ [n -1]}, and let H ∈ S n (G), then the solution of ( 11) is given by X = LDL T , where the unit lower triangular matrix L and diagonal matrix D have the following non-zero entries: L jj = 1, L j+1j = - H j+1j H j+1j+1 , D -1 jj = H jj - H 2 j+1j H j+1j+1 , j ≤ n -1, D -1 nn = H nn (12) The time and memory complexity required to compute (12) is O(n). Note that the computation of ( 12) can be easily parallelized. This explicit solution can be generalized to banded sparsity structures with band size b. Theorem 2 (Explicit solution of (11) for banded matrices). Let the sparsity pattern G be a banded matrix of band size b, i.e. for every vertex j, let I j = {j + 1, . . . , j + b}, then edges E G = ( n j=1 {j} × I j ) ∩ {(i, j) : i ≤ n, j ≤ n}. Then X t = LDL T is the solution of ( 11) with nonzero entries of L and D defined as follows : L jj = 1, L Ij j = -H -1 Ij Ij H Ij j , D -1 jj = (H jj -H T Ij j H -1 Ij Ij H Ij j ), 1 ≤ j ≤ n. ( ) Finding the above solution requires solving n linear systems of size b (which is small) as shown in Algorithm 2, and takes O((n -b + 1)b 3 ) flops. Since b ≪ n, the number of flops is O(n).

3.3. ANALYSIS OF SONEW

The following theorem establishes regret guarantees of SONew in online convex optimization framework mentioned in Section 3.1. Theorem 3. When G = tridiagona/chain graph, then Algorithm 1 incurs the following regret under convex losses f t R T ≤ O     C 1/2 • T 3/4 • 1 + β 1 -β 1/2 •    n i=1 log   1 + g (i) 1:T 2 2 ϵ    + n-1 i=1 log 1 -β 2 i    3/4     , where g (i) 1:T = [(g 1 ) i , . . . (g T ) i ] denotes gradient history of i th variable/parameter ,G (i) ∞ = ∥g (i) 1:T ∥ ∞ , C = max t ( n i=1 (w t -w * ) 2 i (G (i) ∞ ) 2 ), β i = g (i) 1:T ,g (i+1) 1:T (ϵ+ g (i) 1:T 2 2 )•(ϵ+ g (i+1) 1:T 2 2 ) denotes dot-product between gradient histories of connected parameters in the chain graph, β = max i∈[n-1] β i . SONew is scale-invariant to diagonal transformation of gradients; if the gradients sent as feedback to the OCO learner are ḡt = Λg t , then the iterates for the transformed problem will be xt = Λ -1 w t = Λ -1 x t-1 -η(Λ -1 X t-1 Λ -1 )Λ(g t-1 ), this is due to scale-invariance property of LogDet divergence in Section 2.1,i.e, the transformed preconditioner is Xt = Λ -foot_0 X t-1 Λ -1 . Furthermore, the regret bound derived in Theorem 3 for transformed and original problem is approximately same. Our regret bound is data-dependent, since if β i is nearer to 1, then the regret is smaller due to the log(1 -β 2 i ) term. This effect is more amplified when log(1 + ∥g (i) 1:T ∥ 2 2 /ϵ) is relatively low compared to log(1 -β 2 i ). The proof for Theorem 3 is given in Appendix A.2. We also derive a regret bound with O( √ κT 1/2 ) dependency on T in Appendix A.3, where the condition number κ(diag(H t )) ≤ κ.

4. NUMERICAL STABILITY OF SONEW

The matrix H ∈ S n (G) given as input to Algorithm 2 should have positive definite submatrices H Jj Jj , J j = I j ∪ {j}, j ∈ [n] , where I j is as in Theorem 2, hence, it should have positive schur complements (H jj -H T Ij j H -1 Ij Ij H Ij j ). But, if the matrices H Jj Jj and H Ij Ij are illconditioned, then the computed value for (H jj +H T Ij j L Ij j ) = (H jj -H T Ij j H -1 Ij Ij H Ij j ) , in line 7 of Algorithm 2, can be zero or negative due to catastrophic cancellation, since finding L Ij j = -H -1 Ij Ij H Ij j in floating point arithmetic can result in rounding errors. To understand this issue further, we conduct perturbation analysis in Theorem 4 which establishes a componentwise condition number (Higham, 2002) of the optimization problem in (11) with a tridiagonal sparsity structure G. Theorem 4 (Condition number of tridiagonal LogDet subproblem 11). Let H ∈ S ++ n be such that H ii = 1 for i ∈ [n]. Let ∆H be a symmetric perturbation such that ∆H ii = 0 for i ∈ [n], and H + ∆H ∈ S ++ n . Let P G (H) be the input to 11, where G is a chain graph, then κ ℓd ∞ ≤ max i∈[n-1] 2/(1 -β 2 i ) = κℓd ∞ , where, β i = H ii+1 ,κ ℓd ∞ := componentwise condition number of (11) for perturbation ∆H. 1 So, the tridiagonal LogDet problem with inputs H as mentioned in Theorem 4, has high condition number when 1 - β 2 i = H ii -H 2 ii+1 /H i+1i+1 are low and as a result the preconditioner X t in SONew (Algorithm 1) has high componentwise relative errors. In SONew (1), the H t = P G ( t s=1 g s g T s /λ t ) generated in line 4 could be such that the matrix t s=1 g s g T s /λ t need not be positive definite and so the schur complements H ii -H 2 ii+1 /H i+1i+1 can be zero, giving an infinite condition number κ ℓd ∞ by Theorem 4. The following lemma describes such cases in detail for a more general banded sparsity structure case. Lemma 1 (Degenerate inputs to banded LogDet subproblem). Let H = P G (GG T ), where G ∈ R n×T and let g (i) 1:T be i th row of G, which is gradients of parameter i for T rounds, then H ij = g (i) 1:T , g (j) 1:T . • Case 1: For tridiagonal sparsity structure G: if g (j) 1:T = g (j+1) 1:T , then H jj - H 2 jj+1 /H j+1j+1 = 0. • Case 2: For b > 1 in (13): If rank(H Jj Jj ) = rank(H Ij Ij ) = b, then (H jj - H T Ij j H -1 Ij Ij H Ij j ) = 0 and D jj = ∞. If rank(H Ij Ij ) < b then the inverse H -1 Ij Ij doesn't exist and D jj is not well-defined. If GG T = T i=1 g i g i is a singular matrix, then solution to the LogDet problem might not be welldefined as shown in Lemma 1. For instance, Case 1 can occur when preconditioning the input layer of an image-based DNN with flattened image inputs, where j th and (j + 1) th pixel can be highly correlated throughout the dataset. Case 2 can occur in the first b iterations in Algorithm 1 when the rank of submatrices rank(H Ij Ij ) < b and ϵ = 0.foot_1 . We develop Algorithm 3 which is robust to degenerate inputs H, given that H ii > 0. It finds a subgraph G of G for which (13) is well-defined. This is done by removing edges which causes inverse H -1 Ij Ij to be singular or (H jj -H T Ij j H -1 Ij Ij H Ij j ) to be low. Our ablation study in Table 4 demonstrates noticeable improvement in performance Algorithm 3 is used. Theorem 5 (Numerically stable algorithm). Algorithm 3 finds a subgraph G of G, such that explicit solution for G in ( 13) is well-defined. Furthermore, when G is a tridiagonal/chain graph, the component-wise condition number upperbound in ( 14) is reduced upon using Algorithm 3, κ G ℓd < κG ℓd , where κ G ℓd , κG ℓd are defined as in Theorem 4 for graphs G and G respectively. The proofs for Lemma 1 and Theorem 5 are given in Appendix A.4. Algorithm 3 Numerically stable banded LogDet solution 1: Input: Gtridiagonal or banded graph, Hsymmetric matrix in R n×n with sparsity structure G and Hii > 0, γtolerance parameter for low schur complements. 2: Output: Finds subgraph G of G without any degenerate cases from Lemma 1 and finds preconditioner X corresponding to the subgraph 3: Let Ei = {(i, j) : (i, j) ∈ EG} be edges from vertex i to its neighbours in graph G. ) , where Ht = P G (Ht) 4: Let V + i = {j : i < j, (i, j) ∈ EG} and V - i = {j : i > j, (i, j) ∈ EG}, denote positive and negative neighbourhood of vertex i. 5: Let K = i : Hii -H T I i i H -1 I i I i HI i i is not defined or ≤ γ, i ∈ [n] 6: Consider a new subgraph G with edges E G = EG \ ( i∈K Ei ∪ (V + i × V - i )) 7: return X := SPARSIFIED INVERSE ( Ht, G

5. RELATED WORK

Online Newton method is a second order method in online convex optimization framework with properties such as scale invariance (Luo et al., 2016) and logarithmic regrets in exp-concave and strongly convex functions (Hazan et al., 2007; 2016) . However, it has a time complexity of O(n 2 ), making it infeasible for large n. A diagonal version of this method SC-Adagrad (Mukkamala & Hein, 2017) was proposed to make the online Newton method scalable for deep-learning. SC-Adagrad is equivalent to setting the sparsity pattern G in equation 11 as a null graph, furthermore, setting G as a complete graph will result in online Newton method. However, introduction of LogDet divergence measure in SONew allows us to set different sparsity graphs as G such as banded graph with band-size b, for which our preconditioning process is more computationally efficient with a time complexity of O(b 3 (n -b + 1)) compared to online-newton method O(n 2 ). As discussed in Theorem 3, SONew also utilizes correlation among gradients of parameters to improve convergence, in contrast to diagonal preconditioners such as SC-Adagrad. Shampoo (Gupta et al., 2018; Anil et al., 2020) uses Kronecker factored preconditioners to reduce the memory and time complexity from O(n 2 ) to O(d 2 1 + d 2 2 ) and O(d 3 1 + d 3 2 ) respectively, where n = d 1 d 2 denotes number of parameters for a linear layer of dimensions d 1 × d 2 . The time complexity of matrix inversion takes a heavy toll in Shampoo's compute time even with the Kronecker product assumption on the preconditioner, whereas, our method has a time complexity of O(b 3 d 1 d 2 ) quadratic in dimensions of the linear layer (note that b = 1 for tridiagonal structure). Furthermore, precision lost in matrix inversion can be large due to high condition number matrices occurring in training (Anil et al., 2022; 2020) . LogDet problem in equation 11 is closely related to the Maximum Determinant Matrix Completion (MDMC) (Andersen et al., 2013; Vandenberghe et al., 2015) . The MDMC problem is the dual of LogDet problem (11), and has explicit solutions for chordal graphs (Andersen et al., 2013) . Thus the explicit solutions in (13) are the same as the ones proved in Andersen et al. (2013) . Also, we noticed that the tridiagonal explicit solution has been used previously in KFAC (Martens & Grosse, 2015) in the context of a gaussian graphical model interpretation of gradients. In this paper we also analyze conditioning and degenerate cases of these explicit solutions (which can occur frequently in DNN training) and develop algorithms to avoid such cases. In addition, we have regret guarantees connecting these explicit solutions to regret in online convex optimization problem setup. There is prior work (Luo et al., 2016; 2019) in reducing the complexity -O(n 2 ) flops of Online Newton Step (ONS) to O(n) flops using sketching. These ONS variants maintain a low rank approximation of H t (as in Algorithm 1) and updating it with a new gradient g t at every iteration requires conducting SVD (Luo et al., 2019) /orthonormalization (Luo et al., 2016) of a tall and thin matrix in R n×r , where r denotes the rank of approximation of H t . Our proposed method (Algorithm 1) has a more parallelizable update H t := H t-1 + P G (g t g T t ) making it more suitable for DNN training.

6. EXPERIMENTAL RESULTS

In this section we describe our experiments on Autoencoder benchmark (Schmidhuber, 2015) using MNIST dataset (Deng, 2012) . Our results on larger benchmarks are given in Appendix A.5. We compare SONew against commonly used first order methods including SGD (Kiefer & Wolfowitz, 1952 )), SGD with Momentum (Qian, 1999), Nesterov (Nesterov, 1983) , Adagrad (Duchi et al., 2011a) , Adam (Kingma & Ba, 2014), and Rmsprop (Tieleman & Hinton, 2012) . We also compare with Shampoo (Gupta et al., 2018) , a state of the art second-order optimizer used in practice. Computing preconditioner at every step in shampoo could be infeasible, instead it is computed every t steps -referred to as Shampoo(t) in the experiments. For SONew, we use exponentially moving average (EMA) for both first order and second order statistics. EMA is commonly used in adaptive optimizers for deep-learning training Kingma & Ba (2014); Qian (1999). Let β 1 , β 2 ∈ [0, 1] be the coefficients for EMA and g t ∈ R n be the gradient at t th iteration/mini-batch. Let µ t ∈ R n be the first order gradient statistic, and H t ∈ S n (G) be the second order gradient statistic as in (10). Then, the following modification to update rules are made to in the SONew implementation. µ t = β 1 µ t-1 + (1 -β 1 )g t , H t = β 2 H t-1 + (1 -β 2 )g t g T t . Furthermore, µ t is used in w t+1 = w t -η t X t µ t , replacing the gradient g t , similar to Kingma & Ba (2014) . As discussed in Section 3.2 we need to store the values of H t only corresponding to the sparsity pattern, hence SONew uses O(n) space. The updates above get computed in parallel in O(1) time. Moreover, we use Algorithm 3 to make SONew numerically stable. We also use grafting (Agarwal et al., 2022) , a technique used to transfer step size between optimization algorithms. Specifically, given an update v 1 of Optimizer-1 and v 2 of Optimizer-2, grafting allows to use the direction suggested by Optimizer-2 with step size suggested by Optimizer-1. The final update is given by ∥v1∥ ∥v2∥ • v 2 . Grafting has been shown to take advantage of a tuned optimizer step size and improve performance. In our case, we use Adam grafting -using Adam optimizer step size ∥v 1 ∥ with SONew direction v2 /∥v2∥. We use three sparsity patterns for SONew -a) diagonal sparsity, resulting in a diagonal preconditioner similar to adaptive first order methods like Adam and Adagrad; b) tridiagonal sparsity, corresponding to a chain graph; and c) banded sparsity, represented by "band-k" in tables and figures for band size of k. All the baselines and SONew are trained for 100 epochs and use the train set of MNIST dataset containing 60k points. We use a standard sized (2.72M parameters) Autoencoder with layer sizes: [1000, 500, 250, 30, 250, 500, 1000] and tanh non-linearity. Batch size is fixed at 1000 and we use learning rate schedule with a linear warmup of 5 epochs followed by linear decay towards 0. Over 2k hyperparameters are searched over for each experiment using a Bayesian optimization package. The search space for each optimizer is mentioned in Appendix A.5. From the float32 experiments in Table 1 we observe that among first order methods, diag-SONew performs the best while taking same amount of time. Increasing the number of edges in the sparsity graph to tridiag or banded with band size 4 enhances the performance further. Tridiag-SONew performs 4× faster than Shampoo at a marginal cost to the loss -even when Shampoo updates preconditioner once every 20 steps. When dealing with large scale models like Resnet50 (He et al., 2015a) , Shampoo requires updating preconditioner much more often (Anil et al., 2019) , whereas our method is also promising in such scenarios. We leave comparison of SONew with Shampoo on such large benchmarks as a future work. To show efficacy of SONew at lower precision, we conduct bfloat16 experiments. We notice in Table 2 that diag-SONew performs the best in first order methods, just like in float32. Moreover SONew undergoes the least degradation in performance Table 1 : float32 experiments on Autoencoder benchmark. "tridiag" repersents tridiag-SONew, and "band-4" represents banded-SONew with band size 4. We observe that diag-SONew performs the best among all first order methods while taking similar time. tridiag and band-4 perform significantly better than first order methods while requiring similar linear space and being marginally slower. Shampoo performs best but takes O(d 3 ) time for computing preconditioner of a linear layer of size d × d, whereas our methods take O(d 2 ) to find the precondioner. compared to all other optimizers. We also provide an ablation study on effect of using Algorithm 3 on training loss in the appendix.

A APPENDIX

A.1 PROPERTIES OF LOGDET SUBPROBLEM

Proof of Theorem 2

The optimality condition of ( 11) is P G (X -1 ) = P G (H) Let Z = L -T D -1 L -1 , then P G (Z) = H ZL = L -T D -1 =⇒ ZLe j = L -T D -1 e j Let J j = I j ∪ j, then select J j indices of vectors on both sides of the second equality. Z jj Z jIj Z Ij j Z Jj Jj 1 L Ij = 1/d jj 0 Note that L -T is an upper triangular matrix with ones in the diagonal hence J th j block of L -T e j will be [1, 0, 0, . . .]. Also, since P G (Z) = H Z jj Z jIj Z Ij j Z Jj Jj = H jj H jIj H Ij j H Jj Jj Substituting this in the linear equation 15 H jj H jIj H Ij j H Jj Jj 1 L Ij = 1/d jj 0 H jj H jIj H Ij j H Jj Jj d jj d jj • L Ij = 1 0 H jj d jj + d jj H T Ij j L Ij j = 1 H Ij j d jj + d jj H Ij Ij L Ij j = 0 The lemma follows from solving the above equations. Note that here we used that lower triangular halves of matrices L and H have the same sparsity patterns, which follows from the fact that banded graph is chordal and has a perfect elimination order [1, 2, . . . , n] .

Proof of Theorem 1

The proof follows trivially from Theorem 1, when b is set to 1.

A.2 REGRET BOUND ANALYSIS

Proof of Theorem 3. To upperbound the regret we prove the following lemma about regret decompsition Lemma 2 ( Hazan et al. (2016) ). In the OCO problem setup, if a prediction w t ∈ R n is made at round t and is updated as w t+1 := w t -ηX t g t using a preconditioner matrix X t ∈ S ++ n R T ≤ 1 2η •(∥w 1 -w * ∥ 2 X -1 1 -∥w T +1 -w * ∥ X -1 T ) (16) + 1 2η • T -1 t=1 (w t+1 -w * ) T (X -1 t+1 -X -1 t )(w t+1 -w * ) (17) + T t=1 η 2 • g T t X t g t (18) Proof. ∥w t+1 -w * ∥ 2 X -1 t = ∥w t -ηX t g t -w * ∥ 2 X -1 t = ∥w t -w * ∥ 2 X -1 t + η 2 g T t X t g t -2η(w t -w * ) T g t =⇒ 2η(w t -w * ) T g t = ∥w t -w * ∥ 2 X -1 t -∥w t+1 -w * ∥ 2 X -1 t + η 2 g T t X t g t Using the convexity of f t , f t (w t ) -f t (w * ) ≤ (w t -w * ) T g t , where g t = ∆f t (w t ) and summing over t ∈ [T ] R T ≤ T t=1 1 2η • ∥w t -w * ∥ 2 X -1 t -∥w t+1 -w * ∥ 2 X -1 t + η 2 • g T t X t g t (19) The first summation can be decomposed as follows T t=1 ∥w t -w * ∥ 2 X -1 t -∥w t+1 -w * ∥ 2 X -1 t = ∥w 1 -w * ∥ 2 X -1 1 -∥w T +1 -w * ∥ 2 X -1 T + T -1 t=1 (w t+1 -w * ) T (X -1 t+1 -X -1 t )(w t+1 -w * ) Substituting the above identity in the Equation ( 19) proves the lemma. Let R T ≤ T 1 + T 2 + T 3 , where • T 1 = 1 2η • (∥w 1 -w * ∥ 2 X -1 1 -∥w T +1 -w * ∥ X -1 T ) • T 2 = 1 2η • T -1 t=1 (w t+1 -w * ) T (X -1 t+1 -X -1 t )(w t+1 -w * ) • T 3 = T t=1 η 2 • g T t X t g t The following lemmas will be used to bound T 1 , T 2 , T 3 . Lemma 3. If G = chain/tridiagonal graph and X = arg min X∈Sn(G) ++ D ℓd (X, H -1 ), then the inverse X-1 takes the following expression ( X-1 ) ij = H ij |i -j| ≤ 1 Hii+1Hi+1i+2...Hj-1j Hi+1i+1...Hj-1j-1 (20) Proof. X-1 X(j) = e j Where X(j) is the j th column of X. Let Ŷ denote the right hand side of Equation ( 20). ( Ŷ X) jj = Xjj Ŷjj + Xj-1j Ŷj-1j + Xjj+1 Ŷjj+1 = Xjj H jj + Xj-1j H j-1j + Xjj+1 H jj+1 = 1 The third equality is by using the following alternative form of Equation ( 12): ( X(1) ) i,j =        0 if j -i > 1 -Hi,i+1 (HiiHi+1,i+1-H 2 i+1,i+1 ) if j = i + 1 1 Hii 1 + j∈neig G (i) H 2 ij HiiHjj -H 2 ij if i = j , where i < j. Similarly, the offdiagonals of Ŷ X can be evaluated to be zero as follows. ( Ŷ X) ij = Ŷij Xj j + Ŷij-1 Xj-1j + Ŷij+1 Xj+1j = Ŷij Xjj + Ŷij H j-1j-1 H j-1j + Ŷij H jj+1 H jj Xj+1j = 0 Lemma 4. Let y ∈ R n , β = max t max i∈[n-1] (H t ) ii+1 / (H t ) ii (H t ) i+1i+1 < 1, C = n i=1 y 2 i (G (i) ∞ ) 2 , where g (i) 1:T = [(g 1 ) i , . . . (g T ) i ] and G (i) ∞ = ∥g (i) 1:T ∥ ∞ , then y T X -1 t y ≤ (Ct + ϵ ∥y∥ 2 2 ) 1 + β 1 -β Proof. Let X-1 t = diag(H t ) -1/2 Xt diag(H t ) -1/2 y T X -1 t y ≤ diag(H t ) 1/2 y 2 2 X-1 t 2 (22) Using the identity of spectral radius ρ(X) ≤ ∥X∥ ∞ and since X is positive definite, X-1 t 2 ≤ ∥ X-1 t ∥ ∞ X-1 t 2 ≤ max i    j ( X-1 t ) ij    ≤ 1 + 2(β + β 2 + . . .) ≤ 1 + β 1 -β The second inequality is using Lemma 3. Using (H t ) ii = g (i) 1:T 2 2 + ϵ in Equation ( 22) will give the lemma. Lemma 5 (Upperbound of T 1 ). T 1 ≤ (C+ϵD 2 2 ) 2η • 1+β 1-β , where D 2 = max t∈[T ] ∥w t -w * ∥ 2 , where C = max t ( n i=1 (y (t) i ) 2 (G (i) ∞ ) 2 ). Proof. Since X T is positive definite T 1 ≤ ∥w 1 -w * ∥ 2 X -1 1 2η ≤ (y (1) ) T X -1 1 y (1) 2η Using Lemma 4 proves the lemma. Upperbounding T 2 Let y t = w t -w * , P t = y T t X -1 t y t , and Q t = y T t X -1 t-1 y t . Let P = [P 2 , • • • , P T ] and Q = [Q 2 , • • • , Q T ] be the two array representations. Then, 2ηT 2 = T -1 i=1 y T t+1 X -1 t+1 -X -1 t y t+1 = T i=2 y T t X -1 t -X -1 t-1 y t = T i=2 y T t X -1 t y t -y T t X -1 t-1 y t = T i=2 P t -Q t = ∥P -Q∥ 1 In order to upper bound this, we derive here a generalized version of Pinsker's inequality for our setting. Lemma 6. Let A = [A 1 , • • • , A T ] and B = [B 1 , • • • , B T ] be two T -length arrays. Let B i , A i > 0 ∀i ∈ [1, T + 1]. Then, ∥A -B∥ 2 1 ≤ 2 3 T i=1 (2A i + B i ) D KL (B, A) where the generalized KL-Divergence D KL (B, A) = T i=1 B i log Bi Ai -B i + A i . Proof. This proof is a mock up of Pinsker's inequality for probability spaces. First step involves using the following identity, for t ≥ -1 (1 + t)log(1 + t) -t ≥ 1 2 • t 2 1 + t 3 Let t i = Bi Ai -1. Then, -1 ≤ t i . Therefore, D KL (B, A) = T i=1 B i log B i A i -B i + A i = T i=1 A i B i A i log B i A i - B i A i + 1 = T i=1 A i ((t i + 1) log(t i + 1) -t i ) ≥ T i=1 A i t 2 i 2 1 + ti 3 D KL (B, A) = T i=1 A i t 2 i 2 1 + ti 3 • T j=1 A j (1 + tj 3 ) T k=1 A k (1 + t k 3 ) ≥ T i=1 A i |t i | 2 2 T k=1 A k (1 + t k 3 ) (Using Cauchy-Schwartz) = T i=1 |A i -B i | 2 2 3 T k=1 (2A k + B k ) = ∥A -B∥ 2 1 2 3 T j=1 (2A i + B i ) Using the above to bound Equation ( 23) with B := P and A := Q, we get 2ηT 2 = ∥P -Q∥ 1 ≤ 2 3 T t=2 (2Q t + P t ) D KL (P, Q) = 2 3 T t=2 (2Q t + P t ) T t=2 D KL (P i , Q i ) = 2 3 T t=2 (2Q t + P t ) T t=2 D KL (y T t X -1 t y t , y T t X -1 t-1 y t ) = 2 3 T t=2 (2Q t + P t ) T t=2 D KL ( ȳt T X-1 t ȳt , ȳt T X-1 t-1 ȳt ) ≤ 2 3 T t=2 (2Q t + P t ) • T t=2 ∥ ȳt ∥ 2 2 • D vN ( X-1 t , X-1 t-1 ) Here, ȳt = Λ t y t , Xt = Λ t X t Λ t , Xt-1 = Λ t X t-1 Λ t D vN is von-Neuman Divergence, and last inequality is from Lindblad (1975) . We upper bound Equation (24) using the following lemma. Lemma 7. Let A, B be two PD matrices. Then, D vN (A, B) ≤ λ max (A) • D ℓd (B, A) Proof. Let A = V ΛV T and B = U ΘU T D vN (A, B) = i,j (v T i u j ) λ i log λ i θ j -λ i + θ j = i,j (v T i u j )λ i log λ i θ j -1 + θ j λ i ≤ max i λ i i,j (v T i u j ) -log θ j λ i -1 + θ j λ i = λ max (A) • D ℓd (B, A) Combining the above result, we get T 2 = 1 2η • ∥P -Q∥ 1 ≤ 1 2η • 2 3 T t=2 (2Q t + P t ) • T t=2 ∥ȳ t ∥ 2 2 X-1 t 2 • D ℓd ( X-1 t-1 , X-1 t ) ≤ 1 2η • 2 3 T t=2 (2Q t + P t ) • T t=2 ∥ȳ t ∥ 2 2 X-1 t 2 • D ℓd ( X-1 t-1 , X-1 t ) The second inequality is due to scale invariance of LogDet divergence. If we set Λ t = diag(H t ) 1/2 , then using Lemma 4 T 2 ≤ 1 2η • 2 3 T t=2 (2Q t + P t ) • T t=2 (Ct + ϵ ∥y t ∥ 2 2 ) 1 + β 1 -β • D ℓd (X -1 t-1 , X -1 t ) ≤ 1 2η • 2 3 T t=2 (2Q t + P t ) • (CT + ϵD 2 2 ) 1 + β 1 -β • T t=2 D ℓd (X -1 t-1 , X -1 t ) ≤ 1 2η • (CT + ϵD 2 2 ) 1 + β 1 -β • 2 3 • T • T t=2 D ℓd (X -1 t-1 , X -1 t ) where, D 2 = max t ∥y t ∥ 2 , the second and third inequality is using Lemma 4. Now to bound the D ℓd (X -1 t-1 , X -1 t ) term in the above equation, we develop the following lemma. Lemma 8. Tr(X t H t ) = n, ∀t ∈ {2, . . . , T } in Algorithm 1 with any G Proof. P G (X -1 t ) = H t is an optimality condition of LogDet subproblem Equation (11) Tr(X t H t ) = Tr(X t P G (X -1 t )) = Tr(X t X -1 t ) = n The second equality is because X t follows the sparsity graph G. Lemma 9. T t=2 D ℓd (X -1 t-1 , X -1 t ) ≤ log( det X -1 T det X -1 1 ) ≤ n i=1 log   1 + g (i) 1:T 2 2 ϵ    + n-1 i=1 log 1 -β 2 i where β i = g (i) 1:T ,g (i+1) 1:T (ϵ+ g (i) 1:T 2 2 )•(ϵ+ g (i+1) 1:T 2 2 ) Proof. LogDet divergence is defined as follows D ℓd (X -1 t-1 , X -1 t ) = -log det X -1 t-1 det X -1 t + Tr(X -1 t-1 X t ) -n The second two terms on right hand side can be simplified analyzed as follows. Tr(X -1 t-1 X t ) -n = Tr(X -1 t-1 X t -X -1 t X t ) = Tr((X -1 t-1 -X -1 t )X t ) = -g T t X t g t < 0 (26) Thus T t=2 D ℓd X -1 t-1 , X -1 t ≤ T t=2 -log det X -1 t-1 det X -1 t ≤ log det X -1 T det X -1 1 Since LogDet divergence is positive D ℓd (X -1 t-1 , X -1 t ) = -log det X -1 t-1 det X -1 t + Tr(X -1 t-1 X t ) -n ≥ 0 =⇒ log det X -1 t det X -1 t-1 ≥ 0 The second inequality uses Equation (26). Now, using Equation ( 13), which is in cholesky decomposition format, log(det X -1 T ) = n-1 i=1 log (H T ) ii -(H T ) 2 ii+1 /(H T ) i+1i+1 + log((H T ) nn ) ≤ n-1 i=1 log(1 -β 2 i ) + n i=1 log((H T ) ii ) Using the above, we can expand the log deteriminatn difference. log det X -1 T det X -1 1 ≤ log det X -1 T det X -1 0 ≤ n i=1 log   1 + g (i) 1:T 2 2 ϵ    + n-1 i=1 log 1 -β 2 i ( ) Let X 0 = arg min X∈Sn(G) ++ D ℓd (X, H -1 0 ), the above inequality is since log det(X -1 1 ) det(X -1 0 ) ≥ 0 using Equation (27) Lemma 10 (Upperbound of T 2 ). T 2 ≤ 1 2η • (CT + ϵD 2 2 ) 1 + β 1 -β • 2 3 • T •    n i=1 log   1 + g (i) 1:T 2 2 ϵ    + n-1 i=1 log (1 -β 2 i )    Proof. The proof directly follows from Equation ( 25) and Lemma 9 Lemma 11 (Upperbound of T 3 ). Let X 0 = arg min X∈Sn(G) ++ D ℓd (X, H -1 0 ) T 3 = T t=1 η 2 • g T t X t g t ≤ η 2 • log det X -1 T det X -1 0 ≤ η 2 • n i=1 log   1 + g (i) 1:T 2 2 ϵ    + n-1 i=1 log 1 -β 2 i where β i = ((g (i) 1:T ) T g (i+1) 1:T ) 2 (ϵ+ g (i) 1:T 2 2 )•(ϵ+ g (i+1) 1:T 2 2 ) Proof. g T t X t g t = Tr(X t g t g T t ) = Tr(X t (X -1 t -X -1 t-1 )) = n -Tr(X t X -1 t-1 ) ≤ -log det X -1 t-1 det X -1 t The second equality is using Lemma 3 and that X t 's sparsity graph is tridiagonal. The first inequality is using the property D ℓd (X -1 t-1 , X -1 t ) ≥ 0 of LogDet divergence. Thus summing up and using Equation (28) will give the lemma. Putting together T 1 , T 2 and T 3 from Lemma 5, Lemma 10 and Lemma 11 T 1 ≤ (C + ϵD 2 2 ) 2η • 1 + β 1 -β , T 2 ≤ 1 2η (CT + ϵD 2 2 ) 1 + β 1 -β 2 3 T    n i=1 log   1 + g (i) 1:T 2 2 ϵ    + n-1 i=1 log (1 -β 2 i )   , T 3 ≤ η 2 • n i=1 log   1 + g (i) 1:T 2 2 ϵ    + n-1 i=1 log 1 -β 2 i (30) If we set η = C 1/2 T 3/4 •( 1+β 1-β ) 1/2   n i=1 log   1+ ∥ g (i) 1:T ∥ 2 2 ϵ   + n-1 i=1 log(1-β 2 i )   1/4 , then R T ≤ T 1 + T 2 + T 3 ≤ O     C 1/2 • T 3/4 • 1 + β 1 -β 1/2 •    n i=1 log   1 + g (i) 1:T 2 2 ϵ    + n-1 i=1 log 1 -β 2 i    3/4     A.3 O(T 1/2 ) REGRET UPPER BOUND In this section we derive a regret upper bound with a O(T 1/2 ) growth, in contrast to the O(T 3/4 ) obtained in A.2. In (29), T 2 = T t=2 (w t -w * ) T (X -1 t -X -1 t-1 )(w t -w * ) is of the order O(T 3/2 ), which can be reduced to O(T ), by upper bounding each entry of X -1 t -X -1 t-1 individually. The following lemma helps in constructing a telescoping argument to bound (X -1 t -X -1 t-1 ) i,j . Lemma 12. Let H, H ∈ S ++ n , such that H = H + gg T , where g ∈ R n , then

Hij

Hii Hjj - H ij H ii H jj = g i g j Hii Hjj + H ij H ii H jj H ii H jj Hii Hjj -1 = θ ij Proof. Hii Hjj - H ij H ii H jj = 1 H ii H jj ( Hij H ii H jj Hii Hjj -H ij ) = 1 H ii H jj   g i g j H ii H jj Hii Hjj + H ij   H ii H jj Hii Hjj -1     Lemma 13. Let H, H ∈ S ++ n , such that H = H + gg T , where g ∈ R n . Also, Ỹ = arg min X∈Sn(G) ++ D ℓd (X, H) and Y = arg min X∈Sn(G) ++ D ℓd (X, H), where G is a chain graph, then ( Ỹ -1 -Y -1 ) ii+k ≤ G 2 ∞ κ(kβ + k + 2)β k-1 , where i, i + k ≤ n, G ∞ = ∥g∥ ∞ and max i,j |H ij |/ H ii H jj ≤ β < 1. Let κ(diag(H)) := condition number of the diagonal part of H, then κ := max(κ(diag(H)), κ(diag( H))). Proof. Using Lemma 3 will give the following:  ( Ỹ -1 -Y -1 ) ii+k = Hii+1 . . . + N ii+1 . . . N ii+k 1 - H ii H i+ki+k Hii Hi+ki+k | ≤ ( k-1 l=0 |θ i+li+l+1 |)β k-1 + β k-1 1 - H ii H i+ki+k Hii Hi+ki+k , =⇒ ( Ỹ -1 -Y -1 ) ii+k ≤ Hii Hi+ki+k ( k-1 l=0 |θ i+li+l+1 |)β k-1 + β k-1 1 - H ii H i+ki+k Hii Hi+ki+k where max i,j |N i,j |, max i,j | Ñi,j | ≤ β. Expanding θ i+li+l+1 from Lemma 12 in the term |θ i+li+l+1 | Hii Hi+ki+k will give: |θ i+li+l+1 | Hii Hi+ki+k = Hii Hi+ki+k g i+l g i+l+1 Hi+li+l Hi+l+1i+l+1 + Hii Hi+ki+k N i+li+l+1 H i+li+l H i+l+1i+l+1 Hi+li+l Hi+l+1i+l+1 -1 ≤ Hii Hi+ki+k g i+l g i+l+1 Hi+li+l Hi+l+1i+l+1 + Hii Hi+ki+k N i+li+l+1 1 - H i+li+l H i+l+1i+l+1 Hi+li+l Hi+l+1i+l+1 Since H i+li+l H i+l+1i+l+1 ≤ Hi+li+l Hi+l+1i+l+1 , 1 - H i+li+l H i+l+1i+l+1 Hi+li+l Hi+l+1i+l+1 ≤ max 1 - H i+li+l Hi+li+l , 1 - H i+l+1i+l+1 Hi+l+1i+l+1 ≤ max g 2 i+l Hi+li+l , g 2 i+l+1

Hi+l+1i+l+1

Using the above, H i,i /H j,j ≤ κ, and |g i | ≤ G ∞ , ∀i, j ∈ [n], gives Hii Hi+ki+k |θ i+li+l+1 | ≤ G 2 ∞ κ + βG 2 ∞ κ ≤ G 2 ∞ κ(1 + β) Thus the following part of Ỹ -1 -Y -1 ii+k can be upperbounded: 29) can be bounded as follows: Hii Hi+ki+k ( k-1 l=0 |θ i+li+l+1 |)β k-1 ≤ G 2 ∞ κ(1 + β)kβ k-1 Also, β k-1 1 -HiiH i+ki+k Hii Hi+ki+k ≤ β k-1 κG 2 ∞ , so Ỹ -1 -Y -1 ii+k ≤ G 2 ∞ κ(kβ + k + 2)β k-1 Lemma 14 (O(T ) upper bound of T 2 ). Given that κ(diag(H t )) ≤ κ, ∥w t -w * ∥ 2 ≤ D 2 , max i,j |(H t ) ij |/ (H t ) ii (H t ) jj ≤ β < 1, ∀t ∈ [T ] in Algorithm 1, then T 2 in ( T 2 ≤ O T 2η(1 -β) 2 (G ∞ D 2 ) 2 κ Proof. Note that T 2 = 1 2η • T -1 t=1 (w t+1 -w * ) T (X -1 t+1 -X -1 t )(w t+1 -w * ) ≤ T -1 t=1 D 2 2 (X -1 t+1 -X -1 t ) 2 . Using ∥A∥ 2 = ρ(A) ≤ ∥A∥ ∞ for symmetric matrices A, we get X -1 t+1 -X -1 t 2 ≤ ∥X -1 t+1 -X -1 t ∥ ∞ = max i ( j (X -1 t+1 -X -1 t ) ij ) ≤ O( G 2 ∞ κ (1 -β) 2 ) The third inequality is using Lemma 13. Expanding T 2 with this bound gives the result. Theorem 6 (O(T 1/2 ) regret upper bound). Setting η = T 1/2 D 2 G ∞ √ κ (1 -β) n i=1 log 1 + g (i) 1:T 2 2 ϵ + n-1 i=1 log (1 -β 2 i ) , gives the following regret R T ≤ O     1 (1 -β) T 1/2 D 2 G ∞ √ κ n i=1 log   1 + g (i) 1:T 2 2 ϵ    + n-1 i=1 log (1 -β 2 i )     , where condition number κ(diag(H t )) ≤ κ, ∥w t -w * ∥ 2 ≤ D 2 , max i,j |(H t ) ij |/ (H t ) ii (H t ) jj ≤ β < 1, ∀t ∈ [T ] and g t , H t are as defined in Algorithm 1. Proof. The result follows from using (30) and Lemma 14.

A.4 NUMERICAL STABILITY

Theorem 7 (Full version of Theorem 4). Let H ∈ S ++ n such that H ii = 1, for i ∈ [n] and a symmetric perturbation ∆H such that ∆H ii = 0, for i ∈ [n] and H + ∆H ≻ 0. Let X = arg min X∈Sn(G) ++ D ℓd X, H -1 and X + ∆ X = arg min X∈Sn(G) ++ D ℓd X, (H + ∆H) -1 , here G := chain/tridiagonal sparsity graph and S n (G) ++ denotes positive definite matrices which follows the sparsity pattern G. κ ℓd = max |i-j|≤1 lim ϵ→0 sup    ∆ Xij ϵ Xij : |∆H k,l | ≤ |ϵH k,l | , (k, l) ∈ E G    ≤ max i∈[n-1] 1/(1 -β 2 i ) where, κ ℓd := condition number of the LogDet subproblem, κ 2 (.) := condition number of a matrix in ℓ 2 norm, β i = H ii+1 / H ii H i+1i+1 Proof. Consider the offdiagonals for which ( X + ∆ X) ii+1 = -H ii+1 /(1 -H 2 ii+1 ) = f (H ii+1 ),where f (x) = -x/(1 -x 2 ). Let y = f (x), ŷ = f (x + ∆x) and |∆x/x| ≤ ϵ then using Taylor series (ŷ -y) y = xf ′ (x) f (x) ∆x x + O((∆x) 2 ) =⇒ lim ϵ→0 (ŷ -y) ϵy ≤ xf ′ (x) f (x) Using the above inequality, with x := H ii+1 and y := Xii+1 , lim ϵ→0 ∆ Xii+1 ϵ Xii+1 ≤ 1 + H 2 ii+1 1 -H 2 ii+1 (31) ≤ 2 1 -H 2 ii+1 Let g(x) = x 2 /(1 -x 2 ), let y 1 = g(w 1 ), y 2 = g(x 2 ), ŷ1 = g(w 1 + ∆x), ŷ2 = g(x 2 + ∆x). Using Taylor series (ŷ 1 -y 1 ) y 1 = x 1 f ′ (x 1 ) f (x 1 ) ∆x 1 x 1 + O((∆x 1 ) 2 ) (ŷ 2 -y 2 ) y 2 = x 2 f ′ (x 2 ) f (x 2 ) ∆x 2 x 2 + O((∆x 2 ) 2 ) =⇒ lim ϵ→0 ∆y 1 + ∆y 2 ϵ(1 + y 1 + y 2 ) ≤ max 2 1 -x 2 1 , 2 1 -x 2 2 Putting x 1 := H ii+1 , x 2 := H ii-1 and analyzing y 1 := H 2 ii+1 /(1 -H 2 ii+1 ) and y 2 := H 2 ii-1 /(1 - H 2 ii-1 ) will result in the following lim ϵ→0 ∆ Xii Xii ≤ max 2 1 -H 2 ii+1 , 2 1 -H 2 ii-1 (32) Since Xii = 1 + H 2 ii+1 /(1 -H 2 ii+1 ) + H 2 ii-1 /(1 -H 2 ii-1 ). Putting together Equation (32) and Equation (31), the theorem is proved.

Proof of Lemma 1

For b = 1, if g (j) 1:T = g (j+1) 1:T , then H jj+1 = H jj = H j+1j+1 = g (j) 1:T 2 2 , thus H jj - H 2 jj+1 /H j+1j+1 = 0. For b > 1, since H Ij Ij , using Guttman rank additivity formula, rank(H jj -H 2 jj+1 /H j+1j+1 ) = rank(H Jj Jj ) -rank(H Ij Ij ) = 0, thus H jj -H 2 jj+1 /H j+1j+1 = 0. Furthermore, if rank(H) ≤ b, then all b + 1 × b + 1 principal submatrices of H have rank b, thus ∀j, H Jj Jj have a rank b, thus D jj for all j are undefined.

Proof of Theorem 5

Let I i = {j : i < j, (i, j) ∈ E G } and I ′ i = j : i < j, (i, j) ∈ E G Let K = i : H ii -H T Iii H -1 IiIi H Iii is undefined or 0, i ∈ [n] denote vertices which are getting removed by the algorithm, then for the new graph G, D ii = 1/H ii , ∀i ∈ K since H ii > 0. Let K = i : H ii -H T Iii H -1 IiIi H Iii > 0, i ∈ [n] . Let for some j ∈ K, if l = arg min {i : j < i, i ∈ K ∩ I j } , denotes the nearest connected vertex higher than j for which D ll is undefined or zero, then according to the definition E G in Algorithm 3, I ′ j = {j + 1, . . . l -1} ⊂ I j , since D jj is well-defined, H Ij Ij is invertible, which makes it a positive definite matrix (since H is PSD). Since H jj -H T Ij j H -1 Ij Ij H Ij j > 0, using Guttman rank additivity formula H Jj Jj ≻ 0, where J j = I j ∪ j. Since H J ′ j J ′ j is a submatrix of H Jj Jj , it is positive definite and hence its schur complement H jj -H T I ′ j j H -1 I ′ j I ′ j H I ′ j j > 0. Thus for all j ∈ [n], the corresponding D jj 's are well-defined in the new graph G. Note that κ G ℓd = max i∈[n-1] 1/(1 -β 2 i ) < max i∈ K 1/(1 -β 2 i ) = κ G ℓd , for tridiagonal graph, where β i = H ii+1 , in the case where H ii = 1. This is because the arg max i∈[n-1] 1/(1 -β 2 i ) ∈ K. Effect of band size in banded-SONew Increasing band size in banded-SONew captures more correlation between parameters, hence should expectedly lead to better preconditioners. We confirm this through experiments on the Autoencoder benchmark where we take band size = 0 (diag-SONew), 1 (tridiag-SONew), 4, and 10. Effect of mini-batch size To find the effect of mini-batch size, in Table 7 , We empirically compare SONew with state of the art first-order methods such as Adam and RMSProp, and second-order method Shampoo. We see that SONew performance doesn't deteriorate much when using smaller or larger batch size. First order methods on the other hand suffer significantly. We also notice that Shampoo doesn't perform better than SONew in these regimes. Effect of Numerical Stability Algorithm 3 On tridiag-SONew and banded-4-SONew, we observe that using Algorithm 3 improves training loss. We present in Table 4 results The numbers are given in Table 5 . We see that SONew performs comparable or outperforms Adam in all the benchmark in both train loss and validation performance. Moreover, we plot the training and validation curve in Figure 2 . We observe that SONew has an early advantage over Adam. Hyperparaeter search space We provide the hyperparamter search space for experiments presented in Section 6. We search over 2k hyperparameters for each experiment using a Bayesian Optimization package. The search ranges are: first order momentum term β 1 ∈ [1e -1, 0.999], second order momentum term β 2 ∈ [1e -1, 0.999], learning rate ∈ [1e -7, 1e -1], ϵ ∈ [1e -10, 1e -1]. We give the optimal hyperparameter value for each experiment in Table 6 . For large scale benchmark ( 



The full version of Theorem 4 along with proof is given in Appendix A.4. This is proved inAppendix A.4 CONCLUSIONS AND FUTURE WORKIn this paper we have introduced a computationally efficient sparse preconditioner. Our algorithm arises from a novel regret bound analysis using LogDet Divergence, and furthermore we make it numerically stable. Experimental results on the Autoencoder benchmark confirm the effectiveness of SONew in both float32 as well as bfloat16 precision. In the future, one can explore different sparsity graphs for which efficient solutions exist for the LogDet subproblem (11).



Figure 1: (a) Comparison of SONew (tridiag, band-4) with first-order optimizers and Shampoo (second-order). Left (a) uses float32 training and right (b) uses bfloat16 training. We observe that tridiag and banded SONew have better convergence compared to shampoo in lower precision in early stages, and performs better than all first order methods in float32 and bfloat16.

where we observed significant performance improvements. Large Scale Benchmark Comparison To test efficacy of SONew in Deep Learning, we test our method against Adam on 3 large scale benchmarks -Resnet50 He et al. (2015b) on Imagenet Deng et al. (2009) training, Vision Transformer Dosovitskiy et al. (2020) on Imagenet training, and Graph-Network Battaglia et al. (2018); Godwin* et al. (2020) on OGBG-molpcba dataset Hu et al. (2020).

bfloat16 experiments on Autoencoder benchmark. diag-SONew performs the best among all first order methods, while degrading only marginally (0.26 absolute difference) compared to float32 performance. tridiag-SONew and banded-SONew holds similar observations as well. Shampoo performs the best but has a considerable drop (0.70) in performance compared to float32 due to using matrix inverse, and is slower due to its cubic time complexity for computing preconditioners. Shampoo implementation uses 16-bit quantization to make it work in 16-bit setting, leading to further slowdown. Hence the running time in bfloat16 is even higher than in float32.

Ñi+k-1i+k Hi+ki+k -H ii N ii+1 . . . N i+k-1i+k H i+ki+k = Hii Hi+ki+k Ñii+1 . . . Ñi+k-1i+k -N ii+1 . . . N i+k-1i+k H ii H i+ki+k Hii Hi+ki+k where N ij = H ij / H ii H jj . Expanding Ñii+1 = N ii+1 + θ ii+1 (from Lemma 12), subsequently Ñii+2 = N ii+2 + θ ii+2 and so on will give Ñii+1 . . . Ñi+k-1i+k -N ii+1 . . . N i+k-1i+k H ii H i+ki+k Hii Hi+ki+k = θ ii+1 Ñi+1i+2 . . . Ñi+k-1i+k + N ii+1 Ñi+1i+2 . . . Ñi+k-1i+k -N i+1i+2 . . . N i+k-1i+k H ii H i+ki+k Hii Hi+ki+k = |θ ii+1 Ñi+1i+2 . . . Ñi+k-1i+k + N ii+1 θ i+1i+2 Ñii+3 . . . Ñi+k-1i+k + . . . + N ii+1 . . . N ii+k-1 θ i+k-1i+k

float32 experiments on Autoencoder benchmark using different band sizes. Band size 0 corresponds to diag-SONew and 1 corresponds to tridiag-SONew. We see the training loss getting better as we increase band size

bfloat16 experiments on Autoencoder benchmark with and without Algorithm 3. We observe gains in training loss when using Algorithm 3

Large Scale Benchmarks. We compare tds vs Adam on the following large scale benchmark. We compare train CE loss, and validation performance -measured as precision for OGBG benchmark and error rate for Resnet50 and Vision Transformer benchmark. # model parameters # training points Train Loss (Adam) Train Loss(SONew) Valid. perf. (Adam) Valid. perf. (SONew)

We use cosine learning rate schedule. For resnet50 imagenet, we also search label smoothing over [0.0, 0.2]. Batch size was kept = 1024, 1024, and 512 for Resnet50, Vision Transformer, and OGBG respectively. We sweep over 100 hyperparameters in the search space for both SONew and Adam.

