LEARNING TO OPTIMIZE QUASI-NEWTON METHODS

Abstract

We introduce a novel machine learning optimizer called LODO, which online meta-learns an implicit inverse Hessian of the loss as a subroutine of quasi-Newton optimization. Our optimizer merges Learning to Optimize (L2O) techniques with quasi-Newton methods to learn neural representations of symmetric matrix vector products, which are more flexible than those in other quasi-Newton methods. Unlike other L2O methods, ours does not require any meta-training on a training task distribution, and instead learns to optimize on the fly while optimizing on the test task, adapting to the local characteristics of the loss landscape while traversing it. Theoretically, we show that our optimizer approximates the inverse Hessian in noisy loss landscapes and is capable of representing a wide range of inverse Hessians. We experimentally verify our algorithm's performance in the presence of noise, and show that simpler alternatives for representing the inverse Hessians worsen performance. Lastly, we use our optimizer to train a semi-realistic deep neural network with 95k parameters, and obtain competitive results against standard neural network optimizers.

1. INTRODUCTION

Many optimization algorithms like stochastic gradient descent (SGD) (Rosenblatt, 1958) , and Adam (Kingma & Ba, 2014) have been widespread and successful in the rapid training of deep neural networks. (Sun et al., 2019) Fundamentally, this is a problem of minimizing a loss which is a function of a large vector containing the weights of the network. The time it takes to optimize a neural network is a bottleneck in machine learning, so the more quickly a network can be trained, the more computational resources are saved, and therefore researchers have devoted great effort into creating new, faster optimizers. (Jain & Kar, 2017; Metz et al., 2020; Bernstein et al., 2020; Martens & Grosse, 2015a) We present a novel algorithm drawing from the field of learning to optimize (L2O) spearheaded by (Li & Malik, 2016) and (Andrychowicz et al., 2016) . Namely, we use a meta-optimizer to online learn an implicit representation of the local inverse Hessian, which is used in a quasi-Newton method, without any L2O meta training time on a training task distribution. Unlike other L2O algorithms which learn to optimize before optimization (Chen et al., 2021) , our algorithm Learns to Optimize During Optimization (LODO). We intend for LODO to be trained from scratch for each use case and then discarded. This way, LODO learns local features of the loss landscape at a specific point in training for a specific task, instead of only characteristics shared throughout training trajectories for a set of training tasks. Our work targets the Hessian, which varies with both the task and the point along the trajectory. Our use of linear neural networks is what imports the efficiency of the Newton method to our algorithm, while our use of a meta-optimizer like in L2O is what allows us to learn more powerful and general parameterizations of optimizers. Our contributions are as follows. We show theoretically and experimentally that a simplified version of LODO correctly learns the inverse Hessian in a stochastic convex setting. We show theoretically that LODO's inverse Hessian representation is highly expressive, and experimentally that simpler alternatives perform worse. We finally demonstrate the use of LODO in a semi-realistic vision task. This paper serves as a stepping stone in the development of meta-training-free online L2O. The remainder of this paper is structured as follows. Section 2 discusses relevant background and contributions in optimization and L2O. Section 3 shows how LODO works. Section 4 provides theoretical justification for our design of LODO. Sections 5 and 5.2 show experiments which explore what makes LODO work and why. Section 6 discusses and summarize our findings.

2. RELATED WORK

Research into the construction of faster optimizers has mostly fallen under two branches of work. The older branch attempts to endow SGD with adaptive capabilities, often through modifications involving calculation of the first and/or second moments of the gradient (mean and variance) using exponential moving averages (EMAs). RMSprop (Hinton et al., 2012) and Adam use the variance to normalize the step size and the mean to induce momentum. LARS (You et al., 2017) and Yogi (Zaheer et al., 2018) modify both modify the variance calculation, but for different reasons: to normalize layer-wise, and to control increases in effective learning rate in a slower manner, respectively. Some of these methods such as the Newton method and natural gradient descent (Martens & Grosse, 2015b; George et al., 2018) precondition the step with adaptive estimates of the inverse Hessian and the inverse Fisher information matrices, respectively. The Newton method converges quickly but is vulnerable to gradient noise and impractical to implement due to the resources spent in calculating and/or inverting the high dimensional Hessian. Many researchers have developed approximationscalled quasi-Newton methods-which reduce the Newton method's time and memory complexity, such as L-BFGS (Nocedal & Wright, 1999) and variants (Schraudolph et al., 2007; Parker-Holder et al., 2020; Goldfarb et al., 2020; Park & Oliva, 2019) better suited to the stochasticity and structure present in machine learning. The most related methods to our work are hypergradient methods, which online learn low-rank (Moskovitz et al., 2019) , diagonal (Amid et al., 2022; Baydin et al., 2017) , or Kronecker-factorized (Bae et al., 2022) preconditioner matrices to transform the gradient when choosing the step. We improve on these methods by using a more expressive class of preconditioners. More recently, a subfield of meta-learning known as learning to optimize (L2O) has shown that deep networks can themselves be trained to perform optimization, at a speed which exceeds that of popular traditional optimizers. The aim of this effort is to leverage deep neural networks to learn faster optimizers, in hopes of further accelerating training procedures for other deep neural networks. Li & Malik (2016; 2017) and Andrychowicz et al. (2016) were among the first to successfully use backpropagation to train neural networks to map gradients to steps. Since then, many other variations of this idea have successfully produced optimizers exceeding the speed of common optimizers for narrow ranges of machine learning models (Metz et al., 2018) , though theoretical analysis of these learned optimizers tends to be difficult and scarce. A major goal of L2O research is to learn a single optimizer which can generalize to be able to train a wide variety of machine learning models with speed. (Lv et al., 2017) Two issues also prevent L2O optimizers from being rapidly developed experimentally. Firstly, a carefully chosen "task distribution" for the optimizer to practice on is required for the meta-learning of the L2O optimizer, playing the role analogous to the "dataset". These tasks are difficult to curate because the issue of generalization error applies; we want the test task to be similar to the task distribution. Secondly, this meta-learning of the L2O optimizer is prohibitively costly, in that it involves nested training loops, where the inner loop takes a large amount of time and memory to evaluate and backpropagate through (Metz et al., 2019) . Altogether, the choice of task distribution and lengthy meta-training has been a necessary burden in L2O, and we overcome these with LODO.

3. HOW LODO WORKS

In a quasi-Newton method, the approximate solution x t ∈ R n is refined by x t+1 = x t -αG t g t for some learning rate α > 0, where G t ≈ (∇ 2 xt f (x t )) -1 ∈ R n×n is some approximation of the inverse Hessian and g t = ∇ xt f (x t ) ∈ R n is the gradient computed by backpropagation through the task f . α = 1 produces the exact solution if f is quadratic, so we set α = 1. Our algorithm approximates the inverse Hessian using a matrix G(θ t ) ∈ R n×n parameterized by a vector θ t of weights learned over time t, described later in this section. After every step t ← t + 1 using the formula x t+1 = x t -G(θ t )g t , the loss f (x t+1 ) is computed. Then the new gradient ∇ xt+1 f (x t+1 ) in x t+1 is computed through backpropagation as usual, but we continue backpropagation into the step-choosing process until we find the "hypergradient" ∇ θt f (x t+1 ) in the optimizer weights θ t , which allows us to update them as well. Thus, θ is trained such that the quasi-Newton method tries to minimize the loss upon taking a single step. In the L2O interpretation, this would be equivalent to unrolling the inner loop optimization for only one step, causing severe truncation bias; the optimizer learns to greedily optimize within too short of a time horizon, thus suffering in the long term (Metz et al., 2019) . As a countermeasure, we take inspiration from the Momentum modification to SGD, and replace g t by (1 -β) ∞ τ =0 β t-τ g τ before input into the step generation formula. This changes LODO's step generation formula to x t+1 = x t -(1 -β)G(θ t ) ∞ τ =0 β t-τ g τ instead, for some decay parameter β. This results in LODO, summarized in Algorithm 1. Our parameterization of G(θ) separates our work from other hypergradient methods; ours is inspired by the FFT-style Efficient Unitary Neural Network (EUNN) (Jing et al., 2017) designed for parameter-efficient representations of unitary matrices. We replace the FFT matrices in the EUNN with fixed randomly chosen sparse matrices, and write G(θ) to be a large product of matrices: G(θ) = α 0 (I 0) G(θ) T G(θ) I 0 ; G(θ) = N i=1 B(θ (i) )P i (1) where α 0 is a fixed initial learning rate, P i are randomly selected permutation matrices, B(θ (i) ) are block-diagonal matrices whose contents are listed by θ (i) where θ = (θ (1) , . . . θ (N ) ), as illustrated in Figure 1 . Every block is size k × k for some chosen k (we use 4 for our setup), where the blocks' contents are listed by θ. The (I 0) and (I 0) T matrices are n × ñ and ñ × n respectively, and all other matrices are ñ × ñ, where ñ is some chosen integer larger than n. For our setup, we choose ñ = ⌊2n/k⌋k. Though G depends on both α 0 and θ, we omit the α 0 in writing for brevity. By initializing each block matrix in B(θ (i) ) to a random orthogonal matrix, we initialize G(θ) to a multiple of the identity, despite the input information diffusing and mixing with itself inside the layers of the network. The initialization to the identity also allows us to make the depth N large without worrying about issues with gradient magnitudes, and for our setup we choose N = 16. The product G(θ) intentionally resembles the operation performed by a deep neural network with N layers, millions of hidden neurons arranged in inverted bottleneck style, very sparse connections, and no activation functions. The random connections between neurons are intended to bring expander graph properties to the computation graph of the neural network, such that signals can diffuse and self-mix quickly without travelling long distances within the computation graph. Since we wanted LODO to be repelled by saddle points, we chose G(θ) to produce a step which has guaranteed positive dot product with the gradient, so we parameterized G(θ) in a way which guarantees it to be a positive-semidefinite symmetric matrix. Furthermore, since any positive-semidefinite symmetric matrix can be eigendecomposed into U DU T = (U Dfoot_0/2 )(U D 1/2 ) T for unitary U and positivesemidefinite diagonal D, we chose G(θ) to represent the product of a matrix with its transpose. Since we expect the receptive field of output nodes to grow exponentially with depth, it takes O(log n) layers for all of the n inputs to interact with each other, so we intend the depth N to be some multiple of log n. 

4. THEORETICAL PROPERTIES OF LODO

In this section, we illustrate LODO's behavior through its theoretical properties. To make our analysis easier to show, this section only refers to LODO that is modified in the following ways: (i) We update θ t using SGD of learning rate α instead of Adam; and (ii) we use no momentum, β = 0. We present two results about desirable inverse Hessian learning dynamics and expressiveness, and in Appendix D, some further support from the literature for desirable properties of LODO's inverse Hessian approximation error.

4.1. HESSIAN LEARNING DYNAMICS (LODO CAN LEARN THE INVERSE HESSIAN)

In this section, we show that with certain assumptions and approximations, G(θ t ) converges to the inverse Hessian in a stochastic setting. We restrict our analysis to a version of LODO simplified from Algorithm 1, where the parameterization of G(θ t ) ∈ R n×n is instead a rearrangement of the parameter vector θ t ∈ R n 2 and is not necessarily symmetric, though the experiment of Section 5.1 supports similar findings for the original version of LODO as well. We also work in the limit of small learning rate α. The problem setup is of a noisily moving quadratic bowl: s t ∈ R n , s t iid ∼ S, E[s t ] = 0, x * t = t τ =1 s τ , ℓ t (x t ) = 1 2 (x t -x * t ) T H(x t -x * t ) (2) where H is a positive-definite symmetric Hessian matrix and S is some light tailed distribution of zero mean, for example a multivariate normal distribution N (s t ; 0, Σ). Lastly, we assume that ||I -G(θ t )H|| 2 < 1 for all t, that G(θ t ) never travels too far from H -1 . Under these conditions, the gradient of loss with respect to both x t and θ t can be solved for so LODO turns out to follow the training dynamics A t+1 = A t -αHb t+1 b T t H 2 , b t+1 = A t b t -s t , where we define A t = I -G(θ t )H and b t = x t -x * t to be the errors in estimates of the inverse Hessian and the minimum, respectively. A full derivation is in Appendix A.1. b t then exponentially decays to a fixed distribution, and the long term dynamics of A t should not be greatly affected by the quick compensating movement of b t 's distribution. Thus, to determine the movement of A t , we can approximate the dynamics along a short trajectory by fixing A t to its value at t = t 0 to study the settled distribution of b t . 1 Appendix A.2 gives more rigorous justification for this approximation-we keep the A t+1 update but rewrite the b t+1 update as b t+1 = A t0 b t -s t . (4) The total contribution of all s τ for t 0 ≤ τ ≤ t to the movement of A t in the infinite time horizon is then αHA t0 ∞ n=0 A t0 n t τ =t0 s τ s T τ (A t0 n ) T H 2 (5) with some extra terms multiplied with pairwise products between s i and s j (i ̸ = j), and other terms with b t0 . In the long term, the average contribution of each s τ to the total then converges to αHA t0 ∞ n=0 A t0 n E s τ s T τ (A t0 n ) T H 2 (6) by law of large numbers, where any pairwise products of s i and s j for i ̸ = j have no effect due to independence and the mean of s t being zero, and the effect of b t0 vanishes due to small α. The expected error for this convergence decays to zero like O((t -t 0 ) -1/2 ) due to the light tailedness of s τ (eg. when s τ iid ∼ N (s τ ; 0, Σ)) leading to the central limit theorem. Expression 6 is then the step direction for the Euler method for approximating the solution to the differential equation dA dt = -HA ∞ n=0 A n E ss T (A n ) T H 2 , which can be shown to cause A to flow towards zero such that G(θ t ) converges to H -1 . 2 Therefore, it is reasonable to believe that LODO learns the inverse Hessian over time, given small enough learning rate. The Frobenius norm of the error decays faster when magnitudes/norms of H and E ss T are higher, indicating that both curvature of the Hessian and noisy movement of the quadratic bowl's minimum are good for learning the Hessian for fixed α, as long as the approximations mentioned stay accurate. One interpretation of this is that when the noise s is zero in a certain direction, the error b in that direction quickly decays to zero so LODO no longer receives any data along that direction in order to learn that direction's curvature. Furthermore if the Hessian H does not penalize deviation in some direction, then the training signal for that direction vanishes. We may also produce another analysis for when H has a direction of negative curvature, to show that LODO repels saddle points. Here, we instead choose to parameterize G(θ t ) as a positivedefinite symmetric matrix, and take the limit of small learning rate α such that G(θ t ) is effectively fixed. Then, G(θ t )H has some negative eigenvalue λ with eigenvector y (see Appendix B for proof), such that A = I -G(θ t )H has an eigenvalue of norm greater than 1, meaning that the error vector b blows up exponentially in the y direction. Since G(θ t )Hy = λy, left multiplying by y T G(θ t ) -1 further shows that y T Hy < 0, implying that the direction of saddle point repulsion is one of negative curvature, which decreases the loss.

4.2. REPRESENTABILITY (LODO CAN BE EXPRESSIVE)

In this section, we seek to show that our parameterization of G(θ) is highly expressive; LODO may represent a wider range of inverse Hessians than many other methods. In fact, we show that with 2 A flows towards zero because by substituting the eigendecomposition H = U DU T where U U T = I, and using B = U T AU , we can show that the norm of BD -1 decreases over time, d||BD -1 || 2 F dt = d dt tr D -2 B T B (8) = -2tr B T DB ∞ n=0 B n U T E ss T U (B n ) T (9) = -2 D 1/2 B ∞ n=0 B n U T E ss T U (B n ) T 1/2 2 F ≤ 0 (10) where the strict equality is only satisfied for A = 0. an increase in parameter count by only a logarithmic factor, our fixed parameterization of G(θ) can reproduce any possible parameterization G(θ) = F T F where vector multiplication with F is computable with some sparse linear neural network. Our parameterization is thus capable of representing inverse Hessians from (Baydin et al., 2017) and (Amid et al., 2022) with only O(log ñ) times more parameters, and (Moskovitz et al., 2019) with only O(log 2 ñ) times more parameters. Definition 1 creates a function ϵ to characterize the mixing rate of random transpositions when trying to shuffle lists, while Theorem 1 uses this ϵ function to lower bound the probability that the G(θ) network in LODO can represent other linear neural networks. Definition 1. Uniformly sample a sequence of N ñ/2 transpositions of two out of ñ elements, for integers ñ/2 ∈ N and N ∈ N, with the condition that every successive block of ñ/2 transpositions commutes internally (transpositions can be rearranged within a block). We replace each transposition with the identity operation with probability 1/2, and then compose the sequence of transpositions/identities to form a permutation. Then, we define the function ϵ(N ñ/2, ñ) such that the expected entropy of this permutation, given the original sequence but not given the locations where identities were placed, is log ñ! -ϵ(N ñ/2, ñ). Theorem 1. Uniformly sample permutations P i and create block-diagonal matrices B(θ (i) ) where every block is 2 × 2, and whose block contents are listed by the parameters θ (i) . Use these to construct the LODO subnetwork G(θ) as in Equation 1with some depth N and hidden dimension ñ. Construct any linear neural network F with input dimension, output dimension, and number of weights per layer at most ñ, at most k incoming and at most k outgoing weights for every neuron, depth d, and any arrangement of weights. Then, there is a probability of at least 1 -ñ!N 1 2 ϵ ñN 4d(⌈log 2 k⌉ + 1) , ñ that we can make G(θ) = F for some θ. The proof is in Appendix C. We believe that random transpositions in the style of Definition 1 are a quick way to shuffle, since the Cayley graph over the symmetric group generated by all transpositions has good expansion properties (Konstantinova & Kravchuk, 2022) . In other words, we hypothesize that for large N and ñ, we have ϵ(N ñ/2, ñ) ≈ c 1 ñe -c2N ñ/2 for some positive constants c 1 and c 2 . This would imply that the probability that G(θ) can represent all possible F is at least approximately 1 -ñ!N c 1 ñ 2 exp -c 2 ñN 4d(⌈log 2 k⌉ + 1) which can be made to be above 1 -(n!) c for any constant c by using sufficient depth N which goes like N ∝ d(log 2 k) log ñ, due to Stirling's approximation. Thus we believe that by only being O((log 2 k) log ñ) times deeper than F , we can make it very likely that our model G(θ) can represent all possible F .

5. EXPERIMENTS WITH LODO

We present here a number of tasks which provide experimental evidence for the theoretical results we claim, though we test a variety of optimizers on these tasks. Appendix G presents an additional experiment on the Rosenbrock function. Appendix E explains how we tuned each optimizer's hyperparameters for each task. Optimizers were used 8 times for every experiment to ensure reproducibility of results, unless otherwise stated.

5.1. NOISY QUADRATIC BOWL

We use various optimizers to track the minimum of a quadratic bowl of fixed true Hessian H as its minimum is perturbed by noise at every step, to demonstrate that LODO correctly learns its inverse Hessian representation as claimed in Section 4.1. The setup of the noisy quadratic bowl is the same as in Section 4.1 and details are provided in Appendix F.1. We interpret an optimizer to be better if it can maintain a lower loss in the infinite step limit, since the error builds at a constant rate over time and the optimizer's role is to react and correct it as quickly as possible. We tested each optimizer by using it to track the minimum of the moving quadratic bowl over 100k steps. Learning curves in Figure 2 and losses in Table 5 of Appendix F.1 show that LODO tracks the minimum more accurately than all other optimizers, and that an estimator of the inverse Hessian approximation error ||I -G(θ t )H|| 2 F /n decays over time. Other optimizers underperform because their preconditioners are less expressive, being either diagonal or low-rank and thus unable to capture the off-diagonal elements or non-top-few singular values of the true inverse Hessian, leading to a higher loss.  1 100 100 i=1 ||(I -G(θ t )H)v i || 2 2 with random independent unit vectors v i . Right: Average training loss learning curves, smoothed by averaging over blocks of 200 steps each. The dotted line shows the theoretically best possible loss using Newton's method. A better optimizer maintains a lower loss after infinite steps, since for this task, loss is introduced over time and the optimizer serves to quickly suppress it. Left & Right: Error margins indicate ±1 standard deviation between the performances of 8 optimizers at every step. The version of L-BFGS is one with stochastic modifications from (Schraudolph et al., 2007) , instead of the original from (Nocedal & Wright, 1999).

5.2. IMAGE GENERATION

We design a challenge for LODO-an autoregressive image generator for MNIST. We intend to demonstrate the effectiveness of LODO at scale when used to train a semi-realistic neural network. The task is similar to training a PixelCNN (Oord et al., 2016) to generate MNIST images (Lecun et al., 1998) , and is fully described in Appendix F.2. We tested LODO alongside a variety optimizers for 300k steps, though we could not get any quasi-Newton methods to converge on this task. Learning curves in Figures 3 and losses in Table 1 show that LODO trains at a speed that is competitive against other optimizers. Figure 7 in Appendix F.2 shows some MNIST digits we generated. We study the contributions of various components of LODO to its performance by replacing components to observe a performance change, as detailed below. For example, we replace the Adam meta-optimizer with SGD to form a new version of LODO (which we call "LODO-SGD"); its performance is shown in Table 2 . Other modifications are listed below.

5.2.1. EIGENDECOMPOSED ALTERNATE PARAMETERIZATION

We would like to show that architecture LODO uses to choose the step is flexible. We modify the matrix decomposition in Section 3 to G(θ) = α 0 (I 0) G(θ) T D G(θ) (I 0) T by adding learned diagonal matrix D in the middle and changing G(θ) to a product of many weighted permutations each added to the identity matrix. The neural network which G(θ) represents now has residual connections, and the initialization is modified accordingly. Losses in Table 2 show that this version (which we call "LODO-Residuals") performs only slightly worse than LODO for the same number of steps, reflecting the flexibility in the design of LODO.

5.2.2. SIMPLER APPROXIMATE HESSIANS

We presumed that the representability result of Section 4.2 is only useful because LODO's strength comes from the flexibility that G(θ) gives in configuring pairwise interactions between parameters. We therefore expect that using a simpler form of Hessian should hurt the performance of LODO. We test two simpler forms of G(θ): G(θ) = α 0 diag(θ) (which we call "LODO-Diagonal") for θ ∈ R n initialized to a vector of ones, similar to (Amid et al., 2022) -and the even simpler G(θ) = α 0 θI (which we call "LODO-Global") for θ ∈ R initialized to 1, as in (Baydin et al., 2017) . Losses in Table 2 show that the original version of LODO performs the best, verifying our hypothesis.

5.2.3. EFFECTS OF USING EMAS OF GRADIENTS

Similarly to how momentum works for SGD, LODO's input gradients are preproccessed by accumulation into EMAs. To test our claim in Section 3 that momentum benefits LODO, we try 8 separate momentum decay rates in a logarithmic grid from no momentum to the optimal amount of momentum found (β = 0.9343), and test each decay rate once. Figure 4 shows a clear trend that at least up to the optimal decay rate, increasing the effect of momentum improves LODO. We also try removing momentum completely (we call the modified version "LODO-No-Momentum"); results are shown in LODO is a middle point between L2O methods and quasi-Newton methods, retaining significant advantages of both classes of optimizers. Via LODO, we bring ideas from both classes of optimization methods to offer potential solutions to problems common on the other side. Relative to quasi-Newton methods, LODO offers advantages associated with the use of a meta-optimizer on a neural optimizer. Crucially, LODO determines its inverse Hessian estimate using all past gradients, whereas most other quasi-Newton methods use a finite history of them. This allows LODO to retain information about the inverse Hessian for much longer than other methods. This is useful if the gradients contain enough noise that useful signals can only be obtained by accumulating information from many gradient samples. Our theory further shows that the linear neural network in LODO is optimal to a certain extent: it can probably represent all linear neural networks smaller by a logarithmic factorallowing a huge class of inverse Hessians. Our image generation task demonstrates that LODO succeeds in a semi-realistic stochastic nonconvex task where other quasi-Newton optimizers diverge. Due to our use of L2O, LODO also has flexibility in the design of its linear neural network, which makes it amenable to further research and refinement. Relative to L2O, LODO offers advantages associated with the restructuring of the outer and inner loop into a single loop. Most importantly, our modification to L2O alleviates the requirement for meta-training time and the training task distribution. This is at the cost of increased inner loop unrolling truncation bias, but it takes advantage of this sacrifice in resolving the need to compute second-order gradients. LODO still inherits issues of high memory usage and slow step computation from L2O methodology though. Our theory offers some understanding of how LODO learns to optimize, which is rare for L2O methods: the Hessian approximation error decays as learning progresses. We import the idea from quasi-Newton methods that the gradient of one parameter can affect the step for another, which comes from the presence of off-diagonal elements in the Hessian. As shown in Section 4.2, LODO presents an efficient way of approximating subsets of the O(n 2 ) possible pairwise parameter interactions in O(n log n) time. Such interactions are commonly ignored in the design of L2O and more mainstream optimizers, yet our image generation task demonstrates their importance, as evidenced by the improved performance of LODO over ablated versions as well as SGD.

Conclusion

Through LODO, we provide a new way of using L2O methods online without any meta-training to perform quasi-Newton optimization. We introduce the strengths and advantages of quasi-Newton methods and L2O to each other and combine them in a harmonious manner. LODO's abilities showcase the applicability of online L2O methods with nested optimizers to the training of modern neural networks. Our unique methodology serves as a stepping stone for the further development and use of L2O in quasi-Newton optimization and vice versa.

7. REPRODUCIBILITY OF RESULTS

We have described all the details of the construction of our optimizer in Section 3, and all its modified versions in Section 5.2. All the tasks used for experiments are introduced in Section 5 and are fully described in Appendices F.1 and F.2. To further ensure reproducibility, we have also listed all the hyperparameters found for each optimizer and task in Appendix E. All our training runs were repeated 8 times to verify that each measurement of empirical behavior is consistent, with low standard deviation. Furthermore, we will release the code for all our optimizers and experiments on Github for others to see.

A ELABORATION ON HESSIAN LEARNING DYNAMICS A.1 DERIVATION OF TRAINING DYNAMICS

This section gives a derivation of the result that under the problem setup of Section 4.1, LODO follows the Hessian learning dynamics A t+1 = A t -αHb t+1 b T t H 2 , b t+1 = A t b t -s t , where A t = I -G(θ t )H and b t = x t -x * t as long as G(θ t ) is parameterized as a dense matrix filled with elements of θ t , and no momentum is used. We first let b t be x t -x * t . The loss at time t is then ℓ t = 1 2 b T t Hb t . ( ) The gradient is then computed to be dℓ dx t =Hb t . The step taken then produces the next parameters: x t+1 =x t -G(θ t )Hb t . Subtracting x * t+1 = x * t + s t , we get the recurrence for b t , x t+1 -x * t+1 =x t -x * t -s t -G(θ t )Hb t (16) b t+1 =b t -G(θ t )Hb t -s t (17) =(I -G(θ t )H)b t -s t (18) =A t b t -s t . (3) The loss at time t + 1 is computed to be ℓ t+1 = 1 2 b T t+1 Hb t+1 (19) = 1 2 (A t b t -s t ) T H(A t b t -s t ) (20) = 1 2 ((I -G(θ t )H)b t -s t ) T H((I -G(θ t )H)b t -s t ). LODO also computes a step of θ t using the loss on the next step. Since the elements of θ t are just a rearrangement of the elements of G(θ t ) in our derivation, an update of θ t can be treated instead like an update of G(θ t ). The gradient of ℓ t+1 with respect to G(θ t ) is then computed to be dℓ t+1 dG(θ t ) = -H((I -G(θ t )H)b t -s t )b T t H (22) = -H(A t b t -s t )b T t H (23) = -Hb t+1 b T t H (24) and the step of G(θ t ) is G(θ t+1 ) =G(θ t ) + αHb t+1 b T t H (25) resulting in the recurrence for A t : A t+1 =I -G(θ t+1 )H (26) =I -(G(θ t ) + αHb t+1 b T t H)H (27) =A t -αHb t+1 b T t H 2 . (3)

A.2 VALIDITY OF APPROXIMATION ARGUMENT

This section gives justification for the approximation in Section 4.1 of the long term trajectory of the recurrence A t+1 =A t -αHb t+1 b T t H 2 (28) b t+1 =A t b t -s t ( ) by replacing with A ′ t+1 =A ′ t -αHb ′ t+1 b ′ T t H 2 (30) b ′ t+1 =A ′ t0 b ′ t -s t ( ) when α is small and the initial conditions at t 0 are the same: 31. In the case where the noise is not bounded, a probabilistic analysis can be done instead, though we do not provide one. A t0 = A ′ t0 and b t0 = b ′ t0 = 0. To justify this approximation, we prove that the spectral norm of long term deviation corrected for learning rate is small over short distances r, in the following theorem: Theorem 2. lim r→0 lim α→0 1 r ||A t0+⌊r/α⌋ -A ′ t0+⌊r/α⌋ || 2 =0. ( ) In other words, the local movement of A rescaled for learning rate is unaffected by our approximation when the learning rate α is small. Proof. Our proof strategy is as follows: 1. We will first define variables to denote bounds on the movement of A and the approximation error in b. 2. We will show that these variables bound each other, and then we will combine these bounds to create a single recursive bound on the movement of A. 3. We will characterize the bound's growth and it will turn out that A has a maximum movement speed along any trajectory of sufficiently short length. 4. Due to the slow movement of A, we can deduce that the approximation error in b increases at a bounded rate.

5.. Since approximation errors in

A are an accumulation of errors in b, we will show that deviation between the true and approximate A trajectories is quadratic in the distance along the trajectory. 6. We conclude that the approximation error vanishes for short trajectories and small learning rates. First part. We first define the maximum drift in A ϵ A,t0+∆t = max t0≤τ ≤t0+∆t ||A τ -A ′ t0 || 2 (33) up to time difference ∆t for 0 ≤ ∆t ≤ R/α for some chosen small constant trajectory length R > 0. We will pick R later. We will also define the maximum error in b ϵ b,t0+∆t = max t0≤τ ≤t0+∆t ||b τ -b ′ τ || 2 (34) up to the same time. Second part. For the bound in one direction, we have that for all τ such that t 0 ≤ τ ≤ t 0 + ∆t, ||b τ +1 -b ′ τ +1 || 2 =||A τ b τ -s τ -(A ′ t0 b ′ τ -s τ )|| 2 (35) =||A τ b τ -A ′ t0 b ′ τ || 2 (36) ≤||A τ b τ -A ′ t0 b τ || 2 + ||A ′ t0 b τ -A ′ t0 b ′ τ || 2 (37) ≤||A τ -A ′ t0 || 2 ||b τ || 2 + ||A ′ t0 || 2 ||b τ -b ′ τ || 2 (38) ≤ϵ A,t0+∆t ||b τ || 2 + ||A ′ t0 || 2 ||b τ -b ′ τ || 2 (39) using the triangle inequality and sub-multiplicativity for the spectral norm ||•|| 2 . This is a recurrence in ||b τ -b ′ τ || 2 ; by induction we have that for t 0 ≤ τ ≤ t 0 + ∆t + 1, ||b τ -b ′ τ || 2 ≤ϵ A,t0+∆t τ -1 τ1=t0 ||A ′ t0 || τ -1-τ1 2 ||b τ1 || 2 (40) such that we produce the bound ϵ b,t0+∆t+1 ≤ϵ A,t0+∆t max t0≤τ ≤t0+∆t+1 τ -1 τ1=t0 ||A ′ t0 || τ -1-τ1 2 ||b τ1 || 2 (41) ≤ϵ A,t0+∆t max t0≤τ ≤t0+∆t+1 τ -1 τ1=t0 ||A ′ t0 || τ -1-τ1 2 (||b ′ τ1 || 2 + ϵ b,t0+∆t ) (42) ≤ϵ A,t0+∆t t0+∆t τ1=t0 ||A ′ t0 || t0+∆t-τ1 2 (b max + ϵ b,t0+∆t ) (43) ≤ϵ A,t0+∆t ϵ b,t0+∆t+1 + b max 1 -||A ′ t0 || 2 (44) ϵ b,t0+∆t+1 ≤ ϵ A,t0+∆t b max 1 -||A ′ t0 || 2 -ϵ A,t0+∆t . ( ) Now, we show a reverse bound: for all τ such that t 0 ≤ τ ≤ t 0 + ∆t, we have ||A τ +1 -A ′ t0 || 2 =||A τ -αHb τ +1 b T τ H 2 -A ′ t0 || 2 (46) ≤||A τ -A ′ t0 || 2 + α||H|| 3 2 ||b τ || 2 ||b τ +1 || 2 (47) ≤||A τ -A ′ t0 || 2 + α||H|| 3 2 (||b ′ τ || 2 + ϵ b,t0+∆t+1 )(||b ′ τ +1 || 2 + ϵ b,t0+∆t+1 ) (48) ≤||A τ -A ′ t0 || 2 + α||H|| 3 2 (b max + ϵ b,t0+∆t+1 ) 2 (49) By induction we have for t 0 ≤ τ ≤ t 0 + ∆t + 1, ||A τ -A ′ t0 || 2 ≤α||H|| 3 2 (τ -t 0 )(b max + ϵ b,t0+∆t+1 ) 2 (50) such that we produce the reverse bound ϵ A,t0+∆t+1 ≤α||H|| 3 2 (∆t + 1)(b max + ϵ b,t0+∆t+1 ) 2 . ( ) Third part. Substituting the bound in Equation 45into the bound in Equation 51, we produce the recurrence ϵ A,t0+∆t+1 ≤α||H|| 3 2 b 2 max (∆t + 1) 1 + ϵ A,t0+∆t 1 -||A ′ t0 || 2 -ϵ A,t0+∆t 2 (52) =α||H|| 3 2 b 2 max (∆t + 1) 1 -||A ′ t0 || 2 1 -||A ′ t0 || 2 -ϵ A,t0+∆t 2 (53) =f (ϵ A,t0+∆t ). ( ) where f (x) =α||H|| 3 2 b 2 max (∆t + 1) 1 -||A ′ t0 || 2 1 -||A ′ t0 || 2 -x 2 . ( ) To bound the movement of A, we must use the fact that when 0 ≤ ∆t ≤ 4 27 1 -||A ′ t0 || 2 α||H|| 3 2 b 2 max -1 (56) the function f maps the interval I ∆t = 0, 9 4 α||H|| 3 2 b 2 max (∆t + 1) ⊆ 0, 1 3 (1 -||A ′ t0 || 2 ) (57) to a subset of itself. Since at ∆t = 0 we have ϵ A,t0+∆t = 0 ∈ I ∆t , and we also have I ∆t ⊆ I ∆t+1 , we may deduce by induction on ∆t that ϵ A,t0+∆t ∈ I ∆t as long as Equation 56holds, and thus there is a bound ϵ A,t0+∆t ≤ 9 4 α||H|| 3 2 b 2 max (∆t + 1) ≤ 1 3 (1 -||A ′ t0 || 2 ) on the movement speed of A as long as Equation 56 holds. Fourth part. Note that we have assumed that 0 ≤ ∆t ≤ R/α for some constant R which we have not yet picked. By choosing R ≤ 4 27 1 -||A ′ t0 || 2 ||H|| 3 2 b 2 max -α we may always guarantee Equation 56, which implies Equation 58. Then when Equation 58 is substituted into Equation 45, we create a small bound on the approximation error in b which begins at zero and increases with time, ϵ b,t0+∆t+1 ≤ 9 4 α||H|| 3 2 b 3 max (∆t + 1) 1 -||A ′ t0 || 2 -9 4 α||H|| 3 2 b 2 max (∆t + 1) ≤ αb max (∆t + 1) 3R -α(∆t + 1) for 0 ≤ ∆t ≤ R/α. This also holds trivially for ∆t = -1 =⇒ ϵ b,t0+∆t+1 = 0, so we may re-index to have ϵ b,t0+∆t ≤ αb max ∆t 3R -α∆t for 0 ≤ ∆t ≤ R/α + 1. Since the right side of Equation 62 is convex in ∆t over ∆t ∈ [0, R/α], we may bound by a linear function with the same endpoints ϵ b,t0+∆t ≤ αb max 2R ∆t (63) for 0 ≤ ∆t ≤ R/α. Fifth part. Finally, we use this bound on approximation error in b to bound approximation error in A. ||A t0+∆t+1 -A ′ t0+∆t+1 || 2 =||A t0+∆t -αHb t0+∆t+1 b T t0+∆t H 2 -(A ′ t0+∆t -αHb ′ t0+∆t+1 b ′ T t0+∆t H 2 )|| 2 (64) ≤||A t0+∆t -A ′ t0+∆t || 2 + α||H|| 3 ||b t0+∆t+1 b T t0+∆t -b ′ t0+∆t+1 b ′ T t0+∆t || 2 (65) ≤||A t0+∆t -A ′ t0+∆t || 2 + α||H|| 3 ||b t0+∆t+1 || 2 ||b T t0+∆t -b ′ T t0+∆t || 2 + ||b t0+∆t+1 -b ′ t0+∆t+1 || 2 ||b ′ T t0+∆t || 2 (66) ≤||A t0+∆t -A ′ t0+∆t || 2 + ϵ b,t0+∆t+1 α||H|| 3 (2b max + ϵ b,t0+∆t+1 ) By induction, we find that the approximation error of A is quadratic in time for short times 0 ≤ ∆t ≤ R/α, ||A t0+∆t -A ′ t0+∆t || 2 ≤ ∆t-1 ∆t=0 ϵ b,t0+ ∆t+1 α||H|| 3 (2b max + ϵ b,t0+ ∆t+1 ) (68) =||H|| 3 b 2 max ∆t-1 ∆t=0 α 2 2R ( ∆t + 1) 2 + α 2R ( ∆t + 1) ≤||H|| 3 b 2 max ∆t ∆t=1 α 2 2R ∆t 2 + α 2R ∆t (70) =||H|| 3 b 2 max α 2 ∆t 2 2R 2 + α∆t 2R . Sixth part. Now take ∆t = ⌊r/α⌋, which for r → 0 is eventually ≤ R/α as required. As we sought to prove, the learning rate rescaled approximation error of the local drift direction and speed goes to zero: lim r→0 lim α→0 1 r ||A t0+⌊r/α⌋ -A ′ t0+⌊r/α⌋ || 2 = lim r→0 lim α→0 1 r ||H|| 3 b 2 max α 2 ⌊r/α⌋ 2 2R 2 + α⌊r/α⌋ 2R (72) = lim r→0 ||H|| 3 b 2 max r 2R 2 + r 2R =0. B PROOF THAT G(θ t )H HAS A NEGATIVE EIGENVALUE This is true, because we can substitute A = G(θ t ) and B = H in following lemma: Lemma 1. Let us A and B be symmetric, full-rank n×n matrices. Let A be positive-definite and B have at least one negative eigenvalue. Then, the product AB has at least one negative eigenvalue. Proof. Let x be an eigenvector of B with negative eigenvalue λ. Then, we have A -1/2 x T A 1/2 BA 1/2 A -1/2 x ≤ 0 meaning that the positive-definite symmetric matrix A 1/2 BA 1/2 must have at least one negative eigenvalue λ ′ with eigenvector x ′ . Then, we have the eigenvalue equation AB A 1/2 x ′ =A 1/2 A 1/2 BA 1/2 x ′ (75) =λ ′ A 1/2 x ′ which shows that AB has a negative eigenvalue λ ′ .

C PROOF OF REPRESENTABILITY THEOREM

This section gives a proof of the representability theorem stated in Section 4.2: Theorem 1. Uniformly sample permutations P i and create block-diagonal matrices B(θ (i) ) where every block is 2 × 2, and whose block contents are listed by the parameters θ (i) . Use these to construct the LODO subnetwork G(θ) as in Equation 1 with some depth N and hidden dimension ñ. Construct any linear neural network F with input dimension, output dimension, and number of Second part. Suppose we would like to represent p target permutations using p independently generated G(θ) networks each of depth M . Equation 80 lower bounds the probability that each network can represent its respective permutation. Then the probability that all of the p target permutations are accessible by their respective copies of G(θ) is union bounded by 1 -pñ! 1 2 ϵ(M ñ/2, ñ) where the union is over the probability that each copy fails to represent its target permutation. Given that this succeeds, we now need to chain these permutations together with block diagonal matrices to represent arbitrary neural networks F , with Equation 81 lower bounding the probability of failure. Suppose we are successful in representing any combination of p target permutations using p distinct independently generated G(θ) networks of depth M . Then, since each G(θ) network's final (leftmost) operation is a block diagonal matrix, applying an additional block diagonal matrix afterward does not affect the operations that G(θ) can represent, since the set of block diagonal matrices we use is closed under matrix multiplication. Importantly, each block matrix can be used to copy a value from one index into two or to add two values together to leave one, creating fanin or fanout in the network. Then, interleaving p + 1 chained copies of G(θ) with block diagonal matrices leftmultiplied for a total depth of (p + 1)M therefore creates an aggregate operation which can still be represented by a single G(θ) network of depth (p + 1)M . This aggregate operation has up to ñ arbitrary connections from input nodes to output nodes, with either all fanin at most 1 and all fanout at most 2 p , or all fanout at most 1 and all fanin at most 2 p . This is done by building a forest of fanin/fanout trees like illustrated on the right side of Figure 5 . If we compose such a G(θ) network of depth (p + 1)M with fanout up to 2 p together with a G(θ) network of depth (p + 1)M with fanin up to 2 p , then we may represent any sparse matrix structure with at most ñ nonzero weights and maximum fanin and fanout at most 2 p , using a G(θ) network of depth 2(p + 1)M . This construction is illustrated on the left side of Figure 5 . We may adjust the final (leftmost) block diagonal matrix on the fanout side to change the values of the ñ arbitrarily positioned weights in the desired sparse matrix. Therefore, any sparse matrix with at most ñ weights and max node indegree and outdegree at most k can be represented by a G(θ) of depth 2(⌈log 2 k⌉ + 1)M . Then, any linear neural network of depth at most d, at most ñ weights per layer, and maximum fanin and fanout of at most k can be represented by a G(θ) network of depth 2M d(⌈log 2 k⌉ + 1), by composition of sparse matrices. The probability that all of this is successful is merely the probability that all the p = 2M d(⌈log 2 k⌉ + 1) permutations can be represented, which by Equation 81 is at least 1 -2ñ!M d(⌈log 2 k⌉ + 1) 1 2 ϵ(M ñ/2, ñ) . Thus in summary, if we randomly generate a G(θ) network of depth N = 2M d(⌈log 2 k⌉ + 1) for fixed constants k, d, and M , ñ neurons per layer, and block size f = 2, then there is a probability of at least 1 -ñ!N 1 2 ϵ (M ñ/2, ñ) =1 -ñ!N 1 2 ϵ ñN 4d(⌈log 2 k⌉ + 1) , ñ that G(θ) can represent every possible linear neural network of input and output dimension ≤ ñ, depth d, at most ñ nonzero weights per layer, and max fanin/fanout of at most k. D HESSIAN LEARNING LOCAL MINIMA (LODO CAN GENERALIZE) Laurent & von Brecht (2018) gives a theorem showing that for dense linear neural networks on convex losses where each layer is at least as wide as the input or output layer, all local minima in the neural network weights are also global minima. If we simplify LODO by approximating the inverse Hessian with a full dense matrix G(θ t ), then this theorem applies to the Hessian approximation error ||I -G(θ t )H|| 2 F , which the rescaled error ||BD -1 || 2 F used in Section 4.1 is a proxy for. Thus we may expect that any inverse Hessian approximation which LODO could converge to is of similar high quality to the best possible inverse Hessian approximation.

E HYPERPARAMETERS

In every experiment, we tuned the hyperparameters of each optimizer using a genetic algorithm of 10 generations and 32 individuals per generation. Each hyperparamter was rescaled using x → ln x if the hyperparameter was a learning rate and x → 1 -ln(1 -x) if the hyperparameter was a decay parameter, so that the genetic algorithm would operate in a more well-behaved hyperparameter space. Starting from the default hyperparameters, each generation's mean hyperparameters were added to some Gaussian noise to create mutated hyperparameters for each individual, where the standard deviation of the noise was generation-dependent and followed a specific schedule. Each individual performed a generation-dependent number of steps of optimization, also according to a schedule. The next generation's mean hyperparameters were chosen to be the mean hyperparameters of the better performing half of the previous generation, as judged by average training loss during the last 10% of training. We also halved all the learning rates (equiv. initial learning rates for LODO versions) after tuning for the image generation task because tests showed the loss to diverge if training time horizons were longer than 8k steps. Since LODO is a random optimizer, we used a different randomization seed for every individual. Table 3 lists the parameters of the tuning schedule for every task. The tuned hyperparameters can be found in Table 4 .

F TASK SETUP DETAILS

F.1 NOISY QUADRATIC BOWL TASK DETAILS This section fully explains the details of the setup of the noisy quadratic bowl task of Section 5.1. This 100 parameter task consists of a quadratic bowl for its loss landscape. Using a random uniform orthogonal matrix U and a diagonal matrix D consisting of a geometric sequence of 100 values starting at 0.001 and ending at 1, the Hessian of the quadratic bowl is set to H = U DU T , and the center is set to the origin. However, whenever the loss and gradient are evaluated, the center of the quadratic bowl is perturbed by an i.i.d. random standard normal offset in each dimension. The initialization for this task is set to the origin. Due to the random wandering of the center, the expected loss rises linearly over time-unless the optimizer acts to prevent this, driving the error towards a steady state distribution. The expected loss after infinitely many steps can then be taken as a measure of the quality of an optimizer. The optimal solution for this task is to select the current Due to the movement of the minimum between steps and loss evaluations, we should still expect this strategy to a achieve nonzero loss which can be analytically calculated to be 7.412. Table 5 shows various optimizers' performances after having taken many steps.

F.2 CNN IMAGE GENERATION TASK DETAILS

This section explains the CNN image generation task of Section 5.2, which is similar to the task of training a PixelCNN (Oord et al., 2016) . Like PixelCNN, our CNN generates pixels row by row and column by column, and classifies the brightness of each pixel into 256 classes with crossentropy loss. Our data preprocessing is as follows. An MNIST image randomly selected, and one pixel location is chosen uniformly at random and blackened. All pixels below, or to the right of and in the same row as the selected pixel are blackened. The input into the CNN consists of this partially masked/blackened image (divided by 256 for normalization), an image indicating which pixels are specifically masked/blackened (indicated by -1 and 1), another image indicating which pixels are left of the selected pixel (indicated by -1 and 1), an image of a linear gradient from -1 to 1 in the x direction, and the same for the y direction. The last three images are present purely to break horizontal and vertical translation symmetries in the data, which has been shown to be helpful in vision tasks involving collection of information from specific locations of an image specified by the data (Liu et al., 2018) . The preprocessed data is visualized in Figure 6 . The CNN architecture we use for autoregression consists of: • 5 input channels as previously described. • Residual connection to Point A. -One pixel of zero padding, and convolution with 3 by 3 filters to 20 channels, with no activation. • Point A. • The following indented items are repeated 5 times. -The following indented items are repeated 4 times. * Residual connection to Point B. • Convolution with 1 by 1 filters to 40 channels, with arctangent activation. We use the arctangent to mitigate gradient norm issues. • One pixel of zero padding, and three different depthwise convolutions concatenated together, with 3 by 3 filters to 120 channels, with arctangent activation. • Convolution with 1 by 1 filters to 256 channels, with softmax activation. The loss is then the average crossentropy between true brightness of the query pixel and the distribution given by the softmax output of this CNN, over a batch size of 256 images. The whole CNN has about 94696 parameters, which are initialized with LeCun normal initialization (Klambauer et al., 2017) . The task for the optimizer is to train these parameters to minimize this loss. Figure 7 shows some imitation MNIST images generated using this CNN by sampling pixels one by one. Table 5 : Average tracking error of the quadratic bowl minimum on the noisy quadratic bowl task of Section 5.1. Values are are averaged over the last 10% of training before the stated training milestone, with means and standard deviations produced across 8 optimization runs with different randomization seeds. The theoretical best possible loss using Newton's method is also listed. The version of L-BFGS is one with stochastic modifications from (Schraudolph et al., 2007) , instead of the original from (Nocedal & Wright, 1999) . Our timing setup is described in Appendix H. (Hinton et al., 2012) 22.37 ± 0.06 22.47 ± 0.13 Yogi (Zaheer et al., 2018) 15.35 ± 0.10 15.16 ± 0.05 L-BFGS (Schraudolph et al., 2007) 49.30 ± 1.04 40.60 ± 1.40 O-LBFGS (Schraudolph et al., 2007) 13.96 ± 0.08 10.80 ± 0.17 

G ROSENBROCK FUNCTION MINIMIZATION EXPERIMENT

We probe the behavior of LODO with a small test task of finding the minimum of a rescaled Rosenbrock function f (x, y) = 0.01(x -1) 2 + (x 2 -y) 2 , which has no local minima and one global minimum at (x, y) = (1, 1). We initialized the optimizers at (x, y) = (-0.5, 2) and gave them 200 steps to run. The trajectory taken by LODO, shown in Figure 9 , is similar to the short timescale dynamics of other optimizers using momentum, in that it tends to overshoot and then correct itself in an oscillatory manner. Learning curves in Figure 8 and losses in Table 6 show the performance of all the optimizers on this task. This section describes how we timed each optimizer to report performance at specified times and step per second training speeds. We performed all optimization runs in TensorFlow 2, each with 40 Intel Xeon Gold 6248 CPUs and 2 Nvidia Volta V10 GPUs. Time reported includes all training time (forward and backward propagation, optimizer computation, data loading, preprocessing, etc.) except it does not include time taken to evaluate metrics such as the Hessian approximation error and the validation and test losses and accuracies.



We think of this phenomenon as similar to the adiabatic approximation of quantum mechanics. At, the fixed distribution of bt, and the actual distribution of bt play the roles of the Hamiltonian, the ground state, and the actual state, respectively.



Figure 1: Visualization of LODO's matrix structure of G(θ) from Equation (1) for approximation of the inverse Hessian. Reduced to a depth of 3, block size 2, and size ñ = 8 matrices for illustration.

Figure 2: Left: LODO's average inverse Hessian approximation error σ = ||I -G(θ t )H|| 2 F /n on the noisy quadratic bowl task of Section 5.1. σ 2 is measured by the unbiased estimator

Figure 5: In this diagram, ñ = 8 and p = 2, for illustrative purposes. Left: Visual depiction of how a fanout forest followed by a fanin forest can represent arbitrary sparse connections. Right: Visual depiction of how p + 1 chained copies of G(θ) networks left-multiplied by block matrices diagonal can manifest a forest of fanout trees. Gi are permutations implemented by G(θ) networks allowing for arbitrary connections, and B i are block diagonal matrices which create the necessary fanouts. Data flows from left to right in the illustration, though successive operations are written out right to left when using mathematical notation. The largest fanout in this pattern of connections is 3, which is less than 2 p = 4; all the fanins are 1.

Figure 6: Batch of 5 tensors of preprocessed MNIST images from the image generation task of Section 5.2, each as one of the 5 rows of the image. Shown in each of 5 columns are the masked input, the left-of-selected-pixel indicator, visible mask, and two gradient images. The images in each row are concatenated together into a 28 × 28 × 5 data tensor and then used as input into the CNN.

Figure 7: Grids of 16 imitated MNIST images generated the image generation CNN of Section 5.2, trained for 300k steps using RMSprop, Adam, Yogi, and LODO respectively.

Figure 8: Log loss as a function of step, when using various optimizers on the Rosenbrock function minimization task.

Figure 9: 100 step trajectories of various optimizers on the Rosenbrock function minimization task. The red star marks the initialization and the green star marks the location of the global minimum.

Negative log likelihoods in nats per pixel after training for 300k steps on the MNIST image generation task of Section 5.2 with every optimizer. Values are averaged over the last 10% of training before the stated training milestone, with means and standard deviations produced across 8 optimization runs with different randomization seeds. The top 3 optimizers are underlined for each metric. Our timing setup is described in Appendix H. Average training loss learning curves over 8 training runs on the MNIST image generation task of Section 5.2. Learning curves are smoothed by averaging over blocks of 600 steps each. Error margins indicate ±1 standard deviation between the performances of the 8 optimizers at every step. Left: By step. Middle: By time. Our timing setup is described in Appendix H. Right: Validataion loss by step, using a subset of 64 images excluded from the training data. Each image provides 784 pixel colors to predict, so the validation dataset effectively consists of 50176 samples.

Negative log likelihoods in nats per pixel after training for 300k steps on the MNIST image generation task of Section 5.2 with ablated versions of LODO from Section 5.2. Values are averaged over the last 10% of training, with means and standard deviations produced across 8 runs with different randomization seeds. Our timing setup is described in Appendix H. *Setup is similar to(Amid et al., 2022) **(Baydin et al., 2017).



We will work in the bounded noise case ||s t || 2 < ∞, where ||b ′ t || 2 is upper bounded by some ||A ′ t0 || 2 dependent constant b max due to exponential decay in Equation

Noise and step number schedules for tuning the optimizers' hyperparameters using the genetic algorithm presented in Appendix E

• Convolution with 1 by 1 filters to 20 channels, with no activation.

Hyperparameters used for the experiments in Section 5, after tuning hyperparameters with a genetic algorithm as in Appendix E and halving the learning rates (equiv. initial learning rates for LODO variants) for the image generation task. β and β 1 generally represent momentum decay rates while β 2 represent variance EMA decay rates. Dashes indicate that the optimizer was not used for that experiment. Average pooling of 2 by 2 blocks to halve the image size, with a pixel of zero padding beforehand if the image size is odd.

Mean losses between steps 180-200 while training on the Rosenbrock function minimization task, with various optimizers. The standard deviation is calculated over 8 test runs of LODO because its G neural network architecture is randomly generated.

annex

weights per layer at most ñ, at most k incoming and at most k outgoing weights for every neuron, depth d, and any arrangement of weights. Then, there is a probability of at leastthat we can make G(θ) = F for some θ.Proof. Our result comes in two parts: the first shows that G(θ) can represent arbitrary permutations, and the second shows that we can pick these G(θ) permutations and interleave them with block diagonal matrices to create a deeper G(θ) network which manifests any desired neural network. We use the terms fanin and fanout to mean the number of incoming and outgoing weights into and out of a neuron, respectively.First part. Assume we would like to apply a given target permutation to ñ elements using G(θ) = N i=1 B(θ (i) )P i consisting of N layers with randomly chosen permutations P i . Our goal is to perform the target permutation given P i by controlling the block diagonal matrices B(θ (i) ). The form of G(θ) from Section 3 iswhich can be rewritten aswith random independent uniform permutations Q i instead. For each block in the matrix B(θ (i) ), we may restrict ourselves to two options: to swap or to not swap the pair of indices. The conjugation by random permutation Q i then shuffles these pairings, such that applying G(θ) is equivalent to repeatedly randomly pairing up indices for optional transposition instead, and then applying a final permutation Q 1 . We will work under this new equivalent formulation, since it is more amenable to analysis.Let us choose to apply each transposition with probability 1/2. Then, the expected entropy of G(θ) given the whole sequence of pairings is at least log ñ! -ϵ(N ñ/2, ñ) under Definition 1. In other words, the expected KL divergence of this distribution of G(θ) from the uniform is at most ϵ(N ñ/2, ñ). Then by Pinsker's inequality, the expected total variation distance from the uniform is then at mostThis guarantees that at mostpossible target permutations have a probability density of zero, in expectation, which is then an upper bound on the number of inaccessible target permutations. This means the probability that all target permutations are accessible is then at leastNote that the leftmost Q 1 in Equation 77 merely introduces a bijection between target permutations, and so does not change how many are accessible.

