LEARNING TO OPTIMIZE QUASI-NEWTON METHODS

Abstract

We introduce a novel machine learning optimizer called LODO, which online meta-learns an implicit inverse Hessian of the loss as a subroutine of quasi-Newton optimization. Our optimizer merges Learning to Optimize (L2O) techniques with quasi-Newton methods to learn neural representations of symmetric matrix vector products, which are more flexible than those in other quasi-Newton methods. Unlike other L2O methods, ours does not require any meta-training on a training task distribution, and instead learns to optimize on the fly while optimizing on the test task, adapting to the local characteristics of the loss landscape while traversing it. Theoretically, we show that our optimizer approximates the inverse Hessian in noisy loss landscapes and is capable of representing a wide range of inverse Hessians. We experimentally verify our algorithm's performance in the presence of noise, and show that simpler alternatives for representing the inverse Hessians worsen performance. Lastly, we use our optimizer to train a semi-realistic deep neural network with 95k parameters, and obtain competitive results against standard neural network optimizers.

1. INTRODUCTION

Many optimization algorithms like stochastic gradient descent (SGD) (Rosenblatt, 1958) , and Adam (Kingma & Ba, 2014) have been widespread and successful in the rapid training of deep neural networks. (Sun et al., 2019) Fundamentally, this is a problem of minimizing a loss which is a function of a large vector containing the weights of the network. The time it takes to optimize a neural network is a bottleneck in machine learning, so the more quickly a network can be trained, the more computational resources are saved, and therefore researchers have devoted great effort into creating new, faster optimizers. (Jain & Kar, 2017; Metz et al., 2020; Bernstein et al., 2020; Martens & Grosse, 2015a) We present a novel algorithm drawing from the field of learning to optimize (L2O) spearheaded by (Li & Malik, 2016) and (Andrychowicz et al., 2016) . Namely, we use a meta-optimizer to online learn an implicit representation of the local inverse Hessian, which is used in a quasi-Newton method, without any L2O meta training time on a training task distribution. Unlike other L2O algorithms which learn to optimize before optimization (Chen et al., 2021) , our algorithm Learns to Optimize During Optimization (LODO). We intend for LODO to be trained from scratch for each use case and then discarded. This way, LODO learns local features of the loss landscape at a specific point in training for a specific task, instead of only characteristics shared throughout training trajectories for a set of training tasks. Our work targets the Hessian, which varies with both the task and the point along the trajectory. Our use of linear neural networks is what imports the efficiency of the Newton method to our algorithm, while our use of a meta-optimizer like in L2O is what allows us to learn more powerful and general parameterizations of optimizers. Our contributions are as follows. We show theoretically and experimentally that a simplified version of LODO correctly learns the inverse Hessian in a stochastic convex setting. We show theoretically that LODO's inverse Hessian representation is highly expressive, and experimentally that simpler alternatives perform worse. We finally demonstrate the use of LODO in a semi-realistic vision task. This paper serves as a stepping stone in the development of meta-training-free online L2O. The remainder of this paper is structured as follows. Section 2 discusses relevant background and contributions in optimization and L2O. Section 3 shows how LODO works. Section 4 provides Research into the construction of faster optimizers has mostly fallen under two branches of work. The older branch attempts to endow SGD with adaptive capabilities, often through modifications involving calculation of the first and/or second moments of the gradient (mean and variance) using exponential moving averages (EMAs). RMSprop (Hinton et al., 2012) and Adam use the variance to normalize the step size and the mean to induce momentum. LARS (You et al., 2017) and Yogi (Zaheer et al., 2018) modify both modify the variance calculation, but for different reasons: to normalize layer-wise, and to control increases in effective learning rate in a slower manner, respectively. Some of these methods such as the Newton method and natural gradient descent (Martens & Grosse, 2015b; George et al., 2018) precondition the step with adaptive estimates of the inverse Hessian and the inverse Fisher information matrices, respectively. The Newton method converges quickly but is vulnerable to gradient noise and impractical to implement due to the resources spent in calculating and/or inverting the high dimensional Hessian. Many researchers have developed approximationscalled quasi-Newton methods-which reduce the Newton method's time and memory complexity, such as L-BFGS (Nocedal & Wright, 1999) and variants (Schraudolph et al., 2007; Parker-Holder et al., 2020; Goldfarb et al., 2020; Park & Oliva, 2019) better suited to the stochasticity and structure present in machine learning. The most related methods to our work are hypergradient methods, which online learn low-rank (Moskovitz et al., 2019 ), diagonal (Amid et al., 2022; Baydin et al., 2017) , or Kronecker-factorized (Bae et al., 2022) preconditioner matrices to transform the gradient when choosing the step. We improve on these methods by using a more expressive class of preconditioners. More recently, a subfield of meta-learning known as learning to optimize (L2O) has shown that deep networks can themselves be trained to perform optimization, at a speed which exceeds that of popular traditional optimizers. The aim of this effort is to leverage deep neural networks to learn faster optimizers, in hopes of further accelerating training procedures for other deep neural networks. Li & Malik (2016; 2017) and Andrychowicz et al. (2016) were among the first to successfully use backpropagation to train neural networks to map gradients to steps. Since then, many other variations of this idea have successfully produced optimizers exceeding the speed of common optimizers for narrow ranges of machine learning models (Metz et al., 2018) , though theoretical analysis of these learned optimizers tends to be difficult and scarce. A major goal of L2O research is to learn a single optimizer which can generalize to be able to train a wide variety of machine learning models with speed. (Lv et al., 2017) Two issues also prevent L2O optimizers from being rapidly developed experimentally. Firstly, a carefully chosen "task distribution" for the optimizer to practice on is required for the meta-learning of the L2O optimizer, playing the role analogous to the "dataset". These tasks are difficult to curate because the issue of generalization error applies; we want the test task to be similar to the task distribution. Secondly, this meta-learning of the L2O optimizer is prohibitively costly, in that it involves nested training loops, where the inner loop takes a large amount of time and memory to evaluate and backpropagate through (Metz et al., 2019) . Altogether, the choice of task distribution and lengthy meta-training has been a necessary burden in L2O, and we overcome these with LODO.

3. HOW LODO WORKS

In a quasi-Newton method, the approximate solution x t ∈ R n is refined by x t+1 = x t -αG t g t for some learning rate α > 0, where G t ≈ (∇ 2 xt f (x t )) -1 ∈ R n×n is some approximation of the inverse Hessian and g t = ∇ xt f (x t ) ∈ R n is the gradient computed by backpropagation through the task f . α = 1 produces the exact solution if f is quadratic, so we set α = 1. Our algorithm approximates the inverse Hessian using a matrix G(θ t ) ∈ R n×n parameterized by a vector θ t of weights learned over time t, described later in this section. After every step t ← t + 1 using the formula x t+1 = x t -G(θ t )g t , the loss f (x t+1 ) is computed. Then the new gradient ∇ xt+1 f (x t+1 ) in x t+1 is computed through backpropagation as usual, but we continue backpropagation into the step-choosing process until we find the "hypergradient" ∇ θt f (x t+1 ) in the optimizer weights θ t ,

