LEARNING TO OPTIMIZE QUASI-NEWTON METHODS

Abstract

We introduce a novel machine learning optimizer called LODO, which online meta-learns an implicit inverse Hessian of the loss as a subroutine of quasi-Newton optimization. Our optimizer merges Learning to Optimize (L2O) techniques with quasi-Newton methods to learn neural representations of symmetric matrix vector products, which are more flexible than those in other quasi-Newton methods. Unlike other L2O methods, ours does not require any meta-training on a training task distribution, and instead learns to optimize on the fly while optimizing on the test task, adapting to the local characteristics of the loss landscape while traversing it. Theoretically, we show that our optimizer approximates the inverse Hessian in noisy loss landscapes and is capable of representing a wide range of inverse Hessians. We experimentally verify our algorithm's performance in the presence of noise, and show that simpler alternatives for representing the inverse Hessians worsen performance. Lastly, we use our optimizer to train a semi-realistic deep neural network with 95k parameters, and obtain competitive results against standard neural network optimizers.

1. INTRODUCTION

Many optimization algorithms like stochastic gradient descent (SGD) (Rosenblatt, 1958) , and Adam (Kingma & Ba, 2014) have been widespread and successful in the rapid training of deep neural networks. (Sun et al., 2019) Fundamentally, this is a problem of minimizing a loss which is a function of a large vector containing the weights of the network. The time it takes to optimize a neural network is a bottleneck in machine learning, so the more quickly a network can be trained, the more computational resources are saved, and therefore researchers have devoted great effort into creating new, faster optimizers. (Jain & Kar, 2017; Metz et al., 2020; Bernstein et al., 2020; Martens & Grosse, 2015a) We present a novel algorithm drawing from the field of learning to optimize (L2O) spearheaded by (Li & Malik, 2016) and (Andrychowicz et al., 2016) . Namely, we use a meta-optimizer to online learn an implicit representation of the local inverse Hessian, which is used in a quasi-Newton method, without any L2O meta training time on a training task distribution. Unlike other L2O algorithms which learn to optimize before optimization (Chen et al., 2021), our algorithm Learns to Optimize During Optimization (LODO). We intend for LODO to be trained from scratch for each use case and then discarded. This way, LODO learns local features of the loss landscape at a specific point in training for a specific task, instead of only characteristics shared throughout training trajectories for a set of training tasks. Our work targets the Hessian, which varies with both the task and the point along the trajectory. Our use of linear neural networks is what imports the efficiency of the Newton method to our algorithm, while our use of a meta-optimizer like in L2O is what allows us to learn more powerful and general parameterizations of optimizers. Our contributions are as follows. We show theoretically and experimentally that a simplified version of LODO correctly learns the inverse Hessian in a stochastic convex setting. We show theoretically that LODO's inverse Hessian representation is highly expressive, and experimentally that simpler alternatives perform worse. We finally demonstrate the use of LODO in a semi-realistic vision task. This paper serves as a stepping stone in the development of meta-training-free online L2O. The remainder of this paper is structured as follows. Section 2 discusses relevant background and contributions in optimization and L2O. Section 3 shows how LODO works. Section 4 provides

