REVERSE ENGINEERING LEARNED OPTIMIZERS REVEALS KNOWN AND NOVEL MECHANISMS

Abstract

Learned optimizers are algorithms that can themselves be trained to solve optimization problems. In contrast to baseline optimizers (such as momentum or Adam) that use simple update rules derived from theoretical principles, learned optimizers use flexible, high-dimensional, nonlinear parameterizations. Although this can lead to better performance in certain settings, their inner workings remain a mystery. How is a learned optimizer able to outperform a well tuned baseline? Has it learned a sophisticated combination of existing optimization techniques, or is it implementing completely new behavior? In this work, we address these questions by careful analysis and visualization of learned optimizers. We study learned optimizers trained from scratch on three disparate tasks, and discover that they have learned interpretable mechanisms, including: momentum, gradient clipping, learning rate schedules, and a new form of learning rate adaptation. Moreover, we show how the dynamics of learned optimizers enables these behaviors. Our results help elucidate the previously murky understanding of how learned optimizers work, and establish tools for interpreting future learned optimizers.

1. INTRODUCTION

Optimization algorithms underlie nearly all of modern machine learning. A recent thread of research is focused on learning optimization algorithms, by directly parameterizing and training an optimizer on a distribution of tasks. These so-called learned optimizers have been shown to outperform baseline optimizers in restricted settings (Andrychowicz et al., 2016; Wichrowska et al., 2017; Lv et al., 2017; Bello et al., 2017; Li & Malik, 2016; Metz et al., 2019; 2020) . Despite improvements in the design, training, and performance of learned optimizers, fundamental questions remain about their behavior. We understand remarkably little about how these systems work. Are learned optimizers simply learning a clever combination of known techniques? Or do they learn fundamentally new behaviors that have not yet been proposed in the optimization literature? If they did learn a new optimization technique, how would we know? Contrast this with existing "hand-designed" optimizers such as momentum (Polyak, 1964 ), AdaGrad (Duchi et al., 2011 ), RMSProp (Tieleman & Hinton, 2012) , or Adam (Kingma & Ba, 2014) . These algorithms are motivated and analyzed via intuitive mechanisms and theoretical principles (such as accumulating update velocity in momentum, or rescaling updates based on gradient magnitudes in RMSProp or Adam). This understanding of underlying mechanisms allows future studies to build on these techniques by highlighting flaws in their operation (Loshchilov & Hutter, 2018) , studying convergence (Reddi et al., 2019) , and developing deeper knowledge about why key mechanisms work (Zhang et al., 2020) . Without analogous understanding of the inner workings of a learned optimizers, it is incredibly difficult to analyze or synthesize their behavior. In this work, we develop tools for isolating and elucidating mechanisms in nonlinear, highdimensional learned optimization algorithms ( §3). Using these methods we show how learned optimizers utilize both known and novel techniques, across three disparate tasks. In particular, we demonstrate that learned optimizers learn momentum ( §4.1), gradient clipping ( §4.2), learning rate schedules ( §4.3), and a new type of learning rate adaptation ( §4.4). Taken together, our work can be seen as part of a new approach to scientifically interpret and understand learned algorithms. We provide code for training and analyzing learned optimizers, as well as the trained weights for the learned optimizers studied here, at redacted URL. We are interested in optimization problems that minimize a loss function (f ) over parameters (x). We focus on first-order optimizers, which at iteration k have access to the gradient g k i ≡ ∇f (x k i ) and produce an update ∆x k i . These are component-wise optimizers that are applied to each parameter or component (x i ) of the problem in parallel. Standard optimizers used in machine learning (e.g. momentum, Adam) are in this category 1 . Going forward, we use x for the parameter to optimize, g for its gradient, k for the current iteration, and drop the parameter index (i) to reduce excess notation. An optimizer has two parts: the optimizer state (h) that stores information about the current problem, and readout weights (w) that update parameters given the current state. The optimization algorithm is specified by the initial state, the state transition dynamics, and readout, defined as follows: h k+1 = F (h k , g k ) (1) x k+1 = x k + w T h k+1 , ( ) where h is the optimizer state, F governs the optimizer state dynamics, and w are the readout weights. Learned optimizers are constructed by parameterizing the function F , and then learning those parameters along with the readout weights through meta-optimization (detailed in Appendix C.2). Hand-designed optimization algorithms, by distinction, specify these functions at the outset. For example, in momentum, the state is a scalar (known as the velocity) that accumulates a weighted average of recent gradients. For momentum and other hand-designed optimizers, the state variables are low-dimensional, and their dynamics are straightforward. In contrast, learned optimizers have high-dimensional state variables, and the potential for rich, nonlinear dynamics. As these systems learn complex behaviors, it has historically been difficult to extract simple, intuitive descriptions of the behavior of a learned optimizer. Our work is heavily inspired by recent work using neural networks to parameterize optimizers. Andrychowicz et al. ( 2016) originally showed promising results on this front, with additional studies improving robustness (Wichrowska et al., 2017; Lv et al., 2017 ), meta-training (Metz et al., 2019) , and generalization (Metz et al., 2020) of learned optimizers. We also build on recent work on reverse engineering dynamical systems. Sussillo & Barak (2013) showed how linear approximations to nonlinear dynamical systems can yield insight into the algorithms used by these networks. More recently, these techniques have been applied to understand trained RNNs in a variety of domains, from natural langauge processing (Maheswaranathan et al., 2019a; Maheswaranathan & Sussillo, 2020) to neuroscience (Schaeffer et al., 2020) . Additional work on treating RNNs as dynamical systems has led to insights into their computational capabilities (Jordan et al., 2019; Krishnamurthy et al., 2020; Can et al., 2020) .

3.1. TRAINING LEARNED OPTIMIZERS

We parametrize the learned optimizer with a recurrent neural network (RNN), similar to Andrychowicz et al. (2016) . Specifically, we use a gated recurrent unit (GRU) (Cho et al., 2014) with 256 units. The only input to the optimizer is the gradient. The RNN is trained by minimizing a meta-objective, which we define as the average training loss when optimizing a target problem. See Appendix C.2 for details about the optimizer architecture and meta-training procedures. We trained these learned optimizers on each of three tasks. These tasks were selected because they are fast to train (particularly important for meta-optimization) and covered a range of loss surfaces (convex and non-convex, low-and high-dimensional): Convex, quadratic: The first task consists of random linear regression problems f (x) = 1 2 Axb 2 2 , where A and b are randomly sampled. Much of our theoretical understanding of the behavior of optimization algorithms is derived using quadratic functions, in part because they have a constant 1 Notable exceptions include quasi-Newton methods such as L-BFGS (Nocedal & Wright, 2006) or K-FAC (Martens & Grosse, 2015) .

