REVERSE ENGINEERING LEARNED OPTIMIZERS REVEALS KNOWN AND NOVEL MECHANISMS

Abstract

Learned optimizers are algorithms that can themselves be trained to solve optimization problems. In contrast to baseline optimizers (such as momentum or Adam) that use simple update rules derived from theoretical principles, learned optimizers use flexible, high-dimensional, nonlinear parameterizations. Although this can lead to better performance in certain settings, their inner workings remain a mystery. How is a learned optimizer able to outperform a well tuned baseline? Has it learned a sophisticated combination of existing optimization techniques, or is it implementing completely new behavior? In this work, we address these questions by careful analysis and visualization of learned optimizers. We study learned optimizers trained from scratch on three disparate tasks, and discover that they have learned interpretable mechanisms, including: momentum, gradient clipping, learning rate schedules, and a new form of learning rate adaptation. Moreover, we show how the dynamics of learned optimizers enables these behaviors. Our results help elucidate the previously murky understanding of how learned optimizers work, and establish tools for interpreting future learned optimizers.

1. INTRODUCTION

Optimization algorithms underlie nearly all of modern machine learning. A recent thread of research is focused on learning optimization algorithms, by directly parameterizing and training an optimizer on a distribution of tasks. These so-called learned optimizers have been shown to outperform baseline optimizers in restricted settings (Andrychowicz et al., 2016; Wichrowska et al., 2017; Lv et al., 2017; Bello et al., 2017; Li & Malik, 2016; Metz et al., 2019; 2020) . Despite improvements in the design, training, and performance of learned optimizers, fundamental questions remain about their behavior. We understand remarkably little about how these systems work. Are learned optimizers simply learning a clever combination of known techniques? Or do they learn fundamentally new behaviors that have not yet been proposed in the optimization literature? If they did learn a new optimization technique, how would we know? Contrast this with existing "hand-designed" optimizers such as momentum (Polyak, 1964) , AdaGrad (Duchi et al., 2011 ), RMSProp (Tieleman & Hinton, 2012) , or Adam (Kingma & Ba, 2014) . These algorithms are motivated and analyzed via intuitive mechanisms and theoretical principles (such as accumulating update velocity in momentum, or rescaling updates based on gradient magnitudes in RMSProp or Adam). This understanding of underlying mechanisms allows future studies to build on these techniques by highlighting flaws in their operation (Loshchilov & Hutter, 2018) , studying convergence (Reddi et al., 2019) , and developing deeper knowledge about why key mechanisms work (Zhang et al., 2020) . Without analogous understanding of the inner workings of a learned optimizers, it is incredibly difficult to analyze or synthesize their behavior. In this work, we develop tools for isolating and elucidating mechanisms in nonlinear, highdimensional learned optimization algorithms ( §3). Using these methods we show how learned optimizers utilize both known and novel techniques, across three disparate tasks. In particular, we demonstrate that learned optimizers learn momentum ( §4.1), gradient clipping ( §4.2), learning rate schedules ( §4.3), and a new type of learning rate adaptation ( §4.4). Taken together, our work can be seen as part of a new approach to scientifically interpret and understand learned algorithms. We provide code for training and analyzing learned optimizers, as well as the trained weights for the learned optimizers studied here, at redacted URL.

