OVERCOMING BARRIERS TO THE TRAINING OF EFFEC-TIVE LEARNED OPTIMIZERS

Abstract

In this work we focus on general-purpose learned optimizers capable of training a wide variety of problems with no user-specified hyperparameters. We introduce a new, neural network parameterized, hierarchical optimizer with access to additional features such as validation loss to enable automatic regularization. Most learned optimizers have been trained on only a single task, or a small number of tasks. We train our optimizers on thousands of tasks, making use of orders of magnitude more compute, resulting in optimizers that generalize better to unseen tasks. The learned optimizers not only perform well, but learn behaviors that are distinct from existing first order optimizers. For instance, they generate update steps that have implicit regularization and adapt as the problem hyperparameters (e.g. batch size) or architecture (e.g. neural network width) change. Finally, these learned optimizers show evidence of being useful for out of distribution tasks such as training themselves from scratch.

1. INTRODUCTION

Much of the success of modern deep learning has been driven by a shift from hand-designed features carefully curated by human experts, to domain-agnostic methods that can learn features from large amounts of data. By leveraging large-scale datasets with flexible models, we are now able to rapidly learn powerful features for new problem settings that often generalize to novel tasks. While learned features outperform hand-designed features on numerous tasks (Krizhevsky et al., 2012; Berner et al., 2019; Vinyals et al., 2019; Piech et al., 2015) , we continue to use hand-designed optimization algorithms (such as gradient descent, momentum, and so on) for training models. These hand-designed update rules benefit from decades of optimization research but still require extensive expert supervision in order to be used effectively in machine learning. For example, they fail to flexibly adapt to new problem settings and require careful tuning of learning rate schedules and momentum timescales for different model architectures and datasets (Choi et al., 2019) . In addition, most do not leverage alternative sources of information beyond the gradient, such as the validation loss. By separating the optimization target (training loss) from the broader goal (generalization), classic methods require more careful tuning of regularization and/or data augmentation strategies by the practitioner. To address these drawbacks, recent work on learned optimizers aims to replace hand-designed optimizers with a parametric optimizer, trained on a set of tasks, that can then be applied more generally. Recent work in this area has focused on either augmenting existing optimizers to adapt their own hyperparameters (Daniel et al., 2016; Xu et al., 2017; 2019) , or developing more expressive learned optimizers to replace existing optimizers entirely (Andrychowicz et al., 2016; Wichrowska et al., 2017; Lv et al., 2017; Metz et al., 2018; 2019a; b; Gu et al., 2019) . These latter models take in problem information (such as the current gradient of the training loss) and iteratively update parameters. However, to date, learned optimizers have proven to be brittle and ineffective at generalizing across diverse sets of problems. Our work identifies fundamental barriers that have limited progress in learned optimizer research and addresses several of these barriers to train effective optimizers: 1. Computational scale: Training a learned optimizer is costly. When training the optimizer, a single training step requires applying the optimizer to a training task for some number of unrolled steps. This work utilizes massive parallel computing infrastructure to scale the number of unrolled steps an order of magnitude larger than in previous work.

2.. Training tasks:

Deep learning requires large training datasets. For learned optimizers to be effective, we similarly need a large dataset of optimization tasks on which to train. We build off of the TaskSet dataset (Metz et al., 2020) and construct a dataset of more than a six thousand diverse optimization tasks commonly found in machine learning. We show how this large and diverse task distribution is critical for training optimizers that generalize. 3. Inductive bias of optimizer architecture: The parameterization of the learned optimizer and the task information fed to it both strongly affect performance. We propose a new hierarchical learned optimizer architecture that incorporates additional task information (such as validation loss), and show that it outperforms previous learned optimizer architectures. By addressing these barriers, we develop learned optimizers that exceed prior work in scale, robustness, and out of distribution generalization. As a final test, we show that the learned optimizer can be used to train new learned optimizers from scratch (analogous to "self-hosting" compilers (Hart and Levin, 1962)).

2. PRELIMINARIES

Training a learned optimizer is a bilevel optimization problem that contains two loops: an inner loop that applies the optimizer to solve a task, and an outer loop that iteratively updates the parameters of the learned optimizer (Franceschi et al., 2018) . We use the innerand outerprefixes throughout to be explicit about which optimization loop we are referring to. That is, the inner-loss refers to a target task's loss function that we wish to optimize, and the outer-loss refers to a measure of the optimizer's performance training the target task (inner-task). Correspondingly, we refer to the optimizer parameters as outer-parameters, and the parameters that the optimizer is updating as inner-parameters. Outer-optimization refers to the act of finding outer-parameters that perform well under some outer-loss. For a given inner-task, we apply the learned optimizer for some number of steps (unrolling the optimizer). Ideally, we would unroll each target task until some stopping criterion is reached, but this is computationally infeasible for even moderate scale machine learning tasks. Each outer-iteration requires unrolling the optimizer on a target task. Short (truncated) unrolls are more computationally efficient, but suffer from truncation bias Wu et al. (2016); Metz et al. (2019b) in that the outer-loss surface computed using truncated unrolls is different (and may have different minima) than the fully unrolled outer-loss. 3 METHODS: ADDRESSING THE THREE BARRIERS TO LEARNED OPTIMIZERS

3.1. OUTER-OPTIMIZATION

To train the optimizer, we minimize an outer-loss that quantifies the performance of the optimizer. This is defined as the mean of the inner-loss computed on the inner-validation set for some number of unrolled steps, averaged over inner-tasks in the outer-training taskset. Although this outer-loss is differentiable, it is costly to compute the outer-gradient (which involves backpropagating through the unrolled optimization). In addition, the outer-loss surface is badly conditioned and extremely non-smooth Metz et al. (2019b) , making it difficult to optimize. We deal with these issues by using derivative-free optimization-specifically, evolutionary strategies (ES) Rechenberg (1973)-to minimize the outer-loss, obviating the need to compute derivatives through the unrolled optimization process. Previous work has used unrolled derivatives (Andrychowicz et al., 2016; Wichrowska et al., 2017; Metz et al., 2019b) 



, and was thus limited to short numbers of unrolled steps(e.g. 20 in Andrychowicz et al. (2016)  and starting at 50 inMetz et al. (2019b)). Without ES, the gradient estimates we obtain are extreamly high variance to the point that no training occurs. Using ES, we are able to use considerably longer unrolls. Initial unroll lengths were chosen to balance communication cost between parallel workers (updating optimizer parameters) with the computational cost of unrolling on individual workers (estimating the local gradient with ES). We start outer-training by sampling unroll steps uniformly from 240-360. When performance saturates with

