OVERCOMING BARRIERS TO THE TRAINING OF EFFEC-TIVE LEARNED OPTIMIZERS

Abstract

In this work we focus on general-purpose learned optimizers capable of training a wide variety of problems with no user-specified hyperparameters. We introduce a new, neural network parameterized, hierarchical optimizer with access to additional features such as validation loss to enable automatic regularization. Most learned optimizers have been trained on only a single task, or a small number of tasks. We train our optimizers on thousands of tasks, making use of orders of magnitude more compute, resulting in optimizers that generalize better to unseen tasks. The learned optimizers not only perform well, but learn behaviors that are distinct from existing first order optimizers. For instance, they generate update steps that have implicit regularization and adapt as the problem hyperparameters (e.g. batch size) or architecture (e.g. neural network width) change. Finally, these learned optimizers show evidence of being useful for out of distribution tasks such as training themselves from scratch.

1. INTRODUCTION

Much of the success of modern deep learning has been driven by a shift from hand-designed features carefully curated by human experts, to domain-agnostic methods that can learn features from large amounts of data. By leveraging large-scale datasets with flexible models, we are now able to rapidly learn powerful features for new problem settings that often generalize to novel tasks. While learned features outperform hand-designed features on numerous tasks (Krizhevsky et al., 2012; Berner et al., 2019; Vinyals et al., 2019; Piech et al., 2015) , we continue to use hand-designed optimization algorithms (such as gradient descent, momentum, and so on) for training models. These hand-designed update rules benefit from decades of optimization research but still require extensive expert supervision in order to be used effectively in machine learning. For example, they fail to flexibly adapt to new problem settings and require careful tuning of learning rate schedules and momentum timescales for different model architectures and datasets (Choi et al., 2019) . In addition, most do not leverage alternative sources of information beyond the gradient, such as the validation loss. By separating the optimization target (training loss) from the broader goal (generalization), classic methods require more careful tuning of regularization and/or data augmentation strategies by the practitioner. To address these drawbacks, recent work on learned optimizers aims to replace hand-designed optimizers with a parametric optimizer, trained on a set of tasks, that can then be applied more generally. Recent work in this area has focused on either augmenting existing optimizers to adapt their own hyperparameters (Daniel et al., 2016; Xu et al., 2017; 2019) , or developing more expressive learned optimizers to replace existing optimizers entirely (Andrychowicz et al., 2016; Wichrowska et al., 2017; Lv et al., 2017; Metz et al., 2018; 2019a; b; Gu et al., 2019) . These latter models take in problem information (such as the current gradient of the training loss) and iteratively update parameters. However, to date, learned optimizers have proven to be brittle and ineffective at generalizing across diverse sets of problems. Our work identifies fundamental barriers that have limited progress in learned optimizer research and addresses several of these barriers to train effective optimizers: 



Computational scale: Training a learned optimizer is costly. When training the optimizer, a single training step requires applying the optimizer to a training task for some number 1

