TASKSET: A DATASET OF OPTIMIZATION TASKS

Abstract

We present TaskSet, a dataset of tasks for use in training and evaluating optimizers. TaskSet is unique in its size and diversity, containing over a thousand tasks ranging from image classification with fully connected or convolutional neural networks, to variational autoencoders, to non-volume preserving flows on a variety of datasets. As an example application of such a dataset we explore meta-learning an ordered list of hyperparameters to try sequentially. By learning this hyperparameter list from data generated using TaskSet we achieve large speedups in sample efficiency over random search. Next we use the diversity of the TaskSet and our method for learning hyperparameter lists to empirically explore the generalization of these lists to new optimization tasks in a variety of settings including ImageNet classification with Resnet50 and LM1B language modeling with transformers. As part of this work we have opensourced code for all tasks, as well as 29 million training curves for these problems and the corresponding hyperparameters. 1

1. INTRODUCTION

As machine learning moves to new domains, collecting diverse, rich, and application-relevant datasets is critical for its continued success. Historically, research on learning optimization algorithms have only leveraged single tasks (Andrychowicz et al., 2016; Metz et al., 2019a) , or parametric synthetic tasks (Wichrowska et al., 2017) , due to the difficulty of obtaining large sets of tasks.

1.1. TASKSET: A SET OF TASKS

We present a set of tasks significantly larger than any optimizer dataset previously studied. We aim to better enable standardized research on optimizers, be that analysis of existing optimizers, or development of new learned learning algorithms. We call this suite of tasks TaskSet. Much in the same way that learned features in computer vision outpaced hand designed features (Krizhevsky et al., 2012; LeCun et al., 2015) , we believe that data driven approaches to discover optimization algorithms will replace their hand designed counterparts resulting in increased performance and usability. To this end, standardizing a large suite of optimization tasks is an important first step towards more rigorous learned optimizer research. In this setting, a single "example" is an entire training procedure for a task defined by data, loss function, and architecture. Thus, TaskSet consists of over a thousand optimization tasks, largely focused on deep learning (neural networks). They include image classification using fully connected and convolutional models, generative models with variational autoencoders (Kingma & Welling, 2013) or flows (Dinh et al., 2016; Papamakarios et al., 2017) , natural language processing tasks including both language modeling and classification, as well as synthetic tasks such as quadratics, and optimization test functions. The problems themselves are diverse in size, spanning 7 orders of magnitude in parameter count, but remain reasonably fast to compute as almost all tasks can be trained 10k iterations on a CPU in under one hour. To demonstrate the breadth of this dataset we show an embedding of all the tasks in Appendix A.1 in Figure S1 .

1.2. AMORTIZING HYPERPARAMETER SEARCH

Machine learning methods are growing ever more complex, and their computational demands are increasing at a frightening pace (Amodei & Hernandez, 2018) . Unfortunately, most modern machine learning models also require extensive hyperparameter tuning. Often, hyperparameter search is many times more costly than the final algorithm, which ultimately has large economic and environmental costs (Strubell et al., 2019) . The most common approach to hyperparameter tuning involves some form of quasi-random search over a pre-specified grid of hyperparameters. Building on past work (Wistuba et al., 2015b; Pfisterer et al., 2018) , and serving as a typical example problem illustrative of the sort of research enabled by TaskSet, we explore a hyperparameter search strategy consisting of a simple ordered list of hyperparameters to try. The idea is that the first few elements in this list will cover most of the variation in good hyperparameters found in typical machine learning workloads. We choose the elements in this list by leveraging the diversity of tasks in TaskSet, by meta-learning a hyperparameter list that performs the best on the set of tasks in TaskSet. We then test this list of hyperparameters on new, larger machine learning tasks. Although learning the list of hyperparameters is costly (in total we train ∼29 million models consisting of over 4,000 distinct hyper parameter configurations), our final published list is now available as a good starting guess for new tasks. Furthermore, we believe the raw training curves generated by this search will be useful for future hyperparameter analysis and meta-learning research, and we release it as part of this work. We additionally release code in Tensorflow (Abadi et al., 2016 ), Jax (Bradbury et al., 2018 ), and Py-Torch (Paszke et al., 2019) for a reference optimizer which uses our learned hyperparameter list, and can be easily applied to any model.

2. TASKSET: A SET OF TASKS

How should one choose what problems to include in a set of optimization tasks? In our case, we strive to include optimization tasks that have been influential in deep learning research over the last several decades, and will be representative of many common machine learning problems. Designing this dataset requires striking a balance between including realistic large-scale workloads and ensuring that tasks are fast to train so that using it for meta-learning is tractable. We construct our dataset largely out of neural network based tasks. Our chosen tasks have between ten thousand and one million parameters (much smaller than the billions commonly used today), as a result most problems can train in under an hour on a cloud CPU with 5 cores. We additionally focus on increased "task diversity" by including many different kinds of training algorithms, architectures, and datasetsinspired by past work in reinforcement learning which has demonstrated large numbers of problems and increased diversity around some domain of interest is useful for both training and generalization Heess et al. ( 2017 2019). Again though, a balance must be struck, as in the limit of too much diversity no learning can occur due to the no free lunch theorem (Wolpert & Macready, 1997). Our dataset, TaskSet, is made up of 1162 tasks in total. We define a task as the combination of a loss function, a dataset, and initialization. Specifically we define a task as a set of 4 functions: A task has no tunable hyperparameters and, coupled with an optimizer, provides all the necessary information to train using first order optimization. This makes experimentation easier, as each task definition specifies hyperparameters such as batch size (Shallue et al., 2018; McCandlish et al., 2018) or initialization (Schoenholz et al., 2016; Yang & Schoenholz, 2017; Xiao et al., 2018; Li & Nguyen, 



redacted url



); Tobin et al. (2017); Cobbe et al. (2018); OpenAI et al. (

Initialization: () → parameter initial values • Data generator: data split (e.g. train / valid / test) → batch of data • Forward pass: (batch of data, params) → loss • Gradient function: (input data, params) → gradients ( dloss dparams )

