ROMUL: SCALE ADAPTATIVE POPULATION BASED TRAINING

Abstract

In most pragmatic settings, data augmentation and regularization are essential, and require hyperparameter search. Population based training (PBT) is an effective tool for efficiently finding them as well as schedules over hyperparameters. In this paper, we compare existing PBT algorithms and contribute a new one: RO-MUL, for RObust MULtistep search, which adapts its stepsize over the course of training. We report competitive results with standard models on CIFAR (image classification) as well as Penn Tree Bank (language modeling), which both depend on heavy regularization. We also open-source hoptim, a PBT library agnostic to the training framework, which is simple to use, reentrant, and provides good defaults with ROMUL.

1. INTRODUCTION

Hyperparameter tuning is essential for good performance in most machine learning tasks, and poses numerous challenges. First, optimal hyperparameter values can change over the course of training (schedules), e.g. for learning rate, fine tuning phases, data augmentation. Hyperparameters values are also rarely independent from each other (e.g. the magnitude of individual data augmentations depends on the number of data augmentations applied), and the search space grows exponentially with the number of hyperparameters. All of that search has to be performed within a computational budget, and sometimes even within a wall-clock time budget (e.g. models that are frequently retrained on new data), requiring efficient parallelization. In practice, competitive existing methods range from random search (Bergstra & Bengio, 2012) to more advanced methods (that aim at being more compute-efficient) like sequential search (Bergstra et al., 2011; 2013; Li et al., 2018 ), population based training (PBT, e.g. Jaderberg et al. (2017); Ho et al. (2019) ) and search structured by the space of the hyperparameters (Liu et al., 2018; Cubuk et al., 2019b) . A major drawback of advanced hyperparameter optimization methods is that they themselves require attention from the user to reliably outperform random search. In this work, we empirically study the different training dynamics of data augmentation and regularization hyperparameters across vision and language modeling tasks, in particular for multistep (sequential) hyperparameter search. A common failure mode (i) is due to hyperparameters that have a different effect on the validation loss in the short and long terms, for instance using a smaller dropout often leads to faster but worse convergence. Another common problem (ii) is that successful searches are constrained on adequate "hyper-hyperparameters" (such as value ranges or the search policy used, which in current methods are non-adaptative mutation steps). Our contributions can be summarized as follows: • We present a robust algorithm for leveraging population based training for hyperparameter search: ROMUL (RObust MULtistep) search, which addresses (i) and (ii). We empirically study its benefits and limitations, and show that it provides good defaults that compare favorably to existing methods. • We open-source hoptim, a simple library for sequential hyperparameter search, that provides multiple optimizers (including ROMUL), as well as toy benchmarks showcasing hyperparameter optimization problems we identified empirically and standard datasets.

2. HYPERPARAMETER OPTIMIZATION WITH POPULATION-BASED TRAINING

In this article, we refer to the family of algorithms that continuously tunes hyperparameters of a set of models over the course of their training as "PBT algorithms" or "PBT optimizers". Hyperparameter optimization is thus a zero order optimization performed at a slower frequency than the (often first order, e.g. SGD) optimization of the model. A PBT step happens typically after a fixed number of epochs or updates of the model, often optimizing the loss from the validation set, continuing from an already produced "parent" checkpoint, and producing and evaluating a new checkpoint. At every PBT step, hyperparameters can be updated (mutated), incremented or decremented by some number (step size), or sampled. There are multiple aspects to consider when designing a PBT algorithm. Technical constraints: how the optimization is distributed, with a centralized or decentralized algorithm, workers to run the trainings, how failed workers are handled. They are solved in a unified manner in the experiments we performed, by the hoptim library to implement and compare multiple algorithms. It is decoupled from the scheduling of the jobs and designed to accommodate adding more workers to scale up the training, or fewer when some are killed, for example through preemption or time-out on a shared cluster. Optimization method: how the hyper-parameters are modified throughout the training, for instance through mutations. Selection process: which individual of the population are kept, both in term of hyper-parameters and state of the neural network (checkpoint). For those last two points, some solutions are described below.

2.1. CHALLENGES

In order to have a clearer understanding of our proposed methods, we show below the main concerns we have observed in PBT: Anisotropy: by definition, the optimal value of the hyperparameters considered is unknown, and oftentimes the range (or mutation scheme) provided to the algorithm is a loose estimate only. As modifying two hyperparameters with the same step size can produce effects with very different magnitudes, the user is required to to normalize the search space. But pre-tuning the hyperparameter tuner itself can be cumbersome as dynamics evolve during training. Section 3.1 provides an example based on the Rosenbrock function which illustrates this issue and highlights the interest of adaptative mutations. Checkpoint vs. hyperparameters: comparing individuals in the population is extremely hard as improvements can be due to better hyperparameters, or better checkpoints (including potentially better batches). Better performance through better checkpoints is an optimization phenomenon (e.g. random restarts), that can bias the hyperparameter selection. We will detail this aspect in Section 4.2. Short-term-long-term discordance: we observed empirically that hyperparameters which induce better performance in the short term are not always optimal in the longer term. This is a challenge that does not exist in classical static optimization, but is crucial for PBT since local minima are easy to reach and pose a danger for greedy algorithms. An example of such a parameter is the learning rate. Dropping the learning rate often induces a drop in the validation loss, even early in the training, and increasing it has the opposite effect, causing greedy PBT algorithms to reduce it to the minimum value too early, without being able to recover. We will detail this aspect in Section 4.1.

2.2. DIFFERENTIAL EVOLUTION AND ROMUL

Differential Evolution Storn & Price (1997) (DE) is a standard black-box optimization method, for minimizing f : R n → R. It operates on a population x i ∈ R n for all i ∈ {1, ..., M }, M ≥ 4, and indefinitely repeats the following steps for each individual x base in the population to generate another individual called mutated vector that could replace x base if better: 1. given the best individual x best which minimizes f in the population, as well as two randomly selected ones x a and x b , compute the donor d, which will give part of its coefficients to the mutated vector. In the current-to-best/1 scheme we use, these are the base coefficients plus a term attracting to the best set of coefficients from the current population, and an

