MODEL-BASED ASYNCHRONOUS HYPERPARAMETER AND NEURAL ARCHITECTURE SEARCH

Abstract

We introduce an asynchronous multi-fidelity method for model-based hyperparameter and neural architecture search that combines the strengths of asynchronous Successive Halving and Gaussian process-based Bayesian optimization. Our method relies on a probabilistic model that can simultaneously reason across hyperparameters and resource levels, and supports decision-making in the presence of pending evaluations. We demonstrate the effectiveness of our method on a wide range of challenging benchmarks, for tabular data, image classification and language modelling, and report substantial speed-ups over current state-of-the-art methods. Our new method, along with asynchronous baselines, are implemented in a distributed framework which will be open sourced along with this publication.

1. INTRODUCTION

The goal of hyperparameter and neural architecture search (HNAS) is to automate the process of finding the right architecture or hyperparameters x ∈ arg min x∈X f (x) of a deep neural network by minimizing the validation loss f (x), observed through noise: y i = f (x i ) + i , i ∼ N (0, σ 2 ), i = 1, . . . , n. Bayesian optimization (BO) is an effective model-based approach for solving expensive black-box optimization problems (Jones et al., 1998; Shahriari et al., 2016) . It constructs a probabilistic surrogate model of the loss function p(f | D) based on previous evaluations D = {(x i , y i )} n i=1 . Searching for the global minimum of f is then driven by trading off exploration in regions of uncertainty and exploitation in regions where the global optimum is likely to reside. However, for HNAS problems, standard BO needs to be augmented in order to remain competitive. For example, training runs for unpromising configurations x can be stopped early, but serve as low-fidelity approximations of f (x) (Swersky et al., 2014; Domhan et al., 2015; Klein et al., 2017b) . Further, evaluations of f can be executed in parallel to reduce the wall-clock time required for finding a good solution. Several methods for multi-fidelity and/or distributed BO have been proposed (Kandasamy et al., 2017; 2016; Takeno et al., 2020) , but they rely on rather complicated approximations to either select the fidelity level or to compute an information theoretic acquisition function to determine the next candidate. In this work we aim to adopt the desiderata of Falkner et al. ( 2018), namely, that of simplicity, which often leads to more robust methods in practice. A simple, easily parallelizable multi-fidelity scheduling algorithm is successive halving (SH) (Karnin et al., 2013; Jamieson & Talwalkar, 2016) , which iteratively eliminates poorly performing neural networks over time. Hyperband (Li et al., 2017) iterates over multiple rounds of SH with varying ratios between the number of configurations and the minimum amount of resources spent per configuration. Falkner et al. ( 2018) introduced a hybrid model, called BOHB, that uses a probabilistic model to guide the search while retaining the efficient any-time performance of Hyperband. However, both SH and BOHB can be bottlenecked due to their synchronous nature: stopping decisions are done only after synchronizing all training jobs at certain resource levels (called rungs). This approach is wasteful when the evaluation of some configurations take longer than others, as it is often the case when training neural networks (Ying et al., 2019) , and can substantially delay progress towards high-performing configurations (see example shown in Figure 1 ). Recently, Li et al. (2018) proposed ASHA, which adapts successive halving to the asynchronous parallel case. Even though ASHA only relies on the random sampling of new configurations, it has been shown to outperform synchronous SH and BOHB. In this work, we augment ASHA with a Gaussian process (GP) surrogate, which jointly models the performance across configurations and

