MODEL-BASED ASYNCHRONOUS HYPERPARAMETER AND NEURAL ARCHITECTURE SEARCH

Abstract

We introduce an asynchronous multi-fidelity method for model-based hyperparameter and neural architecture search that combines the strengths of asynchronous Successive Halving and Gaussian process-based Bayesian optimization. Our method relies on a probabilistic model that can simultaneously reason across hyperparameters and resource levels, and supports decision-making in the presence of pending evaluations. We demonstrate the effectiveness of our method on a wide range of challenging benchmarks, for tabular data, image classification and language modelling, and report substantial speed-ups over current state-of-the-art methods. Our new method, along with asynchronous baselines, are implemented in a distributed framework which will be open sourced along with this publication.

1. INTRODUCTION

The goal of hyperparameter and neural architecture search (HNAS) is to automate the process of finding the right architecture or hyperparameters x ∈ arg min x∈X f (x) of a deep neural network by minimizing the validation loss f (x), observed through noise: y i = f (x i ) + i , i ∼ N (0, σ 2 ), i = 1, . . . , n. Bayesian optimization (BO) is an effective model-based approach for solving expensive black-box optimization problems (Jones et al., 1998; Shahriari et al., 2016) . It constructs a probabilistic surrogate model of the loss function p(f | D) based on previous evaluations D = {(x i , y i )} n i=1 . Searching for the global minimum of f is then driven by trading off exploration in regions of uncertainty and exploitation in regions where the global optimum is likely to reside. However, for HNAS problems, standard BO needs to be augmented in order to remain competitive. For example, training runs for unpromising configurations x can be stopped early, but serve as low-fidelity approximations of f (x) (Swersky et al., 2014; Domhan et al., 2015; Klein et al., 2017b) . Further, evaluations of f can be executed in parallel to reduce the wall-clock time required for finding a good solution. Several methods for multi-fidelity and/or distributed BO have been proposed (Kandasamy et al., 2017; 2016; Takeno et al., 2020) , but they rely on rather complicated approximations to either select the fidelity level or to compute an information theoretic acquisition function to determine the next candidate. In this work we aim to adopt the desiderata of Falkner et al. (2018) , namely, that of simplicity, which often leads to more robust methods in practice. A simple, easily parallelizable multi-fidelity scheduling algorithm is successive halving (SH) (Karnin et al., 2013; Jamieson & Talwalkar, 2016) , which iteratively eliminates poorly performing neural networks over time. Hyperband (Li et al., 2017) iterates over multiple rounds of SH with varying ratios between the number of configurations and the minimum amount of resources spent per configuration. Falkner et al. ( 2018) introduced a hybrid model, called BOHB, that uses a probabilistic model to guide the search while retaining the efficient any-time performance of Hyperband. However, both SH and BOHB can be bottlenecked due to their synchronous nature: stopping decisions are done only after synchronizing all training jobs at certain resource levels (called rungs). This approach is wasteful when the evaluation of some configurations take longer than others, as it is often the case when training neural networks (Ying et al., 2019) , and can substantially delay progress towards high-performing configurations (see example shown in Figure 1 ). Recently, Li et al. (2018) proposed ASHA, which adapts successive halving to the asynchronous parallel case. Even though ASHA only relies on the random sampling of new configurations, it has been shown to outperform synchronous SH and BOHB. In this work, we augment ASHA with a Gaussian process (GP) surrogate, which jointly models the performance across configurations and rungs to improve the already strong performance of ASHA. The asynchronous nature further requires to handle pending evaluations in principled way to obtain an efficient model-based searcher.

1.1. CONTRIBUTIONS

Dealing with hyperparameter optimization (HPO) in general, and HNAS for neural networks in particular, we would like to make the most efficient use of a given parallel computation budget (e.g., number of compute instances) in order to find high-accuracy solutions in the shortest wall-clock time possible. Besides exploiting low-fidelity approximations, we demonstrate that asynchronous parallel scheduling is a decisive factor in cost-efficient search. We further handle pending evaluations through fantasizing (Snoek et al., 2012) , which is critical for asynchronous searchers. Our novel combination of asynchronous SH with multi-fidelity BO, dubbed MOdel Based aSynchronous mulTi fidelity optimizER (MOBSTER) substantially outperforms either of them in isolation. More specifically: • We clarify differences between existing asynchronous SH extensions: ASHA, as described by Li et al. (2018) and an arguably simpler stopping rule variant, related to the median rule (Golovin et al., 2017) and first implemented in Ray Tune (Liaw et al., 2018) . Although their difference seem subtle, they lead to substantially different behaviour in practice. • We present an extensive ablation study, comparing and analysing different components for asynchronous HNAS. While these individual components are not novel, one of our main contributions is to show that, by systematically combining them to form MOBSTER, we obtain a reliable and more efficient method than other recently proposed approaches. Due to limited space, we present a detailed description of the technical nuances and complexities of model-based asynchronous multi-fidelity HNAS in Appendix A.3. • On a variety of neural network benchmarks, we show that MOBSTER is more efficient in terms of wall-clock time than other state-of-the-art algorithms. Unlike BOHB, it does not suffer from substantial synchronization overheads when evaluations are expensive. As a result, we can achieve the same performance in the same amount of wall-clock time but with often just half the computational resources compared to random-sampling based ASHA. Next, we relate our work to approaches recently published in the literature. In Section 2, we review synchronous SH, as well as two asynchronous extensions. Our novel method is presented in Section 3. We present empirical evaluations for the HNAS of various neural architecture types in Section 4, and finish with conclusions and future work in Section 5.



Coloured lines trace training jobs for HP configurations, with marker size proportional to validation error (smaller is better). Lines not reaching the maximum number of authorized epochs (27) are stopped early. For synchronous SH, rungs (epochs 1, 3, 9) need to be filled completely before any configuration can be promoted, and each synchronization results in some idle time. In asychronous SH, configurations proceed to higher epochs faster. Without synchronization overhead, more configurations are promoted to the highest rung in the same wall-clock time (figure best seen in colours).

