PRIOR-GUIDED BAYESIAN OPTIMIZATION

Abstract

While Bayesian Optimization (BO) is a very popular method for optimizing expensive black-box functions, it fails to leverage the experience of domain experts. This causes BO to waste function evaluations on bad design choices (e.g., machine learning hyperparameters) that the expert already knows to work poorly. To address this issue, we introduce Prior-guided Bayesian Optimization (PrBO). PrBO allows users to inject their knowledge into the optimization process in the form of priors about which parts of the input space will yield the best performance, rather than BO's standard priors over functions (which are much less intuitive for users). PrBO then combines these priors with BO's standard probabilistic model to form a pseudo-posterior used to select which points to evaluate next. We show that PrBO is around 12x faster than state-of-the-art methods without user priors and 10,000× faster than random search on a common suite of benchmarks, and achieves a new state-of-the-art performance on a real-world hardware design application. We also show that PrBO converges faster even if the user priors are not entirely accurate and that it robustly recovers from misleading priors.

1. INTRODUCTION

Bayesian Optimization (BO) is a data-efficient method for the joint optimization of design choices that gained great popularity in recent years. It is impacting a wide range of areas, including hyperparameter optimization (Snoek et al., 2012; Falkner et al., 2018) , AutoML (Feurer et al., 2015a; Hutter et al., 2018) , robotics (Calandra et al., 2016 ), computer vision (Nardi et al., 2017; Bodin et al., 2016) , environmental monitoring (Marchant & Ramos, 2012 ), combinatorial optimization (Hutter et al., 2011 ), experimental design (Azimi et al., 2012) , RL (Brochu et al., 2010 ), Computer Go (Chen et al., 2018 ), hardware design (Koeplinger et al., 2018; Nardi et al., 2019) and many others. It promises greater automation so as to increase both product quality and human productivity. As a result, BO is also established in many large tech companies, e.g., with Google Vizier (Golovin et al., 2017) and Facebook BoTorch (Balandat et al., 2019) . Nevertheless domain experts often have substantial prior knowledge that standard BO cannot incorporate. Users can incorporate prior knowledge by narrowing the search space; however, this type of hard prior can lead to poor performance by missing important regions. BO also supports a prior over functions p(f ), e.g., via a kernel function. However, this is not the prior experts have: users often know which ranges of hyperparameters tend to work best, and are able to specify a probability distribution p best (x) to quantify these priors. E.g., many users of the Adam optimizer (Kingma & Ba, 2015) know that its best learning rate is often in the vicinity of 1e-3 (give or take one order of magnitude), yet one may not know the accuracy one may achieve in a new application. Similarly, Navruzyan et al. ( 2019) derived neural network hyperparameter priors for image datasets based on their experience with five datasets. In these cases, users know potentially good values for a new application, but cannot be certain about them. As a result, many competent users instead revert to manual search, which can fully incorporate their prior knowledge. A recent survey showed that most NeurIPS 2019 and ICLR 2020 papers that reported having tuned hyperparameters used manual search, with only a very small fraction using BO (Bouthillier & Varoquaux, 2020) . In order for BO to be adopted widely, and help facilitate faster progress in the ML community by tuning hyperparameters faster and better, it is therefore crucial to devise a method that fully incorporates expert knowledge into BO. In this paper, we introduce Prior-guided Bayesian Optimization (PrBO), a novel BO variant that combines user prior knowledge with a probabilistic model of the observations made. Our technical contributions with PrBO are: 1. PrBO bridges the TPE methodology and standard BO probabilistic models, such as GPs, RFs or Bayesian NNs, instead of Tree Parzen Estimators only. 2. PrBO is flexible w.r.t. how the prior is defined, allowing previously hard-to-inject (e.g. exponential) priors. 3. PrBO gives more importance to the model as iterations progress, gradually forgetting the prior and ensuring robustness against misleading priors. We demonstrate the effectiveness of PrBO on a comprehensive set of real-world applications and synthetic benchmarks, showing that accurate prior knowledge helps PrBO to achieve similar performance to current state-of-the-art on average 12.12× faster on synthetic benchmarks and 1.49× faster on a real-world application. PrBO also achieves equal or better final performance in all but one of the benchmarks tested.

2. BACKGROUND

2.1 BAYESIAN OPTIMIZATION Bayesian Optimization (BO) is an approach for optimizing an unknown function f : X → R that is expensive to evaluate over an input space X . In this paper, we aim to minimize f , i.e., find x * ∈ arg min x∈X f (x). BO approximates x * with an optimal sequence of evaluations x 1 , x 2 , . . . ∈ X , with each new x n+1 depending on the previous function values y 1 , y 2 , . . . , y n at x 1 , . . . , x n . BO achieves this by building a posterior on f based on the set of evaluated points. At each BO iteration, a new point is selected and evaluated based on the posterior, and the posterior is updated to include the new point (x n+1 , y n+1 ). The points explored by BO are dictated by the acquisition function, which attributes a value to each x ∈ X by balancing the predicted value and uncertainty of the prediction for each x. In this work, as the acquisition function we choose Expected Improvement (EI) (Mockus et al., 1978) , which quantifies the expected improvement over the best function value found so far: EI finc (x) := inf -inf max(f inc -y, 0)p(y|x)dy, where f inc is the incumbent function value, i.e., the best objective function value found so far, and p(y|x) is a probabilistic model, e.g., a GP. Alternatives to EI would be Probability of Improvement (PI) (Jones, 2001) , upper-confidence bounds (UCB) (Srinivas et al., 2010), and entropy-based methods (e.g. Hernández-Lobato et al. (2014) ).

2.2. TREE-STRUCTURED PARZEN ESTIMATOR

Whereas the standard probabilistic model in BO directly models p(y|x), the Tree-structured Parzen Estimator (TPE) approach of Bergstra et al. ( 2011) models p(x|y) and p(y) insteadfoot_0 . This is done by constructing two parametric densities, g(x) and l(x), which are computed using the observations with function value above and below a given threshold, respectively. The separating threshold y * is defined as a quantile of the observed function values. TPE uses the densities g(x) and l(x) to define p(x|y) as: p(x|y) = l(x)I(y < y * ) + g(x)(1 -I(y < y * )), where I(y < y * ) is 1 when y < y * and 0 otherwise. The parametrization of the generative model p(x, y) = p(x|y)p(y) facilitates the computation of EI as it leads to EI y * (x) ∝ l(x)/g(x) and, thus, arg max x∈X EI y * (x) = arg max x∈X l(x)/g(x).

3. BAYESIAN OPTIMIZATION WITH PRIORS

We propose a BO approach dubbed PrBO that allows field experts to inject user prior knowledge into the optimization in the form of priors. PrBO combines this user-defined prior with a probabilistic



Note that, technically, the model does not parameterize p(y), since it is computed based on the observed data points, which are heavily biased towards low values due to the optimization process. Instead, it parameterizes a dynamically changing p(yi) t i=1 , which helps to constantly challenge the model to yield better observations.

