PROBABILISTIC META-LEARNING FOR BAYESIAN OPTIMIZATION

Abstract

Transfer and meta-learning algorithms leverage evaluations on related tasks in order to significantly speed up learning or optimization on a new problem. For applications that depend on uncertainty estimates, e.g., in Bayesian optimization, recent probabilistic approaches have shown good performance at test time, but either scale poorly with the number of data points or under-perform with little data on the test task. In this paper, we propose a novel approach to probabilistic transfer learning that uses a generative model for the underlying data distribution and simultaneously learns a latent feature distribution to represent unknown task properties. To enable fast and accurate inference at test-time, we introduce a novel meta-loss that structures the latent space to match the prior used for inference. Together, these contributions ensure that our probabilistic model exhibits high sample-efficiency and provides well-calibrated uncertainty estimates. We evaluate the proposed approach and compare its performance to probabilistic models from the literature on a set of Bayesian optimization transfer-learning tasks.

1. INTRODUCTION

Bayesian optimization (BO) is arguably one of the most proven and widely used blackbox optimization frameworks for expensive functions (Shahriari et al., 2015) with applications that include materials design (Frazier & Wang, 2016) , reinforcement learning (Metzen et al., 2015) , and automated machine learning (ML) (Hutter et al., 2019) . In practical applications, BO is repeatedly used to solve variations of similar tasks. In these cases, the sample efficiency can be further increased by not starting the optimization from scratch, but rather leveraging previous runs to inform and accelerate the latest one. Several approaches to this emerged under the name of transfer-learning (Weiss et al., 2016) and meta-learning (Vanschoren, 2018) . Compared to early work by Swersky et al. (2013); Golovin et al. (2017) , recent publications leverage the representative flexibility of neural networks, which allows for more powerful models and impressive results (Gordon et al., 2019; Rusu et al., 2019; Garnelo et al., 2018b; a; Zintgraf et al., 2019) . Despite these significant advances, only a small subset of algorithms offers the well-calibrated uncertainty estimates on which BO relies to guide its sampling strategy efficiently. Additionally, BO benefits greatly from a meaningful prior over tasks that quickly converges to the true function to provide the highest sample efficiency. Existing work mostly focuses on deterministic models and, for those providing uncertainty estimates, sample-efficiency at test time is often a challenge.

Contributions

We set out to close this gap and introduce BAyesian optimization with Neural Networks and Embedding Reasoning (BaNNER), a flexible meta-learning method for BO. We go beyond previous work of Perrone et al. (2018) and introduce a generative regression model explicitly conditioned on a low-dimensional latent representation for the tasks. This allows our model to (i) encode a meaningful prior over tasks and (ii) remain highly sample-efficient, since each new task only requires inference over a low-dimensional latent representation. To ensure robust training of our model, we introduce a novel loss function to regularize the latent distribution and optimize our model's hyper-parameters using the available meta-data. We evaluate BaNNER on a set of synthetic benchmarks and two meta-learning problems and compare with the state-of-the-art in the literature. 

2. PROBLEM STATEMENT AND BACKGROUND

Our goal is to efficiently optimize an unknown function f (x, τ ) over a domain x ∈ X for some unknown but fixed task parameters τ that are sampled from an unknown distribution τ ∼ p(T ). To this end, at each iteration n we can select function parameters x n and observe a noisy function value y n = f (x n , τ ) + n , with n drawn i.i.d. from some distribution p . While our method can handle arbitrary noise distributions, we assume a Gaussian distribution, i.e. p = N (0, σ 2 ), for the remainder of the paper. We assume that each evaluation of f is expensive either in terms of monetary cost or time, so that we want to minimize the number of evaluations of f during the optimization process. The most data-efficient class of algorithms for this setting are Bayesian optimization (BO) algorithms, which use the observations collected up to iteration n, D n = {x i , y i } n-1 i=1 , in order to infer a posterior belief over the function values f (x, τ ). To select parameters x n that are informative about the optimum max x f (x, τ ), BO algorithms define an acquisition function α(•) that uses the posterior belief to select parameters as x n = argmax x α(p(f (x, τ ) | D n ). While BO algorithms have been studied extensively, their performance crucially depends on the properties of the statistical model used for f . The two key requirements for BO algorithms to be data-efficient are i), that the prior belief over f concentrates quickly on the true function f as we observe data in D n and ii), that the posterior uncertainty estimates are calibrated, so that the model always considers the true function f to be statistically plausible. The latter requirement means that the true function f (•, τ ) must always be contained in the model's confidence intervals with high probability. Since the task parameters τ are unknown, in general this requires a conservative model that works well for all possible tasks τ . We propose to use meta-learning in order to learn an effective prior (Fig. 1a ) that can quickly adapt to a new task τ (Fig. 1b ). We are given data from T previous tasks τ t ∼ p(T ) with N t observations D meta t = {(x n,t , y n,t } Nt n=1 each. We show the resulting generative model on the left in Fig. 2 . Metalearning aims to distill the information in D meta into a model g θ by optimizing the meta-parameters θ. At test time, we then keep these parameters fixed and use the learned model g θ to speed up the optimization of the new function f (•, τ ).

3. RELATED WORK

There are several approaches to improve the sample efficiency of BO methods based on information from related tasks. We refer to (Vanschoren, 2018) for a broad review of meta-learning in the context of automated machine learning and focus on the most relevant approaches below. One strategy to improve sample-efficiency is to initialize the BO algorithm with high-quality query points. These initial configurations can be either constructed to be complementary (Feurer et al., 2014; 2015; Lindauer & Hutter, 2018) or learned based on data set features (Kim et al., 2017) . An



Parameters x Function values f (x, τt) g θ (x) (BaNNER) (a) Example tasks and meta-learned prior. Parameters x Function values f (x, τ ) g θ (x) (BaNNER) (b) Meta-learned posterior after two observations.

Figure1: Example application BaNNER. Fig.1ashows functions f (•, τt) based on samples τt ∼ p(T ) for a parameterized Forrester function together with the 2σ confidence interval of the meta-learned prior. In Fig.1bwe see the corresponding posterior distribution after two data points (blue circles) for a specific test-function f (x, τ ). The confidence intervals contain the true function and collapse quickly, which enables highly-efficient Bayesian optimization. More plots in Fig.6(Appendix A.2).

