MALIBO: META-LEARNING FOR LIKELIHOOD-FREE BAYESIAN OPTIMIZATION

Abstract

Bayesian Optimization (BO) is a popular method to optimize expensive blackbox functions. While BO typically only optimizes a single task, recent methods exploit knowledge from related tasks to warm-start BO and improve data-efficiency. However, these methods are either not scalable or sensitive to heterogeneous value scales across multiple tasks. We propose a novel approach to solve these problems by combining meta-learning with a likelihood-free acquisition function. Specifically, our meta-learning model simultaneously learns the underlying (taskagnostic) data distribution and a latent feature representation for individual tasks to be used as the acquisition function inside BO. The likelihood-free approach has less stringent assumptions about the problems compared to regression based methods and works with any classification algorithm, making it computation efficient and robust to different scales across tasks. Finally, we use gradient boosting as a residual model on top to adapt to distribution drifts between new and prior tasks, which might otherwise weaken the usefulness of the meta-learned features. Experiments show that the meta-model learns an effective prior for warm-starting optimization algorithms, is cheap to evaluate, and invariant under changes of scale across different datasets.

1. INTRODUCTION

Bayesian Optimization (BO) is a widely used method to optimize expensive black-box functions (Shahriari et al., 2016) and has been successfully applied in different fields, including automated machine learning (ML) (Hutter et al., 2019) . Given small amounts of data, traditional BO uses a Gaussian Process (GP) surrogate model together with an acquisition function to quickly optimize a black-box function. However, most BO techniques start from scratch for each new optimization problem, instead of leveraging information from previous runs for similar tasks to further improve data-efficiency. To warm-start BO, exploiting additional task information has been explored in the context of transfer learning (Weiss et al., 2016) and meta-learning (Vanschoren, 2018) . Prior knowledge can be used to build informed surrogate models (Schilling et al., 2016; Wistuba et al., 2018; Feurer et al., 2018b; Perrone et al., 2018) , to restrict the search space (Perrone et al., 2019) , or to warm-start the optimization with configurations that generally score well (Feurer et al., 2014; Salinas et al., 2020) . However, these approaches have three important issues: (i) GPs scale poorly due to their cubical computational complexity (Rasmussen, 2004) . (ii) The standard BO framework requires a surrogate model with well-calibrated and tractable predictive uncertainty, which is challenging in high-dimensional problems (Tiao et al., 2021; Song et al., 2022) . (iii) Regression models, including GPs, struggle with different scales and noise levels across tasks, which hurts warm-starting and optimization efficiency (Feurer et al., 2018a) . We propose a new meta-learning BO approach that can effectively transfer knowledge from related tasks and scales to large datasets. Our method is inspired by the idea of likelihood-free BO (Bergstra et al., 2011; Tiao et al., 2021; Song et al., 2022) , which replaces the surrogate model with a meta-learned classifier that directly balances exploration and exploitation without modeling the objective function. That way, we elegantly avoid both the scalability and the scale sensitivity issues simultaneously. We make the following contributions: (i) A novel probabilistic meta-learning model that uses Bayesian logistic regression and a probabilistic approach to learn feature representations from prior tasks. (ii) A scalable BO technique with good anytime performance that combines a meta-learning classifier and a likelihood-free acquisition function. (iii) Robust adaptation to new tasks by combining the meta-learned classifier with gradient boosting to correct prediction errors and Thompson Sampling for more explorationn. et al., 2017; Marco et al., 2017) , or weighted combinations of independent GPs for different tasks (Schilling et al., 2016; Wistuba et al., 2018; Feurer et al., 2018a) .

2. RELATED WORK

Several methods simultaneously learn the initial design and modify the surrogate model. Springenberg et al. ( 2016) apply task-specific embeddings for BO and use a Bayesian NN as the surrogate model, which are computationally expensive and hard to train. Perrone et al. (2018) propose Adaptive Bayesian Linear Regression (ABLR), which uses a NN to learn a shared feature representation across tasks with task-specific BLR layers to improve scalability and adaptability. However, both these methods are sensitive to changes in the scale and noise level across datasets. To tackle this, Salinas et al. (2020) propose Gaussian Copula Process Plus Prior (GC3P), which transforms the data response values via the empirical CDF, and fits a NN across all prior tasks. This NN is used to warm-start the optimization and predict the mean for a GP on the target task. Despite its robustness, the use of a GP surrogate still limits its applicability on high-dimensional problems. Our meta-learning method is closely related to Bayesian optimization with NNs and embedding reasoning (BANNER, Berkenkamp et al. ( 2021)), which uses a meta-learning model based on a NN to learn a latent representation and a task-specific BLR layer, similar to ABLR. However, the model output is divided into a task-independent mean and a task-specific residual prediction learned by a BLR layer. In this paper, we introduce a classifier variant of this meta-learning model and combine it with a likelihood-free acquisition function. Likelihood-free BO approaches address two drawbacks of traditional GP-based BO methods: the computationally expensive inference and the lack of flexibility due to the strong assumptions of most kernel methods. Rather than modeling the objective function, likelihood-free BO methods can use deterministic classifiers to separate good and bad configurations resulting in scale-invariant models



Bayesian optimization does not require an explicit model of the likelihood of the observed values (Garnett, 2022). Tree-structured Parzen Estimators (TPE, Bergstra et al. (2011)) phrase BO as a density ratio estimation problem (Sugiyama et al., 2012) and use the density ratio over 'good' and 'bad' configurations as an acquisition function without a probabilistic regression model. Tiao et al. (2021) estimate the density ratio through class probability estimation (Qin, 1998), which is equivalent to modeling the acquisition function with a binary classifier. Likelihood-Free BO (LFBO, Song et al. (2022)) improves upon this by weighting the observations.

Meta-learning Various methods have been proposed that improve the data-efficiency of Bayesian optimization (BO) by leveraging the information of previous observations from similar tasks. They apply meta-learning(Vanschoren, 2018)  or transfer-learning(Weiss et al., 2016)  depending on the context and have been proven effective in various applications(Andrychowicz et al., 2016; Finn et al.,  2017). We refer to Vanschoren (2018) for an in-depth overview.One line of work adapts the initial design to warm-start BO, either by reducing the search space(Perrone et al., 2019; Li et al., 2022)  or reusing good configurations from similar tasks, where similarity can be based on hand crafted features(Feurer et al., 2014)  or learned with Neural Networks

