MALIBO: META-LEARNING FOR LIKELIHOOD-FREE BAYESIAN OPTIMIZATION

Abstract

Bayesian Optimization (BO) is a popular method to optimize expensive blackbox functions. While BO typically only optimizes a single task, recent methods exploit knowledge from related tasks to warm-start BO and improve data-efficiency. However, these methods are either not scalable or sensitive to heterogeneous value scales across multiple tasks. We propose a novel approach to solve these problems by combining meta-learning with a likelihood-free acquisition function. Specifically, our meta-learning model simultaneously learns the underlying (taskagnostic) data distribution and a latent feature representation for individual tasks to be used as the acquisition function inside BO. The likelihood-free approach has less stringent assumptions about the problems compared to regression based methods and works with any classification algorithm, making it computation efficient and robust to different scales across tasks. Finally, we use gradient boosting as a residual model on top to adapt to distribution drifts between new and prior tasks, which might otherwise weaken the usefulness of the meta-learned features. Experiments show that the meta-model learns an effective prior for warm-starting optimization algorithms, is cheap to evaluate, and invariant under changes of scale across different datasets.

1. INTRODUCTION

Bayesian Optimization (BO) is a widely used method to optimize expensive black-box functions (Shahriari et al., 2016) and has been successfully applied in different fields, including automated machine learning (ML) (Hutter et al., 2019) . Given small amounts of data, traditional BO uses a Gaussian Process (GP) surrogate model together with an acquisition function to quickly optimize a black-box function. However, most BO techniques start from scratch for each new optimization problem, instead of leveraging information from previous runs for similar tasks to further improve data-efficiency. To warm-start BO, exploiting additional task information has been explored in the context of transfer learning (Weiss et al., 2016) and meta-learning (Vanschoren, 2018) . Prior knowledge can be used to build informed surrogate models (Schilling et al., 2016; Wistuba et al., 2018; Feurer et al., 2018b; Perrone et al., 2018) , to restrict the search space (Perrone et al., 2019) , or to warm-start the optimization with configurations that generally score well (Feurer et al., 2014; Salinas et al., 2020) . However, these approaches have three important issues: (i) GPs scale poorly due to their cubical computational complexity (Rasmussen, 2004) . (ii) The standard BO framework requires a surrogate model with well-calibrated and tractable predictive uncertainty, which is challenging in high-dimensional problems (Tiao et al., 2021; Song et al., 2022) . (iii) Regression models, including GPs, struggle with different scales and noise levels across tasks, which hurts warm-starting and optimization efficiency (Feurer et al., 2018a) . We propose a new meta-learning BO approach that can effectively transfer knowledge from related tasks and scales to large datasets. Our method is inspired by the idea of likelihood-free BO (Bergstra et al., 2011; Tiao et al., 2021; Song et al., 2022) , which replaces the surrogate model with a meta-learned classifier that directly balances exploration and exploitation without modeling the objective function. That way, we elegantly avoid both the scalability and the scale sensitivity issues simultaneously. We make the following contributions: (i) A novel probabilistic meta-learning model that uses Bayesian logistic regression and a probabilistic approach to learn feature representations from prior tasks. (ii) A scalable BO technique with good anytime performance that combines a meta-learning classifier and a likelihood-free acquisition function. (iii) Robust adaptation to new tasks by combining the meta-learned classifier with gradient boosting to correct prediction errors and Thompson Sampling for more explorationn.

2. RELATED WORK

Meta-learning Various methods have been proposed that improve the data-efficiency of Bayesian optimization (BO) by leveraging the information of previous observations from similar tasks. They apply meta-learning (Vanschoren, 2018) or transfer-learning (Weiss et al., 2016) depending on the context and have been proven effective in various applications (Andrychowicz et al., 2016; Finn et al., 2017) . We refer to Vanschoren (2018) for an in-depth overview. One line of work adapts the initial design to warm-start BO, either by reducing the search space (Perrone et al., 2019; Li et al., 2022) or reusing good configurations from similar tasks, where similarity can be based on hand crafted features (Feurer et al., 2014) or learned with Neural Networks (NNs) (Kim et al., 2017) . Alternative approaches estimate the usefulness of a given configuration on both the current and prior tasks based on heuristics (Wistuba et al., 2015) or learning-based (Volpp et al., 2020) methods. Other approaches use transfer-learning to modify the probabilistic surrogate model, for instance using a multi-task GP (Swersky et al., 2013; Tighineanu et al., 2022) , an additive GP model (Golovin et al., 2017; Marco et al., 2017) , or weighted combinations of independent GPs for different tasks (Schilling et al., 2016; Wistuba et al., 2018; Feurer et al., 2018a) . Several methods simultaneously learn the initial design and modify the surrogate model. Springenberg et al. (2016) apply task-specific embeddings for BO and use a Bayesian NN as the surrogate model, which are computationally expensive and hard to train. Perrone et al. (2018) propose Adaptive Bayesian Linear Regression (ABLR), which uses a NN to learn a shared feature representation across tasks with task-specific BLR layers to improve scalability and adaptability. However, both these methods are sensitive to changes in the scale and noise level across datasets. To tackle this, Salinas et al. (2020) propose Gaussian Copula Process Plus Prior (GC3P), which transforms the data response values via the empirical CDF, and fits a NN across all prior tasks. This NN is used to warm-start the optimization and predict the mean for a GP on the target task. Despite its robustness, the use of a GP surrogate still limits its applicability on high-dimensional problems. Our meta-learning method is closely related to Bayesian optimization with NNs and embedding reasoning (BANNER, Berkenkamp et al. (2021) ), which uses a meta-learning model based on a NN to learn a latent representation and a task-specific BLR layer, similar to ABLR. However, the model output is divided into a task-independent mean and a task-specific residual prediction learned by a BLR layer. In this paper, we introduce a classifier variant of this meta-learning model and combine it with a likelihood-free acquisition function. Likelihood-free Bayesian Optimization Bayesian optimization does not require an explicit model of the likelihood of the observed values (Garnett, 2022) . Tree-structured Parzen Estimators (TPE, Bergstra et al. (2011) ) phrase BO as a density ratio estimation problem (Sugiyama et al., 2012) and use the density ratio over 'good' and 'bad' configurations as an acquisition function without a probabilistic regression model. Tiao et al. (2021) estimate the density ratio through class probability estimation (Qin, 1998) , which is equivalent to modeling the acquisition function with a binary classifier. Likelihood-Free BO (LFBO, Song et al. (2022) ) improves upon this by weighting the observations. Likelihood-free BO approaches address two drawbacks of traditional GP-based BO methods: the computationally expensive inference and the lack of flexibility due to the strong assumptions of most kernel methods. Rather than modeling the objective function, likelihood-free BO methods can use deterministic classifiers to separate good and bad configurations resulting in scale-invariant models and allows the application of any binary classification method (Tiao et al., 2021; Song et al., 2022) . We leverage this flexibility to create a new meta-learning classifier, which yields a scalable method that is robust to heterogeneous scales across datasets.

3. PROBLEM STATEMENT AND BACKGROUND

In this section, we introduce our problem setting and introduce related methods.

3.1. BAYESIAN OPTIMIZATION

Bayesian optimization (BO) aims to optimize a black-box function f (x) : X → R over x ∈ X . At each step n, BO proposes a x n obtains a noisy observation y n = f (x n ) + ϵ n . The proposal is based on a probabilistic surrogate model M and all previous observations D n-1 = {(x i , y i )} n-1 i=1 . The models posterior prediction p(y | x, D n-1 ) combined with the acquisition function α(x; D n-1 ) quantifies the utility of each input and x n = arg max x∈X α(x; D n-1 ). Typically, BO algorithms use a Gaussian Process model and assume ϵ n ∼ N (0, σ 2 ), for some unknown but fixed variance. Most acquisition functions, including Knowledge-Gradient (Frazier et al., 2009) , Predictive Entropy Search (Hernández-Lobato et al., 2014) and Max-value Entropy Search (Wang & Jegelka, 2017) are defined as an expected utility function U (y; τ ) usually with a threshold τ that heuristically balances exploration and exploitation. For example, the prevalent Expected Improvement (EI, Močkus (1975) ) acquisition function has U (y; τ ) := max(τ -y, 0) while the Probability of Improvement (PI, Kushner (1964) ) has U (y; τ ) := 1(τ -y > 0). The expected utility over the posterior belief from the surrogate model p(y | x, D n ) is given by α U (x; D n , τ ) = E y∼p(y | x,Dn) [U (y; τ )] = U (y; τ )p(y | x, D n )dy, where the threshold is usually chosen as the lowest observed function value, i.e., τ = min Dn y i . Tiao et al. (2021) propose to improve several aspects of TPE by directly estimating the DR rather than solving the more challenging problem of modeling two independent densities as an intermediate step. Their approach, dubbed BORE, rephrases the DR estimation as a binary classification problem when using the training loss

Likelihood

L BORE (θ; D N , τ ) = - 1 N N n=1 (k n log C θ (x n ) + (1 -k n ) log(1 -C θ (x n ))) . Here k n = 1(y n ≤ τ ) represents the binary class labels estimated by the classifier C θ with learnable parameters θ. Specifically, they show α DR (x; D N , τ ) ∝ C θ (x). Although Tiao et al. (2021) argue that BORE resembles EI, assigning the same label to all observations with y < τ regardless of the magnitude of improvement conforms to the definition of PI rather than EI, and might lead to conservative optimization with little global exploration (Garnett, 2022; Song et al., 2022) . In practice, EI has exhibits stronger and robuster performance than PI. Song et al. (2022) convert the expected utility of EI into an optimization problem using a variational representation and reformulate it as a classification problem with the following objective: L LFBO (θ; D N , τ ) = -E (x,y)∼p(x,y | D N ) [max(τ -y, 0) log C θ (x) + log(1 -C θ (x))], The resulting method, dubbed Likelihood-Free Bayesian Optimization (LFBO), can be seen as a weighted classification problem with noisy targets for only class k = 1, where the EI utility function max(τ -y, 0) weights the importance for observations below τ by their improvement. The minimizer of Eq. ( 3) is shown to be equivalent to the EI acquisition function (Song et al., 2022) .

4. METHODOLOGY

In this section, we introduce our MetA-learning for LIkelihood-free BO method, dubbed MALIBO. It incorporates a probabilistic meta-learning approach derived from BANNER into the LFBO framework. In particular, we show how to convert BANNER's regression model into a meta-learning classifier that warm-starts BO and combine it with gradient boosting (Friedman, 2001) . This yields good anytime performance, more flexibility in selecting the model, and better adaptation to new tasks. We first explain the different components and then summarize our algorithm in Algorithm 1.

4.1. META-LEARNING

Meta-learning for optimization strives to extract information from previous tasks and use this to accelerate the optimization of a new one. As all methods discussed in Section 2, we only consider the case of identical input spaces X across all tasks, which simplifies the learning problem. Our goal is to learn a probabilistic model for the likelihood-free acquisition function from the meta-data, as illustrated in Fig. 1 for a one-dimensional function. Following the approach from Berkenkamp et al. (2021) and Perrone et al. (2018) , our meta-learning classifier uses a deterministic, task-agnostic model to map the input space into a feature space, h(x) = ϕ of a predefined dimensionality. In this feature space, we seek to find a simple model able to classify the data on all prior tasks using a task specific adaptation. We propose to use Bayesian logistic regression with a minor modification. In addition to the standard linear model ϕ • z t , where z t represents prior task t in the learned feature space, and the sigmoid function σ converting the linear model into class probabilities, we introduce a task agnostic mean function m(ϕ) that acts as a non-linear bias. We provide an overview of our method in Fig. 2 . The meta-learning model g(•) = m(h(x)) learns a global prior for the classification problem, and the task-specific embedding vectors z t representing the necessary adaptations. For better inference on new tasks, we regularize the distribution of the z values during training. We follow the technique used by Berkenkamp et al. (2021) and Saseendran et al. (2021) to bring z t close to the prior distribution p(Z) = N (0, I), using a modified Kolmogorov-Smirnov test and the covariance to calculate the disparity of two distributions. The loss used for training on the meta-data reads: L meta = min g(•),z1,...,z T 1 T T t=1 L LFBO (k t ; g(x t ), z t ) + λR({z 1 , . . . , z T }; p(Z)) , where we average the contribution of all T prior-tasks and R is the regularization term outlined by Berkenkamp et al. (2021) weighted by λ.

4.2. TASK ADAPTATION VIA BAYESIAN LOGISTIC REGRESSION

After training the meta-model, we need to adapt our predictions to each new task. Hence, we must estimate a value for z that yields good classification resultsfoot_0 for the predictions C(x) = σ(g(x) + z • h(x) ). While we could greedily optimize Eq. ( 3) , we pursue a Bayesian approach, Bayesian logistic regression in particular, to capture the uncertainty in the task-specific z. Especially with few observations on the new task, using a single z vector can introduce a strong bias. Besides stronger exploration in the early phases of optimization, the Bayesian approach allows us to employ Thompson sampling (Thompson, 1933) for z, as shown in Fig. 1 . We believe this to be a valuable strategy for parallelization briefly explored in Appendix E. Kandasamy et al. (2018) showed that this bypasses the sequential scheme of traditional BO, without introducing the common computational burden of more sophisticated methods. See Garnett (2022) for more details. Bayesian logistic regression is a simple yet powerful classification model with Bayesian treatment. Although exact Bayesian inference is intractable, the Laplacian and probit approximation allow us to approximate the weights' posterior and the predictive distribution respectively while remaining reliable and scalable (Bishop & Nasrabadi, 2006; Murphy, 2012) . The Laplacian method fits a Gaussian distribution around the maximum-a-posteriori (MAP) estimate of the weights distribution and matches the second order derivative at the optimum. In the first step, we obtain the MAP estimate by maximizing the posterior of our classifier C parameterized by z. To be consistent with the regularization used during meta-training, we use a standard, isotropic Gaussian prior for the weights: p(z) = N (z | m 0 , Σ 0 ) , with mean m 0 = 0 and covariance matrix Σ 0 = I. Given observations D N , the log posterior likelihood w.r.t. z is given by L MALIBO = 1 2 (z -m 0 ) T Σ -1 0 (z -m 0 ) + N n=1 (k n (τ -y) log C(x n ) + log(1 -C(x n )) , and defines the MAP estimate of the weights via z MAP = arg min z∈Z L MALIBO . As a second step for the Laplace approximation, we compute the negative Hessian of the log posterior Σ -1 N = -∇∇ ln p(z | D N ) = Σ -1 0 + N n=1 kn (1 -kn )ϕ n ϕ T n (6) which serves as the precision matrix for the approximated posterior q(z) = N (z | z MAP , Σ N ). To finally compute the approximate predictive distribution, we need to marginalize w.r.t. p(z | D N ): C(x; g(x), D N ) = p(k = 1 | g(x), z)p(z | D N )dz ≃ σ(z T ϕ + m(ϕ))q(z)dz Algorithm 1: MALIBO: Meta-learning for likelihood-free Bayesian optimization Meta-learning: Input: Meta-datasets D meta t for tasks t = 1, . . . , T , proportion γ ∈ (0, 1) 1 Generate binary class labels for meta-data using γ: 1(y ≤ τ ), where τ = Φ -1 (γ) 2 Learn the meta-learning model C(•; D meta ) by optimizing L meta (Eq. ( 4)) Bayesian optimization with Meta-learning: Input: Fixed meta-learned model g(•) x * = arg maxx∈X C(x; g(x), D) with probit approximation (Eq. ( 25)) 13 D ← D ∪ {(x * , f (x * ) + ϵ)} Although no closed form solution for the integral in Eq. ( 7) is available due to the logistic activation, we can evaluate the predictions with probit approximation, which makes use of the similarity of the sigmoid function and the probit function. See Appendix B for further details.

4.3. GRADIENT BOOSTING AS A RESIDUAL PREDICTION MODEL

While the learned feature representation warm-starts the optimization, the model gains more knowledge about the new task with growing data and needs to adapt. Eventually, the accuracy of C(x) will saturate, as the parametric model has only limited flexibility and is ultimately limited by the amound and the quality of the meta-data. We expect a suboptimal classification performance especially in cases of little meta-data or when a large discrepancy between the training data and the meta-data distribution existsfoot_1 . More in depth studies are shown in Appendix D.2. To counteract this, we propose to combine our method with gradient boosting (GB) (Friedman, 2001) . In every iteration, after updating the Bayesian logistic regression model, we train a gradient boosting model (regression trees) on D n without leveraging any meta-learned features and use our classifier as the first one in the ensemble. This way, GB only corrects where our meta-model is inaccurate and allows for fine-tuning and better convergence.

5. EXPERIMENTS

In this section, we describe the experiments conducted to empirically evaluate our method. For the choice of problems, we focus on automated machine learning (AutoML), i. e. hyperparameter optimization (HPO) and neural architecture search (NAS). Besides that, we evaluate our methods on synthetic functions to show the robustness against multiplicative noise. Throughout all benchmarks, we show that MALIBO clearly improves upon LFBO's performance. We compare our method against state-of-the-art baselines across all problems. We picked random search (Bergstra & Bengio, 2012) , the gradient boosting variants of BORE (Tiao et al., 2021) and LFBO (Song et al., 2022) and BO with GPs as baselines without meta-learning and chose ABLR (Perrone et al., 2018) , RGPE (Feurer et al., 2018a) and GC3P (Salinas et al., 2020) as representative algorithms with a meta-learning component. Additionally, we consider the performance of our method without any additional components (MALIBO), with gradient boosting (MALIBO (GB)), with Thompson sampling (MALIBO (TS)), and with both (MALIBO (GB-TS)). We use immediate regret to quantitatively measure performance of different methods, which by definition is the absolute error between the global minimum and the lowest function value obtained so far. For all benchmarks, we report the results by mean and standard error across 100 random runs. We implemented ABLR, RGPE and MALIBO in PyTorch (Paszke et al., 2019) , and used scikitlearn (Pedregosa et al., 2011) for gradient boosting. As the meta-learning model for MALIBO, we considered a 4-layer, 64-unit residual feedforward network ResFFN-4-64. For more details we refer to Appendix G. For GC3P, we used the authors' open source implementationfoot_2 . The algorithm samples five candidates from a meta-learned NN model before building a task-specific Copula process while BORE and LFBO sample 10 random configurations for gathering global information from the target task before training a model. Thanks to the meta-learned acquisition function, MALIBO starts the optimization at the point with highest acquisition function value. For all likelihood-free BO methods the required hyperparameter γ, we set γ = 1/3, following Tiao et al. ( 2021) and Song et al. (2022) . Neural network tuning (HPOBench) This benchmark represents a joint NAS and HPO for a two-layer feed-forward regression network on four popular UCI datasets (Dua & Graff, 2017) . The search space is 9 dimensional and available as a tabular benchmark (Klein & Hutter, 2019; Eggensperger et al., 2021) with a total of 62,208 unique configurations. The optimization objective in this benchmark is the validation mean squared error after training with the corresponding network configuration (see Appendix H.1 for more details). As meta-data for each dataset, we randomly sampled 512 configurations from each of the remaining three. Fig. 3 shows the strong warm-starting performance of all MALIBO variants. For the protein and Slice dataset, MALIBO (TS-GB) exhibits the best final performance. For Parkinsons and Naval, the problem characteristics seem to favour the most basic MALIBO variant. GC3P performs very competitively often being the best method once the Copula process is fitted, but its final performance is usually matched or surpassed by BORE and LFBO. For all datasets, ABLR performs poorly, presumably due to the small number of prior-tasks and the fact that the meta-training seems to overfit to the meta-data in the first BO iterations. The GP based methods barely outperform RS, indicating that GPs fail to model the loss landscape. Neural architecture search (NASBench201) NASBench201 (Dong & Yang, 2020) considers designing a neural cell with 6 discrete parameters totaling 15, 625 unique architectures, evaluated on CIFAR-10, CIFAR-100 (Krizhevsky, 2009) and ImageNet-16 (Chrabaszcz et al., 2017) . The 2021) evaluated on 20 OpenML tasks (Vanschoren et al., 2014) , except for MLP with only 8 tasks. The search spaces dimensions range from 2 (SVM) to 5 (MLP) with the validation error as the objective. We refer to Appendix H.2 for more details. For meta-data, we randomly sampled 128 configurations from each task except the current target one (which changed across repetitions). As seen in Fig. 5 , all MALIBO variants continue to demonstrate strong warm-starting performance and similar performance on all benchmarks, except for RandomForest, where gradient boosting variants are better. ABLR finally shows its potential with more meta-tasks and is competitive on some benchmarks, but fails to converge on most of them. The warm-starting of GC3P fluctuates from on-par with MALIBO (SVM, LogReg) to essentially random guesses (MLP, XGBoost). RGPE demonstrates quick adaptation, but the GP seems to limit convergences again. To avoid biasing this experiment towards a single method, we decided to use a heteroscedastic noise that is incompatible with any assumptions about the noise of any method. In particular, this violates the GP methods' and ABLR's assumption of homoscedastic, Gaussian noise. GC3P makes a similar assumption, but after the nonlinear transformation applied to the y-values, which does not translate to a well-known noise model. BORE, LFBO and by extension all MALIBO variants, make no explicit noise assumptions, but will optimize for the best mean. We choose a multiplicative noise, i.e. y = f (x) • (1 + ϵ • n), where n ∼ N (0, 1). To test the robustness, we evaluate ϵ ∈ {0, 0.1, 1.0}. The noise corrupts observations with large values more, while having a smaller effect on lower function values. For meta-training, we randomly sampled 128 noisy observations from 256 functions in the ensemble. We show our results in Fig. 6 , where we can see across all magnitudes of multiplicative noise, our method still learns a meaningful prior for the optimization. As the noise level increases, the performaces of all methods degrade, with ABLR and GC3P performing worst for the largest noise. The GP based methods, especially RGPE do well on such smooth functions and handle the noise surprisingly well. In comparison, BORE, LFBO and MALIBO show the least performance losses with growing noise on a function ensemble where GP methods excel.

6. CONCLUSION

We introduced Meta-learning for Likelihood-free BO (MALIBO), a novel meta-learning optimization algorithm that is computationally efficient and robust to varying scales of the observations and heterogeneous noise. By directly modeling the acquisition function from observations, the method makes fewer assumptions about the data and noise distributions. Coupled with meta-learning, it leverages information from prior tasks for more sample efficiency. To ensure adaption to new tasks, possibly different from prior ones, we incorporate our model into gradient boosting, transitioning from a meta-learning driven model towards a specialized one on the current task. Empirical results demonstrate superior performance on both HPO and NAS benchmarks, as well as synthetic benchmarks with heteroscedastic noise. Despite the promising experimental results, some limitations of the method should be noted. As discussed by Tiao et al. ( 2021), the threshold parameter τ in likelihood-free BO algorithms, which controls the exploitation and exploration trade-off, should be treated more carefully. One might consider probabilistic treatment for this hyperparameter. We also observed over-confident predictions in our experiments, for example the Forrester ensemble in Appendix F, where not all runs find the true optimum. This is especially important for problems with more complicated loss landscapes, but mitigation strategies exist. Directions for future work include extensions to parallel BO with Thompson sampling (Kandasamy et al., 2018) , multi-fidelity optimization for HPO problems (Falkner et al., 2018) , multi-objective optimization (Hernandez-Lobato et al., 2016) , BO with automatic stopping (Makarova et al., 2022) and Bayesian deep active learning (Gal et al., 2017) .

A LIKELIHOOD-FREE ACQUISITION FUNCTION

For completeness, we provide the proofs and derivations from Bergstra et al. (2011 ), Tiao et al. (2021 ), and Song et al. (2022) . Recall from Eq. ( 1) that the expected utility function is defined as the expectation of the improvement utility function U (y; τ ) over the posterior predictive distribution p(y | x, D n ). For the specific expected improvement (EI) acquisition function, where the utility function is U (y; τ ) := max(τ -y, 0), the function reads: α(x; D n , τ ) := E p(y | x,Dn) = ∞ -∞ U (y; τ )p(y | x, D n )dy = ∞ τ (τ -y)p(y | x, D n )dy = 1 p(x | D n ) ∞ τ (τ -y)p(x | y, D n )p(y | D N )dy, We follow the prove from Bergstra et al. (2011) and Tiao et al. ( 2021) and consider ℓ(x) = p(x | y ≤ τ, D n ) and g(x) = p(x | y > τ, D n ). The denominator of the above equation can then be written as: p(x | D n ) = ∞ -∞ p(x | y, D n )p(y | D n )dy = ℓ(x) τ -∞ p(y | D n )dy + g(x) ∞ τ p(y | D n )dy = γℓ(x) + (1 -γ)g(x), where γ := Φ(τ ) = p(y ≤ τ | D n ). The numerator can be evaluated as: ∞ τ max(τ -y, 0)p(x | y, D n )p(y | D N )dy = ℓ(x) ∞ τ max(τ -y, 0)p(y | D n )dy (10) = ℓ(x)τ ∞ τ p(y | D n )dy -ℓ(x) ∞ τ yp(y | D n )dy (11) = γτ ℓ(x) -ℓ(x) ∞ τ yp(y | D n )dy (12) = K • ℓ(x), where K = γτ -∞ τ yp(y | D n )dy. Therefore the EI acquisition function is equivalent to the γ-relative density ratio up to a constant K, α(x; D N , τ ) expected improvement ∝ ℓ(x) γℓ(x) + (1 -γ)g(x) density ratio (14) Intuitively, one can think of the configuration x whose corresponding y ≤ τ as good configurations, and the those with y > τ are bad configurations. Then the density ratio can be interpreted as the ratio between the model's belief on that configuration being a good and a bad configuration. For Tree-structured Parzen Estimators (TPE, Bergstra et al. (2011) ), they first select the γ as hyperparameter and estimate this density ratio by explicitly modelling ℓ(x) and g(x) using kernel density estimation. As for BORE by Tiao et al. (2021) , they model the density ratio by class probability, where ℓ(x) = p(x | k = 1) and g = p(x | k = 0). Song et al. (2022) proof that density ratio acquisition function is not always equivalent to EI. Bergstra et al. (2011) and Tiao et al. (2021) claims that Eq. ( 10) hold true, which is ℓ(x) times a value independent of x. Bergstra et al. (2011) assumes ℓ(x) is independent of y once y ≤ τ and therefore we can take ℓ(x) directly out of the integral. Nevertheless, p(x | y ≤ τ, D n ) is still dependent on y even if y ≤ τ , because it is a conditional probability condition on y. They confused with depending on y and depending on y ≤ τ , where these two statements are different and mean differently. By just satisfying the condition y ≤ τ , it does not necessarily mean that it become independent on y. From the definition of conditional probability: p(x | y ≤ τ, D n ) = τ -∞ p(x, y | D n )dy τ -∞ p(y | D n )dy ̸ = p(x | y, D n ), they are not equivalent. Intuitively, to understand this, is for example, the probability of the configuration x still depends on the its corresponding value y even though y < τ . Another interpretation for this would be, density ratio acquisition function would treat all (x, y) pairs above it with equal importance, because it does not depend on y any more if y ≤ τ according to the independence assumption. In fact, expected improvement weighted the importance (x, y) pairs by how much y is lower than τ . To tackle this issue, Song et al. (2022) propose to estimate the density ratio via variational f-divergence estimation (Nguyen et al., 2010) . They provide a variational representation for the expected utility function at any point x, provided the samples from some distribution p(y | x), which can replace the integration with a variational objective function: E p(y|x) [U (y; τ )] = arg max s∈[0,∞) E p(y|x) [U (y; τ )f ′ (s) -f * (f ′ (s))], where the utility function U is non-negative, and f : [0, ∞) → R is a strictly convex function, and f * is the convex conjugate of f . This acquisition function does not model distributions with probability but only samples from the observations D n . They consider their acquisition function α LFBO = ŜDn,τ (x), and state that the acquisition function can be written as: ŜDn,τ (x) = arg max E Dn [U (y; τ )f ′ (S(x)) -f * (f ′ (S(x))]. This means by optimizing a variaitional objective in the search space X , we can recover an expected utility acquisition function over x, which makes L LFBO equivalent to expected utility acquisition function. For practical purpose, they choose a specific convex funciton f : f (r) = r log r r+1 +log 1 r+1 for all r > 0. For their acquisition function α LFBO they consider: α LFBO (x; D n , τ ) = ĈDn,τ (x)/(1 -ĈDn,τ (x)), where ĈDn,τ is the maximizer of some objecetive over C : X → (0, 1). By applying this f and Eq. ( 18) to Eq. ( 17), the objective for maximization objective C become: E (x,y)∼p(x,y | D N ) [U (y; τ ) log C(x) + log(1 -C(x))]. B PROBIT APPROXIMATION Let's set a = z T ϕ + m(ϕ), and the distribution of a would be a Gaussian N (a | µ a , σ a ), where µ a and σ a are respectively: µ a = E[a] = p(a)a da = q(z)(z T ϕ + m(ϕ)) dz = z T MAP ϕ + m(ϕ) (20) σ 2 a = var[a] = p(a){a 2 -E[a] 2 } da = q(z){(z T ϕ + m(ϕ)) 2 -(m T N ϕ + m(ϕ)) 2 } dz = ϕ T Σ N ϕ (21) Thus our approximation to the predictive distribution becomes: p(k = 1 | g(x), D N ) = σ(a)N (a | µ a , σ 2 a ) da (22) To evaluate the integral in Eq. ( 22), we can obtain a good approximation by making use of the close similarity between the logistic sigmoid function σ(a) and the probit function, which is given by the cumulative distribution of the standard Gaussian Φ(a) = a -∞ N (θ | 0, 1) dθ: Φ(λa)N (a | µ, σ 2 ) da = Φ µ (λ -2 + σ 2 ) 1/2 (23) We apply the approximation σ(a) ≃ Φ(λa) to the probit functions appearing on both sides of the equation: σ(a)N (a | µ, σ 2 ) da ≃ σ((1 + πσ 2 /8) -1/2 µ) Therefore we obtain predictive distribution in the form: p(k = 1 | g(x), D N ) ≃ σ((1 + πσ 2 a /8) -1/2 µ a ) (25) where µ a is Eq. ( 20) and σ 2 a is Eq. ( 21).

C RUNTIME ANALYSIS

The experiments in Section 5 show the performance over optimization steps. To be complementary, we demonstrate the same results from a different perspective, namely, we report the immediate regrets as a function of estimated wall-clock time. To obtain the realistic wall-clock time, we accumulate the time to optimize for corresponding BO methods and the recorded runtime for configurations in the benchmarks. Notice that, all the methods run for the same number of steps in an experiment. As the results are shown in Figs. 7 8 9 , MALIBO and its variants attain the best warm-starting performance across all benchmarks and constantly achieve the lowest regrets with the same amount of time. LFBO and BORE are two competitive methods in terms of end performance, but both need quite some time to catch up the regrets of MALIBO and its variants. GC3P is the method with closest time performance as MALIBO in most of the benchmarks. However, their performance is not as stable as MALIBO, especially for NASBench201 and some of the MLBench problems, where the meta-learning fail to warm-start the optimization. Similarly, the meta-learning of RGPE and ABLR do not deliver any advantage over the non-meta-learning baselines and thus end up with close performance as normal GP. To further investigate the time efficiency of MALIBO, we illustrates the runtime of the optimization algorithms for each step in Fig. 10 . The runtime for MALIBO and its variants is the fastest among all the meta-learning methods, while only slightly slower than the non-meta-learning likelihood-free methods, namely BORE and LFBO. Due to the increasing amount of observations, the runtime of almost all the methods grow over iterations, especially for RGPE and GP, where the growths are the most significant. Although ABLR and GC3P are around a order of magnitude slower than MALIBO at the beginning, but their runtimes remain stable over steps. 

D ABLATION STUDIES

We conduct ablation studies to show how the different components of MALIBO help for the optimization. To be specific, we first demonstrate the meta-learned features representation for two different function in Appendix D.1 to provide an intuitive visualization for the expressiveness of our meta-learning model. To answer the questions of what components and how they help the optimization, we show the experiments of MALIBO and its variants running on 4 different synthetic function benchmarks without any meta-learning in Appendix D.2.

D.1 LATENT FEATURE ANALYSIS

We show how our meta-learning model learns a feature representation from meta-data. The latent features ϕ can be considered as basis functions and are supposed to represent the structure of the meta-data distribution. With the properly learned features ϕ, the mean layer and the task embedding layer z can combine them to obtain a function with similar structure to the meta-data while matching the shape of target function. In order to learn a effective feature representation, one should capture both the local and global structure of the function. Therefore, we select two types of function to study the effectiveness of features learning for MALIBO: i) Forrester functions (Sobester et al., 2008) with two very likely positions for the global optimum, which has rich local structure. ii) quadratic functions, where the functions share a certain global shape, but the optima could be located anywhere in the search space. For more details on the synthetic functions and the generation of meta-data, we refer to Appendix F. The results for these two synthetic functions are shown in Fig. 11 and Fig. 12 respectively. On one hand, we can see that in Fig. 11 the features learned by MALIBO has either maximum or minimum around the two likely optima, which means the model successfully infers the local structure from the meta-data. On the other hand, Fig. 12 shows that, even without a clear location of optima, the features still learns the shape of quadratic functions. 

D.2 EFFECTS OF GRADIENT BOOSTING AND THOMPSON SAMPLING

The vanilla MALIBO uses Bayesian logistic regression for task adaptation, which leverages the meta-learned feature. However, the performance depends heavily on the quality of the data and the latent features learned from the data. In practice, the amount and the quality of the data are often not guaranteed. We introduce gradient boosting as a residual model to safeguard the optimization when little meta-data or a large discrepancy between the training data and the meta-data distribution exists. Further, we apply Thompson sampling to encourage exploration, which enables MALIBO to collect more information about the target function by exploring the search space efficiently. To show the usefulness of these components in MALIBO, we remove the effects of meta-learning by optimizing the target synthetic functions without any meta-training, and therefore the experiments will focus only on the effects on the gradient boosting and Thompson sampling. We use four synthetic benchmarks for the experiments, namely quadratic, Forrester, Branin and Hartmann3 functions. We refer to Appendix F for more details of the synthetic functions. As the results shown in Fig. 13 , the MALIBO (GB) variant consistently demonstrate better performance than the vanilla MALIBO except in Forrester, where the non-meta-learning MALIBO can easily stuck in one local optima. This indicates that, when there is little meta-data, gradient boosting can help the model to converge toward a lower regret and the Bayesian logistic regression fail to optimize. For MALIBO (TS), the results show the model achieve lower immediate regrets than vanilla MALIBO across all the benchmark, because it encourages the exploration in the search space. However, due to the fact that Thompson sampling improve only exploration, for example Branin and Hartmann3 function, the MALIBO (GB) outperforms the MALIBO (GB) variant. Last but not least, the results in experiments show that, by combing the exploitation of gradient boosting and exploration of Thompson sampling, MALIBO (GB-TS) achieves the best performance across all benchmarks. 

F ADDITIONAL BENCHMARKS

In this section we show more results for experiments with multiplicative noise as in Section 5. With the settings remaining the same, we perform our experiments on Forester (Sobester et al., 2008) and Hartmann3 (Dixon, 1978) function ensembles. We refer to Appendices H.5 and H.7 respectively for more details. Forrester function ensemble For meta-training in Forester ensemble experiment, we randomly sampled noisy 32 observations in 64 prior-tasks. The results is shown in Fig. 18 . MALIBO and its variants keep showing strong warm-starting performance and stay robust to noise compared to other methods. However, all the likelihood-free BO methods, namely LFBO, MALIBO and its variants, seem to stuck in local minimum in some runs, resulting in almost no improvement over the optimization process. The performance of all MALIBO variants is on-par with GC3P in all cases. Although most of the GP-based methods, namely GP, RGPE and ABLR, all outperform the other likelihood-free based methods, however after increasing the noise level, their performances degrade significantly. G EXPERIMENTAL DETAILS Consider a Residual Feed Forward Network (ResFFN) (He et al., 2016) architecture ResFFN-4-64, which contains 4 residual feedforward layers with 64 units. We use ResFFN-4-64 to learn the latent feature representation, with 4 hidden layers, each with 64 units. For the mean layer m(•) and task embedding layer z, we use a fully connected layer with 50 units for each. We use ELU (Clevert et al., 2016) as activation function for our problem following Tiao et al. (2021) . During meta-training, we optimize the weights with ADAM (Kingma & Ba, 2015) using batch size of B = 256, and polynomial decay for learning rate, with the initial learning rate lr initial = 10 -3 , end learning rate lr final = 2 -4 and the exponent set to 2. The model is trained for 2048 epochs with early stopping. We set the regularization factor λ = 0.1 in Eq. ( 4) and follow the approach in (Berkenkamp et al., 2021) to estimate the weights for modified Kolmogorov-Smirnov test and covariance regularization. In task adaptation, we optimize the task embedding with L-BFGS (Byrd et al., 1995) . For the gradient boosting applied to BORE, LFBO and MALIBO, we use the implementation in scikit-learn (Pedregosa et al., 2011) with default settings. The only difference is that we use the meta-learned MALIBO classifier as the initial estimator for gradient boosting variants of MALIBO.

H DETAILS OF BENCHMARKS H.1 HPOBENCH

The hyperparameters for HPOBench and their ranges are demonstrated in Table 1 . All hyparameters are discrete and there are in total 66,208 possible combinations. More details can be found in Klein & Hutter (2019) . The hyperparameters for machine learning (ML) algortihms in HPOBench (Eggensperger et al., 2021) and their ranges are summarized in Table 2 . More details can be found in Eggensperger et al. (2021) . The hyperparameters for NASBench201 and their ranges are summarized in Table 3 . All hyparameters are discrete and there are in total 15,625 possible combinations. More details can be found in Dong & Yang (2020) . Hyperparameter Range ARC 0 { none, skip-connect, conv-1 × 1, conv-3 × 3, avg-pool-3 × 3 } ARC 1 { none, skip-connect, conv-1 × 1, conv-3 × 3, avg-pool-3 × 3 } ARC 2 { none, skip-connect, conv-1 × 1, conv-3 × 3, avg-pool-3 × 3 } ARC 3 { none, skip-connect, conv-1 × 1, conv-3 × 3, avg-pool-3 × 3 } ARC 4 { none, skip-connect, conv-1 × 1, conv-3 × 3, avg-pool-3 × 3 } ARC 5 { none, skip-connect, conv-1 × 1, conv-3 × 3, avg-pool-3 × 3 }

H.4 THE QUADRATIC ENSEMBLE

The function for the quadratic ensemble is defined as: f (x, a, b, c) = (a • (x -b)) 2 -c x ∈ [0, 1] ) To form the ensemble, we choose the distribution for the parameters as: a ∼ U(0.5, 1.5) b ∼ U(-0.9, 0.9) c ∼ U(-1, 1) (27) This distribution of parameters ensures that the search space contains the minimum of the quadratic function at x * = b with f (x * ) = c. The location of the optimum has a broad distribution over the function space, which is intended to highlight algorithms that learn the global structure of the ensemble rather than restricting on some small regions of interest.

H.5 THE FORRESTER ENSEMBLE

The original Forrester function (Sobester et al., 2008) is defined following: (32) 



In contrast to the meta-training, we only infer z but keep the task agnostic model g(•) fixed. A large missmatch between traning and test data can arise here when the xn all cluster around an optimum after many iterations, but the meta-data consisted of IID points on all tasks. https://github.com/geoalgo/A-Quantile-based-Approach-for-Hyperparameter-Transfer-Learning



-free acquisition functions model the belief of a candidate being promising instead of explicitly computing the likelihood of outcomes via the model's posterior p(y | x, D N ). One of the first likelihood-free BO algorithms, called Tree-structured Parzen Estimators (TPE, Bergstra et al. (2011)), dismisses the surrogate for the outcomes and models the two densities ℓ(x) = p(x | y ≤ τ, D n ) and g(x) = p(x | y > τ, D n ) instead. The threshold τ relates to the γ-th quantile of the observed y values via γ = Φ(τ ) := p(y ≤ τ | D n ). The density ratio (DR) serves as the acquisition function α: α DR (x; D N , τ ) = ℓ(x)/g(x).

Figure 1: Illustration of meta-learning the acquisition function. Left: Observations (crosses) from 10 related tasks and the target task. The top performing observations (γ = 1/3) in each task are shown in red, the rest in blue. Right: The top panel shows the approximated predictive distribution (see Eq. (7)) while the others show Thompson samples. MALIBO successfully identifies the promising areas, and the Thompson samples show variability in the meta-learned acquisition function.

Figure 2: Schematic representation of our meta-learning classifier. A Residual Feedfoward Network (ResFFN) maps the input x t to a latent feature representation ϕ. From this, a global mean prediction m(ϕ) and task-specific embedding z t are combined and translated into a class prediction via a sigmoid function.

Obtain the first best guess x0 from the meta-learned model D = {(x0, f (x0) + ϵ)} For the gradient boosting variant, use C(•) ← GB(C(•)) as the classifier 5 Has budget do 6 Estimate zMAP by optimizing L MALIBO (Eq. (5)) w.r.t.

Figure 3: Immediate regret for different BO algorithms on the HPOBench neural network tuning problems (D = 9) for 4 datasets. The optimization objective in this benchmark is the validation loss.

Figure 4: Immediate regret for different BO algorithms on the NASBench201 neural architecture search problems (D = 6) on 3 datasets. We optimize the validation accuracy for this benchmark.

Figure 5: Immediate regret for different BO algorithms on the HPOBench hyperparamter tuning problems for 5 different machine learning algorithms. We specifically optimize the (1accuracy) for this benchmark.

Figure 6: Immediate regret for different BO algorithms on Branin function ensembles (D = 2) with different levels of multiplicative noise.

Figure 7: Immediate regrets of different BO algorithms on the HPOBench neural network tuning problem. Each algorithm runs for 500 iterations and we show the corresponding estimated wall-clock time on the x axis in log scale.

Figure 8: Immediate regrets of different BO algorithms on the NASBench201 neural network architecture search problem. Each algorithm runs for 200 iterations and we show the corresponding estimated wall-clock time on the x axis in log scale.

Figure 9: Immediate regrets of different BO algorithms on the HPOBench hyperparameter tuning for machine learning algorithms. Each algorithm runs for 72 iterations and we show the corresponding estimated wall-clock time on the x axis in log scale.

Figure 11: Left: Forrester functions with two very likely optima as target function and related-tasks. The learned acquisition function is shown below. Right: Meta-learned latent features from relatedtasks. The latent features show the model successfully infer the location of two optima, resulting in a acquisition with two modes around the same locations.

Figure 12: Left: Quadratic functions with varying optima as target function and related-tasks. The learned acquisition function is shown below. Right: Meta-learned latent features from related-tasks. The latent features show the model can learn the global structure shared across all the meta-data, even though there is no clear position for optima.

Figure 13: Results of MALIBO and its variants on four synthetic function benchmarks without meta-training. We report the standard error of immediate regrets over 100 runs.

Figure 17: Synchronous parallel Thompson sampling using MALIBO (TS) to optimize a quadratic function. Every iteration we draw three samples as acquisition functions and utilize the resulting query points as observations for the next optimization step. The exploration from the Thompson sampling helps to discover the boundary of the promising region, i.e. the observations y ≤ τ (blue crosses), in fewer iterations and the range of promising region keep reducing as more unfavorable points close to the true optimum were acquired.

Figure 18: Immediate regret for different BO algorithms on Forrester function ensembles (D = 1) with different levels of multiplicative noise.

Figure 19: Immediate regret for different BO algorithms on Hartmann3 function ensembles (D = 3) with different levels of multiplicative noise.

(x, a, b, c) = a • (6x -2) 2 ṡ in(12x -4) + b(x -0.5) -c x ∈ [0, 1](28)The function has one local and one global minimum, and a zero-gradient inflection point in the domain x ∈ [0, 1]. To form the ensemble, we choose the distribution for the parameters as: a ∼ U(0.2, 3) b ∼ U(-5, 15) c ∼ U(-5, 5) (29) Let τ = {a, b, c} and p(τ ) is a three dimensional uniform distribution. The ranges are chosen around the usually used fixed values for the parameters, namely a = 0.5, b = 10, c = -5.H.6 THE BRANIN ENSEMBLEThe function for the Branin ensemble is the following:f (x, a, b, c) = a(x 2 -bx 2 1 + cx 1 -r) + s(1 -t) cos(x 1 ) + s x 1 ∈ [-5, 10], x 2 ∈ [0, 15](30)The distribution for the parameters are chosen as:a ∼ U(0.5, 1.5) b ∼ U(0.1, 0.15) c ∼ U(1.0, 2.0) r ∼ U(5.0, 7, 0) s ∼ U(8.0, 12.0) t ∼ U(0.03, 0.05)(31)Let τ = {a, b, c, r, s, t} and p(τ ) is a six dimensional uniform distribution. The ranges are chosen around the usually used fixed values for the parameters, namely a = 1, b = 5.1/(4π 2 ), c = 5/π, r = 6, s = 10 and t = 1/(8π).H.7 THE HARTMANN3 ENSEMBLEThe function for Hartmann3(Dixon, 1978) ensemble reads:f (x, α 1 , α 2 , α 3 , α 4 ) =i,j (x j -P i,j ) 2

Configuration spaces for HPOBenchHyperparameter RangeInitial LR { 5 × 10 -4 , 1 × 10 -3 , 5 × 10 -3 , 1 × 10 -2 , 5 × 10 -2 , 1 × 10 -1 } LR Schedule { cosine, fixed } Batch size { 2 3 , 2 4 , 2 5 , 2 6 } Layer 1 Width { 2 4 , 2 5 , 2 6 , 2 7 , 2 8 , 2 9 } Activation { relu, tanh } Dropout rate { 0.0, 0.3, 0.6 } Layer 2 Width { 2 4 , 2 5 ,2 6 , 2 7 , 2 8 , 2 9 } Activation { relu, tanh } Dropout rate { 0.0, 0.3, 0.6 } H.2 ML ALGORITHMS IN HPOBENCH

Configuration spaces for ML algorithms in HPOBench

Configuration spaces for NASBench201

E STEP-THROUGH VISUALIZATION

For illustration purposes, we provide step-through visualizations on a Forrester and a quadratic function. For details of the synthetic functions, we refer to Appendix H.5 and Appendix H.4 respectively. We use the same meta-trained model for the visualizations as the one used in Appendix D.1 for the corresponding problem.We demonstrate the advantage of using Thompson sampling in two parts. First, by showing the MALIBO (TS) variant optimize functions sequentially, which is in correspondence to normal BO pipeline, we can see how the algorithm explore the space efficiently. The illustrations are demonstrated in Fig. 14 and Fig. 15 . Thereafter, we show two toy examples of synchronous parallel BO (Kandasamy et al., 2018) using MALIBO (TS) on the same functions. To be specific, we use three Thompson samples as acquisition functions in each iteration, and evaluates the three proposed points for the next optimization step. We demonstrate that, MALIBO (TS) can be easily extended to parallel BO with the help of Thompson sampling. To form the ensemble, we choose the distribution for the parameters as:α 1 ∼ U(0.0, 2.0) α 2 ∼ U(0.0, 2.0) α 3 ∼ U(2.0, 4.0) α 4 ∼ U(2.0, 4.0) (33)Let τ = {α 1 , α 2 , α 3 , α 4 } and p(τ ) is a four dimensional uniform distribution.

