EFFICIENT BAYESIAN OPTIMIZATION WITH DEEP KERNEL LEARNING AND TRANSFORMER PRE-TRAINED ON MULIPLE HETEROGENEOUS DATASETS

Abstract

Bayesian optimization (BO) is widely adopted in black-box optimization problems and it relies on a surrogate model to approximate the black-box response function. With the increasing number of black-box optimization tasks solved and even more to solve, the ability to learn from multiple prior tasks to jointly pre-train a surrogate model is long-awaited to further boost optimization efficiency. In this paper, we propose a simple approach to pre-train a surrogate, which is a Gaussian process (GP) with a kernel defined on deep features learned from a Transformerbased encoder, using datasets from prior tasks with possibly heterogeneous input spaces. In addition, we provide a simple yet effective mix-up initialization strategy for input tokens corresponding to unseen input variables and therefore accelerate new tasks' convergence. Experiments on both synthetic and real benchmark problems demonstrate the effectiveness of our proposed pre-training and transfer BO strategy over existing methods.

1. INTRODUCTION

In black-box optimization problems, one could only observe outputs of the function being optimized based on some given inputs, and can hardly access the explicit form of the function. These kinds of optimization problems are ubiquitous in practice (e.g., (Mahapatra et al., 2015; Korovina et al., 2020; Griffiths & Lobato, 2020) ). Among black-box optimization problems, some are particularly challenging since their function evaluations are expensive, in the sense that the evaluation either takes a substantial amount of time or requires a considerable monetary cost. To this end, Bayesian Optimization (BO; Shahriari et al. (2016) ) was proposed as a sample-efficient and derivative-free solution for finding an optimal input value of black-box functions. BO algorithms are typically equipped with two core components: a surrogate and an acquisition function. The surrogate is to model the objective function from historical interactions, and the acquisition function measures the utility of gathering new input points by trading off exploration and exploitation. Traditional BO algorithms adopt Gaussian process (GP; Rasmussen & Williams ( 2009)) as the surrogates, and different tasks are usually optimized respectively in a cold-start manner. In recent years, as model pre-training showed significant improvements in both convergence speed and prediction accuracy (Szegedy et al., 2016; Devlin et al., 2019) , pre-training surrogate(s) in BO becomes a promising research direction to boost its optimization efficiency. Most existing work on surrogate pre-training (Bardenet et al., 2013; Swersky et al., 2013; Yogatama & Mann, 2014; Springenberg et al., 2016; Wistuba et al., 2017; Perrone et al., 2018; Feurer et al., 2018a; Wistuba & Grabocka, 2021) assumes that the target task shares the same input search space with prior tasks generating historical datasets. If this assumption is violated, the pre-trained surrogate cannot be directly applied and one has to conduct a cold-start BO. Such an assumption largely restricts the scope of application of a pre-trained surrogate, and also prevents it from learning useful information by training on a large number of similar datasets. To overcome these limitations, a textbased method was proposed recently. It formulates the optimization task as a sequence modeling problem and pre-trains a single surrogate using various optimization trajectories (Chen et al., 2022) . In this work, we focus on surrogate pre-training that transfers knowledge from prior tasks to new ones with possibly different input search spaces, for further improving the optimization efficiency of BO. We adopt a combination of Transformer (Vaswani et al., 2017) and deep kernel Gaussian process (Wilson et al., 2016b) for the surrogate, which enables joint training on prior datasets with variable input dimensions. For a target task, only the feature tokenizer of the pre-trained model needs to be modularized and reconstructed according to its input space. Other modules of the pretrained model remain unchanged when applied to new tasks, which allows the new task to make the most of prior knowledge. Our contributions can be summarized as follows: • To the best of our knowledge, this is the first transfer BO method that is able to jointly pre-train on tabular data from tasks with heterogeneous input spaces. • We provide a simple yet effective strategy of transferring the pre-trained model to new tasks with previously unseen input variables to improve optimization efficiency. • Our transfer BO method shows clear advantage on both synthetic and real problems from different domains, and also achieves the new state-of-the-art results on the HPO-B (Pineda-Arango et al., 2021) public datasets.

2. BACKGROUND

Gaussian process A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution (Rasmussen & Williams, 2009) . Formally, a GP is represented as f (x) ∼ GP(m(x), k(x, x ′ )), where m(x) and k(x, x ′ ) denotes mean and covariance function, respectively. Given a dataset D = {(x (i) , y (i) )} n i=1 with n examples, any collection of function values has a joint Gaussian distribution f = [f (x (1) ), . . . , f (x (n) )] ⊤ ∼ N (µ, K x,x ), where the mean vector µ i = m(x (i) ), [K x,x ] ij = k(x (i) , x (j) ). A nice property of GP is that its distributions of various derived quantities can be obtained explicitly. Specifically, under the additive Gaussian noise assumption, the predictive distribution of the GP evaluated at a new test example x ( * ) can be derived as p(f ( * ) |x ( * ) , D) ∼ N (E[f ( * ) ], cov(f ( * ) )), where  E[f ( * ) ] = m(x ( * ) )+K x ( * ) ,x [K x,x +σ 2 I] -1 y, cov(f ( * ) ) = k(x ( * ) , x ( * ) )-K x ( * ) ,x [K x,x + σ 2 I] -1 K x,x



* ) , K x ( * ) ,x denotes the vector of covariances between the test example x(i) and the n training examples, and y is the vector consisting of all response values.Bayesian Optimization Bayesian optimization(Shahriari et al., 2016)  uses a probabilistic surrogate model for data-efficient black-box optimization. It is suited for expensive black-box optimization, where objective evaluation can be time-consuming or of high cost. Given the previously gathered dataset D, BO uses surrogate models like GP to fit the dataset. For a new input x ( * ) , the surrogate model gives predictive distribution in equation 1, then an acquisition function is constructed with both the prediction and uncertainty information to balance exploitation and exploration. The acquisition is optimized by third-party optimizer like evolutionary algorithm to generate BO recommendation. Throughout this paper, we use the lower confidence bound (LCB)Srinivas et al. (2009)   as the acquisition function. LCB(x) = m(x)-κ×σ(x), where σ(x) denotes the standard deviation and κ (set to 3 in experiments) is a constant for tuning the exploitation and exploration trade-off.

FT-Transformer FT-Transformer (Gorishniy et al., 2021)  is a recently proposed attention-based model for tabular data modeling. It consists of a Feature-Tokenizer layer, multiple Transformer layers, and a prediction layer. The Feature-Tokenizer layer enables its ability of handling tabular data. For d numerical input features x = [x 1 , . . . , x d ] ⊤ , the Feature-Tokenizer layer initializes a valuedependent embedding table W ∈ R d×de and a column-dependent embedding table B ∈ R d×de , where d e is the dimension of embedding vector. During forward-pass of the Feature-Tokenizer layer, the i-th feature x i would be transformed to x i × w i + b i , where w i and b i are the i-th row in W and B. In this way, an n × d matrix is transformed into a n × d × d e tensor. Then, a [CLS] token embedding is appended to the tensor and the tensor is passed to the stacked transformer layers to extract output embedding vectors. The output embedding vector corresponding to the [CLS] token is used as the output representation. The output representation is then passed into the prediction layer for final model prediction. The tokenization process for categorical data is implemented by a look-up table, in which each categorical variable corresponds to a b i and each unique value of a variable corresponds to a w i .

