EFFICIENT BAYESIAN OPTIMIZATION WITH DEEP KERNEL LEARNING AND TRANSFORMER PRE-TRAINED ON MULIPLE HETEROGENEOUS DATASETS

Abstract

Bayesian optimization (BO) is widely adopted in black-box optimization problems and it relies on a surrogate model to approximate the black-box response function. With the increasing number of black-box optimization tasks solved and even more to solve, the ability to learn from multiple prior tasks to jointly pre-train a surrogate model is long-awaited to further boost optimization efficiency. In this paper, we propose a simple approach to pre-train a surrogate, which is a Gaussian process (GP) with a kernel defined on deep features learned from a Transformerbased encoder, using datasets from prior tasks with possibly heterogeneous input spaces. In addition, we provide a simple yet effective mix-up initialization strategy for input tokens corresponding to unseen input variables and therefore accelerate new tasks' convergence. Experiments on both synthetic and real benchmark problems demonstrate the effectiveness of our proposed pre-training and transfer BO strategy over existing methods.

1. INTRODUCTION

In black-box optimization problems, one could only observe outputs of the function being optimized based on some given inputs, and can hardly access the explicit form of the function. These kinds of optimization problems are ubiquitous in practice (e.g., (Mahapatra et al., 2015; Korovina et al., 2020; Griffiths & Lobato, 2020) ). Among black-box optimization problems, some are particularly challenging since their function evaluations are expensive, in the sense that the evaluation either takes a substantial amount of time or requires a considerable monetary cost. To this end, Bayesian Optimization (BO; Shahriari et al. (2016) ) was proposed as a sample-efficient and derivative-free solution for finding an optimal input value of black-box functions. BO algorithms are typically equipped with two core components: a surrogate and an acquisition function. The surrogate is to model the objective function from historical interactions, and the acquisition function measures the utility of gathering new input points by trading off exploration and exploitation. Traditional BO algorithms adopt Gaussian process (GP; Rasmussen & Williams ( 2009)) as the surrogates, and different tasks are usually optimized respectively in a cold-start manner. In recent years, as model pre-training showed significant improvements in both convergence speed and prediction accuracy (Szegedy et al., 2016; Devlin et al., 2019) , pre-training surrogate(s) in BO becomes a promising research direction to boost its optimization efficiency. Most existing work on surrogate pre-training (Bardenet et al., 2013; Swersky et al., 2013; Yogatama & Mann, 2014; Springenberg et al., 2016; Wistuba et al., 2017; Perrone et al., 2018; Feurer et al., 2018a; Wistuba & Grabocka, 2021) assumes that the target task shares the same input search space with prior tasks generating historical datasets. If this assumption is violated, the pre-trained surrogate cannot be directly applied and one has to conduct a cold-start BO. Such an assumption largely restricts the scope of application of a pre-trained surrogate, and also prevents it from learning useful information by training on a large number of similar datasets. To overcome these limitations, a textbased method was proposed recently. It formulates the optimization task as a sequence modeling problem and pre-trains a single surrogate using various optimization trajectories (Chen et al., 2022) . In this work, we focus on surrogate pre-training that transfers knowledge from prior tasks to new ones with possibly different input search spaces, for further improving the optimization efficiency

