FEW-SHOT BAYESIAN OPTIMIZATION WITH DEEP KERNEL SURROGATES

Abstract

Hyperparameter optimization (HPO) is a central pillar in the automation of machine learning solutions and is mainly performed via Bayesian optimization, where a parametric surrogate is learned to approximate the black box response function (e.g. validation error). Unfortunately, evaluating the response function is computationally intensive. As a remedy, earlier work emphasizes the need for transfer learning surrogates which learn to optimize hyperparameters for an algorithm from other tasks. In contrast to previous work, we propose to rethink HPO as a few-shot learning problem in which we train a shared deep surrogate model to quickly adapt (with few response evaluations) to the response function of a new task. We propose the use of a deep kernel network for a Gaussian process surrogate that is meta-learned in an end-to-end fashion in order to jointly approximate the response functions of a collection of training data sets. As a result, the novel few-shot optimization of our deep kernel surrogate leads to new state-of-the-art results at HPO compared to several recent methods on diverse metadata sets.

1. INTRODUCTION

Many machine learning models have very sensitive hyperparameters that must be carefully tuned for efficient use. Unfortunately, finding the right setting is a tedious trial and error process that requires expert knowledge. AutoML methods address this problem by providing tools to automate hyperparameter optimization, where Bayesian optimization has become the standard for this task (Snoek et al., 2012) . It treats the problem of hyperparameter optimization as a black box optimization problem. Here the black box function is the hyperparameter response function, which maps a hyperparameter setting to the validation loss. Bayesian optimization consists of two parts. First, a surrogate model, often a Gaussian process, is used to approximate the response function. Second, an acquisition function is used that balances the trade-off between exploration and exploitation. In a sequential process, hyperparameter settings are selected and evaluated, followed by an update of the surrogate model. Recently, several attempts have been made to extend Bayesian optimization to account for a transfer learning setup. It is assumed here that historical information on machine learning algorithms is available with different hyperparameters. This can either be because this information is publicly available (e.g. OpenML) or because the algorithm is repeatedly optimized for different data sets. To this end, several transfer learning surrogates have been proposed that use this additional information to reduce the convergence time of Bayesian optimization. We propose a new paradigm for accomplishing the knowledge transfer by reconceptualizing the process as a few-shot learning task. Inspiration is drawn from the fact that there are a limited number of black box function evaluations for a new hyperparameter optimization task (i.e. few shots) but there are ample evaluations of related black box objectives (i.e. evaluated hyperparameters on other data sets). This approach has several advantages. First, a single model is learned that is trained to quickly adapt to a new task when few examples are available. This is exactly the challenge we face when optimizing hyperparameters. Second, this method can be scaled very well for any number of considered tasks. This not only enables the learning from large metadata sets but also enables the problem of label normalization to be dealt with in a new way. Finally, we present an evolutionary algorithm that can use surrogate models to get a warm start initialization for Bayesian optimization. All of our contributions are empirically compared with several competitive methods in three different problems. Two ablation studies provide information about the influence of the individual components. Concluding, our contributions in this work are: • This is the first work that, in the context of hyperparameter optimization (HPO), trains the initialization of the parameters of a probabilistic surrogate model from a collection of meta-tasks by few-shot learning and then transfers it by fine-tuning the initialized kernel parameters to a target task. • We are the first to consider transfer learning in HPO as a few-shot learning task. • We set the new state of the art in transfer learning for the HPO and provide ample evidence that we outperform strong baselines published at ICLR and NeurIPS with a statistically significant margin. • We present an evolutionary algorithm that can use surrogate models to get a warm start initialization for Bayesian optimization.

2. RELATED WORK

The idea of using transfer learning to improve Bayesian optimization is being investigated in several papers. The early work suggests learning a single Gaussian process for the entire data (Bardenet et al., 2013; Swersky et al., 2013; Yogatama & Mann, 2014) . Since the training of a Gaussian process is cubic in the number of training points, this idea does not scale well. This problem has recently been addressed by several proposals to use ensembles of Gaussian processes where a Gaussian process is learned for each task (Wistuba et al., 2018; Feurer et al., 2018) . This idea scales linearly in the number of tasks but still cubically in the number of training points per task. Thus, the problem persists in scenarios where there is a lot of data available for each task. Bayesian neural networks are a possible more scalable way of learning with large amounts of data. For example, Schilling et al. (2015) propose to use a neural network with task embedding and variable interactions. To obtain mean and variance predictions, the authors propose using an ensemble of models. In contrast, Springenberg et al. ( 2016) use a Bayesian multi-task neural network. However, since training Bayesian neural networks is computationally intensive, Perrone et al. ( 2018) propose a more scalable approach. They suggest using a neural network that is shared by all tasks and using Bayesian linear regression for each task. The parameters are trained jointly on the entire data. While our work shares some similarities with the previous work, our algorithm has unique properties. First of all, a meta-learning algorithm is used, which is motivated by recent work on model-agnostic meta-learning for few-shot learning (Finn et al., 2017) . This will allow us to integrate all task-specific parameters out such that the model does not grow with the number of tasks. As a result, our algorithm scales very well with the number of tasks. Second, while we are also using a neural network, we combine it with a Gaussian process with a nonlinear kernel in order to obtain uncertainty predictions. A simpler solution for using the transfer learning idea in Bayesian optimization is initialization (Feurer et al., 2015; Wistuba et al., 2015a) . The standard Bayesian optimization routine with a simple Gaussian process is used in this case but it is warm-started by a number of hyperparameter settings that work well for related tasks. We also explore this idea in the context of this paper by proposing a simple evolutionary algorithm that can use a surrogate model to estimate a data-driven warm start initialization sequence. The use of an evolutionary algorithm is motivated by its ease of implementation and natural capability to deal with continuous and discrete hyperparameters.

3.1. BAYESIAN OPTIMIZATION

Bayesian optimization is an optimization method for computationally intensive black box functions that consists of two main components, the surrogate model and the acquisition function. The surrogate model is a probabilistic model with mean µ and variance σ 2 , which tries to approximate the unknown black box function f , in the following also response function. The acquisition function

