LEARNING COLLISION-FREE LATENT SPACE FOR BAYESIAN OPTIMIZATION

Abstract

Learning and optimizing a blackbox function is a common task in Bayesian optimization and experimental design. In real-world scenarios (e.g., tuning hyperparameters for deep learning models, synthesizing a protein sequence, etc.), these functions tend to be expensive to evaluate and often rely on high-dimensional inputs. While classical Bayesian optimization algorithms struggle in handling the scale and complexity of modern experimental design tasks, recent works attempt to get around this issue by applying neural networks ahead of the Gaussian process to learn a (low-dimensional) latent representation. We show that such learned representation often leads to collision in the latent space: two points with significantly different observations collide in the learned latent space. Collisions could be regarded as additional noise introduced by the traditional neural network, leading to degraded optimization performance. To address this issue, we propose Collision-Free Latent Space Optimization (CoFLO), which employs a novel regularizer to reduce the collision in the learned latent space and encourage the mapping from the latent space to objective value to be Lipschitz continuous. CoFLO takes in pairs of data points and penalizes those too close in the latent space compared to their target space distance. We provide a rigorous theoretical justification for the regularizer by inspecting the regret of the proposed algorithm. Our empirical results further demonstrate the effectiveness of CoFLO on several synthetic and real-world Bayesian optimization tasks, including a case study for computational cosmic experimental design.

1. INTRODUCTION

Bayesian optimization is a classical sequential optimization method and is widely used in various fields, including recommender systems, scientific experimental design, hyper-parameter optimization, etc. Many of theses applications involve evaluating an expensive blackbox function; therefore the number of queries should be minimized. A common way to model the unknown function is via Gaussian processes (GPs) Rasmussen and Williams (2006) . GPs have been extensively studied under the bandit setting, and has proven to be an effective approach for addressing a broad class of black-box function optimization problems. One of the key computational challenges for learning with GPs concerns with optimizing specific kernels used to model the covariance structures of GPs. As such optimization task depends on the dimension of feature space, for high dimensional input, it is often prohibitively expensive to train a Gaussian process model. Meanwhile, Gaussian processes are not intrinsically designed to deal with structured input that has a strong correlations among different dimensions, e.g., the graphs and time sequences. Therefore, dimensionality reduction algorithms are needed to speed up the learning process. Recently, it has become popular to investigate GPs in the context of latent space models. As an example, deep kernel learning (Wilson et al., 2016) simultaneously learns a (low-dimensional) data representation and a scalable kernel via an end-to-end trainable deep neural network. In general, the neural network is trained to learn a simpler latent representation with reduced dimension and has the structure information already embedded for the Gaussian process. Such a combination of neural network and Gaussian process could improve the scalability and extensibility of classical Bayesian optimization, but it also poses new challenges for the optimization task (Tripp et al., 2020) . As we later demonstrate, one critical challenge brought by introducing the neural network is that the latent representation is prone to collisions: two points with significant different observations can get 1

