LEARNING COLLISION-FREE LATENT SPACE FOR BAYESIAN OPTIMIZATION

Abstract

Learning and optimizing a blackbox function is a common task in Bayesian optimization and experimental design. In real-world scenarios (e.g., tuning hyperparameters for deep learning models, synthesizing a protein sequence, etc.), these functions tend to be expensive to evaluate and often rely on high-dimensional inputs. While classical Bayesian optimization algorithms struggle in handling the scale and complexity of modern experimental design tasks, recent works attempt to get around this issue by applying neural networks ahead of the Gaussian process to learn a (low-dimensional) latent representation. We show that such learned representation often leads to collision in the latent space: two points with significantly different observations collide in the learned latent space. Collisions could be regarded as additional noise introduced by the traditional neural network, leading to degraded optimization performance. To address this issue, we propose Collision-Free Latent Space Optimization (CoFLO), which employs a novel regularizer to reduce the collision in the learned latent space and encourage the mapping from the latent space to objective value to be Lipschitz continuous. CoFLO takes in pairs of data points and penalizes those too close in the latent space compared to their target space distance. We provide a rigorous theoretical justification for the regularizer by inspecting the regret of the proposed algorithm. Our empirical results further demonstrate the effectiveness of CoFLO on several synthetic and real-world Bayesian optimization tasks, including a case study for computational cosmic experimental design.

1. INTRODUCTION

Bayesian optimization is a classical sequential optimization method and is widely used in various fields, including recommender systems, scientific experimental design, hyper-parameter optimization, etc. Many of theses applications involve evaluating an expensive blackbox function; therefore the number of queries should be minimized. A common way to model the unknown function is via Gaussian processes (GPs) Rasmussen and Williams (2006) . GPs have been extensively studied under the bandit setting, and has proven to be an effective approach for addressing a broad class of black-box function optimization problems. One of the key computational challenges for learning with GPs concerns with optimizing specific kernels used to model the covariance structures of GPs. As such optimization task depends on the dimension of feature space, for high dimensional input, it is often prohibitively expensive to train a Gaussian process model. Meanwhile, Gaussian processes are not intrinsically designed to deal with structured input that has a strong correlations among different dimensions, e.g., the graphs and time sequences. Therefore, dimensionality reduction algorithms are needed to speed up the learning process. Recently, it has become popular to investigate GPs in the context of latent space models. As an example, deep kernel learning (Wilson et al., 2016) simultaneously learns a (low-dimensional) data representation and a scalable kernel via an end-to-end trainable deep neural network. In general, the neural network is trained to learn a simpler latent representation with reduced dimension and has the structure information already embedded for the Gaussian process. Such a combination of neural network and Gaussian process could improve the scalability and extensibility of classical Bayesian optimization, but it also poses new challenges for the optimization task (Tripp et al., 2020) . As we later demonstrate, one critical challenge brought by introducing the neural network is that the latent representation is prone to collisions: two points with significant different observations can get too close in the latent space. The collision effect is especially evident when information is lost by dimension reduction, and/or when the training data is limited in size in Bayesian optimization. As illustrated in Figure 1 , when passed through the neural network, data points with drastically different observations are mapped to close positions in the latent space. Such collisions could be regarded as additional noise introduced by the neural network. Although Bayesian optimization is known to be robust to mild noisy observations, the collision in latent space could be harmful to the optimization performance, as it is non-trivial to explicitly model the collision into the acquisition function. In addition, the additional noise induced by the collision effect will further loosen the regret bound for classical Bayesian optimization algorithms (Srinivas et al., 2010) .

Overview of main results

To mitigate the collision effect, we propose a novel regularization scheme which can be applied as a simple plugin amendment for the latent space-based Bayesian optimization models. The proposed algorithm, namely Collision-Free Latent Space Optimization (CoFLO), leverages a regularized regression loss function, to periodically optimize the latent space for Bayesian optimization. Concretely, our regularizer is encoded by a novel pairwise collision penalty function defined jointly on the latent space and the output domain. In order to mitigate the risk of collision in the latent space (and consequently boost the optimization performance), one can apply the regularizer uniformly to the latent space to minimize the collisions. However, in Bayesian global optimization tasks, we seek to prioritize the regions close to the possible optimum, as collisions in these regions are more likely to mislead the optimization algorithm. Based on this insight, we propose a optimizationaware regularization scheme, where we assign a higher weight for the collision penalty on those pairs of points closer to the optimum region in the latent space. This algorithm-which we refer to as dynamically-weighted CoFLO-is designed to dynamically assess the importance of a collision during optimization. Comparing to the uniform collision penalty over the latent space, the dynamic weighting mechanism has demonstrated drastic improvement over the state-of-the-art latent spacebased Bayesian optimization models. We summarize our the key contributions below: • We propose a novel regularization scheme, as a simple plugin amendment for latent spacebased Bayesian optimization models. Our regularizer penalizes collisions in the latent space and effectively reduces the collision effect. • We propose an optimization-aware dynamic weighting mechanism for adjusting the collision penalty to further improve the effectiveness of regularization for Bayesian optimization. • We provide theoretical analysis for the performance of Bayesian optimization on regularized latent space. • We conducted an extensive empirical study on four synthetic and real-world datasets, including a real-world case study for cosmic experimental design, and demonstrate strong empirical performance for our algorithm.

2. RELATED WORK

Bayesian optimization has demonstrated promisming performance in various cost-sensitive global optimization tasks (Shahriari et al., 2016) . However, due to its intrinsic computational limitation in the high-dimensional regime, its applicability has been restricted to relatively simple tasks. In this section, we provide a short survey on recent work in Bayesian learning, which were designed to overcome the high-dimensionality challenge for both Bayesian optimization and regression tasks. Deep kernel learning Deep kernel learning (DKL) (Wilson et al., 2016) combines the power of the Gaussian process and that of neural network by introducing a deep neural network g to learn a mapping g : X → Z from the input domain X to a latent space Z, and use the latent representation z ∈ Z as the input of the Gaussian process. The neural network g and a spectral mixture base kernel k forms a scalable expressive closed-form covariance kernel, denoted by k DK (x i , x j ) → k(g(x i ), g(x j )), for Gaussian processes. Despite of encouraging results in numerous regression tasks, it remains unclear whether DKL is readily applicable to Bayesian optimization. One key difference in Bayesian regression and optimization tasks is the assumption on the accessibility of training data: Bayesian optimization often assumes limited access to labeled data, while DKL for regression relies on abundant access to data in order to train a deep kernel function. Another problem lies in the difference between the objective functionn. While DKL focuses on improving the general regression performance, it does not specifically address the problem caused by collisions, which-as we later demonstrate in section 3.3-could be harmful for sequential decision making tasks. Representation learning and latent space optimization Aiming at improving the scalability and extensibility of the Gaussian process, various methods are proposed to reduce the dimensionality of the original input. Djolonga et al. (2013) assume that only a subset of input dimensions varies, and the kernel is smooth (i.e. with bounded RKHS norm). Under these assumptions, they underlying subspace via low-rank matrix completion. Huang et al. (2015) use Autoencoder to learn a lowdimensional representation of the inputs to increase GP's scalability in regression tasks. Snoek et al. (2015) further propose to learn a pre-trained encoder neural network before BO. Lu et al. (2018) learn a variational auto-encoder iteratively during sequential optimization to embed the structure of the input. The challenge for combining latent space learning with Bayesian optimization lies in that a pre-trained neural network may not extract adequate information around the more promising region of the input space. Furthermore, the latent space could be outdated without continuous updates with the latest acquired observation. Tripp et al. (2020) propose to periodically retrain the neural network to learn a better latent space, in order to minimize the number of iterations needed for LSO. They claim that by prioritizing the loss of more promising data points in the original input space (i.e. by assigning a higher weight to these data points), the model could focus more on learning highvalue regions and allow a substantial extrapolation in the latent space to accelerate the optimization. However, such a framework does not explicitly deal with collisions in the latent space, which we found to be a key factor in the poor performance of modern latent space optimization algorithms.

3. PROBLEM STATEMENT

In this section, we introduce necessary notations and formally state the problem. We focus on the problem of sequentially optimizing the function f : X → R, where X ⊆ R d is the input domain. In each round t, we pick a point x t ∈ X , and observe the function value perturbed by an additive noise: y t = f (x t ) + t with t ∼ N (0, σ 2 ) being i.i.d. Gaussian noise. Our goal is to maximize the sum of rewards T t=1 f (x t ) over T iterations, or equivalently, to minimize the cumulative regret R T := T t r t , where r t := max x∈X f (x) -f (x t ) denote the instantaneous regret of x t .

3.1. BAYESIAN OPTIMIZATION

Bayesian optimization typically employs Gaussian processes as the statistic tool for modeling the unknown objective function. The major advantage of using GP is that it presents a computationally tractable solution to depict a sophisticated and consistent view across the space of all possible function (Rasmussen and Williams, 2005) , which allows closed-form posterior estimation in the function space. BO methods starts with a prior on the black-box function. Upon observing new labels, BO then iteratively updates the posterior distribution in the function space, and maximizes an acquisition function measuring each point's contribution to finding the optimum, in order to select the next point for evaluation. Formally, in Bayesian optimization we assume that f follows a GP(m(x), k(x, x )), where m(x) is the mean function, k(x, x ) is the kernel or covariance function. Throughout this paper, we use squared exponential kernel, k SE (x, x ) = σ 2 SE exp -(x-x ) 2l , where the length scale l determines the length of the "wiggles" and the output variance σ 2 SE determines the average distance of the function away from its mean. At iteration T , given the historically selected points A T = {x 1 , ..., x t } and the corresponding noisy evaluations y T = [y 1 , ...y T ], the posterior over f also takes the form of a GP, with mean µ T (x), covariance k T (x, x ), and variance σ 2 T (x): µ T (x) = k T (x) T (K T + σ 2 I) -1 y T k T (x, x ) = k(x, x ) -k T (x)T (K T + σ 2 I) -1 k T (x ) σ 2 T (x) = k T (x, x) where k T (x) = [k(x 1 , x), ..., k(x T , x)] T and K T is the positive definite kernel matrix [k(x, x )] x,x ∈A T . After obtaining the posterior, one can compute the acquisition function α : X → R, which is used to select the next point to be evaluated. Various acquisition functions have been proposed in the literature, including popular choices such as Upper Confidence Bound (UCB) (Srinivas et al., 2010) and Thompson sampling (TS) (THOMPSON, 1933) . UCB uses the upper confidence bound α U CB (x) = µ t (x)+β 1/2 σ t (x) with β(x) being the confidence coeffecient, and enjoys rigorous sublinear regret bound. TS usually outperform UCB in practice and has been shown to enjoy a similar regret bound Agrawal and Goyal (2012) . It samples a function ft from the GP posterior ft ∼ GP(µ t (x), k t (x, x )) and then uses the sample as an acquisition function: α T S (x) = ft (x). Remark. Regret is commonly used as performance metric for BO methods. In this work we focus on the simple regret r * T = max x∈X f (x) -max t<T f (x t ) and cumulative regret R(T ) = T t r t .

3.2. LATENT SPACE OPTIMIZATION

Recently, Latent Space Optimization (LSO) has been proposed to solve Bayesian optimization problems on complex input domains (Tripp et al., 2020) . LSO first learns a latent space mapping g : X → Z to convert the input space X to the latent space Z. Then, it constructs an objective mapping h : Z → R such that f (x) ≈ h(g(x)), ∀z ∈ Z. The latent space mapping g and base kernel k could be regarded as a deep kernel, denote by k nn (x, x ) = k(g(x), g(x )). Thus, the actual input space for BO is the latent space Z and the objective function is h. With acquisition function α nn (x) := α(g(x)), it is unnecessary to compute an inverse mapping g -1 as discussed in Tripp et al. (2020) , as BO could directly select x t = arg max xt∈X α nn (x) ∀t ≤ T and evaluate f . In the meantime, BO can leverage the latent space mapping g, usually represented by a neural network, to effectively learn and optimize the target function h on a lower-dimension input space.

3.3. THE COLLISION EFFECT OF LSO

When the mapping g : X → Z is represented by a neural network, it may cause undesirable collisions between different input points in the latent space Z. Under the noise-free setting, we say there exists a collision in Z, if ∃x i , x j ∈ X , such that when g( x i ) = g(x j ), |f (x i ) -f (x j )| > 0. Such collision could be regarded as additional (unknown) noise on the observations introduced by the neural network g. For a noisy observation y = f (x) + , we define a collision as follows: for ρ > 0, ∃x i , x j ∈ D, |g(x i ) -g(x j )| < ρ|y i -y j |. When the distance between a pair of points in the latent space is too close comparing to their difference in the output space, the different output values for the collided points in the latent space could be interpreted as the effect of additional observation noise. In general, collisions could degrade the performance of LSO. Since the collision effect is a priori unknown, it is often challenging to deal with collisions in LSO, even if we regard it as additional observation noise and increase the (default) noise variance in the Gaussian process. Thus, it is necessary to mitigate the collision effect, by directly restraining it in the representation learning phase.

4. COLLISION-FREE LATENT SPACE OPTIMIZATION

In this section, we introduce Collision-Free Latent Space Optimization (CoFLO), an algorithmic framework designed to mitigate the collision effect.

4.1. OVERVIEW OF THE COFLO ALGORITHM

The major challenge in restraining collisions in the latent space is that, unlike the traditional regression problem, we cannot quantify it on a single point's observation. We can, however, quantify collisions by grouping pairs of data points and inspecting their corresponding observations. We define the collision penalty based on pairs of inputs, and further introduce a pair loss function to characterize the collision effect. Based on this pair loss, we propose a novel regularized latent space optimization algorithmfoot_0 , as summarized in Algorithm 1. The proposed algorithm first uses the pair-wise input and concurrently feeds them into the same network and then calculates the pair loss function. We demonstrate this process in Figure 2 . Given a set of labeled data points, we can train the neural network to create an initial latent space representationfoot_1 , similar to DKL (Wilson et al., 2016) . Once provided with the initial representation, we can then refine the latent space by running CoFLO and periodically update the latent space (i.e. updating the latent representation after collection a batch of data points) to mitigate the collision effect as we collect more labels Algorithm 1 Collision-Regularized Latent Space Optimization (CoFLO) 1: Input: Regularization weight ρ (cf. Equation 3), penalty parameter λ (cf. Equation 1), retrain interval T , importance weight parameter γ (cf. Equation 2), neural network M 0 , base kernel K 0 , prior mean µ 0 , total time steps T ; 2: for t = 1 to T do 3: x t ← arg max x∈D α(M t (x)) maximize acquisition function In this subsection, we aim to quantify the collision effect based on the definition proposed in Section 3.3. As illustrated in Figure 2 , we feed pairs of data points into the neural network and obtain their latent space representations. Apart from maximizing the GP's likelihood, we concurrently calculate the amount of collision on each pair, and penalize only if the value is positive. For x i , x j ∈ X , y i = f (x i ) + , y i = f (x i ) + are the corresponding observations, and z i = g(x i ), z j = g(x j ) are the corresponding latent space representations. We define the collision penalty as p ij = max(λ|y i -y j | -|z i -z j |, 0) (1) where λ is a penalty parameter that controls the smoothness of the target function h : Z → R.

4.3. DYNAMIC WEIGHT

Note that it is challenging to universally eliminate the collision effect by minimizing the collision penalty and the GP's regression loss-this is particularly true with a limited amount of training data. Fortunately, in the optimization tasks it is often unnecessary to learn equally good representation for suboptimal regions. Therefore, we can dedicate more training resources to improve the learned latent space by focusing on the (potentially) near-optimal regions. Following this insight, we propose to use a weighted collision penalty function, which uses the objective values for each pair as importance weight in each iteration. Formally, for any pair ((x j , z j , y j ), (x i , z i , y i )) in a batch of observation pairs D t = {((x m , z m , y m ), (x n , z n , y n ))} m,n , we define the importance-weighted penalty function as ym+yn) . pij = p ij w ij with w ij = e γ(yi+yj ) (m,n)∈Dt e γ( (2) Here the importance weight γ is used to control the aggressiveness of the weighting strategy. Combining the collision penalty and regression loss of GP, we define the pair loss function L as L ρ,λ,γ (M t , K t , D t ) = 1 ||D t || 2 i∈Dt,j∈Dt (GP Kt (M t (x i )) -y i ) 2 + (GP Kt (M t (x j )) -y j ) 2 + ρp ij , Here, GP Kt (M t (x i )) denotes the Gaussian process's posterior mean on x i with kernel K t and neural network M t at timestep t. ρ denotes the regularization weight; as we demonstrate in Section 5, in practice we often choose ρ to keep the penalty at a order close to the regression loss.

4.4. THEORETICAL ANALYSIS

In this subsection, we provide a theoretical justification for the collision-free regularizer, by inspecting the effect of regularization on the regret bound of CoFLO. We first connect the proposed collision penalty in Equation 1 to Lipschitz-continuity, and then integrate it into the regret analysis to provide an improved regret bound.

Lipschitz continuity of the target function h

The collision penalty encourages the Lipschitzcontinuity for h. Formally, the proposed regularization promotes to learn a latent space where ∀x i , x j ∈ D, z i = g(x i ), z j = g(x j ) ∈ Z, |g(x i ) -g(x j )| ≤ λ|f (x i ) -f (x j )| The above inquality reduces to the Lipschitz-continuity for h. Unlike typical smoothness assumptions in GPs, a function can be non-smooth and still Lipschitz continuous. Recently, Ahmed et al. ( 2019) leverage the Lipschitz-continuity of the objective function to propose improved acquisition functions based on the common acquisition functions, and provide an improved regret bound both theoretically and empirically. In the following, we show that running GP-UCB on the collision-free latent space amounts to an improvement in terms of its regret bound: Theorem 1. Let Z ⊂ [0, r] d be compact and convex, d ∈ N, r > 0, λ ≥ 0. Suppose that the objective function h defined on Z is a sample from GP and is Lipschitz continuous with Lipschitz constant λ. Let δ ∈ (0, 1), and define β t = 2 log π 2 t 2 /6δ + 2d log λrdt 2 . Running the GP-UCB with β t for a sample h of a GP with mean function zero and covariance function k(x, x ), we obtain a regret bound of O * ( √ dT γ T ) with high probability. Precisely, with C 1 = 8/ log 1 + σ -2 , we have P R T ≤ C 1 T β T γ T + 2 ≥ 1 -δ. Here γ T is the maximum information gain after T iterations, defined as γ T := max Here σ denotes the empirical standard deviation. n denotes the number of cases repeated in experiments. The hyper-parameters are set as the following. The retrain interval T are set to be 100 iterations for 8c 8a and 8d, 200 for 8b. The Regularization parameters ρ are set to be 1e 5 for 8c 8a and 8d, 1e 3 for 8b. The penalty parameter λ are all set to be 1e -2 . The weighting parameter γ are set to be 1e -2 . The prior mean µ 0 are all set to be 0. The squared exponential kernel is used as the GP covariance for all the four experiments. We also demonstrate the median curves in the Appendix. Comparing the above result to Theorem 2 of Srinivas et al. (2010) which offers the regret bound with the sub-gaussianity assumption on the objective functions derivative, the second part of our regret bound does not rely on δ. The coefficients are also smaller as the deterministic bound on the derivative of f avoids union bound. Remark. The collision penalty encourages h to be Lipschitz-continuous on the latent space. Ideally, when the collision penalty p i,j (λ) term converges to zero for all data points in the latent space, we can claim that h is Lipschitz-continuous with Lipschitz constant at most λ. Applying Theorem 1 with β t = 2 log π 2 t 2 /6δ + 2d log λrdt 2 , we can reduce the regret bound by choosing smaller λ. However, in practice, since the observation can be noisy, we need to choose a λ big enough to tolerant the noise. A small λ could make it difficult to learn a meaningful representation.

5. EXPERIMENTS

In this section, we empirically evaluate our algorithm on two synthetic blackbox function optimization tasks and two real-world optimization problems.

5.1. EXPERIMENTAL SETUP

We consider four baselines in our experiments. The rudimentary random selection algorithm (RS) shows the task complexity. Three popular optimization algorithms, namely particle swarm optimization (PSO) (Miranda, 2018), Tree-structured Parzen Estimator Approach (TPE) (Bergstra et al., 2011) , and standard Bayesian optimization (BO) (Nogueira, 2014) which uses Gaussian process as the statistical model and the upper confidence bound (UCB) as its acquisition function, are tuned in each task. Another baseline we consider is the sample-efficient LSO (SE LSO) algorithm, which is implemented based on the algorithm proposed by Tripp et al. (2020) . We also compare the non-regularized latent space optimization (LSO), Collision-Free Latent Space Optimization (CoFLO) and the dynamically-weighted CoFLO (DW CoFLO) proposed in this paper. The performance for each task is measured on 10,000 pre-collected data points. One crucial problem in practice is tuning the hyper-parameters. The hyper-parameters for GP are tuned for periodically retraining in the optimization process, by minimizing the loss function on a validation set. For all our tasks, we choose a simplistic neural network architecture M , due to limited and expensive access to labeled data under the BO setting. The coefficient ρ is, in general, selected to guarantee a similar order for the collision penalty to GP loss. The λ should be estimated according to the first several sampled data and tolerant the additive noise in the evaluation. γ controls the aggressiveness of the importance weight. While γ should not be too close to zero (which is equivalent to uniform weight), an extremely high value could make the regularization overly biased. Such a severe bias could possibly allow a heavily collided representation in most of the latent space and degrade regularization effectiveness. The value choice is similar to the inverse of the temperature parameter of softmax in deep learning Hinton et al. (2015) . Here we use the first batch of observed samples to estimate the order of all observations and choose the appropriate γ.

5.2. DATASETS AND RESULTS

We now evaluate CoFLO on two synthetic datasets and two real-world datasets. In the experiments, all the input data points are mapped to a one-dimensional latent space by via the neural network. We demonstrated the improvement of CoFLO brought by the explicit collision mitigation in the lowerdimensional latent space in terms of average simple regret. We also include the median result and statistical test in the appendix.

2D-Rastrigin

The Rastrigin function is a non-convex function used as a performance test problem for optimization algorithms. It was first proposed by RASTRIGIN (1974) and used as a popular benchmark dataset for evaluating Gaussian process regression algorithms (Cully et al., 2018) . Formally, the 2D Rastrigin function is f (x) = 10d + d i=1 x 2 i -10cos(2πx i ), d = 2 For convenience of comparison, we take the -f (x) as the objective value to make the optimization tasks a maximization task. The neural network is pretrained on 100 data points. As illustrated by figure 8a , CoFLO and DW CoFLO could quickly reach the (near-) optimal region, while the baselines generally suffer a bigger simple regret even after an excessive number of iterations. Feynman III.9.52 Equation Growing datasets have motivated pure data-backed analysis in physics. The dataset of 100 equations from the Feynman Lecture on Physics for the symbolic regression tasks in physics (Udrescu and Tegmark, 2020) could play the role as a test set for databack analysis algorithms in physics. The III.9.52 we choose to test the optimization algorithms is ρ γ = p d E f t h/2π sin((ω -ω 0 )t/2) 2 ((ω -ω 0 )t/2) 2 The equations have 6 variables as inputs and are reported to require at least 10 3 data for the regression task. The neural network is randomly initialized at the beginning. As illustrated by figure 8b , in the first 100 iterations, CoFLO and DW CoFLO behaves similarly to random selection. After the first training at iteration 100, CoFLO and DW CoFLO approach the optimum at a much faster pace compared to the baselines; among them, DW CoFLO shows a faster reduction in simple regret. Supernova Our first real-world task is to perform maximum likelihood inference on 3 cosmological parameters, the Hubble constant H 0 ∈ (60, 80), the dark matter fraction Ω M ∈ (0, 1) and the dark energy fraction Ω A ∈ (0, 1). The likelihood is given by the Roberson-Walker metric, which requires a one-dimensional numerical integration for each point in the dataset from Davis et al. (2007) . The neural network is pretrained on one hundred data points. As illustrated by figure 8c , the simple regret SE LSO has a faster drop at the beginning, while later remained relatively stable and eventually ends at a similar level to LSO. These results demonstrate the efficiency of SE LSO when finding sub-optimal. However, without collision reduction, SE LSO couldn't outperform the LSO in the long run, where both reach their limitation. And the CoFLO and DW CoFLO demonstrate its robustness when close to the optimal as both constantly approach the optimal. Among them, DW CoFLO slightly outperform CoFLO.

Redshift Distribution

The challenges in designing and optimizing cosmological experiments grow commensurately with their scale and complexity. Careful accounting of all the requirements and features of these experiments becomes increasingly necessary to achieve the goals of a given cosmic survey. SPOKES (SPectrOscopic KEn Simulation) is an end-to-end framework that can simulate all the operations and key decisions of a cosmic survey (Nord et al., 2016) . It can be used for the design, optimization, and forecasting of any cosmic experiment. For example, some cosmic survey campaigns endeavor to observe populations of galaxies that exist at a specific range of redshifts (distances) from us. In this work, we use SPOKES to generate galaxies within a specified window of distances from Earth. We then minimize the Hausdorff distance between the desired redshift distribution and the simulation of specific cosmological surveys generated by SPOKES. In our experiments, the neural network is pretrained on 200 data points. As illustrated by figure 8d , the simple regret of SE LSO drops faster at the initial phase. However, when it gets close to the (near-) optimal region where simple regret is approximately 0.15, it is caught up by both CoFLO and DW CoFLO, and eventually gets slightly outperformed. Such a result indicates that the collision problem could have more impact when the algorithm gets close to the optimal region. Notice that the rudimentary BO eventually outperformed the non-regularized LSO, indicate that without mitigation of collision, the learned representation could worsen the performance in the later stage when the algorithm gets close to the optimal. In conclusion, the mitigation of collision like CoFLO is necessary to further improve the later performance of LSO, when collision matters more in the near-optimal areas.

5.3. DISCUSSION

In general, our experimental results consistently demonstrate the robustness of our methods against collisions in the learned latent space. Our method outperforms all baselines; when compared to the sample-efficient LSO, the dynamically-weighted LSO performs better in most cases and shows a steady capability to reach the optimum by explicitly mitigating the collision in the latent space. In contrast, the Sample-efficient LSO might fail due to the collision problem.

6. CONCLUSION

We have proposed a novel regularization scheme for latent space based Bayesian optimization. Our algorithm-namely CoFLO-addresses the collision problem induced by dimensionality reduction, and improves the performance for latent space-based optimization algorithms. The regularization is proved to be effective in mitigating the collision problem in learned latent space, and therefore can boost the performance of the Bayesian optimization in the latent space. We demonstrate strong empirical results for CoFLO on several synthetic and real-world datasets, and show that CoFLO is capable of dealing with high-dimensional input that could be highly valuable for real-world experiment design tasks such as cosmological survey scheduling. Proof. Using the union bound of δ/2 in both Lemma 5.5 in Srinivas et al. (2010) and Lemma 1, we have that with probability 1 -δ: r t = h(z * ) -h(z t ) ≤ β 1/2 t σ t-1 (z t ) + 1/t 2 + µ t-1 (z t ) -h(z t ) ≤ 2β 1/2 t σ t-1 (z t ) + 1/t 2 which complete the proof. Now we are ready to use the Lemma5.4 in Srinivas et al. (2010) and Lemma 2 to complete the proof of Theorem 1. Proof. Using Lemma5.4 in Srinivas et al. (2010) , we have that with probability ≥ 1 -δ: T t=1 4β t σ 2 t-1 (x t ) ≤ C 1 β T γ T ∀T ≥ 1 By Cauchy-Schwarz: T t=1 2β 1/2 t σ t-1 (x t ) ≤ C 1 β T γ T ∀T ≥ 1 Finally, substitute π t with π 2 t 2 /6 (since 1/t 2 = π 2 /6). Theorem 1 follows. We demonstrate the collision effect in the latent space. We trained the same neural network on Feynman dataset with 101 data points which demonstrate the latent space after two retrains with the retrain interval set to be 50 data points. The regularized one employed DW CoFLO, with the regularization parameter ρ = 1e 5 , penalty parameter λ = 1e -2 , retrain interval T , weighting parameter γ = 1e -2 and the base kernel set to be square exponential kernel. The non-regularized one employed LSO.

C SUPPLEMENTAL MATERIALS ON ALGORITHMIC DETAILS C.1 ALGORITHMIC DETAILS ON NEURAL NETWORK ARCHITECTURE

As the main goal of our paper was to showcase the performance of a novel collision-free regularizer, we picked our network architectures to be basic multi-layer dense neural network: For SPOKES, we used a 5-layer dense neural network. Its hidden layers consist of 16 neurons with Leaky Relu nonlinearities, 8 neurons with Sigmoid nonlinearities, 4 neurons with Sigmoid nonlinearities, and 2 neurons with Sigmoid nonlinearities respectively. Each hidden layer also applies a 0.2 dropout rate. The output layer applies Leaky Relu nonlinearity. For SuperNova, Feynman, and Rastrigin 2D, we used a 4-layer dense neural network. Its hidden layers consist of 8 neurons with Sigmoid nonlinearities, 4 neurons with Leaky Relu nonlinearities, and 2 neurons with Leaky Relu nonlinearities respectively. Each hidden layer applies a 0.2 dropout rate. The output layer also applies Leaky Relu nonlinearity. And we choose ρ = 1e 5 in practice as it maintains the collision penalty in the same order of the regression loss of GP in equation 3. 7a shows that a relatively small value could do a good job. We believe that's because the wide range of objective values of the tested dataset needs to be mitigated. The curves demonstrate the decent performance of CoFLO as long as the parameters are not set to be too large.

D ADDITIONAL EXPERIMENTAL RESULTS

We added both the detailed median curves and the p-values of the Welch's t-tests of the experiments we discussed in section 5.

D.1 MEDIAN CURVE

The median curves demonstrate similar trends to the mean curves. In the four experiments, DW CoFLO consistently demonstrates a superior performance over the baselines. Rastrigin-2D 1.07e -4 3.88e -8 1.01e -2 6.38e -3 1.10e -5 4.23e -1 Supernova 3.24e -3 3.61e -3 3.18e -2 3.43e -1 1.41e -8 2.62e -1 Feynman 1.73e -1 1.52e -07 8.20e -1 288e -1 6.37e -1 2.25e -1 SPOKES 4.62e -1 9.90e -3 2.64e -1 4.17e -2 2.87e -3 4.11e -1



Note that we have introduced several hyper-parameters in the algorithm design; we will defer our discussion on the choice of these parameters to Section 5. To obtain an initial latent space representation, the labels do not have to be exact and could be collected from a related task of cheaper cost



Figure 1: Illustration of the collision effect in latent space-based Bayesian optimization tasks. Since the data points around the optimum severely collided, BO is misguided to the sub-optimum.

Figure 2: CoFLO schematic

A ; h A ).

Figure 3: Experiment results on four pre-collected datasets. Each experiment is repeated at least ten times.The colored area around the mean curve denotes the σ √ n . Here σ denotes the empirical standard deviation. n denotes the number of cases repeated in experiments. The hyper-parameters are set as the following.The retrain interval T are set to be 100 iterations for 8c 8a and 8d, 200 for 8b. The Regularization parameters ρ are set to be 1e 5 for 8c 8a and 8d, 1e 3 for 8b. The penalty parameter λ are all set to be 1e -2 . The weighting parameter γ are set to be 1e -2 . The prior mean µ 0 are all set to be 0. The squared exponential kernel is used as the GP covariance for all the four experiments. We also demonstrate the median curves in the Appendix.

Figure 4: Illustrate the collision and quantified measurement of the collision. Here we propose two quantity measurement of the collision. For the second graph the y axis of the is the ratio of exceeding |y 1 -y 2 | > |z 1 -z 2 | * L. And for the third graph, the y axis of the third column is the mean of λ = |y 1 -y 2 |/|z 1 -z 2 |.

Figure 7: Simple regrets under different parameter settings on the SPOKES dataset. 7b shows that a regularization parameter too big could distract the training process and downgrade the performance.And we choose ρ = 1e 5 in practice as it maintains the collision penalty in the same order of the regression loss of GP in equation 3. 7a shows that a relatively small value could do a good job. We believe that's because the wide range of objective values of the tested dataset needs to be mitigated. The curves demonstrate the decent performance of CoFLO as long as the parameters are not set to be too large.

the p-values of Welch's t-tests of the experiments. It demonstrates the significance of the improvement brought by DW CoFLO over the baselines.

A REGRET BOUND FOR A LIPSCHITZ-CONTINUOUS OBJECTIVE FUNCTION

In this section, we provide the detailed proof for Theorem 1. We first modify Lemma 5.7 and Lemma 5.8 in Srinivas et al. (2010) since we are assuming the deterministic Lipschitz-continuity for h. Use the same analysis tool Z t defined as a set of discretization Z t ⊂ Z where Z t will be used at time t in the analysis.We choose a discretization Z t of size (τ t ) d . so that ∀z ∈ Z,where [z] t denotes the closest point in Z t to z.Lemma 1. Pick δ ∈ (0, 1) and set β = 2log(π t /δ)+2dlog(Lrdt 2 ), where t≥1 π -1Proof. Using the Lipschitz-continuity and equation 4, we have thatThen using Lemma 5.6 in Srinivas et al. (2010) , we reach the expected result.Base on Lemma 5.5 in Srinivas et al. (2010) and Lemma 1, we could have the following result directly. Lemma 2. Pick δ ∈ (0, 1) and set β = 2log(2π t /δ) + 2dlog(Lrdt 2 ), where t≥1 π -1 t = 1, π t > 0. Then with probability ≥ 1 -δ, for all t ∈ N , the regret is bounded as follows: Figure 5 : Illustrate the 1-D latent space of Feynman III.9.52 dataset. 5a shows a regularized latent space with a few observable collisions. 5b shows a non-regularized latent space with bumps of collisions especially around the maxima among the observed data points. Besides, having fewer collisions in the latent space contribute to the optimization through improving the learned Gaussian process. We observed in this comparison that the next point selected by the acquisition function of the regularized version is approaching the global optima, while the next point in the non-regularized version is trying to solve the uncertainty brought by the severe collision near the currently observed maxima.The neural networks are trained using ADAM with a learning rate of 1e -2 .x 0x 1 . . . input layer 1 st hidden layer L th hidden layer output layerFigure 6 : Network graph of a (L + 1)-layer dense network with D input units and 1 output units. In our experiments L is set to be 4 for Rastrigin 2D, Feynman II.9.52, Supernova, and 5 for SPOKES.

C.2 PARAMETER CHOICES

We further investigate the robustness of parameter choices of both the regularization parameter ρ and the penalty parameter λ on SPOKES dataset. We show the result in the figures below.

