POISSON PROCESS FOR BAYESIAN OPTIMIZATION

Abstract

Bayesian Optimization (BO) is a sample-efficient, model-based method for optimizing black-box functions which can be expensive to evaluate. Traditionally, BO seeks a probabilistic surrogate model, such as Tree-structured Parzen Estimator (TPE), Sequential Model Algorithm Configuration (SMAC), and Gaussian process (GP), based on the exact observed values. However, compared to the value response, relative ranking is harder to be disrupted due to noise resulting in better robustness. Moreover, it has better practicality when the exact value responses are intractable, but information about candidate preferences can be acquired. This work introduces an efficient BO framework, namely Poisson Process Bayesian Optimization (PoPBO), consisting of a novel ranking-based response surface based on Poisson process and two acquisition functions to accommodate the proposed surrogate model. We show empirically that PoPBO improves efficacy and efficiency on both simulated and real-world benchmarks, including HPO and NAS.

1. INTRODUCTION

Bayesian optimization (BO) (Mockus et al., 1978 ) is a popular black-box optimization paradigm and has achieved great success in a number of challenging fields, such as robotic control (Calandra et al., 2016) , biology (González et al., 2015) , and hyperparameter tuning for complex learning tasks (Bergstra et al., 2011) . A standard BO routine usually consists of two steps: (1) Learning a probabilistic response surface that captures the distribution of an unknown function f (x); (2) Optimizing an acquisition function that suggests the most valuable points for the next query iteration. Popular response surface for the first step includes Random Forest (SMAC) (Hutter et al., 2011) , Treestructure Parzen Estimator (TPE) (Bergstra et al., 2011) , Gaussian Process (GP) (Snoek et al., 2012) and Bayesian Neural Network (BNN) (Springenberg et al., 2016; Snoek et al., 2015) . Acquisition functions for the second step include Expected Improvement (EI) (Mockus, 1994) , Thompson Sampling (TS) (Chapelle & Li, 2011; Agrawal & Goyal, 2013) and Upper/Lower Confidence Bound (UCB/LCB) (Srinivas et al., 2012) , which are designed to trade off exploration and exploitation. Most of the existing BO methods (Bergstra et al., 2011; Hutter et al., 2011; Snoek et al., 2012) adopt absolute response surfacesfoot_0 that attempt to fit the black-box function based on the observed absolute function values. However, such an absolute metric can have the following disadvantages. 1) Absolute response can be difficult to obtain or even unavailable in some practical scenarios, such as sports games and recommender systems where only relative evaluationfoot_1 can be provided by pairwise comparison (He et al., 2022) . 2) Absolute response can be sensitive to noise, which is also pointed out by Rosset et al. (2005) . Such an issue will affect the performance of BO in real-world scenarios, where absolute responses are usually noisy. 3) It can be challenging to directly transfer absolute response surfaces. In particular, multi-fidelity metrics usually have different absolute responses for the same candidate, making it hard to utilize history observations on a coarse-fidelity metric to warm up the training of surrogate models on a fine-grained-fidelity one. Similarly, in hyperparameter optimization (HPO) and neural architecture search (NAS) tasks, performance on different datasets of the same hyperparameter selection or neural architectures is also different and is hard to be transferred across datasets. Relative metrics can be an effective cure for the above issues. 1) Relative response such as ranking has better practicality when the information about candidate preferences can be more easily acquired than raw value (González et al., 2017) , which is also widely used in many prior works (Kahneman & Tversky, 2013; Brusilovsky et al., 2007; González et al., 2017) . 2) Relative response is more robust to noise than absolute response since relation such as ranking between candidates is harder to be disrupted to noise, but absolute values are sensitive. In this work, we analyze the robustness of rankings in Sec. 3.1 under the common additive Gaussian noise assumption, showing that rankings are more insensitive to noise than absolute values. Similar conclusions are also made in other areas regarding the advantage of ranking models e.g. (Rosset et al., 2005) . 3) Relative response has better transferability, such as rankings between candidates, since they are usually comparable among multi-fidelity metrics or evaluations across different datasets for the same candidate. It is also demonstrated by (Salinas et al., 2020; Nguyen et al., 2021; Feurer et al., 2018) . Some Bayesian Optimization methods also adopt relative responses and are related to our work. Preferential BO methods (González et al., 2017; Mikkola et al., 2020) attempt to capture the relative preference by comparing pair of candidates. However, they have to rely on a computational-expensive soft-Copeland score (PBO) or have to optimize EI by the projective preferential query (PPBO) to propose the next query (optimal candidate). Moreover, they ignore tie situations, which commonly exist in real scenarios. Nguyen et al. (2021) extend the above method by comparing k samples. Specifically, they utilize Gaussian Process to model the absolute function values and leverage a multinominal logit model to build the evidence likelihood of local ranking of k observations. Although this method overcomes the computational disadvantage and takes ties into account, it essentially models the absolute response and simply captures the relationship (local ranking) among k candidates. In contrast to the above methods, we propose to capture the global ranking of each candidate in a feasible domain (search space) and model the relative response. On the one hand, we can directly search the optimum based on our relative response surface and obtain the next query without computationally expensive procedure (González et al., 2017; Mikkola et al., 2020) . On the other hand, unlike (Nguyen et al., 2021) that first build an absolute response surface and then derive local ranking among k candidates as the evidence, our method directly fits a ranking-based relative response surface. Moreover, due to the nature of ranking, our method can handle tie situations where candidates have the same ranking. Specifically, we adopt Poisson Process (PP) to capture the global ranking, which is naturally suitable since the ranking of a candidate can be figured out by counting the number of better candidates. Fig. 1 shows the superiority of our response surface to capture the global ranking against the GP-based one. Specifically, we conduct experiments on the Forrester function with various degrees of additive Gaussian noise. The setting details can be found in Appendix C.1. Our response surface is more robust to noise and can better capture the global ranking. Furthermore, we derive two acquisition functions to accommodate our response surface for a better exploitation-exploration trade-off. Finally, we propose a novel Bayesian Optimization framework, named PoPBO, achieving lower regret (better performance) with faster speed. Our contributions can be summarized as follows: 1) Ranking-based Response Surface based on Poisson Process. Unlike the prior absolute response surface (Bergstra et al., 2011; Snoek et al., 2012) , nor those (Nguyen et al., 2021) using relative evidence likelihood based on absolute responses, this work, to the best of our knowledge, is the first to directly capture the global ranking over a feasible domain via Poisson process. The robustness against noise is also analyzed in Sec. 3.1 and illustrated in Fig. 1 . 2) Tailored Acquisition Function for Ranking-based Response Surface. Two acquisition functions for our response surface, named R-LCB and ERI, are deduced from the vanilla LCB and EI for better exploitation-exploration trade-off. Gradients of the proposed acquisition functions w.r.t. candidates are also derived, so the next query can be optimized by SGD. 3) Computational-Efficient Bayesian Optimization Framework. The proposed ranking-based response surface and acquisition functions form a novel Bayesian optimization framework: Poisson Process Bayesian Optimization (PoPBO). Our framework is much faster than Gaussian process-based BO methods. Specifically, the computational complexity of PoPBO is O(N 2 ) compared to O(N 3 ) of GP, where N is the number of samples (see Fig. 3 ).

4) Extensive Empirical

Study with Strong Performance. Our method achieves substantial improvements over many prior BO methods on the simulated functions and multiple benchmarks on real-world datasets, including hyperparameter optimization and neural architecture search. Based on the Forrester function value, the solid black line (oracle) indicates actual rankings over 100 points evenly spaced from 0 to 0.8. We draw lines between the 100 predictions by linear interpolation for a clear illustration. The dash lines indicate predicted rankings over the 100 points by (a) Gaussian process and (b) Poisson process on observations with varying degrees of noise whose standard deviation σ ranges from 0 to 0.45. Each response surface is trained on the same 15 queries. Note that GP performs worse as the standard deviation of noise increases. In contrast, PP performs consistently well due to its great robustness against noise in our analysis.

2. PRELIMINARIES AND BACKGROUND

Bayesian Optimization. Consider minimizing a black-box target function f (⋅): X → R defined on a d-dimensional feasible domain X ⊂ R d , our goal is to find a global minimum of f (⋅). x * = arg min x∈X f (x). (1) These black-box objective functions can be expensive to be evaluated, such as those without closedform expression/derivative functions and long-time training neural networks. Bayesian Optimization (BO) is an efficient method to solve these problems (Brochu et al., 2010; Perrone et al., 2018) . It relies on a response surface built on the black-box function, which fits sequentially using the new query selected according to an exploit-explore trade-off acquisition function. A Gaussian Process with a nonlinear kernel is one of the most popular choices of the response surface, which places a GP prior to the unknown function f . It gives its posterior in an exact form conditioned to the prior observations D = {(x j , y j )} J j=1 . SMAC (Hutter et al., 2011) introduces random forest models for regression, which can be used to handle those categorical hyperparameters. TPE (Bergstra et al., 2011) is another BO method that models two densities l(x) = p(y < α|x, D) and g(x) = p(y > α|x, D) via kernel density estimator instead of modeling p(y|x) directly like GP, and then the ratio l(x)/g(x) will be optimized as the acquisition function to find the next query. Recently, Bayesian neural network was introduced into BO framework (Snoek et al., 2015) to model the response surface due to its flexibility and scalability, and Springenberg et al. ( 2016) using a more robust stochastic gradient MCMC method (Chen et al., 2014) to evaluate the posterior. Besides a powerful response surface, BO also needs an exquisite criterion for suggesting the next query considering the trade-off between exploitation and exploration. The most common acquisition function is EI since it is intuitive and has achieved strong performance in various tasks. Another common criterion is LCB, which minimizes regret during sequential optimization. BO with Relative Metrics. The relative metric does not have to utilize absolute responses of the black-box function. Some methods focus on the cases where the function evaluation is not directly accessible (Brusilovsky et al., 2007; González et al., 2017; Mikkola et al., 2020; Siivola et al., 2021) . Absolute responses can be difficult to obtain or even unavailable in some practical scenarios, such as sports games and recommender systems (Brusilovsky et al., 2007) where only relative evaluation can be provided by pairwise comparisons. Preferential Bayesian Optimization (PBO) (González et al., 2017) captures correlations between different inputs to find the optimal value of a latent function, which requires limited comparisons. To handle a high-dimensional black-box function, Projective Preferential Bayesian Optimization (PPBO) (Mikkola et al., 2020) proposes a projective preferential query allowing for the feedback given by human interaction. However, they ignore the tie situations and have to rely on a computationally expensive procedure to suggest the next query. Nguyen et al. (2021) extend the above method by comparing k samples but have to model the absolute response surface by Gaussian Process and assume the noise obeys Gumbel distribution. In addition, ranking-based methods (Feurer et al., 2018; Salinas et al., 2020) can also facilitate the identification of similar runs for transfer learning, reusing insights from past similar experiments. This work, on the contrary, makes the first attempt to directly capture the global ranking of candidates based on Poisson process and derive a novel Bayesian Optimization framework named PoPBO. We analyze the robustness of relative metric (ranking) against noise and show the outstanding performance of our method on various simulated benchmarks and real-world datasets.

3. POISSON PROCESS FOR BAYESIAN OPTIMIZATION

This section first introduces a ranking-based evaluation and analyzes its robustness compared to the value-based evaluation. Then, we propose a novel response surface based on Poisson process to capture the ranking-based evaluation for each candidate and derive the log-likelihood on observations. Finally, we fit the ranking-based response surface by training the weights through SGD and then provide the posterior probability.

3.1. RANKING-BASED EVALUATION

Suppose a black-box score function f (⋅) is defined on a feasible domain X consisting of all optional candidates. Given a sample x ∈ X and a subset of the feasible domain S ⊂ X, we define a set S x = {y|y ∈ S, f (y) < f (x)}, consisting of the better candidates than x in a set S. Considering the physical meaning of S x , we can estimate the superiority of x for f (⋅) against the points in set S by measuring S x . Specifically, considering two points x 1 , x 2 , S x1 has a larger measure value, representing that there are more points in S better than x 1 , so x 1 is worse than x 2 . To construct such a metric, we first introduce sets Q = {S x |S ⊂ X, x ∈ X}. Given a sampled set Ŝ ⊂ X. We can define a metric μ on the space (X, Q) by the number of elements in S x ∩ Ŝ: μ(S x ) = |S x ∩ Ŝ|. (2) Robustness Analysis. Suppose a black-box function f (x) with additive Gaussian noisy observations f (x) = f (x) + ϵ, ϵ ∼ N (0, σ 2 ). Consider two queries x 1 , x 2 with observations f (x 1 ), f (x 2 ). We assume f (x 1 ) < f (x 2 ) without loss of generality, the probability of correctly ranking x 1 , x 2 is: P (f (x 1 ) < f (x 2 )) = P (ϵ 1 -ϵ 2 < f (x 2 ) -f (x 1 )). (3) Since ϵ 1 , ϵ 2 ∼ N (0, σ 2 ) are independent variables, ∆ϵ = ϵ 1 -ϵ 2 ∼ N (0, 2σ 2 ). According to three- sigma rule of thumb, if f (x 2 ) -f (x 1 ) > √ 2σ, the probability of correctly ranking x 1 , x 2 is larger than 82.63%; If f (x 2 ) -f (x 1 ) > 2 √ 2σ , the probability of correctly ranking x 1 , x 2 is larger than 97.72%. Therefore, even if observations are noisy, the ranking of candidates is hard to be disrupted. Fig. 1(b ) verifies the robustness to noise of ranking-based evaluation. Specifically, we do not involve any prior distribution for noise when training our PP response surface, but it can still capture the correct ranking (black line) under various noise levels.

3.2. CAPTURING THE RANKING VIA POISSON PROCESS

Given a sample x and a set Ŝ, we utilize a random process Rx (S), ∀S ⊂ X to capture the ranking x. Though Rx (S) also depends on Ŝ, we omit it for conciseness. In particular, we define Rx ≜ Rx (X) to denote the ranking of x over the whole feasible domain X, which is a random variable. We can model Rx (S), ∀S ⊂ X as independent increment counting process, since the objective function f (x) is black-box and we assume the rankings of x over two disjoint areas are independent, i.e., Rx (S 1 ) Rx (S 2 ), ∀S 1 , S 2 ⊂ X, S 1 ∩ S 2 = ∅. Moreover, Rx (S) has the following properties: 1) Rx (∅) = 0 and 2) lim ∆s→0 P( Rx (S + ∆s) -Rx (S) ≥ 2) = 0, ∀S ⊂ X. Detailed discussion is provided in Appendix A. Since the supremum of Rx (S) is | Ŝ|, Rx (S) obeys truncated nonhomogeneous Poisson process (Yigiter & Inal, 2006) as Eq. 4 with parameter λ(s, x), s ∈ X. Rx (S) ∼ P oisson ( ∫ S λ(s, x)ds). Hence, the ranking of x over the whole feasible domain is Rx = Rx (X), the probability of Rx = k is: P ( Rx = k|x, Ŝ) = ( ∫ X λ(s, x)ds) k k! ⋅ Z(x) exp (-∫ X λ(s, x)ds) = (λ ξ (x)|X|) k k! ⋅ Z(x) exp (-λ ξ (x)|X|), (5) Z(x) = | Ŝ/{x}| ∑ k=0 [ (λ(ξ, x)|X|) k k! exp (-λ(ξ, x)|X|)], where Z(x) is the normalized coefficient and | Ŝ/{x}| indicates the number of sampled sets without x. There exists ξ ∈ X satisfying ∫ X λ(s, x)ds = λ ξ (x)|X| according to the mean value theorem for integrals. We can approximate λ ξ (x) by a multi-layer perception (MLP) λ ξ (x; θ) with parameter θ. Give N (N >= 2) samples Ŝ = {x j } N j=1 , the ranking of each sample over Ŝ is K = { kxj } N j=1 . With the independent assumption of observations similar to (Salinas et al., 2020) , the log-likelihood is: log L( K| Ŝ; θ) = N ∑ j=1 ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ kxj log (λ ξ (x j ; θ)|X|) -log ( kxj !) -log [ N -1 ∑ i=0 (λ ξ (x j ; θ)|X|) i i! ] ⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭ . (7) We can train the weights θ by minimizing the negative log-likelihood on observations in Eq. 7 by SGD, whose gradient can be computed as follows: ∂ ∂θ ( -log L( K| Ŝ)) = N ∑ j=1 ∂λ ξ (x j ; θ) ∂θ ⋅ ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ kxj λ ξ (x j ; θ) - ∑ N -2 i=0 (λ ξ (xj ;θ)|X|) i i! |X| ∑ N -1 i=0 (λ ξ (xj ;θ)|X|) i i! ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ . ( ) Once θ is determined after training on the observations ( Ŝ, K), the ranking of a new sample x * over the whole feasible domain X can be predicted, where Z is the normalized coefficient by Eq. 6. P ( Rx * (X) = k|θ, x * , Ŝ) = (λ ξ (x * ; θ)|X|) k k! ⋅ Z(x * ) ⋅ exp ( -λ ξ (x * ; θ)|X|). The proposed Bayesian optimization framework with Poisson process (PoPBO) is outlined in Alg. 1 in Appendix B. The acquisition function will be introduced in the next section.

4. ACQUISITION FUNCTION FOR POPBO

The existing acquisition functions are designed for absolute response surface considering independent mean and variance, which can be improper for our response surface since the mean of Poisson distribution is the same as the variance. Directly applying these acquisition functions to our PoPBO will cause the over-exploitation issue. To this end, we introduce a series of acquisition functions, named Rectified Upper Confidence Bound (R-LCB) and Expected Ranking Improvement (ERI), derived from vanilla LCB and EI, respectively.

4.1. RECTIFIED LOWER CONFIDENCE BOUND (R-LCB)

The ranking of each point x obeys Poisson distribution as Eq. 9, with expectation µ (x) = λ ξ (x)|X| ⋅ ∑ N -1 i=0 (λ ξ (xj ;θ)|X|) i /i! ∑ N i=0 (λ ξ (xj ;θ)|X|) i /i! and standard deviation σ(x) = √ µ(x). Thus the vanilla LCB of each point is: α LCB (x) = µ(x) -βσ(x) = √ µ(x) ( √ µ(x) -β) . However, Poisson distribution with a large expectation has a larger variance, indicating less confidence in the ranking prediction. So the vanilla LCB will be easily trapped into an over-exploitation issue. Therefore, we propose to restrict the lower value as a threshold and define the rectified LCB (R-LCB): α R-LCB (x) = { α LCB (x) if λ ξ (x; θ)|X| < q| Ŝ| ϵ x , ϵ x ∼ U [0, 1] Otherwise , where ϵ x is used for re-parameterization, and q is some quantile of the number of existing samples, according to which, the threshold q| Ŝ| can be adaptively adjusted during the BO process. To minimize R-LCB, we randomly sample a set of start points and adopt LBFGS (Liu & Nocedal, 1989) for optimization. In particular, LBFGS will not update the samples whose predicted ranking is larger than q| Ŝ|, and they have a probability of being selected as the next query if the sampled ϵ x is very small. We set q = 0.6 by default. Results in Fig. 6 show the advantage of our R-LCB against LCB. Y-axis indicates the residual between the optimum function value and the incumbent. We run each method ten times and plot the average performance and standard deviation as the line and shadow.

4.2. EXPECTED RANKING IMPROVEMENT (ERI)

Inspired by EI, we introduce ERI to maximize the expected improvement on ranking over the worst tolerable ranking K m to balance exploitation and exploration. We set K m = 5 by default. α ERI (x) = Km ∑ k=0 (K m -k) ⋅ P ( Rx = k|θ, x) , where P ( Rx = k|θ, x) is defined in Eq. 9 representing the prediction of ranking of x. The gradient of ERI w.r.t. x is defined as follows, where λ ξ (x) = λ ξ (x; θ). ∂α ERI (x) ∂x = Km ∑ k=0 ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ K m -k k! ∂ ∂x ⎛ ⎝ (λ ξ (x)|X|) k k! ⋅ Z(x) ⋅ exp ( -λ ξ (x)|X|) ⎞ ⎠ ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ = Km ∑ k=0 ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ (K m -k) (λ ξ (x)|X|) k-1 |X| k!( ∑ N i=0 (λ ξ (x)|X|) 2 i! ) 2 [k N ∑ i=0 (λ ξ (x)|X|) 2 i! -λ ξ (x)|X| N -1 ∑ i=0 (λ ξ (x)|X|) 2 i! ] ⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭ . Hence, we can get the next query x * by minimizing α ERI (x) through LBFGS optimizer. Similar to R-LCB, we also apply the rectified technique in Eq. 11 to ERI.

5. EMPIRICAL ANALYSIS

Benchmarks. We verify the efficacy of PoPBO on both simulated and real-world benchmarks, including HPO and NAS. For the simulated benchmark, we apply PoPBO to optimize three simulation functions: 1) 2-d Branin function with the domains of each dimension are [-5, 10] and [0, 15] respectively; 2) 6-d Hartmann function in [0, 1] for all six dimensions; 3) 6-d Rosenbrock function defined in [-5, 10] 6 . For the HPO task, we test PoPBO on the tabular benchmark HPO-Bench (Eggensperger et al., 2021) , containing the root mean square error (RMSE) of a 2-layer feed-forward neural network (FCNET) (Klein & Hutter, 2019) trained under 62208 hyper-parameter configurations on four real-world datasets: protein structure (Rana, 2013), slice localization (Graf et al., 2011) , naval propulsion (Coraddu et al., 2016) and parkinsons telemonitoring (Tsanas et al., 2010) . The averaged RMSE over four independent runs under the same configuration is utilized as the performance of that configuration. For the NAS task, we test on NAS-Bench-201 (Dong & Yang, 2020) containing 15,625 architectures in a cell search space that consists of 6 categorical parameters, and each parameter has five choices. Each architecture is evaluated on three datasets. Following the setting of (Dong & Yang, 2020) , we search for the best architecture according to its performance on the CIFAR-10 validation set after 12 epochs training. Baselines. We compare against random search (RS) (Bergstra & Bengio, 2012) and various valuebased Bayesian optimization methods, including Gaussian Process (GP) (Snoek et al., 2012) , Treestructured Parzen Estimation (TPE) (Bergstra et al., 2011) , and Sequential Model Algorithm Configuration (SMAC) (Hutter et al., 2011) , BOHAMIANN (Springenberg et al., 2016) , and HEBO (Cowen-Rivers et al., 2020) . For GP methods, we use EI and LCB as acquisition functions, which are optimized by LBFGS, and adopt the ARD Matérn 5/2 covariance function to be the kernel function. We also compare with PPBO (Mikkola et al., 2020) , one of state of the art preferential BO methods that also utilizes relative response (preferential between pair of candidates). Detailed settings of the baselines are provided in Appendix C.2. Settings. We run all the methods for 80 iterations with 12 initial points by default. For the Rosenbrock-6d simulation function, we run all methods for 80 iterations with 30 initial points due to its complex search space. The MLP λ ξ (x; θ) used to approximate the parameter of the Poisson process has three hidden layers with 128 nodes and a ReLU activation function. The MLP is trained for 100 steps by ADAM with 64 batch sizes and a 0.01 initial learning rate multiplied by 0.2 every 30 steps. All methods are evaluated ten times independently on an Intel(R) Xeon(R) Silver 4210R CPU.

5.1. PERFORMANCE ON THE SIMULATED BENCHMARKS

2-d Branin. Fig. 2 (a) compares PoPBO and baselines on Branin, which is a widely used simulation benchmark, verifying the efficacy of our PoPBO with both ERI and R-LCB. ,WHUDWLRQ 7LPHFRVWSHULWHUDWLRQ 3R3%2 *3 33%2 In particular, PPBO performs great in the early stage but falls into a local optimum after ten iterations.

6-d Rosenbrock.

Optimizing the 6-d Rosebrock function is much more complicated than Branin and Hartmann since its global optimum lies in a narrow valley (Picheny et al., 2013) as well as a more extensive search space. Hence, we increase the initial points to 30 for a better preview of the Rosenbrock landscape for all methods. Fig. 2 (c) shows that our PoPBO can quickly find the valley and significantly outperforms other BO methods. Computational Cost. Fig. 3 compares the time cost of three peer methods, showing that the cost of GP-BO and PPBO are much higher than PoPBO as the number of observations increases. Specifically, GP has to compute the inverse of a covariance matrix resulting in a O(N 3 ) computational complexity. PPBO is also based on GP and requires computing another covariance matrix of size J × J, where J is the number of random samples during optimization. In contrast, the computational bottleneck of PoPBO lies in the training of an MLP, which is O(N 2 ) as shown in Eq. 7.

5.2. PERFORMANCE ON THE REAL-WORLD BENCHMARKS

HPO-Bench. we run each method for ten times and plot the trend of minimum regret during the BO procedure. Fig. 4 compares PoPBO with advanced Bayesian optimization methods and random search on HPO-Bench (Eggensperger et al., 2021) , showing that our PoPBO achieves the best on all the four datasets. In contrast, other methods are unable to perform consistently well and even worse than random search. Moreover, the performance of our method has a lower standard deviation than other methods, indicating its outstanding stability. The numerical performance of all methods on the four datasets are provided in outperforms the state-of-the-art Bayesian optimization methods BOHB (Falkner et al., 2018) and BOHAMIANN (Springenberg et al., 2016) . Additionally, we plot the performance trend of various methods on the test set of CIFAR-10, CIFAR-100, and ImageNet16-120 in Fig. 5 . We observe that though our PoPBO slightly falls behind GP at an early stage, it catches up with GP at around 30-th epoch and takes the lead till the end of the search procedure. The performance trend on the validation set is also displayed in Fig. 7 in Appendix C.3.

5.3. EFFECTIVENESS OF THE RECTIFIED TECHNIQUE

The quantile parameter q in Eq. 11 controls the trade-off between exploration and exploitation. A smaller q has better exploration ability. On the one hand, it is undesirable to set q as a rather small value since our method degrades to Random Search when q → 0, making PoPBO suffers an Figure 6 : Ablation study on hyperparameter q, which controls the exploitation-exploration trade-off. We test on the Rosenbrock simulation function, a complex landscape, and show the effect of q on (a) R-LCB and (b) ERI. For each setting, we run ten times and plot the average performance of the incumbent. Notice that we adopt the same initial random seeds for all settings at each run for fairness. Hence, the lines in each plot have the same initial point (at the 0-th iteration). over-exploration issue. On the other hand, our method degrades to the vanilla acquisition functions when q → 1, making PoPBO suffers an over-exploitation issue as our analysis in Sec. 4. Fig. 6 evaluates the effect of quantile parameters q on both R-LCB and ERI. We independently run ten times for each setting and adopt the same initial random seeds for all settings at each run for a fair comparison. Hence, the lines in Fig. 6 have the same initial point (at the 0-th iteration). We discover that the best setting of the quantile parameter q for R-LCB is 0.6, and the best for ERI is 0.4.

6. CONCLUSION

We have proposed a novel Bayesian Optimization framework, named PoPBO, for optimizing blackbox functions with relative responses, being more robust to noise than absolute responses. Specifically, we introduce a relative response surface to capture the global ranking of candidates based on the Poisson process that is suitable for modeling discrete count events. We give the likelihood and posterior forms of ranking under the general assumption of a non-homogeneous Poisson process. To balance the trade-off between exploration and exploitation, we design two acquisition functions, namely Rectified Lower Confidence Bound (R-LCB) and Expected Ranking Improvement (ERI), for our ranking-based response surface. Our method enjoys a lower computational complexity of O(N 2 ) compared to GP's O(N 3 ) and performs competitively on both simulated and real-world benchmarks. Limitations and Future Work. This work analyzes the robustness of relative response against noise and thus does not involve prior knowledge of noise. However, there exist real scenarios where the noise is too large to disrupt the ranking of observations, and we would like to leave it as our future work. Additionally, the mean of Poisson distribution is the same as variance, which has a potential over-exploitation issue as mentioned in Sec. 4. This work introduces a rectified technique to alleviate it, and we would like to explore other elegant acquisition functions in future work. A DISCUSSION ON THE ASSUMPTIONS FOR Rx (S) We assume Rx (S)has the following properties: 1) Rx (∅) = 0 and 2) lim ∆s→0 P( Rx (S ′ ) + ∆s) -Rx (S ′ ) ≥ 2) = 0, ∀S ′ ⊂ S. The first assumption is naturally satisfied since there are no points x ′ ∈ ∅ satisfying f (x ′ ) > f (x), where f (x) indicates the observation of point x. The second assumption is naturally satisfied when the black-box function has a discrete domain of definitions since we can find a small enough ∆s that only contains one point. We further assume that it still holds in the case of continuous spaces since we have no prior information about the black-box function whose observation is noisy and thus discrete everywhere. Specifically, we introduce the following proposition. Proposition 1. Suppose a black-box function f (x) whose observation f (x) = f (x) + ϵ has additive noise ϵ ∼ N (0, σ) obeying Gaussian distribution. Then the observation f (x) is discrete everywhere. Proof. We would like to first give the definition of continuous, and then provide the definition of discrete through the negative continuous proposition. Definition of continuous. If ∀µ > 0, ∃δ > 0, ∀x ′ ∈ δ(x) ≜ {x ′ ∶ |x ′ -x| < δ} satisfying |f (x ′ )-f (x)| < µ, we say f (x) is continuous at point x. Definition of discrete (negative continuous proposition). If ∃µ > 0, ∀δ > 0, ∃x ′ ∈ δ(x) satisfying |f (x ′ ) -f (x)| ≥ µ, we say f (x) is discrete (not continuous) at point x. Consider proposition 1, ∀δ, ∀x ′ ∶ |x ′ -x| < δ, |f (x ′ ) -f (x)| = | f (x ′ ) -f (x) + ϵ ′ -ϵ|, where ϵ ′ , ϵ indicates the observation noise at x ′ , x. ∆ϵ = (ϵ ′ϵ) ∼ N (0, 2σ) also obeys Gaussian distribution since ϵ ′ and ϵ are independent stochastic variables obeying Gaussian distribution. Hence, P (∃x ′ ∈ δ(x), |f (x ′ ) -f (x)| ≥ µ) = 1 -P (∀x ′ ∈ δ(x), |f (x ′ ) -f (x)| < µ) = 1 -∏ x ′ ∈δ(x) P (∆ϵ ∈ (-µ - f (x ′ ) + f (x), µ + f (x ′ ) -f (x))) ≈ 1, since f (x) is continuous in δ(x) . Therefore, we can say that f (x) is almost surely discrete anywhere. In our implementation, Rx (S) depends on a discrete set S ∩ Ŝ. We can find a small enough ∆s satisfying ∆s ∩ Ŝ = ∅. Therefore, the property 2 (sparse assumption) is naturally satisfied in our implementation.

B ALGORITHM DETAILS

As the number of observations N increases, the right truncated Poisson distribution gradually approaches to the normal one (Yigiter & Inal, 2006) , which has a smaller computational cost. Hence, we use normal Poisson process to model the response surface when N ≥ 12.

C SUPPLEMENTARY OF EXPERIMENTS C.1 EXPERIMENTAL SETTINGS ON ROBUSTNESS ANALYSIS

We compare the sensitivity to additive Gaussian noise between GP (value-based) response surface and PoPBO (ranking-based) response surface on Forrester function. To simulate the performance of GP, we first utilize Gaussian process to fit a certain number of (15 in this paper) observed values and plot the ranking of e.g. 100 points according to the values predicted by GP as Fig. 1 (a). Meanwhile, our method utilizes Poisson process to directly capture the ranking response surface based on the same 15 observations and predict the ranking of 100 points as shown in Fig. 1(b ). We observe that our response surface is more robust to noise and can better capture the global ranking.

C.2 DETAILED SETTINGS OF BASELINE METHODS

In this section, we provide the specific details of each baseline mentioned in the paper: Methods Naval Parkinson Protein Slice Random Search (2012) 9.54 × 10 -5 ± 5.76 × 10 -5 7.99 × 10 -3 ± 4.18 × 10 -3 3.04 × 10 -2 ± 1.51 × 10 -2 1.48 × 10 -4 ± 7.45 × 10 -5 TPE (2011) 1.67 × 10 -5 ± 1.64 × 10 -5 4.49 × 10 -3 ± 2.55 × 10 -3 1.86 × 10 -2 ± 1.37 × 10 -2 1.80 × 10 -4 ± 7.10 × 10 -5 SMAC (2011) 2.94 × 10 -4 ± 3.96 × 10 -4 1.48 × 10 -2 ± 9.38 × 10 -3 2.05 × 10 -2 ± 1.14 × 10 -2 3.16 × 10 -4 ± 2.25 × 10 -4 GP (EI) (2012) 1.79 × 10 -4 ± 1.77 × 10 -4 6.20 × 10 -3 ± 2.44 × 10 -3 8.06 × 10 -3 ± 6.53 × 10 -2 1.66 × 10 -4 ± 7.00 × 10 -5 GP (LCB) (2012) 1.38 × 10 -4 ± 1.77 × 10 -4 5.31 × 10 -3 ± 2.79 × 10 -3 8.08 × 10 -3 ± 1.02 × 10 -2 1.95 × 10 -4 ± 1.30 × 10 -4 PoPBO (ERI) 1.38 × 10 -5 ± 1.59 × 10 -5 1.92 × 10 -3 ± 1.51 × 10 -3 5.77 × 10 -3 ± 3.55 × 10 -3 4.83 × 10 -5 ± 3.29 × 10 -5 PoPBO (R-LCB) 1.07 × 10 -5 ± 6.26 × 10 -6 2.41 × 10 -3 ± 1.93 × 10 -3 4.62 × 10 -3 ± 3.91 × 10 -3 3.14 × 10 -5 ± 1.74 × 10 -5 Table 2 : Regret of the configuration discovered by various methods on the four datasets of HPO-Bench. We run each method for ten times and report the mean and standard deviation. The best performance (lowest mean value and standard deviation) is in bold. an efficient Markov chain Monte Carlo (MCMC) method, to fit the hyperparameters of GP, which we find to work more robustly for GP. Tree Parzen Estimator (TPE). Bergstra et al. (2011) adopt kernel density estimators to model the probability of points with bad and good performance respectively. Then TPE give the next query by optimizing the ratio between the two estimated likelihood, which is proved to be equivalent to optimizing EI. We use the default settings provided in hyperopt package (https://github. com/hyperopt/hyperopt). SMAC. We utilize the same MLP architecture and training settings as PoPBO's to fit the Gaussian likelihood of each candidate. We denote such a setting as GP-MLP. We plot the regret of GP-MLP, GP, and our PoPBO on Hartmann and Rosenbrock in Fig. 8 . We observe that GP-MLP performs much worse than GP and ours, showing that the complex representation of the MLP in the surrogate model is not the main reason for the gain in performance. Figure 9 : Ablation study of ERI on the two simulation functions. K max is the worst tolerable ranking, and a higher K max leads to a higher rate of exploration. K m converges faster but ERI with various K m have similar ultimate performance after 80 iterations. For Rosenbrock that has larger search space and more complex landscape, better exploitation ability is more important than exploration, and thus ERI with lower K m achieves better performance. 

D NOTATIONS

1 X is the whole feasible domain (search space). If X is continuous, |X| is the volume of X. While X is discrete, |X| is the cardinality of it. 2 S x = {y|y ∈ S, f (y) < f (x)} is the set of better points than x in S ⊂ X, where S can be any continuous domain or discrete set. Under review as a conference paper at ICLR 2023 3 Ŝ is a discrete set containing both initial samples for BO and the history queries. 4 Given a specific S, Rx (S) is a random variable denoting possible ranking of x over a discrete set Ŝx ∩ Ŝ. Hence, Rx (S) depends on Ŝ, we utilize a hat symbolˆto simplify the notation.

E BROADER IMPACT AND LIMITATIONS

This paper addresses the problem of Bayesian Optimization (BO) to enable efficient and effective black-box optimization. It has broad applications in perception tasks, especially computer vision, robotic control and biology. This, on the one hand, would facilitate our daily life, and on the other hand, we shall be careful about their abuse which may break one's privacy. In this sense, privacyprotection BO is also needed for development, and our techniques can also be of specific help for its generality.



In this work, 'absolute evaluation (response)' of one query is defined as its exact black-box function value. In this work, 'relative evaluation (response)' of one query is defined as its ranking, which can be computed by comparing with other candidates. Specifically, a sorting function can serve as the Rank(⋅).



Figure1: We compare the sensitivity of additive Gaussian noise between GP (value-based) response surface and PoPBO (ranking-based response surface) on the Forrester function. Based on the Forrester function value, the solid black line (oracle) indicates actual rankings over 100 points evenly spaced from 0 to 0.8. We draw lines between the 100 predictions by linear interpolation for a clear illustration. The dash lines indicate predicted rankings over the 100 points by (a) Gaussian process and (b) Poisson process on observations with varying degrees of noise whose standard deviation σ ranges from 0 to 0.45. Each response surface is trained on the same 15 queries. Note that GP performs worse as the standard deviation of noise increases. In contrast, PP performs consistently well due to its great robustness against noise in our analysis.

Figure 2: Performance of various black-box optimization methods on three simulation functions.Y-axis indicates the residual between the optimum function value and the incumbent. We run each method ten times and plot the average performance and standard deviation as the line and shadow.

Figure 3: Time cost of Gaussian process Bayesian optimization (GP), PPBO (Mikkola et al., 2020) and PoPBO. All the methods are applied to optimize 6-d Hartmann function.

Figure4: Minimum regret comparison with random search and various Bayesian optimization methods on tabular datasets in HPO-Bench. Y-axis indicates the residual between the optimum function value and the incumbent. We run each method ten times and plot the average performance and standard deviation of the incumbent as the line and shadow. Our PoPBO can quickly discover good samples and achieves the best performance (lowest regret).

Figure5: Performance trend of random search and Bayesian optimization methods on NAS-Bench-201. We run 10 times for each setting and plot the mean accuracy as the lines. Note that when testing Random Search, GP-BO, and our PoPBO, we adopt the same initial random seeds for all settings at each run for fairness. Hence, the lines in each plot have the same initial point (at the 0-th iteration).

Fig.9compares the performance of ERI under various settings for the worst tolerant ranking K m . We observe that PoPBO-ERI is not very sensitive to K m . Specifically, for Branin, ERI with larger

Figure 10: Performance trend of PoPBO and GP by running 200 iterations on 6-d Rosenbrock. For each setting, we conduct replicated experiments for six times with various random seeds.

Table 2 in Appendix C.3.



Hutter et al. (2011) adopt random forest to model the response surface of the black-box function. We use the default settings given by scikit-optimize package (https://github.com/ scikit-optimize/scikit-optimize). Heteroscedastic Evolutionary Bayesian Optimisation that won the NeurIPS 2020 black-box optimisation competition. We use the default strategy and its default parameters provided in HEBO package (https://github.com/huawei-noah/HEBO). For a fair comparison, we use a uniform sampling strategy instead of a sobol one during initialization and candidates generation. C.3 DETAILED RESULTS ON HPO-BENCH AND NAS-BENCH-201 Table 2 reports the numerical performance of PoPBO and other methods on the four datasets on HPO-Bench. Fig. 7 displays the performance trend of PoPBO and other methods on the validation set and test set under the search space of NAS-Bench-201. C.4 IS THE GAIN IN PERFORMANCE OF POPBO DUE TO THE COMPLEX REPRESENTATION OF MLP IN THE SURROGATE MODEL?

annex

Algorithm 1: PoPBO: Bayesian Optimization with Poisson Process Inputs :1) A function Rank({⋅}) 3 to rank samples based on a black-box function f (⋅); 2) A feasible domain (search space) X;3) An acquisition function α (R-LCB as Eq. 11 or ERI as Eq. 12); 4) The number of initial points N ; 5) The number of total training iterations T1 Random sample N initial points Ŝ ∶= {x j } N j=1 from X; 2 Initialize the parameters θ of λ ξ (x); We run 10 times for each setting and plot the mean accuracy as the lines. Note that when testing Random Search, GP-BO, and our PoPBO, we adopt the same initial random seeds for all settings at each run for fairness. Hence, the lines in each plot have the same initial point (at the 0-th iteration).Random Search (RS). Following the description in Bergstra & Bengio (2012) , we sample candidates uniformly at random.BO with Gaussian Process (GP). We follow the settings that described by Snoek et al. (2012) and use the implementation of our own. We use expected improvement (EI) and lower confidence bound (LCB) as acquisition functions and adopt LBFGS to optimize them. When the search space is completely discrete like Dong & Yang (2020) , we use random sampling to find the next query, which gives the maximizer of the acquisition function among N = 1000 random samples. For kernel function, we use ARD Matérn 5/2 kernel for GP. During the training process, we adopt slice sampling,

