GENERATIVE PRETRAINING FOR BLACK-BOX OPTIMIZATION

Abstract

Many problems in science and engineering involve optimizing an expensive blackbox function over a high-dimensional space. For such black-box optimization (BBO) problems, we typically assume a small budget for online function evaluations, but also often have access to a fixed, offline dataset for pretraining. Prior approaches seek to utilize the offline data to approximate the function or its inverse but are not sufficiently accurate far from the data distribution. We propose BONET, a generative framework for pretraining a novel black-box optimizer using offline datasets. In BONET, we train an autoregressive model on fixed-length trajectories derived from an offline dataset. We design a sampling strategy to synthesize trajectories from offline data using a simple heuristic of rolling out monotonic transitions from low-fidelity to high-fidelity samples. Empirically, we instantiate BONET using a causally masked Transformer (Radford et al., 2019) and evaluate it on Design-Bench (Trabucco et al., 2022) , where we rank the best on average, outperforming state-of-the-art baselines.

1. INTRODUCTION

Many fundamental problems in science and engineering, ranging from the discovery of drugs and materials to the design and manufacturing of hardware technology, require optimizing an expensive black-box function in a large search space (Larson et al., 2019; Shahriari et al., 2016) . The key challenge here is that evaluating and optimizing such a black-box function is typically expensive, as it often requires real-world experimentation and exploration of a high-dimensional search space. Fortunately, for many such black-box optimization (BBO) problems, we often have access to an offline dataset of function evaluations. Such an offline dataset can greatly reduce the budget for online function evaluation. This introduces us to the setting of offline BBO. A key difference exists between the offline BBO setting and its online counterpart; in offline BBO, we are not allowed to actively query the black-box function during optimization, unlike in online BBO where most approaches (Snoek et al., 2012; Shahriari et al., 2016) utilize iterative online solving. One natural approach for offline BBO would be to train a surrogate (forward) model that approximates the blackbox function using the offline data. Once learned, we can perform gradient ascent on the input space to find the optimal point. Unfortunately, this method does not perform well in practice because the forward model can incorrectly give sub-optimal and out-of-domain points a high score (see Figure 1a ). To mitigate this issue, COMs (Trabucco et al., 2021) learns a forward mapping that penalizes high scores on points outside the dataset, but this can have the opposite effect of not being able to explore high fidelity points that are far from the dataset. Further, another class of recent approaches (Kumar & Levine, 2020; Brookes et al., 2019; Fannjiang & Listgarten, 2020) propose a conditional generative approach that learns an inverse mapping function values to the points. For effective generalization, such a mapping needs to be highly multimodal for high-dimensional functions, which in itself presents a challenge for current approaches. We propose Black-box Optimization Networks (BONET), a new generative framework for pretraining black-box optimizers on offline datasets. Instead of approximating the surrogate function (or its inverse), we seek to approximate the dynamics of online black-box optimizers using an autoregressive sequence model. Naively, this would require access to several trajectory runs of different blackbox optimizers, which is expensive or even impossible in many cases. Our key observation is that we can synthesize synthetic trajectories comprised of offline points that mimic empirical characteristics of online BBO algorithms, such as BayesOpt. While one could design many characteristic properties, we build off an empirical observation related to the function values of the proposed points. In particular, averaged over multiple runs, online black-box optimizers (e.g., BayesOpt) tend to show improvements in the function values of the proposed points (Bijl et al., 2016) , as shown in Figure 1c . While not exact, we build on this observation to develop a sorting heuristic that constructs synthetic trajectories consisting of offline points ordered monotonically based on their ascending function values. Even though such a heuristic does not apply uniformly for the trajectory runs of all combinations of black-box optimizers and functions, we show that it is simple, scalable, and quite effective in practice. Further, we augment every offline point in our trajectories with a regret budget, defined as the cumulative regret of the trajectory starting at the current point until the end of the trajectory. We train BONET to generate trajectories conditioned on the regret budget of the first point of the trajectory. Thus, at test time, we can generate good candidate points by rolling out a trajectory with a low regret budget. Figure 1b shows an illustration. We evaluate our method on several real-world tasks in the Design-Bench (Trabucco et al., 2022) dataset. These tasks are based on real-world problems such as robot morphology optimization, DNA sequence optimization, and optimizing superconducting temperature of materials, all of which requires searching over a high-dimensional search space. We achieve a normalized mean score of 0.772 and an average rank of 2.4 across all tasks, outperforming the next best baseline, which achieves a rank of 3.7.

2. PRETRAINING BLACK-BOX OPTIMIZERS VIA BONET

2.1 PROBLEM STATEMENT Let f : X → R be a black-box function, where X ⊆ R d is an arbitrary d-dimensional domain. In black-box optimization (BBO), we are interested in finding the point x * that maximizes f : x * ∈ arg max x∈X f (x) (1) Typically, f is expensive to evaluate and we do not assume direct access to it during training. Instead, we have access to an offline dataset of N previous function evaluations D = {(x 1 , y 1 ), • • • , (x N , y N )}, where y i = f (x i ). For evaluating a black-box optimizer post-training, we allow it to query the black-box function f for a small budget of Q queries and output the point with the best function value obtained. This protocol follows prior works in offline BBO (Trabucco et al., 2021; 2022; Kumar & Levine, 2020; Brookes et al., 2019; Fannjiang & Listgarten, 2020) . Overview of BONET We illustrate our proposed framework for offline BBO in Figure 2 and Algorithm 1 . BONET consists of 3 sequential phases: trajectory construction, autoregressive modelling, D traj Trajectory dataset D Offline dataset R 1 x 1 R 2 x 2 R T x T x1 x2 xT Autoregressive Transformer Training (Phase 2) budget point R 1 x 1 R P x P R xP +1 R xT prefix predictions SORT-SAMPLE (Phase 1) Evaluation (Phase 3) Figure 2 : Schematic for BONET. In Phase 1, we construct a trajectory dataset D traj using SORT-SAMPLE. In Phase 2, we learn an autoregressive model for D traj . In Phase 3, we condition the model on an offline prefix sequence and unroll it further to obtain candidate proposals xP +1:T . roll-out evaluation. In Phase 1 (Section 2.2), we transform the offline dataset D into a trajectory dataset D traj . This is followed by Phase 2 (Section 2.3), where we train an autoregressive model on D traj . Finally, we evaluate the model by rolling out Q candidate points in Phase 3 (Section 2.4).

2.2. PHASE 1: CONSTRUCTING TRAJECTORIES

Our key motivation in BONET is to train a model to mimic the sequential behavior of online blackbox optimizers. However, the difficulty is that we do not have the ability to generate trajectories by actively querying the black-box function during training. In BONET, we overcome this difficulty by synthesizing trajectories purely from an offline dataset based on two guiding desiderata.  1: ▷ Phase 1: SORT-SAMPLE 2: Construct bins {B 1 , • • • , B N B } from D, each bin covering equal y-range, as described in 2.2 3: Calculate the scores (n 1 , n 2 , • • • , n N B ) for each bin using K and τ 4: D traj ← ϕ 5: for i = 1, • • • , num trajs do 6: Uniformly randomly sample n i points from B i and concatenate them to construct T

7:

Sort T in the ascending order of the function value 8:  Represent T as (R 1 , x 1 , R 2 , x 2 , • • • , R T , x T ), (R t , x t ) to g θ sequentially, ∀t = 1, • • • , P 15: Roll-out g θ autoregressively while feeding R t = R, ∀t = P + 1, • • • , T 16: X is the set of last min(Q, T -P ) rolled-out points. First, the procedure for synthesizing trajectories should efficiently scale to high-dimensional data points and large offline datasets. Second, each trajectory should mimic characteristic behaviors commonly seen in online black-box optimizers. We identify one such characteristic of interest. In particular, we note that the moving average of function values of points proposed by such black-box optimizers tends to improve over the course of their runs barring local perturbations (e.g., due to exploration). While exceptions can and do exist, this phenomena is commonly observed in practice for optimizers such as BayesOpt (Bijl et al., 2016) . We also illustrate this behavior in Figure 1c for some commonly used test functions optimized via BayesOpt.

Sorted Trajectories

We propose to satisfy the above desiderata in BONET by following a sorting heuristic. Specifically, given a set of T offline points, we construct a trajectory of length T by simply sorting the points in ascending order from low to high function values. We note that sorting is just a heuristic we use for constructing synthetic trajectories from the offline dataset, and this behavior may not be followed by any general optimizer over any arbitrary functions. We also perform ablations on different heuristics in Appendix C.1. Further, we note that sorting does not provide any guidance on the rate or the relative spacing between the points i.e., how fast the function values increase. This rate is important for controlling the sample budget for black-box optimization. Next, we discuss a sampling strategy for explicitly controlling this rate. for the Ant Morphology benchmark (Trabucco et al., 2022) . Notice how the overall density of points with high function values is up-weighted post our re-weighting.

Sampling Strategies for Offline

Points So far, we have proposed a simple heuristic for transitioning a set of offline points into a sorted trajectory. To obtain these offline trajectory points from the offline dataset, one default strategy is to sample uniformly at random T points from D and sort them. However, we found this strategy to not work well in practice. Intuitively, we might expect a large volume of the search space to consist of points with low-function values. Thus, if our offline dataset is uniformly distributed across the domain, the probability of getting a high quality point will be very low with a uniform sampling strategy. To counter this challenge, we propose a 2 step sampling strategy based on binning followed by importance reweighting. Our formulation is motivated by a similar strategy proposed by Kumar & Levine (2020) for loss reweighting. First, we use the function values to partition the offline dataset D into N B bins of equal-width, i.e., each bin covers a range of equal length. Next, for each bin, we assign a different sampling probability, such that (a) bins where the average function value is high are more likely to be sampled, and (b) bins with more points are sampled more often. The former helps minimize the budget, whereas the latter ensures diversity in trajectories. Based on these two criteria, the score s i for a bin B i is given as: s i = |B i | |B i | + K exp -|ŷ -y bi | τ ( ) where ŷ is the best function value in the dataset D, |B i | refers to the number of points in the i th bin, and y bi is the midpoint of the interval corresponding to the bin B i . Here, the first term |Bi| |Bi|+K allows us to assign a higher weight to the larger bins with smoothing. The second term gives higher weight to the good bins using an exponential weighting scheme. More details about K and τ can be found in Appendix B. Finally, we use these scores s i to proportionally sample n i points from bin B i where n i = T si j sj for i ∈ {2, • • • , N B } and n 1 = T -i>1 n i , making the overall length of the trajectories equal to T . In Figure 3 , we illustrate the shift in distribution of function values due to our sampling strategy. We refer to the combined strategy of sampling and then sorting as SORT-SAMPLE in Figure 2 and Algorithm 1. Augmenting Trajectories With Regret Budgets Our sorted trajectories heuristically reflect rollouts of implicit black-box optimizers. However they do not provide us with information on the rate at which a trajectory approaches the optimal value. A natural choice for such a knob would be the cumulative regret. Moreover, as we shall show later, cumulative regret provides BONET a simple and effective knob to generalize outside the offline dataset during the evaluation phase. Hence, we propose to augment each data point x i in our trajectory with a Regret Budget (RB). The RB R i at timestep i defined as the cumulative regret of the trajectory, starting at point x i : R i = T j=i (f (x * ) -f (x j )). Intuitively, a high (low) value for R i is likely to result in a high (low) budget for the model to explore diverse points. Note, we are only assuming knowledge of an estimate for f (x * ) (and not x * ). Thus, each trajectory in our desired set D traj can be represented as: T = (R 1 , x 1 , R 2 , x 2 , • • • , R T , x T ) We will refer to R 1 as Initial Regret Budget (IRB) henceforth. This will be of significance for evaluating our model in Phase 3 (Section 2.4), as we can specify a low IRB to induce the model to output points close to the optima.

2.3. PHASE 2: TRAINING AN AUTOREGRESSIVE GENERATIVE MODEL

Given our trajectory dataset, we design our BBO agent as a conditional autoregressive model and train it to maximize the likelihood of trajectories in D traj . More formally, we denote our model parameterized by θ as g θ (x t |x <t , R ≤t ), where by k <t we mean the set {k 1 , • • • , k t-1 }. Here, x i are the sequence of points in a trajectory, and R i refers to the regret budget at timestep i. Building on recent advances in sequence modeling (Vaswani et al., 2017; Brown et al., 2020; Radford et al., 2019) , we instantiate our model with a causally masked transformer and train it to maximize the likelihood of our trajectory dataset D traj . L(θ; D traj ) = E T ∼Dtraj T i=1 log g θ (x i |x <i , R ≤i ) In practice, we translate this loss to the mean squared error loss for a continuous X (equivalent to a Gaussian g θ with fixed variance), and cross entropy loss for a discrete X .

2.4. PHASE 3: EVALUATION ROLLOUT OF FINAL CANDIDATES

Once trained, we can use our BBO agent to directly output new points as its candidate guesses for maximizing the black-box function. We do so by rolling out evaluation trajectories from our model. Each trajectory will be subdivided into a prefix subsequence and a prediction subsequence. The prefix subsequence consists of P < T points sampled from our offline dataset as before. These prefix points provide initial warm-up queries to the model. Thereafter, we rollout the prediction subsequence consisting of T -P points by sampling from our autoregressive generative model.

Setting Regret Budgets

One key question relates to setting the regret budget at the start of the suffix subsequence. It is not preferable to set it to R P +1 of the sampled trajectory, as doing so will lead to a slow rate of reaching high-quality regions similar to the one observed in the training trajectories. This will not allow the suffix to generalize beyond the offline dataset. Alternatively, we initialize it to a low value in BONET to accelerate the trajectory towards good points following a prefix subsequence. We refer to this low value as Evaluation RB and denote it as R in Figure 2 and in Section 3. Thereafter, we keep the RB for the suffix subsequence fixed ( R), as the agent is expected to be already in a good region. Moreover, updating the RB here would require sequential querying to the function f , which can be prohibitive. Thus, our evaluation protocol can generate a set of candidate queries purely in an offline manner. In practice, we also find it helpful to split the Q candidates among a few (and not 1) small R values, each with a different prefix.

3. EXPERIMENTAL EVALUATION

We first empirically evaluate BONET for optimizing a synthetic 2D function, Branin, in order to analyze its working and probe the various components. Next, we perform large-scale benchmarking and experiments on Design-Bench (Trabucco et al., 2022) , a suite of offline BBO tasks based on real-world problems.

3.1. BRANIN TASK

Branin is a well-known benchmark function for evaluating optimization methods. It is a 2D function evaluated on the ranges x 1 ∈ [-5, 10] and x 2 ∈ [0, 15]: f br (x 1 , x 2 ) = a(x 2 -bx 2 1 + cx 1 -r) 2 + s(1 -t) cos x 1 + s (5) where a = -1, b = 5.1 4π 2 , c = 5 π , r = 6, s = -10, and t = 1 5π . In this square region, f br has three global maximas, (-π, 12.275), (π, 2.275), and (9.42478, 2.475); with the maximum value of -0.397887. Figure 1b shows an illustration of the function contours. For offline optimization we uniformly sample N = 5000 points in the domain, and remove the top 10%-ile (according to the function value) from this set to remove points close to the optima to make the task more challenging. We then construct 400 trajectories of length 64 each according to the SORT-SAMPLE strategy. Table 1 : Best function value achieved by each method on Branin task. We report mean and standard deviation averaged over 5 runs. Gradient ascent performs poorly because on many initialization points, the trajectories escape out of the square domain. OPTIMA D (best) BONET Grad. Ascent -0.398 -6.119 -1.79 ± 0.843 -3.953 ± 4.258 During the evaluation, we initialize four trajectories with a prefix length of 32 and unroll them for an additional 32 steps, and output the best result, thus consuming a query budget of 128. As we see in Table 1 , BONET successfully generalizes beyond the best point in our offline dataset. We also report numbers for a gradient ascent baseline, which uses the offline dataset to train a forward model (a 2 layer NN) mapping x to y and then performs gradient ascent on x to infer its optima. Next, we perform ablations to understand the effect of the Evaluation RB R and prefix length P on our rolled-out trajectories. Impact of R Figures 4a and 4b shows rolled-out trajectories for our model for different R values, with prefix lengths 16 and 32. We see that low R rolls out higher quality points compared to high R. To verify our semantics of regret budget as a knob for controlling the rate at which the model accelerates to high-quality points, in the Figure 4c , we also plot trajectories where we update the RB values in the suffix. We stop the roll-out if RB becomes non-positive. It is evident that for smaller R, the agent quickly accelerates to high-quality regions, whereas for high R, it gradually shifts to high-quality points. This shows how R controls the rate of transition from low to high-quality points. To further check whether our model has learned to generate a sequence having cumulative regret close to the initial RB R 1 , we plot R 1 vs the cumulative regret of a full rolled-out sequence in Figure 5a . We observe that the curve is close to the desired ideal line y = x. Notice that the range of R 1 values of D traj is quite narrow, but BONET is able to generalize well to a much wider range, allowing it to propose points even better than the dataset. During training, the model has only seen low RB values towards the end of the trajectories. However, the powerful stitching ability (Chen et al., 2021) of the model allows it to roll out a novel trajectory having low cumulative regret when conditioned on low unseen R 1 values. Finally, since in BBO our goal is to find the best point, we also plot the best rolled-out point across a trajectory versus R in Figure 5b while keeping the prefix sequence fixed. As expected, we observe a decreasing trend, justifying our choice of small R values. Impact of Prefix Length Figure 5c shows the obtained best function values for different prefix lengths, averaged over multiple R values, with same query budget Q = 32. As expected, we see an increasing trend in the best function value. We also observe a decreasing variance, indicating that the trajectory roll-outs are more stable when augmented with history of points. Note that prefix lengths larger than 32 doesn't perform very well in practice because they have fewer than 32 shots to propose a good point in a single trajectory. Empirically, we found prefix length equal to half of the trajectory length to perform well across the experiments. We provide more details about the ablations, experimental setup, and model hyper-parameters in the Appendix B and C. 

3.2. DESIGN-BENCH TASKS

Next, we evaluate BONET on 7 complex real-world tasks of Design-Bench (Trabucco et al., 2022) foot_0 . TF-Bind-8 and TF-Bind-10 are discrete tasks where the goal is to optimize for a DNA sequence which has maximum affinity to bind with a particular transcription factor. The sequences are of length 8 (10) for TF-Bind-8 (TF-Bind-10), where each element in the sequence is one of 4 bases. ChEMBL is a discrete task where the aim is to design a drug with certain qualities. NAS is a discrete task where we want to optimize a NN for performance on CIFAR10 (Krizhevsky et al., 2010) . In D'Kitty and Ant morphology tasks, we optimize the morphology of two robots: Ant from OpenAI gym (Brockman et al., 2016) and D'Kitty from ROBEL (Ahn et al., 2019) . In Superconductor task, the aim is to find a chemical formula for a superconducting material with high critical temperature. D'Kitty, Ant and Superconductor are continuous tasks with dimensions 56, 60, and 86 respectively. For the first four tasks, we have query access to the exact oracle function. For Superconductor, we only have an approximate oracle, which is a random forest regressor trained on a much larger hidden dataset. These tasks are considered challenging due to high dimensionality, low quality points in the offline dataset, approximate oracles in some cases, and highly sensitive landscapes with narrow optima regions (Trabucco et al., 2022) . Baselines We compare BONET with multiple canonical baselines like gradient ascent, REIN-FORCE (Sutton et al., 1999) , BayesOpt (Snoek et al., 2012) and CMA-ES (Hansen, 2006) . We also compare with more recent methods like MINs (Kumar & Levine, 2020) , COMs (Trabucco et al., 2021) and CbAS (Brookes et al., 2019) .foot_1 For inherently active methods like BayesOpt, since we cannot query the oracle function f (x) during optimization (due to being in an offline setting), we follow the procedure used by Trabucco et al. (2022) , and perform BayesOpt on a surrogate model f (x) (a feedforward NN) trained on the offline dataset. For the BayesOpt baseline, we use a Gaussian Process to quantify uncertainty and use the quasi-Expected Improvement (Wilson et al., 2017) Table 2 : 100th percentile comparative evaluation of BONET over 7 tasks averaged over 5 runs. The error bars refer to the standard deviation across the 5 seeds. We report normalized scores with Q = 256 (except for NAS, where we use Q = 128 due to compute restrictions) and highlight the top 2 results in each column. Blue denotes the best entry in the column, and Violet denotes the second best. We observe that BONET is in the top 2 in 5 out of 7 tasks, and is also consistently able to outperform the best offline dataset point in all tasks. acquisition function for optimization, similar to prior work (Trabucco et al., 2022; 2021) . MINs (Kumar & Levine, 2020) also train a forward model to optimize over the conditioning parameters. Our method does not need a separate forward model and thus, is not dependent on the quality of the fit of the forward model. We provide other variants of BayesOpt baseline in Appendix D.2. Evaluation We allow a query budget of Q = 256 for all the baselines, except for NAS, where we use a reduced budget of Q = 128 due to compute restrictions. For BONET, across all tasks, we roll out 4 trajectories, each with a prefix and prediction subsequence length of 64 each. The prediction subsequence is initialized with one of 4 candidate low R values 0, 0.01, 0.05, 0.1. For each R, we then roll out for 64 timesteps and choose the best point. We report the mean and standard deviation over 5 trials for each of the models and tasks in Table 2 . Following the procedure used by (Trabucco et al., 2021; 2022; Yu et al., 2021) , the results of Table 2 are linearly normalized between the minimum and maximum values of a large hidden offline dataset. In Table 3 , we also present the median function value of the the proposed output points for each method, averaged over 5 runs. Results Overall, we obtain a mean score of 0.772 and an average rank of 2.4, which is the best among all the baselines. We also achieve the best results on three tasks. Additionally, we are among the top 2 for five out of seven tasks. We show significant improvements over generative methods such as MINs (Kumar & Levine, 2020) or CbAS (Brookes et al., 2019) , and forward mapping methods such as COMs (Trabucco et al., 2021) on TF-Bind-8, TF-Bind-10, Ant and D'Kitty. We also note that while BONET is placed second in Ant, it shows a much lower standard deviation (0.012) compared to the to the best performing method CMA-ES, which has a much larger standard deviation of 0.928. We also have a lowest mean standard deviation across all tasks, suggesting that BONET is less sensitive to bad initializations compared to other methods. We report the unnormalized results, ablations and other experimental details for Design-Bench tasks in Appendix C. BONET also performs best on 50th percentile evaluation, showing that it has a better set of proposed points compared to other methods, and that the performance is not by randomly finding good points.

4. RELATED WORK

Active BBO The majority of prior work in BBO have been in the active setting, where surrogate models are allowed to query the function during training. This includes long bodies of work in Bayesian Optimization, e.g., (Snoek et al., 2012; Swersky et al., 2013; Srinivas et al., 2010; Nguyen & Osborne, 2020) and bandits, e.g., (Garivier et al., 2016; Riquelme et al., 2018; Joachims et al., 2018b; a; Swaminathan & Joachims, 2015; Jacq et al., 2019) . Such methods usually employ surrogate models such as Gaussian Processes (Srinivas et al., 2010) , Neural Processes (Garnelo et al., 2018a; b; Gordon et al., 2019; Anonymous, 2022; Singh et al., 2019) or Bayesian Neural Networks (Chang, 2021; Goan & Fookes, 2020) to approximate the black-box function, and an uncertainty-aware acquisition strategy for querying new points. Offline BBO Recent works have made use of such datasets and shown promising results (Kumar & Levine, 2020; Trabucco et al., 2021; Brookes et al., 2019; Fannjiang & Listgarten, 2020; Fu & Levine, 2021; Yu et al., 2021) . Kumar & Levine (2020) train a stochastic inverse mapping from the outputs y to inputs x using a generative model similar to a conditional GAN (Goodfellow et al., 2014; Mirza & Osindero, 2014; Arjovsky et al., 2017; Nowozin et al., 2016) . They then optimize over y to find a good design point. Training GANs however can be difficult due to issues like mode collapse (Arjovsky et al., 2017) . Other methods make use of gradient ascent to find an optimal solution. Trabucco et al. (2021) and Yu et al. (2021) train a model to be robust to outliers by regularizing the objective such that they assign low score to those points. Fu & Levine (2021) train a normalized maximum likelihood estimate of the function. We instead offer a fresh perspective based on generative sequence modelling and we show strong results in comparison to many of these prior works in Section 3. Offline reinforcement learning (RL) While both RL and BBO are sequential decision-making problems, the key difference is that RL is stateful while BBO is not. In the offline setting (Schmidhuber, 2019; Jacq et al., 2019) , both problems require models or policies that can generalize beyond the offline dataset to achieve good performance. Related to our work, autoregressive transformers have been successfully applied on trajectory data obtained from offline RL (Chen et al., 2021; Janner et al., 2021) . However, there are important differences between their setting and offline BBO. For example, the data in offline BBO is not sequential in nature, unlike in offline RL where the offline data is naturally in the form of demonstrations. One of our contributions is to design a notion of 'high-to-low' sequences to BBO for autoregressive modeling and test-time generalization. Learned optimizers Our work also bears resemblance to work on meta-learning optimizers like those of (Andrychowicz et al., 2016) and (Chen et al., 2017) . However, a key difference between these learned optimizers and BONET is that they require access to gradients either during training time or during both training and evaluation time, whereas BONET has no such restriction, meaning it can work in situations where access to gradient information is not practical (for instance with nondifferentiable black-box functions). Furthermore, we concentrate on the offline setting, in contrast to learned optimizer work which usually looks at an active optimization setting.

5. DISCUSSION

We presented BONET, a novel generative framework for pretraining black-box optimizers using offline data. BONET consists of a 3 phased process. In the first phase, we use a novel SORT-SAMPLE strategy to generate trajectories from offline data that use the sorting heuristic to mimic the behavior of online BBO optimizers. In phases 2 and 3, we train our model using an autoregressive transformer and use it to generate candidate points that maximize the black-box function. Experimentally, we verify that BONET is capable of solving complex high-dimensional tasks like the ones in Design-Bench, achieving an average rank of 2.4 with a mean score of 0.772. Limitations and Future Work BONET assumes knowledge of the approximate value of the optima f (x * ) to computes the regrets of points. Though we show in Appendix C.5 that different reasonable estimates of f (x * ) can give similar results, this is still something we would like to address in future work. We are also interested in extending BONET to an active setting where our model can quantify uncertainty and actively query the black-box function after pretraining on an offline dataset. On a practical side, we would also be interested in analyzing the properties of domains that dictate where BONET can strongly excel (e.g., D'Kitty) or struggle (e.g., Superconductor) relative to other approaches.Finally, we aim to expand our scope to a meta-task setting, where instead of a single function, our offline data consists of past evaluations from multiple black-box functions.

6. ETHICAL AND SOCIAL RESPONSIBILITY

Offline black-box optimization can play a critical role in improving efficiency and safety in many real-world settings like nuclear reactors or the pharmaceutical industry, where active optimization may prove to be computationally inefficient (requiring too many queries) or even dangerous. However, while we do not anticipate anything inherently malicious about our work, it is possible to utilize our proposed method (and other optimizers in general) in malicious settings (e.g., optimizing for drugs that have harmful effects). This is something to remember when deploying algorithms such as ours for high stake real-world applications.

7. REPRODUCIBILITY

Throughout the paper, we maintained a high standard of rigor for reproducibility. We report the detailed pseudocode of our method in Algorithm 1, and we also provide the link to our code via an anonymized link here. We provide more details on our training setup and choice of hyperparameters in Appendix B. We report results on multiple datasets in Design-Bench (Trabucco et al., 2022) , each with different properties and benchmark our method over multiple baselines from different families of approaches. Our results are averaged over 5 seeds, and we also provide the standard deviations. We also conduct several ablations to evaluate sensitivity of BONET to different parameters.

A NOTATIONS & THEORETICAL ANALYSIS

A.1 NOTATIONS For ease of reference, we list the notations in Table 4 . One crucial component in the SORT-SAMPLE is sorting the trajectories in the increasing order of the function values. Although our primary motivation for sorting is derived from the empirical observations from online black-box optimizers (Figure 1c ), we note that for a certain class of functions that are non-trivial to solve from the perspective of optimization (maximization for our paper), lower (higher) function values occupy a larger (smaller) domain. Thus, intuitively we can relate lower function values with exploration and higher function values with exploitation. With this perspective, sorting can be seen as moving from a high-diversity region to a low-diversity regiona behavior typically seen in online black-box optimizers (Bijl et al., 2016; Garivier et al., 2016) . In this section, we try to formally prove such properties for this constrained class of functions. We consider the simplified case of differential 1D functions with certain assumptions for simplicity, and further extend this notion to a more general D-dimensional case. First, we define a notion of ϵ-high points. Definition 1. ϵ-high values. Let the range of f be denoted as [y min , y max ]. Then, a function value y in this range is ϵ-high if y ≥ y min + ϵ(y max -y min ). Intuitively, the above definition implies that y is ϵ-high if it is in the top 1 -ϵ fraction of the range of f . The following result characterizes the relative diversity of regions consisting of ϵ-high points for 1-D functions. Proof of Proposition 1. Note that if no point in the domain [a, b] achieves ϵ-high function value, then the statement holds trivially true. So, we assume that there is atleast one point which has ϵ-high function value. Let x 1 and x 2 be the smallest and largest such points in the domain. Since the boundary points doesn't have ϵ-high values, |H c | ≥ (x 1 -a) + (b -x 2 ) and |H| ≤ (x 2 -x 1 ). Thus, if we prove that (x 1 -a)+(b-x 2 ) ≥ (x 2 -x 1 ), then we are done. To prove this, we try to prove (x 1 -a) ≥ (x 2 -x 1 ). Assume, on the contrary, that (x 1 -a) < (x 2 -x 1 ). Rearranging, we get 2 x 2 -a < 1 x 1 -a (7) Now, by the definition of Lipchitz constant, we have: f (x 1 ) -f (a) x 1 -a ≤ L ⇒ 2(f (x 1 ) -f (a)) x 2 -a 7 ≤ L ⇒ 2(ϵ * (y max -y min ) + y min -f (a)) b -a ≤ L (8) Last inequality holds because x 2 -a ≤ b -a. This inequality contradicts the bound 6 on L , completing our proof for Proposition 1. Now, we show an extension of this proposition for D-dimensional functions with hypercube domains. Proposition 2. Let f : X → R be a D-dimensional, real-valued, continuous, and differentiable function with hypercube domain X = [a 1 , b 1 ] × [a 2 , b 2 ], • • • , ×[a D , b D ] , such that none of the boundary points are ϵ-high, for some fixed ϵ. Here by boundary points, we mean the points on the surface of the domain hypercube. Let y max and y min be the maximum and the minimum values attained by f . Let H ⊆ X be a Lebesgue-measurable set of points for which f Proof. We prove this proposition by induction on the number of dimensions D. Notice that for D = 1, the statement reduces to Proposition 1, which we have already proved. Next, we assume that the statement holds for (D -1)-dimensional functions and prove it for D dimensions, with D > 1. (x) is ϵ-high. Let H c = X \ H. If the Lipchitz constant L of f is upper bounded by 2(ϵ * (y max -y min ) + y min -max x2,••• ,x D f (a 1 , x 2 , • • • , x D )) b 1 -a 1 , Let's define H D,ϵ : F D → B D to be a functional that maps any D-dimensional function, say f , to a Lebesgue-measurable subset of R D that corresponds to the set of points where f (x) is ϵ-high. , Here, F D and B D denote the set of all D-dimensional functions and the set of all Lebesgue-measurable subsets of R D respectively. We similarly define the mapping H c D,ϵ to be a functional mapping a function f to the complement of its ϵ-high region. Thus, H = H D,ϵ (f ) and H c = H c D,ϵ (f ). Now, by definition, |H D,ϵ (f ( • , • • • , • )| = x D |H D-1,ϵ (f ( • , • • • , • , x D ))| dx D (10) And similarly, |H c D,ϵ (f ( • , • • • , • )| = x D |H c D-1,ϵ (f ( • , • • • , • , x D ))| dx D (11) Consequently, to prove |H D,ϵ (f ( • , • • • , • )| ≤ |H c D,ϵ (f ( • , • • • , • )|, we prove |H D-1,ϵ (f ( • , • • • , • , x D ))| ≤ |H c D-1,ϵ (f ( • , • • • , • , x D ))| for every x D ∈ [a D , b D ]. To do this, we first fix the D th dimension to be c. In other words, we are considering the (D -1)- dimensional slice of f (x 1 , • • • , x D ) with x D = c. Let g be such a slice with g(x 1 , • • • , x D-1 ) = f (x 1 , • • • , x D-1 , c ). First we need ϵ g for which ϵ g -high value for g is ϵ-high for f : ϵ g (y g max -y g min ) + y g min = ϵ(y max -y min ) + y min where y g max and y g min are the minimum and maximum values respectively achieved by g. By this choice of ϵ g , we are ensuring that a point (x 1 , • • • , x D-1 , c) is ϵ-high w.r.t f if and only if (x 1 , • • • , x D-1 ) is ϵ g -high w.r.t g. In other words, H D-1,ϵ g (g) = H D-1,ϵ (f ( • , • • • , • , c)) H c D-1,ϵ g (g) = H c D-1,ϵ (f ( • , • • • , • , c)) Let the Lipchitz constant of g be L g . First we show that L g ≤ L. By definition of Lipchitz constant, for x = (x 1 , • • • x D-1 ), z = (z 1 , • • • , z D-1 ) in the domain of g, L g = sup x̸ =z |g(x 1 , • • • , x D-1 ) -g(z 1 , • • • , z D-1 )| D-1 i=1 (x i -z i ) 2 = sup x̸ =z |f (x 1 , • • • , x D-1 , c) -f (z 1 , • • • , z D-1 , c)| D-1 i=1 (x i -z i ) 2 + (c -c) 2 ≤ L Where last inequality is by definition of L w.r.t f . Combining this with our bound on L in 9, we get L g ≤ 2(ϵ * (y max -y min ) + y min -max x2,••• ,x D f (a 1 , x 2 , • • • , x D )) b 1 -a 1 ≤ 2(ϵ * (y max -y min ) + y min -max x2,••• ,x D-1 f (a 1 , x 2 , • • • , c)) b 1 -a 1 (fixing D th dimension) 12 ≤ 2(ϵ g * (y g max -y g min ) + y g min -max x2,••• ,x D-1 g(a 1 , x 2 , • • • , x D-1 )) b 1 -a 1 Thus, the Lipchitz bound assumption is followed by g with ϵ = ϵ g . Also, the boundaries are not ϵ ghigh w.r.t g because of the choice of ϵ g . This implies, by inductive assumption, that |H D-1,ϵ g (g)| ≤ |H c D-1,ϵ g (g)|. This, combined with equality 13 proves that that |H D-1,ϵ (f ( • , • • • , • , c))| ≤ |H c D-1,ϵ (f ( • , • • • , • , c))|. Since this is true for all c ∈ [a D , b D ] , by equations 10 and 11, our proof for |H D,ϵ (f ( • , • • • , • )| ≤ |H c D,ϵ (f ( • , • • • , • )| is complete. B EXPERIMENTAL DETAILS B.1 SORT-SAMPLE In SORT-SAMPLE, the score of each bin is calculated according to the formula s i = |B i | |B i | + K exp -|ŷ -y bi | τ The two variables K and τ here act as smoothing parameters. K controls the relative priority given to the larger bins (bins with more points). Higher value of K assigns higher relative weight to these large bins compared to smaller bins, whereas a low value of K the weight assigned to large and small bins would be similar. In the extreme case where K = 0, the weight due to |Bi| |Bi|+K will always be 1, regardless of bin size. For very large value of K, the weight will be approximately linearly proportional to the bin size |B i |. The later case is not desirable because if there is a bin As expected, we don't see much sensitivity to the choice of K and τ . For N B , as expected, we a significant decrease for N B = 1, but N B = 32 is comparable to N B = 64 which has very large number of points (which might be a low quality bin), then most of the total weight will be given to that bin because of the linear proportionality. Temperature τ controls how harshly the bad bins are penalized. Lower the τ value, lower the relative score of the low quality bins (bins with high regret) and vice versa. In our experiments, we don't tune the values of K, τ , and number of bins N B . In all the tasks, we use K = 0.03 × N and τ = R 10 , where R 10 is the 10 th percentile regret value in D. For Branin task, we use , N B = 32 and for all the Design-Bench tasks, we use N B = 64. Empirically, as we show in Figure 6a , we didn't observe much effect on K in the range [0.01, 0.1], and for τ from the 50 th to the 10 th percentiles of R. Figure 6b shows variation with N B , keeping all other parameters fixed. As expected, N B = 1 doesn't perform well, as having just one bin is equivalent to having no re-weighting. Beyond 32, we don't see much variation with the value of N B .

B.2 MODEL ARCHITECTURE & IMPLEMENTATION

Architecture We use a GPT (Radford et al., 2019) like architecture, where each timestep refers to two tokens R t and x t . Similar to Chen et al. (2021) , we add a new learned timestep embedding (in addition to the positional embedding already present in transformers). Each token R t and x t that goes as input to our model is first projected into a 128 dimensional embedding space using a linear embedding layer. To this embedding, we also add the positional and timestep embeddings. This is passed through a causally masked transformer. The prediction head for R t predicts xt , which is then used to compute the loss. The output of the prediction head for x t is discarded. At each timestep, we feed in the last C timesteps to the model, where C here refers to the context length. For continuous tasks, the prediction head for R t outputs a d-dimensional prediction xt . For discrete tasks, the prediction head outputs a V × d-dimensional prediction, where V refers to the number of classes in the discrete task. Thus, each dimension in X corresponds to a V -dimensional logits vector. Code Our code (available at the anonymized link here) is built upon the code from minGPTfoot_2 and Chen et al. ( 2021) 4 . All code we use is under the MIT licence. Training The parameter details for all the tasks are summarized in the Table 5 . Note that almost all of the parameters are same across all the Design-Bench tasks. Number of layers is higher for continuous tasks, as they are of higher dimensionalities. For all the tasks, we use a batch size of 128 and a fixed learning rate of 10 -4 for 75 epochs. All training is done using 10 Intel(R) Xeon(R) CPU cores (E5-2698 v4443 @ 2.20GHz) and one NVIDIA Tesla V100 SXM2 GPU. 2 are also normalized using the same procedure, similar to prior works (Trabucco et al., 2022; 2021) . We also report unnormalized results in Table 7 . We noticed that the oracle of Hopper is highly inaccurate for points with higher function values. Figure 9 shows the function values in the dataset vs. the oracle output for the top 10 best points in the data, clearly showing the inconsistency between the two. In fact, the oracle minima and maxima for the dataset are just 56.26 and 786.79, respectively, far from the actual dataset values. Due to such discrepancies, we have decided not to include the Hopper task in our analysis. 

C ADDITIONAL EXPERIMENTAL RESULTS

C.1 ABLATIONS ON SORT-SAMPLE STRATEGY SORT-SAMPLE algorithm has two main components: Sampling after re-weighting and sorting. Our sorting heuristic is primarily motivated by typical runs of online optimizers. To show this, we run an online GP to optimize the three synthetic functions, namely the negative Branin, negative Goldstein-Price and negative Six hump camel functions and plot the function values for the proposed points. Figure 10 shows sample trajectories of the function values of the proposed maxima after each function evaluation. We can see, on average, the function values tend to increase over time as the number of queries increases. Such behavior has also been reported for other black-box functions and setups, see e.g. (Bijl et al., 2016) . However, in this section, we do perform ablations to see the effects of these components. To this end, we construct trajectories using 4 strategies: 1. Random: Uniformly randomly sample a trajectory from the offline dataset. 2. Random + Sorting: Uniformly randomly sample a trajectory from the offline dataset and sort it in ascending order of the function values. 3. Re-weighting + Partial Sorting: Perform re-weighting, uniformly randomly sample n i number of points from each bin, and concatenate them from lowest quality bin to the high-est quality bin. This way, the trajectory will be partially sorted, i.e. the order of the bins themselves is sorted, but the points sampled from a bin will be randomly ordered. In this case, the trajectories are not entirely monotonic w.r.t. the function values. Intuitively, this intermingles exploration and exploitation phases within and across bins respectively. 4. Re-weighting + Sorting (default in BONET): Sort the trajectory obtained in strategy 3. This is the default setting we use in our experiments. Figure 11 contains the results obtained by each of the four strategies. Note that while going from strategy 1 to 2, we keep the points sampled in a trajectory the same, so the only difference between them is sorting. Figure 11 shows that strategy 1 clearly outperforms strategy 2. This means that sorting has a significant impact on the results. Next, note that strategy 2 and 4 differ only in their sampling strategy, and strategy 4 outperforms strategy 2, which shows the effectiveness of re-weighting. This experiment justifies our choice for both re-weighting and sorting. C.2 ANALYSIS ON QUERY BUDGET Q So far, we have been discussing the results with query budget Q = 256. Here, we describe the evaluation strategy we use when lower query budget is available. Our strategy will be to give higher preference to lower R values when lower budget is available. For example, when Q = 192, we only roll-out and evaluate for R ∈ {0.0, 0.01, 0.05}. For 192 < Q ≤ 256, we will roll-out for R ∈ {0.0, 0.01, 0.05, 0.1}, evaluate the entire predicted sub-sequences of lengths 64 for {0.0, 0.01, 0.05}, and evaluate the first Q -192 points in the predicted sub-sequences for R = 0.1. In the Figure 12 , we present the results for different query budgets for our method compared to important baselines, for the D'Kitty task. We outperform the baselines for almost all the query budget values.

C.3 ADDITIONAL ABLATIONS

Here we present ablations similar to Section 3 on the D'Kitty task, and observe similar trends to what we see in the Branin ablations.

C.4 EFFECT OF PREFIX SEQUENCES

During the evaluation, the unrolled trajectory depends on two factors affecting the unrolled output: Evaluation Regret Budget R and the prefix sequence. Empirically, we found that R has a larger impact on the unrolled trajectory than the prefix sequence. To show this, we first evaluate the Branin task for 10 different randomly sampled prefix sequences for a fixed R and then do the same with 10 different R values sampled from the range (0.0, 0.5) for the same prefix sequence. Figure 15 shows the standard deviation of the minimum regret of the 10 different unrolled trajectories for 3 In this experiment we study the effect of changing the number parameters in BONET. We do this by altering the number of layers and heads in BONET on D'Kitty. We find that increasing the number of parameters helps up to a point, beyond which the model over-fits. It is important to note that we present this study only to understand the impact of model size on our performance. We don't actually tune over these parameters in our experiments. They are kept fixed across all the discrete and continuous tasks (refer to Table 5 ). Figure 17 : We show the performance of various models with differing number of layers and heads on D'Kitty to see their effect on BONET. We find that increasing the number of parameters helps upto a point, beyond which it overfits.

C.7 ABLATIONS ON DATASET SIZE

To test the limits of BONET we run experiments where we withhold offline training data from BONET and evaluate the performance. We have two settings, one where we withhold an x% size random subsection of data in Table 8 , and another where we withhold the top x percentile of data during training and evaluation in Table 9 . We see that with just reducing the number of points, we don't see as sharp of a drop off in performance as compared to when we withhold the good points in the dataset. This leads us to believe that the dominating effect is not the size of the dataset exactly, but the quality of points in the dataset. Further, note that even here the points proposed are significantly larger than the maximum point in the dataset, which rules out the possibility of memorization for BONET.

C.8 NOISE ABLATION

We run an experiment where we add progressive larger amounts of noise to the y values in our offline dataset while training our model, to test how robust BONET is to noisy data. We report the results in Table 10 for D'Kitty. We find that, as expected, increasing noise reduces performance, and BONET is in fact reasonably robust to noise, and the largest drop-off occurs when the magnitude of noise is equal to the magnitude of values. C.9 RANDOM BASELINE One might argue that BONET just memorizes the best points in the offline dataset and outputs random points close to those best points during evaluation. To rule out this possibility, we perform a simple experiment for the D'Kitty task. We choose a small hypercube domain around the optimal point in the offline dataset and uniformly randomly sample 256 points in that domain. In Table 11 , we show the results for different widths of this hypercube. 0 width means only the best point in the dataset. For smaller hypercubes around the best points in the offline dataset, we see that the best point found by 256 random searches is roughly 225, which is significantly lower than what BONET finds (291.08). For larger hypercubes, the points are highly diverse. These observations suggest that this optimization problem cannot be solved by just randomly outputting points around the best point in the dataset. If we look at the 256 points output by BONET, they are consistently good (mean is 220), with comparatively very low variance. This suggests that BONET is not simply outputting random points around the best points in the dataset.

D ADDITIONAL ANALYSIS D.1 VIZUALIZATION OF PREDICTED POINTS

Here we try to visualize the predicted points of BONET compared to the points in the offline data to study the nature of the points proposed by the model. As shown in Figure 18 , BONET generalizes well on the unseen maxima regions of the function and produces low regret points.

D.2 ACTIVE GP EXPERIMENT

We also run a experiment to compare BONET with an active BBO method. Namely, we compare BONET with active BayesOpt, using the same GP prior and acquisition function (quasi-Expected Improvement) as mentioned in Section 3. The difference between the active method and the offline method we compare with in Table 2 is that while the active method directly optimizes the ground truth function, the offline method first trains a surrogate neural network on the data, and then performs bayesian optimization on the surrogate instead of the ground truth function. This is done to make the BayesOpt baseline fully offline, and is the same procedure followed by Trabucco et al. (2022; 2021) . Note that this would result in an unfair comparison since the method is both online and queries the oracle function resulting in it using more queries than our budget. As shown in Table 12 , We find that using oracle actually doesn't necessarily improve performance across all tasks, and there are other tasks where the performance doesn't change at all. And our model does outperform even the oracle GP-qEI method on several tasks. 

D.3 T-SNE PLOTS ON D'KITTY

We show t-SNE plots on DKitty for the datasets of differing sizes, with the removal of randomly sampled x% of the data (setting one described in the previous section). The blue points represent points proposed by our model, and the red points represent points in the dataset. In general, blue points do not overlap the red points, indicating that the points proposed by BONET are from a different region. 



We haven't included Hopper since the domain is buggy -we found that the oracle function used to evaluate the task was highly inaccurate and noisy. An expanded discussion can be found in Appendix B.7 We take the baseline implementations from https://github.com/brandontrabucco/design-baselines https://github.com/karpathy/minGPT https://github.com/kzl/decision-transformer We were not able to reproduce the results of(Fu & Levine, 2021) and(Yu et al., 2021) on the latest version of Design-Bench.



Figure 1: (a) Example of offline BBO on toy 1D problem. Here, the domain ends at the red dashed line. Thus, the correct optimal value is x * , whereas gradient ascent on the fitted function will output out of the domain point x. (b) Example trajectory on the 2D-Branin function. The dotted lines denote the trajectories in our offline dataset, and the solid line refers to our model trajectory, with low-quality blue points and high-quality red points. (c) Function values of trajectories generated by a simple gaussian process (GP) based BayesOpt model on several synthetic functions.

following equation 3, and append T to D traj 9: end for 10: ▷ Phase 2: Training 11: Train the model g θ to maximize the log-likelihood of D traj using the loss in equation 4 12: ▷ Phase 3: Evaluation 13: Construct a trajectory T ′ from D following Phase 1, and delete the last T -P points 14: Calculate R t and feed

Figure 3: Plots showing the distribution of function values in the offline dataset D (left) and the trajectories in D traj (right) for the Ant Morphology benchmark(Trabucco et al., 2022). Notice how the overall density of points with high function values is up-weighted post our re-weighting.

Figure 4: Rolled-out trajectories for Branin task for multiple R values (averaged over 5 runs). Figure (b) shows trajectories with prefix length 32, without updating RB (default evaluation setting). In Figure (a), we change the prefix length to 16 while in Figure (c), we update RB in the suffix.

Let f : [a, b] → R, a, b ∈ R, be a real-valued, continuous and differentiable function such that the f (a) and f (b) are not ϵ-high. Let H ⊆ [a, b] be a Lebesgue-measurable set of x values for which f (x) is ϵ-high, with ϵ > 0.5. Let H c = [a, b] \ H. Without loss of generality, let's assume that f (a) ≤ f (b). If the Lipchitz constant L of f is upper bounded by 2(ϵ(y max -y min ) + y min -f (a)) b -a , (6) then |H| < |H c |, where | • | denotes volume of a set w.r.t. the Lebesgue Measure.

) then |H| < |H c |, where | • | denotes volume of a set w.r.t. the Lebesgue Measure.

Figure 6: We plot the best achieved function value for different combinations of K, τ and N B .As expected, we don't see much sensitivity to the choice of K and τ . For N B , as expected, we a significant decrease for N B = 1, but N B = 32 is comparable to N B = 64

Figure 7: Comparison between 3 strategies: (1) do not update the RB at all (blue), (2) update the RB but do not handle the case when R t becomes negative (orange) and (3) don't make the update if the update is going to make the RB negative and update otherwise (green). In all three cases, we see similar performance.

Figure 8: Histogram of normalized function values in the Hopper dataset. The distribution is highly skewed towards low function values.

Figure 9: Dataset values vs Oracle values for top 10 points. Oracle being noisy, we show mean and standard deviation over 20 runs.

Figure 10: Mean and standard deviations of 10 trajectories unrolled by a simple GP-based BayesOpt algorithm on the 3 synthetic functions.

Figure 11: Results with various trajectory construction strategies for D'Kitty task, averaged over 5 runs. Comparing blue and orange bars, it is evident that sorting is improving the results. Similarly, comparison of orange bar with red bar shows that re-weighting further improves the results.

Figure 12: Results for various query budget values Q for D'Kitty task, averaged over 5 runs. We match or outperform other baselines on almost all the values of Q.

Figure13: Figures13a and 13bshow plots of the trajectories generated on DKitty for different values of evaluation RB (0, 8 and 10). In Figure13awe show results without updating RB, and in Figure13bwe show results with updating. All the trajectories are averaged over 5 runs.

Figure 14: Ablations on D'Kitty, averaged over 5 runs.

(a) Varying the number of layers in BONET. Heads are fixed at 8 (b) Varying the number of heads in BONET. Layers are fixed at 16

Figure 18: Vizualization of 32 points unrolled by a sample evaluation trajectory of BONET compared to 2000 points randomly sampled from the offline dataset. Three maxima regions don't contain any dataset points because of the removal of the top 10%-ile of uniformly sampled points, as described in the section 3.1. However, almost all the unrolled points fall into a maxima region, clearly showing the generalization capability of BONET.

Figure 19: We show tSNE plots on DKitty for the datasets of differing sizes with the removal of randomly sampled x% of the data. The blue points represent points proposed by our model, and the red points represent points in the dataset. We find that in general, blue points do not overlap the red points, indicating that the points proposed by BONET are from a different region.

50th percentile evaluations on all the tasks. Similar to Table2, we find that BONET achieves both the best average rank and the best mean score on all tasks. This shows that BONET consistently outputs good candidate points.

Important notations used in the paper

Task details Hopper task in our results in Table 2 because of the inconsistency between the offline dataset values and the oracle outputs. Hopper data consist of 3200 points, each with 5126 dimensions. The lowest and highest function values are 87.93 and 1361.61. Figure 8 shows the distribution of the normalized function values. This distribution is extremely skewed towards low function values. Only 6 points out of 3200 have a normalized function value greater than 0.5.

Results for when a random x% subsection of the offline dataset was withheld during training from BONET

Results for when the top x% of the offline dataset was withheld during training from BONET

Ablation on adding various magnitudes of noise to training data

Results of using a simple sampling strategy randomly from a small hypercube centered around the optima. We find that BONET considerably beats this baseline, indicating that generalization occurring with BONET is not fortuitous.Width of hypercube Max. function Value Mean function Value Std. Deviation

Comparison with GP-qEI on oracle function

B.3 EVALUATION

For all the Design-Bench tasks (except NAS), we use a query budget Q = 256. For NAS, we use a query budget of Q = 128 due to compute restrictions. Since we use a trajectory length of 128 and a prefix length of 64, this means that we can roll-out four different trajectories. There are two variable parameters during the evaluation: Evaluation RB ( R) and the prefix sub-sequence. We empirically observed that R has more impact on the variability of rolled-out points compared to prefix subsequence. Hence, we roll-out trajectory for 4 different low R values (0.0, 0.01, 0.05, 0.1). These values are kept fixed across all the tasks and are not tuned. They are chosen to probe the interval [0.0, 0.1], while giving slightly more importance to the low values by choosing 0.01. Evaluation strategies with lower query budget Q available are discussed in the section C.2 . However, there are two issues with such an update rule:1. Updating RB adds a sequential dependency on our model during evaluation, as we must query f (x t ) to compute R t+1 . Thus, generating the Q candidate points is not purely offline.2. While updating the regret budget R t , it is possible that at some timestep t, R t becomes negative. Since the model has never seen negative RB values during training, this is undesirable.Hence, we do not update RB during evaluation, and instead provide a fixed R value at every timestep after the prefix length. This way, point proposal is not dependent on sequential evaluations of f , making it much faster as the evaluations on f can then be parallelized. Furthermore, by not updating R we sidestep the issue of negative RBs. Empirically, as we see in Figure 7 , there is not much difference across different strategies, which justifies our choice of not updating RB, allowing our method to be purely offline.

B.5 BASELINES

For the gradient ascent baseline of Branin task, we train a 2 layer neural network (with hidden layer of size 128) as a forward model for 75 epochs with a fixed learning rate of 10 -4 . For gradient ascent on the learnt model during evaluation, we report results with a step size of 0.1 for 64 steps. We average over 5 seeds, and for each seed we perform two random restarts.For the baselines in the Design-Bench tasks, we run the baseline code 5 provided in Trabucco et al. (2022) and report results with the default parameters for a query budget of 256. fixed R and prefix sequences, respectively. The standard deviation for the variation in prefix length is consistently lower than that for the variation in R, explaining our choice of spending query budget on different R rather than different prefix sequences.

C.5 EFFECT OF ESTIMATING y M A X

A key assumption of our method is the knowledge of y max . Though in many problems this is not an issue, there are many other problems where the value of y max may not be known. A simple solution could be to just estimate y max . In Figure 16 we evaluate BONET on D'Kitty multiple varying values of y max starting from just beyond the dataset maxima. We find that the value of y max initially affects performance alot, but beyond a point, it plateaus.. 

