BAYESIAN LEARNING TO OPTIMIZE: QUANTIFYING THE OPTIMIZER UNCERTAINTY Anonymous

Abstract

Optimizing an objective function with uncertainty awareness is well-known to improve the accuracy and confidence of optimization solutions. Meanwhile, another relevant but very different question remains yet open: how to model and quantify the uncertainty of an optimization algorithm itself? To close such a gap, the prerequisite is to consider the optimizers as sampled from a distribution, rather than a few pre-defined and fixed update rules. We first take the novel angle to consider the algorithmic space of optimizers, each being parameterized by a neural network. We then propose a Boltzmann-shaped posterior over this optimizer space, and approximate the posterior locally as Gaussian distributions through variational inference. Our novel model, Bayesian learning to optimize (BL2O) is the first study to recognize and quantify the uncertainty of the optimization algorithm. Our experiments on optimizing test functions, energy functions in proteinprotein interactions and loss functions in image classification and data privacy attack demonstrate that, compared to state-of-the-art methods, BL2O improves optimization and uncertainty quantification (UQ) in aforementioned problems as well as calibration and out-of-domain detection in image classification.

1. INTRODUCTION

Computational models of many real-world applications involve optimizing non-convex objective functions. As the non-convex optimization problem is NP-hard, no optimization algorithm (or optimizer) could guarantee the global optima in general, and instead, their solutions' usefulness (sometimes based on their proximity to the optima), when the optima are unknown, can be very uncertain. Being able to quantify such uncertainty is important to not only assessing the solution uncertainty after optimization but also enhancing the search efficiency during optimization. For instance, reliable and trustworthy machine learning models demand uncertainty awareness and quantification during training (optimizing) such models, whereas in reality deep neural networks without proper modeling of uncertainty suffer from overconfidence and miscalibration (Guo et al., 2017 ). In another application example of protein docking, although there exists epistemic uncertainty of the objective function and the aleatoric uncertainty of the protein structure data (Cao & Shen, 2020) , state-ofthe-art methods only predict several single solutions (Porter et al., 2019) without any associated uncertainty, which makes those predictions hard to be interpreted by the end users. Various optimization methods have been proposed in response to the need of uncertainty awareness. Stochastic optimization methods like random search (Zhigljavsky, 2012) , simulated annealing (Kirkpatrick et al., 1983) , genetic algorithms (Goldenberg, 1989) and particle swarm optimization (Kennedy & Eberhart, 1995) injected the randomness into the algorithms in order to reduce uncertainties. However, these methods do not provide the uncertainty quantification (UQ) of solutions. Recently, there have been growing interests in applying inference-based methods to optimization problems (Brochu et al., 2010; Shapiro, 2000; Pelikan et al., 1999) . Generally, they transfer the uncertainties within the data and model into the final solution by modelling the posterior distribution over the global optima. For instance, Bijl et al. (2016) uses sequential Monte Carlo to approximate the distribution over the optima with Thompson sampling as the search strategy. Hernández-Lobato et al. (2014) uses kernel approximation for modelling the posterior over the optimum under Gaussian process. Ortega et al. (2012) ; Cao & Shen (2020) directly model the posterior over the optimum as a Boltzmann distribution. They not only surpass the previous methods in accuracy and efficiency, but also provide easy-to-interpret uncertainty quantification. Under review as a conference paper at ICLR 2021 Despite progress in optimization with uncertainty-awareness, significant open questions remain. Existing methods consider uncertainty either within the data or the model (including objective functions) (Kendall & Gal, 2017; Ortega et al., 2012; Cao & Shen, 2020) . However, no attention was ever paid to the uncertainty arising from the optimizer that is directly responsible for deriving the end solutions with given data and models. The optimizer is usually pre-defined and fixed in the optimization algorithm space. For instance, there are several popular update rules in Bayesian optimization, such as expected improvement Vazquez & Bect (2010) or upper confidence bound Srinivas et al. (2009) , that are chosen and fixed for the entire process. For Bayesian neural networks training, the update rule is usually chosen off-the-shelf, such as Adam, SGD, or RMSDrop. The uncertainty in the optimizer is intrinsically defined over the optimizer space and important to the optimization and UQ solutions. However, such uncertainty is unwittingly ignored when the optimizer is treated as a fixed sample in the space. To fill the aforementioned gap, the core intellectual value of this work is to recognize and quantify a new form of uncertainty, that lies in the optimization algorithm (optimizer), besides the classical data-or model-based uncertainties (also known as epistemic and aleatoric uncertainties). The underlying innovation is to treat an optimizer as a random sample from the algorithmic space, rather than one of a few hand-crafted update rules. The key enabling technique is to consider the algorithmic space being parameterized by a neural network. We then leverage a Boltzmann-shaped posterior over the optimizers, and approximate the posterior locally as Gaussian distributions through variational inference. Our approach, Bayesian learning to optimize (BL2O), for the first time addresses the modeling of the optimizer-based uncertainty. Extensive experiments on optimizing test functions, energy functions in a bioinformatics application, and loss functions in the image classification and data privacy attack demonstrate that compared to the start-of-art methods, BL2O substantially improves the performance of optimization and uncertainty quantification, as well as calibration and out-of-domain detection in classification. In the following sections, we first review related methods in details and reveal the remaining gap. We then formally define the problem of optimization with uncertainty quantification and point out the optimizer as a source of uncertainty. After formally defining the optimizer space, the optimal optimizer as a random vector in the space, and the optimizer uncertainty, we propose our novel model, BL2O. And lastly, we compare our BL2O with both Bayesian and non-Bayesian competing methods on extensive test functions and real-world applications.

2. RELATED WORK

Many works (Wang & Jegelka, 2017; Hennig & Schuler, 2012) studied optimization with uncertainty quantification under the framework of Bayesian optimization (Shahriari et al., 2016; Brochu et al., 2010) . In these studies, multiple objectives are sampled from the posterior over the objectives (p(f |D)), where D is the observed data. Each sampled objective is optimized for obtaining samples of the global optima: w * so that the empirical distribution over w * can be built. Approximation is much needed since those approaches need optimization for every sample. For instance, Henrández-Lobato et al. (2014) uses kernel approximation to approximate the posterior distirbution. Another line of work uses various sampling schemes for estimating the density of posterior distributions. For instance, Bijl et al. (2016) uses sequential Monte Carlo sampling. De Bonet et al. (1997) designs a randomized optimization algorithm that directly samples global optima. These methods are much more efficient, but their performance heavily depends on the objective landscapes. Moreover, a few studies (Ahmed et al., 2016; Lizotte, 2008; Osborne et al., 2009; Wu et al., 2017) in Bayesian optimization utilize first-order information to boost the performance of optimization. For instance, Osborne et al. (2009) uses gradient information to improve the covariance matrix in Gaussian process. Wu et al. (2017) embeds the derivative knowledge into the acquisition function which is optimized in every iteration. Finally, there are approaches (Ortega et al., 2012; Cao & Shen, 2020) that directly model the shape of posterior as the Boltzmann distributions: p(w * |D) ∝ exp(-αf (w * )), where α is the scheduled temperature constant. They automatically adjust α during the search in order to balance the exploration-exploitation tradeoff. They beat previous work in terms of both efficiency and accuracy. However, as revealed earlier in the Introduction, none of the methods above consider the uncertainty within the optimizer.

3. METHODS

Notation. We use a bold-faced uppercase letter to denote a matrix (e.g. W ), a bold-faced lowercase letter to denote a vector (e.g. w), and a normal lowercase letter to denote a scalar (e.g. w).

3.1. PROBLEM STATEMENT

The goal of optimization is to find the global optimum for an objective function f (w) w.r.t. w: w * = arg min w f (w). (1) w * is assumed unknown and treated as a random vector in this study. Once a optimizer obtains ŵ, its estimate of w * , it is important to assess the quality and the uncertainty of the solution. Considering that many real-world objective functions are nonconvex and noisy in ŵ, solution quality is often measured by || ŵ -w * ||, the proximity to the global optimum rather than that to the optimal function value. Examples include energy functions as the objective and RMSDs as the proximity measure in protein docking (Lensink et al., 2007) . Therefore, the goal of uncertainty quantification (UQ) here is the following: P (|| ŵ -w * || r σ |D) = σ (2) where r σ is the upper bound of || ŵw * || at σ confidence level, and D denotes samples during optimization. Such UQ results additionally provide confidence in the solution ŵ and improve model reliability for end users. To calculate the probability defined in Eq 2 and perform UQ, a direct albeit challenging way is to model the posterior over w * (p(w * |D)) and then sample from the posterior. When the optimizer g is regarded fixed in existing literature, the posterior is actually p(w * |D, g). A central contribution of ours is to further consider the optimizer as a source of uncertainty, model it as a random vector in an optimizer space, and perform posterior estimation of p(w * |D).

3.2. OPTIMIZER UNCERTAINTY: A FRAMEWORK

An optimizer is directly responsible for optimization and thus naturally a source of solution uncertainty. To address this often-neglected uncertainty source, we first define the space of optimizers and then model an optimizer as a point in this space. Considering that many widely-used optimizers are iterative and using first-order derivatives, we restrict the optimizer space as follows: Definition 3.1 ((First-order Iterative) Optimizer Space) We define a first-order, iterative algorithmic space G, where each point g ∈ G is an iterative optimizer, that has the following mapping: g({∇f (w τ )} t τ =1 ) = δw t , where ∇f (w) τ and δw t are the gradient and the update vector at τ th and tth iteration, respectively. Here we use g(•) to denote a pre-defined update rule and the resulting optimizer. For instance, in gradient descent, g({∇f (w τ )} t τ =1 ) = -α∇f (w t ), where α is the step size. Now that the optimizer space is defined, we next define the (unknown) optimal optimizer and its uncertainty. Definition 3.2 (Optimal Optimizer) We define the optimal optimizer g * ∈ G as the optimizer that can obtain the lowest function value with a fixed budget T : g * = arg min g∈G ( T t=1 f (w t g )) where w t g = w t-1 g + g({∇f (w τ g )} t-1 τ =1 ) is the parameter value at tth iteration updated through the optimizer g. In practice the optimal optimizer g * is unknown so we treat g * as a random vector and formally define the optimizer uncertainty as follows: Definition 3.3 (Optimizer Uncertainty) Let G be the algorithmic space, where each point g ∈ G is an optimizer. We assume there is a prior distribution over the optimal optimizer g * as p(g * ). We also assume a likelihood distribution as p(D|g * ), where D are the observed data (sample trajectory) given g * . Then we define the optimizer uncertainty through p(g * |D) ∝ p(D|g * )p(g * ). To inject the optimizer uncertainty into p(w * |D), it is straightforward to have the following integration for posterior estimation: p(w * |D) = p(g * |D)p(w * |D, g * )dg (4)

3.3. PARAMETERIZING THE OPTIMIZER SPACE

The optimizer uncertainty p(g * |D) as defined in Def. 3.3 can be intractable when there is no proper parameterization of the optimizer space G. Therefore, we next introduce possible ways of parameterizing G as defined in Def. 3.1. Parameterization through Hyperparameters of Specific Optimizers. A simple way to parameterize the optimizer space for classical optimizers (e.g. Gradient Descent, Adam) is based on their hyperparameters: G = H, where H is the hyperparameter space. For instance, for gradient descent, we have H = (α), where α is the learning rate. For Adam, we have H = (α, β 1 , β 2 ), where β 1 and β 2 are the coefficients used for computing running averages of gradient and its square. However, such parameterization has significant drawbacks. The resulting algorithmic space G is very limited and heavily depends on the specific optimizer. The G (a 1D space) parameterized by the hyperparameters of gradient descent is different from that (a 3D space) parameterized by the hyperparameters of Adam. In fact, each is a rather restricted region of the actual G. The intrinsic flexibility (uncertainty) that lies in an iterative optimizer's update rule is not explored at all in this parameterization. These drawbacks are empirically demonstrated in Sec. 4. Parameterization through Neural Networks. In order to reasonably and accurately model the intrinsic uncertainty within the update rule, we need to find a much more flexible way for modelling g. We thus consider to parameterize the optimizer space as a neural network: G = Θ, where each θ ∈ Θ are the parameters in the neural network. Overcoming drawbacks of the optimizer space H by hyperparameters, Θ by neural network parameters generalizes update rules through neural networks that can represent a wide variety of functions. We note that this is also the space of meta-optimizers that learn to optimize (L2O) iterative update rules from data on a given task (Andrychowicz et al., 2016; Chen et al., 2017; Lv et al., 2017; Cao et al., 2019a) . However, there has been no notion of uncertainty let alone the task of UQ for the learned optimizer in these L2O methods, which is to be addressed in our Bayesian L2O (BL2O).

3.4. MODELING AN OPTIMIZER AS A RANDOM VECTOR

Now that we have the optimizer space G properly defined and parameterized, we proceed to model an optimizer g as a random vector in the space. Boltzmann-shaped Posterior. Since we have G = Θ, we can rewrite each g ∈ G as g θ with θ ∈ Θ and the optimal optimizer g * as g θ * . Therefore, p(g * |D) becomes p(θ * |D). We consider a Gaussian prior over the parameters of the neural network : p(θ * ) ∝ exp(-λ||θ * || 2 2 ) , where λ is a constant controlling the variance. We use the chain rule to decompose the likelihood function p(D|θ * ) at a fixed budge T : p(D|θ * ) = T t=1 p(f (w t θ * ), w t θ * |θ * , {f (w τ θ * ), w τ θ * } t-1 τ =0 ) = T t=1 p(f (w t θ * ), |w t θ * , θ * , {f (w τ θ * ), w τ θ * } t-1 τ =0 ) The second equality is due to that w t θ * is fixed given θ * and past data points. For the single sample likelihood, we apply the results from Ortega et al. (2012) ; Cao & Shen (2020) and obtain p(f (w t θ * ), |w t θ * , θ * , {f (w τ θ * ), w τ θ * } t-1 τ =1 ) ∝ exp(-f (w t θ * )) We multiply the likelihood functions of all samples together and obtain the Boltzmann-shaped likelihood function as p(D|θ * ) ∝ exp(- T t=1 f (w t θ * )). We finally multiply the conjugate prior to the likelihood and obtain the Boltzmann-shaped posterior as: p(θ * |D) ∝ exp(- T t=1 f (w t θ * )) • exp(-λ||θ * || 2 2 ) = exp(-F (θ * )) where F (θ * ) = T t=1 f (w t θ * ) + λ||θ * || 2 2 , which actually contains the objective in Eq 3 plus a L2 regularization constant. Local Approximation and Bayesian Loss. However, the above posterior distribution involves an integral in the normalization constant which is computationally intractable. Moreover, the architecture of F (θ * ) is so complicated that it is impossible to directly sample from the posterior distribution. In order to overcome the aforementioned challenges, we would like to learn a distribution function q(θ * |φ) that has the analytic form and is easy to be sampled, where φ is the parameter vector in q(θ * |φ), to approximate the real posterior p(θ * |D). Furthermore, due to the high dimensions of θ * and the complicated landscape of the posterior, it is impossible to approximate p(θ * |D) at every position in the θ * space. We then consider to approximate it locally around θ c , an optimum of interest for F (θ * ). We denote the local region as Θ c , a neighborhood around θ c , and re-normalization constant C = θ * ∈Θ c p(θ * |D)dθ * . Then the local posterior will be a conditioned (re-scaled) version of p(θ * |D): p (θ * |D) = p(θ * |D)/C, θ * ∈ Θ c . In order to make q(θ * |φ) ≈ p (θ * |D), we calculate the KL-divergence between these two: KL(q(θ * |φ)||p (θ * |D)) = θ * ∈Θ c q(θ * |φ) log q(θ * |φ) p (θ * |D) dθ * = θ * ∈Θ c q(θ * |φ) log q(θ * |φ) p(θ * |D)/C dθ * = θ * ∈Θ c q(θ * |φ) log q(θ * |φ) exp(-F (θ * )) dθ * + θ * ∈Θ c q(θ * |φ) log(ZC)dθ * , where Z = exp(-F (θ * ))dθ * is the normalization constant. The second term in the above equation equals to log(ZC), a constant w.r.t. φ, thus could be ignored during optimization. We then propose our Bayesian loss as: F B (φ) = θ * ∈Θ c q(θ * |φ) log q(θ * |φ)dθ * + θ * ∈Θ c q(θ * |φ)F (θ * )dθ * = -H(q(θ * |φ)) + E q(θ * |φ) [F (θ * )], where the first term of F B measures the negative entropy of our approximated posterior, and the second term is the expectation of the loss function over of posterior. Gaussian Posterior. For local approximation, we consider φ = (µ, Σ) and q(θ * |φ) = N (µ, Σ), where µ is the mean vector and Σ is the covariance matrix of a normal distribution. For simplicity, we consider Σ to be a diagonal matrix: Σ = diag(σ 2 1 , σ 2 2 , σ 2 3 , ...). The second term in Eq (9) involves the integral over F (θ * ), which is intractable. Therefore, we use Monte Carlo sampling through q(θ * |φ) to replace the integral there. However, the direct sampling of the posterior parameters makes it difficult for the optimization as it is inaccessible to get the gradient w.r.t. µ and Σ. Moreover, the standard deviation σ 1 , σ 2 , ... must be non-negative, making the optimization constrained. To overcome those two challenges, we use the trick introduced in (Blundell et al., 2015) to shift sampling from q(θ * |φ) to sampling from a standard normal distribution N (0, I). And we reparameterize standard deviation σ i to ρ i as σ i = log(1 + exp(ρ i )). Then for any sampled from N (µ, I), we could calculate θ * as θ * = u + log(1 + exp(ρ)) , where means element-wise product and ρ = (ρ 1 , ρ 2 , ...).

3.5. BAYESIAN AVERAGING

We recall our goal to build the posterior over the global optimum: p(w * |D) through Eq 4. We consider using Monte Carlo sampling to approximate the integral as: p(w * |D) = p(g(•)|D)p(w * |D, g)dg ≈ θ * ∈Θ c q(θ * |φ)p(w * |g θ * (•), D)dθ * ≈ N i=1 p(w * |g θ * i (•), D) ) where θ * i is sampled from q(θ * |φ). Since q(θ * |φ) follows a multivariate Gaussian distribution where individual dimensions are independent of each other, we estimate the summation above in each dimension using independent MC samplings. In practice, N being 10, 000, 100, 000, and 500, 000 led to negligible differences in the 1D estimations, and N was thus fixed at 10, 000.

3.6. META-TRAINING SET

In order to boost the robustness and generalizability of our optimizer posterior p(g * |D), we consider using an ensemble of objective functions F = {f i } N i=1 . Specifically, we replace the objective function in Eq 3 with 1 N N i=1 T t=1 f i (w t θ * ,i ) and rewrite F (θ * ) as: F (θ * ) = 1 N N i=1 T t=1 f i (w t θ * ,i ) + λ||θ * || 2 (11) where w t θ * ,i is the solution at tth iteration for objective f i optimized by g θ * . Such replacement can let our posterior generalize to novel objective functions. We regard the functional dataset F as the meta-training set. During the experiments, we create different meta-training sets for different problems, which will be described in details in Sec 4. We note that Eq 11 is also the objective or part of the objective in many meta-optimizers (Andrychowicz et al., 2016; Chen et al., 2017; Lv et al., 2017; Cao et al., 2019a) . However, those methods are focusing on training a deterministic optimizer without uncertainty-awareness.

3.7. TWO-STAGE TRAINING FOR EMPOWERING THE LOCAL POSTERIOR

As mentioned before, due to the extreme large optimizer space, we focus on modelling the posterior locally around θ c , an optimum of interest. If we directly train our model through the Bayesian loss in Eq 9, we simply regard that our posterior is locally around the random initialized point. In order to obtain an real optimum of interest θ c , we first train our model in a non-Bayesian way through minimizing the loss in Eq 11. We then use θ c as the warm start for µ, and start the second Bayesian training stage through the loss in Eq 9. Both training stages are critical for empowering our local posterior. Such statement can be demonstrated through the ablation study in Appendix F.

3.8. MODEL ARCHITECTURE, IMPLEMENTATION AND COMPUTATIONAL COMPLEXITY

The model is implemented in Tensorflow 1.13 (Abadi et al., 2016) and optimized by Adam (Kingma & Ba, 2014) . For the optimizer architecture, we use the coordinate-wise LSTM from Andrychowicz et al. (2016) . We also validate this design choice in Appendix F. Due to the coordinate-wise nature, our BL2O model only contains 10,282 free parameters. For all experiments, the length of LSTM is set to be 20. Both training stages include 5,000 training epochs. The time complexity for BL2O is O(KBN e +KN e H 2 ), where K is the number of sampling trajectories, B is the minibatch size, N e is the number of objective parameters, and H is the hidden size of LSTM (H = 20 in the study). As the batch size increases, the computational cost is close to the traditional Bayesian neural networks trained through SGD. Due to the coordinate-wise LSTM, the space coomplexity (memory cost) of BL2O is only O(H 2 ), which remains the same as the number of objective parameters varies. Both the time and the space complexity of BL2O are the same as DM LSTM (Abadi et al., 2016) , while those of Adam are O(KBN e ) and O(N e ), respectively.

4. EXPERIMENTS

We test our BL2O model extensively on optimizing: non-convex test functions, energy functions in protein-protein interactions, loss functions in image classification and loss functions in data privacy attack. We compare BL2O to three non-Bayesian methods: Adam, Particle Swarm Optimization (PSO) (Kennedy & Eberhart, 1995) , DM LSTM (Andrychowicz et al., 2016) and a recently published Bayesian method, BAL (Cao & Shen, 2020) . All algorithms are running for 10,000 times with random initializing points to obtain the empirical posterior distributions. During each run, the hyperparameters in Adam and PSO are sampled from Table 4 in Appendix A. Out of 10,000 solutions we choose the one with the lowest function value to be the final solution ( ŵ). Generally, for optimization performance, we assess the distance between the final solution and the global optima: || ŵw * ||. The lower the distance is, the better the solution quality is. For uncertainty quantification, we assess the upper bound r σ and the real confidence σ given a fixed confidence level σ. The real confidence σ is defined by the fraction of 10,000 solutions that actually fall in the bounded region. The lower the r σ is, the tighter the confidence interval is. And the closer of σ to σ is, the more accurate the confidence estimate is. Comparison in optimizing test functions. We first test the performance on test functions in the global optimization benchmark set (Jamil & Yang, 2013) . We choose three extremely rugged, nonconvex functions: Rastrigin, Ackley and Griewank in 5 dimensions: 6D, 12D, 18D, 24D, 30D. For each function, we create a diverse, broad family of similar functions f j (w) as the meta-training set used for training DM LSTM and BL2O. The analytical forms and the meta-training sets of those functions are shown in Table 5 in Appendix B. We compare BL2O with all 4 competing methods. The optimization and UQ performances are shown in Fig. 1 . In all three cases and 5 dimensions, BL2O has led to the best solution quality. In terms of UQ, BL2O has shown the most accurate confidence estimation ( σ ≈ σ) when σ = 0.9 and σ = 0.8, while BAL was the second best. And BL2O has shown much tighter confidence intervals r σ against BAL. In some cases, although DM LSTM has lower r σ than BL2O, it has much lower confidence level, indicating that this tight upper bound in DM LSTM is miscalibrated. As a result, BL2O has shown the best performance in both optimization and UQ. Comparison in optimizing energy functions for protein docking. We then apply BL2O to a bioinformatics application: predicting the 3D structures of protein-complexes (Smith & Sternberg, 2002) , called protein docking. Ab initio protein docking can be recast as optimizing a noisy and expensive energy function in a high-dimensional conformational space (Cao & Shen, 2020) : x * = arg min x f (x). While solving such optimization problems still remains difficult, quantifying the uncertainty of resulting optima (docking solutions) is even more challenging. In this section, we apply our BL2O to optimization and uncertainty quantification in protein docking and compare with a state-of-the-art method BAL (Cao & Shen, 2020) . We describe the detailed settings of BL2O on protein docking in Appendix C. From BL2O, we obtain a posterior distribution p(w * |D) over the native structure w * and the lowest energy structure, ŵ. In protein docking, the quality of a predicted structure is based on the distance to the native structure (the global optimum): || ŵw * ||. For UQ, we assess the two-sided confidence interval at σ = 0.9 as P (l 0.9 || ŵw * || r 0.9 ) = 0.9. In Table 1 , we assess || ŵ -w * ||, r 0.9 -l 0.9 and whether || ŵ -w * || is within the confidence interval. For optimization, BL2O clearly outperforms BAL in two medium cases while performing slightly worse in the other cases. Yet for UQ, BL2O shows clearly superior performance over BAL in all cases, with accurate or/and tight confidence intervals. We also Table 1 : Performances in optimization and uncertainty quantification on 5 docking cases. || ŵw * || r 0.9 -l 0.9 ( Å) || ŵw * || ∈ [l 0.9 , r 0.9 ]? Target (docking diffficulty) BAL BL2O BAL BL2O BAL BL2O 1AHW 3 (easy) 1 Comparison in optimizing loss functions in image classification. We then test the performance of optimizing the loss function in image classification on the MNIST dataset. We apply a 2layers MLP network as the classifier. The competing methods include Adam, DM LSTM and two Bayesian neural network methods: variational inference (VI) (Blundell et al., 2015) and Learnable Bernoulli Dropout (LBO) (Boluki et al., 2020) . Moreover, for DM LSTM and BL2O, we apply a trick during the optimizer training called curriculum learning (CL) and introduce it in detail in Appendix E for training over long-term iterations. We call DM LSTM with CL as DM LSTM C and BL2O with CL as BL2O C. The assessment of the optimization and UQ for this machine learning task is different from that for optimization before. In terms of optimization, we assess the classification accuracy on the test set. In terms of UQ, we measure two metrics that assess the robustness and trustworthiness of the classifier: the in-domain calibration error and the out-of-domain detection rate. We first compare the accuracy on the testing set among different methods. As shown in Table 2 , Adam, DM LSTM C and BL2O C have almost the same best performance. The significant improvement from DM LSTM to DM LSTM C, and from BL2O to BL2O C shows the big advantage of curriculum learning in learning to optimize. In conclusion, BL2O C had on par accuracy with Adam and DM LSTM C on the MNIST dataset. However, classification models must not only be accurate, but also indicate when they are likely to be incorrect. Confidence calibration, the probability that estimates the true likelihood of each prediction is also important for classification models. In the ideal case, the maximum output probability (MaxConfidence) for each test sample should be equal to the prediction accuracy for that sample. To assess the calibration of each methods, we split the test set into 20 equal-sized bins and assess the calibration error as the average discrepancy between accuracy and MaxConfidence in each bin. As seen in Table 2 , among all methods compared, BL2O C and BL2O had the least calibration error. The figure of Acc. vs MaxConf. is also shown in Fig. 4 in Appendix E. We also inspect the out-of-domain detection of BL2O, BL2O C and competing methods. We train all models on the data belonging to the first 5 classes in the MNIST training dataset (the last layer of the optimizee is modified to have 5 rather than 10 neurons) and test them on the remaining samples from the other 5 classes. An ideal model would predict a uniform distribution over the 5 wrong classes. Therefore, we define the out-of-domain detection rate at threshold t, q t , as the percentage of test samples with max class confidence below t. The larger the q t , the better out-ofdomain detection is. As shown in Table 2 , BL2O and BL2O C shows superior performance with all competing methods. Notably, BL2O without curriculum learning had much better out-of-domain detection rates compared to BL2O with curriculum learning. Comparison in optimizing loss functions for data privacy attack. We finally apply our model to an application that critically needs UQ. As many machine learning models are deployed publicly, it is important to avoid leaking private sensitive information, such as financial data, health data and so on. Data privacy attack (Nasr et al., 2018) studies this problem by playing the role of hackers and attacking the machine-learning models to quantify the risk of privacy leakage. Better attacks would help models to be better prepared for privacy defense. We use the model and dataset in (Cao et al., 2019b) , where each input has 9 features involving patient genetic information and the output p is the probability of the clinical significance (having cancer or not) for a patient. We study the following model inversion attack (Fredrikson et al., 2015) : by giving 5 features w ∈ [0, 1] 5 out of 9 and the label p of each patient, we want to recover the rest 4 features w * ∈ [0, 1] 4 (potentially sensitive patient information). Therefore, for each patient, the objective is w * = arg min w∈[0,1] 4 (m(w , w) -p) 2 , where w * is the ground-truth of w and m is the trained predictive model. The closeness between the predicted and the real input features can quantify the risk of information leakage and the quality of the attack. We compare BL2O with Adam, PSO, BAL and DM LSTM on optimization and UQ on all test cases in (Cao et al., 2019b) . The meta-training objectives for BL2O and DM LSTM are the training set in (Cao et al., 2019b) . As shown in Table 3 , BL2O has shown the best performance in both optimization and UQ compared to all competing methods. It is noteworthy that learned optimizers (DM LSTM and BL2O) had much better optimization performance than pre-defined optimizers. And the Bayesian methods (BAL and BL2O) had significantly better UQ performance than non-Bayesian methods. BL2O possessed the advantages of both learned and Bayesian optimizers to achieve the best performance.  (w) = ||w|| 2 2 - n i=1 10 cos(2πw i ) + 10n Ackley f j (w) = -20 exp(-0.2 0.5||w|| 2 2 ) - n i=1 exp(cos(2πw i )/n) + e + 20 Griewank f j (w) = 1 + 1 4000 n i=1 ||w|| 2 2 - n i=1 cos(w i ) Meta-training set Rastrigin f j (w) = ||A j w -b j || 2 2 -10c j cos(2πw) + 10n Ackley f j (w) = -20 exp(-0.2 0.5||A j w -b j || 2 2 ) -exp(c j cos(2πw)/n) Griewank f j (w) = 1 + 1 4000 n i=1 ||A j w -b j || 2 2 - n i=1 (cos(w i ) + c ji -1)

C SETTINGS FOR PROTEIN DOCKING EXPERIMENTS

We calculate the energy function (objective function f (x)) in a CHARMM 19 force field as in (Moal & Bates, 2010) . 25 protein-protein complexes are chosen from the protein docking benchmark set 4.0 (Hwang et al., 2010) as the training set, which is shown in Table 6 . For each target, we choose 5 starting points (top-5 models from ZDOCK (Pierce et al., 2014) ). In total, our training set includes 125 samples. Moreover, we parameterize the search space as R 12 as in BAL (Cao & Shen, 2020) . The resulting f (x) is fully differentiable in the search space. We only consider 100 interface atoms due to the computational concern. The number of iterations for one training epoch is 600 and in total we have 5000 training epochs. Both BL2O and BAL have 600 iterations during the testing stage. For fair comparison, after optimization, we rescore the BL2O samples for UQ using the scoring function (random forest) in BAL. 



Figure 1: The optimization performance (left) and the UQ performance (r σ and σ ) of different methods on three test functions.

visulize the posterior distributions over || ŵw * || for protein 1JMO 4. As shown in Fig 2, we can see compared to that of BAL, BL2O's distribution has real || ŵw * || within the 90% C.I. and smaller variance. More posterior distributions are shown in Appendix D.

Figure 2: Visualizations of estimated posterior distributions and confidence intervals.

Figure 3: Visualizations of estimated posterior distributions and confidence intervals for more docking cases.

Performance of classification on the MNIST test set.

The optimization and UQ performance of different methods on data privacy attack. ANALYTIC FORMS AND META-TRAINING SETS OF TEST FUNCTIONS

The analytic forms of test functions and the meta-training sets used for training DM LSTM and BL2O, where n is the dimension and A j ∈ R n×n , b j ∈ R n×1 and c j ∈ R n×1 are parameters whose elements are sampled from i.i.d. normal distributions. It is obvious that the test functions are the special cases in the training sets, respectively.

4-letter ID of proteins used in the training set.

Methods

Optimizer Distribution Settings Adam log 10 (lr) ∼ U[-2, -1], β 1 ∼ U [0.9, 1.0], β 2 ∼ U [0.999, 1.0] PSO w ∼ U[0.5, 1.5], C1 ∼ U[1.5, 2.5], C2 ∼ U[1.5, 2.5] In order to overcome this issue, we bring the idea from (Bengio et al., 2009) . Specifically, We set a list for the number of iterations: [100, 200, 500, 1000, 1500, 2000, 2500, 3000] and we gradually increase the number of iterations following the list for every 100 epochs in optimizer training if the optimizee loss is decreasing. Once the number reaches 3000, it will not change any more until the training ends. F ABLATION STUDY.In order to validate various design choices, we perform the ablation study as follows:• B1: We use the coordinate-wise gated recurrent unit (GRU) network as the optimizer architecture. We train our model directly on the Bayesian loss: Eq 9 without non-Bayesian training.• B2: We replace the GRU network with the LSTM network.• BL2O: We add the non-Bayesian training stage to find a local optimum of interest first before training on the Bayesian loss.

