BAYESIAN LEARNING TO OPTIMIZE: QUANTIFYING THE OPTIMIZER UNCERTAINTY Anonymous

Abstract

Optimizing an objective function with uncertainty awareness is well-known to improve the accuracy and confidence of optimization solutions. Meanwhile, another relevant but very different question remains yet open: how to model and quantify the uncertainty of an optimization algorithm itself? To close such a gap, the prerequisite is to consider the optimizers as sampled from a distribution, rather than a few pre-defined and fixed update rules. We first take the novel angle to consider the algorithmic space of optimizers, each being parameterized by a neural network. We then propose a Boltzmann-shaped posterior over this optimizer space, and approximate the posterior locally as Gaussian distributions through variational inference. Our novel model, Bayesian learning to optimize (BL2O) is the first study to recognize and quantify the uncertainty of the optimization algorithm. Our experiments on optimizing test functions, energy functions in proteinprotein interactions and loss functions in image classification and data privacy attack demonstrate that, compared to state-of-the-art methods, BL2O improves optimization and uncertainty quantification (UQ) in aforementioned problems as well as calibration and out-of-domain detection in image classification.

1. INTRODUCTION

Computational models of many real-world applications involve optimizing non-convex objective functions. As the non-convex optimization problem is NP-hard, no optimization algorithm (or optimizer) could guarantee the global optima in general, and instead, their solutions' usefulness (sometimes based on their proximity to the optima), when the optima are unknown, can be very uncertain. Being able to quantify such uncertainty is important to not only assessing the solution uncertainty after optimization but also enhancing the search efficiency during optimization. For instance, reliable and trustworthy machine learning models demand uncertainty awareness and quantification during training (optimizing) such models, whereas in reality deep neural networks without proper modeling of uncertainty suffer from overconfidence and miscalibration (Guo et al., 2017 ). In another application example of protein docking, although there exists epistemic uncertainty of the objective function and the aleatoric uncertainty of the protein structure data (Cao & Shen, 2020) , state-ofthe-art methods only predict several single solutions (Porter et al., 2019) without any associated uncertainty, which makes those predictions hard to be interpreted by the end users. Various optimization methods have been proposed in response to the need of uncertainty awareness. Stochastic optimization methods like random search (Zhigljavsky, 2012), simulated annealing (Kirkpatrick et al., 1983 ), genetic algorithms (Goldenberg, 1989) and particle swarm optimization (Kennedy & Eberhart, 1995) injected the randomness into the algorithms in order to reduce uncertainties. However, these methods do not provide the uncertainty quantification (UQ) of solutions. Recently, there have been growing interests in applying inference-based methods to optimization problems (Brochu et al., 2010; Shapiro, 2000; Pelikan et al., 1999) . Generally, they transfer the uncertainties within the data and model into the final solution by modelling the posterior distribution over the global optima. Existing methods consider uncertainty either within the data or the model (including objective functions) (Kendall & Gal, 2017; Ortega et al., 2012; Cao & Shen, 2020) . However, no attention was ever paid to the uncertainty arising from the optimizer that is directly responsible for deriving the end solutions with given data and models. The optimizer is usually pre-defined and fixed in the optimization algorithm space. For instance, there are several popular update rules in Bayesian optimization, such as expected improvement Vazquez & Bect (2010) or upper confidence bound Srinivas et al. (2009) , that are chosen and fixed for the entire process. For Bayesian neural networks training, the update rule is usually chosen off-the-shelf, such as Adam, SGD, or RMSDrop. The uncertainty in the optimizer is intrinsically defined over the optimizer space and important to the optimization and UQ solutions. However, such uncertainty is unwittingly ignored when the optimizer is treated as a fixed sample in the space. To fill the aforementioned gap, the core intellectual value of this work is to recognize and quantify a new form of uncertainty, that lies in the optimization algorithm (optimizer), besides the classical data-or model-based uncertainties (also known as epistemic and aleatoric uncertainties). The underlying innovation is to treat an optimizer as a random sample from the algorithmic space, rather than one of a few hand-crafted update rules. The key enabling technique is to consider the algorithmic space being parameterized by a neural network. We then leverage a Boltzmann-shaped posterior over the optimizers, and approximate the posterior locally as Gaussian distributions through variational inference. Our approach, Bayesian learning to optimize (BL2O), for the first time addresses the modeling of the optimizer-based uncertainty. Extensive experiments on optimizing test functions, energy functions in a bioinformatics application, and loss functions in the image classification and data privacy attack demonstrate that compared to the start-of-art methods, BL2O substantially improves the performance of optimization and uncertainty quantification, as well as calibration and out-of-domain detection in classification. In the following sections, we first review related methods in details and reveal the remaining gap. We then formally define the problem of optimization with uncertainty quantification and point out the optimizer as a source of uncertainty. After formally defining the optimizer space, the optimal optimizer as a random vector in the space, and the optimizer uncertainty, we propose our novel model, BL2O. And lastly, we compare our BL2O with both Bayesian and non-Bayesian competing methods on extensive test functions and real-world applications.

2. RELATED WORK

Many works (Wang & Jegelka, 2017; Hennig & Schuler, 2012) studied optimization with uncertainty quantification under the framework of Bayesian optimization (Shahriari et al., 2016; Brochu et al., 2010) . In these studies, multiple objectives are sampled from the posterior over the objectives (p(f |D)), where D is the observed data. Each sampled objective is optimized for obtaining samples of the global optima: w * so that the empirical distribution over w * can be built. Approximation is much needed since those approaches need optimization for every sample. 1997) designs a randomized optimization algorithm that directly samples global optima. These methods are much more efficient, but their performance heavily depends on the objective landscapes. Moreover, a few studies (Ahmed et al., 2016; Lizotte, 2008; Osborne et al., 2009; Wu et al., 2017) in Bayesian optimization utilize first-order information to boost the performance of optimization. For instance, Osborne et al. (2009) uses gradient information to improve the covariance matrix in Gaussian process. Wu et al. (2017) embeds the derivative knowledge into the acquisition function which is optimized in every iteration. Finally, there are approaches (Ortega et al., 2012; Cao & Shen, 2020) that directly model the shape of posterior as the Boltzmann distributions: p(w * |D) ∝ exp(-αf (w * )), where α is the scheduled temperature constant. They automatically adjust α during the search in order to balance the exploration-exploitation tradeoff. They beat previous work in terms of both efficiency and accuracy. However, as revealed earlier in the Introduction, none of the methods above consider the uncertainty within the optimizer.



For instance, Bijl et al. (2016) uses sequential Monte Carlo to approximate the distribution over the optima with Thompson sampling as the search strategy. Hernández-Lobato et al. (2014) uses kernel approximation for modelling the posterior over the optimum under Gaussian process. Ortega et al. (2012); Cao & Shen (2020) directly model the posterior over the optimum as a Boltzmann distribution. They not only surpass the previous methods in accuracy and efficiency, but also provide easy-to-interpret uncertainty quantification. Under review as a conference paper at ICLR 2021 Despite progress in optimization with uncertainty-awareness, significant open questions remain.

For instance, Henrández-Lobato et al. (2014) uses kernel approximation to approximate the posterior distirbution. Another line of work uses various sampling schemes for estimating the density of posterior distributions. For instance, Bijl et al. (2016) uses sequential Monte Carlo sampling. De Bonet et al. (

