ADADGS: AN ADAPTIVE BLACK-BOX OPTIMIZATION METHOD WITH A NONLOCAL DIRECTIONAL GAUSSIAN

Abstract

The local gradient points to the direction of the steepest slope in an infinitesimal neighborhood. An optimizer guided by the local gradient is often trapped in local optima when the loss landscape is multi-modal. A directional Gaussian smoothing (DGS) approach was recently proposed in (Zhang et al., 2020) and used to define a truly nonlocal gradient, referred to as the DGS gradient, for high-dimensional black-box optimization. Promising results show that replacing the traditional local gradient with the DGS gradient can significantly improve the performance of gradient-based methods in optimizing highly multi-modal loss functions. However, the optimal performance of the DGS gradient may rely on fine tuning of two important hyper-parameters, i.e., the smoothing radius and the learning rate. In this paper, we present a simple, yet ingenious and efficient adaptive approach for optimization with the DGS gradient, which removes the need of hyper-parameter fine tuning. Since the DGS gradient generally points to a good search direction, we perform a line search along the DGS direction to determine the step size at each iteration. The learned step size in turn will inform us of the scale of function landscape in the surrounding area, based on which we adjust the smoothing radius accordingly for the next iteration. We present experimental results on highdimensional benchmark functions, an airfoil design problem and a game content generation problem. The AdaDGS method has shown superior performance over several the state-of-the-art black-box optimization methods.

1. INTRODUCTION

We consider the problem of black-box optimization, where we search for the optima of a loss function F : R d → R given access to only its function queries. This type of optimization finds applications in many machine learning areas where the loss function's gradient is inaccessible, or unuseful, for example, in optimizing neural network architecture (Real et al., 2017) , reinforcement learning (Salimans et al., 2017) , design of adversarial attacks (Chen et al., 2017) , and searching the latent space of a generative model (Sinay et al., 2020) . The local gradient, i.e., ∇F (x), is the most commonly used quantities to guide optimization. When ∇F (x) is inaccessible, we usually reformulate ∇F (x) as a functional of F (x). One class of methods for reformulation is Gaussian smoothing (GS) (Salimans et al., 2017; Liu et al., 2017; Mania et al., 2018) . GS first smooths the loss landscape with d-dimensional Gaussian convolution and represents ∇F (x) by the gradient of the smoothed function. Monte Carlo (MC) sampling is used to estimate the Gaussian convolution. It is known that the local gradient ∇F (x) points to the direction of the steepest slope in an infinitesimal neighborhood around the current state x. An optimizer guided by the local gradient is often trapped in local optima when the loss landscape is non-convex or multimodal. Despite the improvements (Maggiar et al., 2018; Choromanski et al., 2018; 2019; Sener & Koltun, 2020; Maheswaranathan et al., 2019; Meier et al., 2019) , GS did not address the challenge of applying the local gradient to global optimization, especially in high-dimensional spaces. The nonlocal Directional Gaussian Smoothing (DGS) gradient, originally developed in (Zhang et al., 2020) , shows strong potential to alleviate such challenge. The key idea of the DGS gradient is to conduct 1D nonlocal explorations along d orthogonal directions in R d , each of which defines a non-local directional derivative as a 1D integral. Then, the d directional derivatives are assembled to form the DGS gradient. Compared with the traditional GS approach, the DGS gradient can use large smoothing radius to achieve long-range exploration along the orthogonal directions This enables the DGS gradient to provide better search directions than the local gradient, making it particularly suitable for optimizing multi-modal functions. However, the optimal performance of the DGS gradient may rely on fine tuning of two important hyper-parameters, i.e., the smoothing radius and the learning rate, which limits its applicability in practice. In this work, we propose AdaDGS, an adaptive optimization method based on the DGS gradient. Instead of designing a schedule for updating the learning rate and the smoothing radius as in (Zhang et al., 2020) , we learn their update rules automatically from a backtracking line search (Nocedal & Wright, 2006) . Our algorithm is based on a simple observation: while the DGS gradient generally points to a good search direction, the best candidate solution along that direction may not locate in nearby neighborhood. More importantly, relying on a single candidate in the search direction based on a prescribed learning rate is simply too susceptible to highly fluctuating landscapes. Therefore, we allow the optimizer to perform more thorough search along the DGS gradient and let the line search determine the step size for the best improvement possible. Our experiments show that the introduction of the line search into the DGS setting requires a small but well-worth extra amount of function queries per iteration. After each line search, we update the smoothing radius according to the learned step size, because this quantity now represents an estimate of the distance to an important mode of the loss function, which we retain in the smoothing process. The performance and comparison of AdaDGS to other methods are demonstrated herein through three medium-and high-dimensional test problems, in particular, a high-dimensional benchmark test suite, an airfoil design problem and a level generation problem for Super Mario Bros. Related works. The literature on black-box optimization is extensive. We only review methods closely related to this work (see (Rios & Sahinidis, 2009; Larson et al., 2019) for overviews). Random search. These methods randomly generate the search direction and either estimate the directional derivative using GS formula or perform direct search for the next candidates. Examples are two-point approaches (Flaxman et al., 2005; Nesterov & Spokoiny, 2017; Duchi et al., 2015; Bubeck & Cesa-Bianchi, 2012) , three-point approaches (Bergou et al., 2019) , coordinate-descent algorithms (Jamieson et al., 2012) , and binary search with adaptive radius (Golovin et al., 2020) . Zeroth order methods based on local gradient surrogate. This family mimics first-order methods but approximate the gradient via function queries (Liu et al., 2017; Chen et al., 2019; Balasubramanian & Ghadimi, 2018) . A exemplary type of these methods is the particular class of Evolution Strategy (ES) based on the traditional GS, first developed by (Salimans et al., 2017) . MC is overwhelmingly used for gradient approximation, and strategies for enhancing MC estimators is an active area of research, see, e.g., (Maggiar et al., 2018; Rowland et al., 2018; Maheswaranathan et al., 2019; Meier et al., 2019; Sener & Koltun, 2020) . Nevertheless, these effort only focus on local regime, rather than the nonlocal regime considered in this work. Adaptive methods. Another adaptive method based on DGS gradient can be found in (Dereventsov et al., 2020) . Our work is dramatically different in that our update rule for the learning rate and smoothing radius is drawn from line search instead of from Lipschitz constant estimation. The long-range line search can better exploit the DGS direction and thus significantly reduce the number of function evaluations and iterations. Line search is a classical method for selecting learning rate (Nocedal & Wright, 2006) and has also been used in adaptation of some nonlocal search techniques, see, e.g., (Hansen, 2008) . In this work, we apply backtracking line search on DGS direction. We do not employ popular terminate conditions such as Armijo (Armijo, 1966) and Wolfe condition (Wolfe, 1969) and always conduct the full line search, as this requires a small extra cost, compared to high-dimensional searching.



Orthogonal exploration. It has been investigated in black-box optimization, e.g., finite difference explores orthogonal directions. (Choromanski et al., 2018) introduced orthogonal MC sampling into GS for approximating the local gradient; (Zhang et al., 2020) introduced orthogonal exploration and the Gauss-Hermite quadrature to define and approximate a nonlocal gradient.

