ADADGS: AN ADAPTIVE BLACK-BOX OPTIMIZATION METHOD WITH A NONLOCAL DIRECTIONAL GAUSSIAN

Abstract

The local gradient points to the direction of the steepest slope in an infinitesimal neighborhood. An optimizer guided by the local gradient is often trapped in local optima when the loss landscape is multi-modal. A directional Gaussian smoothing (DGS) approach was recently proposed in (Zhang et al., 2020) and used to define a truly nonlocal gradient, referred to as the DGS gradient, for high-dimensional black-box optimization. Promising results show that replacing the traditional local gradient with the DGS gradient can significantly improve the performance of gradient-based methods in optimizing highly multi-modal loss functions. However, the optimal performance of the DGS gradient may rely on fine tuning of two important hyper-parameters, i.e., the smoothing radius and the learning rate. In this paper, we present a simple, yet ingenious and efficient adaptive approach for optimization with the DGS gradient, which removes the need of hyper-parameter fine tuning. Since the DGS gradient generally points to a good search direction, we perform a line search along the DGS direction to determine the step size at each iteration. The learned step size in turn will inform us of the scale of function landscape in the surrounding area, based on which we adjust the smoothing radius accordingly for the next iteration. We present experimental results on highdimensional benchmark functions, an airfoil design problem and a game content generation problem. The AdaDGS method has shown superior performance over several the state-of-the-art black-box optimization methods.

1. INTRODUCTION

We consider the problem of black-box optimization, where we search for the optima of a loss function F : R d → R given access to only its function queries. This type of optimization finds applications in many machine learning areas where the loss function's gradient is inaccessible, or unuseful, for example, in optimizing neural network architecture (Real et al., 2017) , reinforcement learning (Salimans et al., 2017) , design of adversarial attacks (Chen et al., 2017) , and searching the latent space of a generative model (Sinay et al., 2020) . The local gradient, i.e., ∇F (x), is the most commonly used quantities to guide optimization. When ∇F (x) is inaccessible, we usually reformulate ∇F (x) as a functional of F (x). One class of methods for reformulation is Gaussian smoothing (GS) (Salimans et al., 2017; Liu et al., 2017; Mania et al., 2018) . GS first smooths the loss landscape with d-dimensional Gaussian convolution and represents ∇F (x) by the gradient of the smoothed function. Monte Carlo (MC) sampling is used to estimate the Gaussian convolution. It is known that the local gradient ∇F (x) points to the direction of the steepest slope in an infinitesimal neighborhood around the current state x. An optimizer guided by the local gradient is often trapped in local optima when the loss landscape is non-convex or multimodal. Despite the improvements (Maggiar et al., 2018; Choromanski et al., 2018; 2019; Sener & Koltun, 2020; Maheswaranathan et al., 2019; Meier et al., 2019) , GS did not address the challenge of applying the local gradient to global optimization, especially in high-dimensional spaces. The nonlocal Directional Gaussian Smoothing (DGS) gradient, originally developed in (Zhang et al., 2020) , shows strong potential to alleviate such challenge. The key idea of the DGS gradient is to conduct 1D nonlocal explorations along d orthogonal directions in R d , each of which defines a non-local directional derivative as a 1D integral. Then, the d directional derivatives are assembled to form the DGS gradient. Compared with the traditional GS approach, the DGS gradient can use large smoothing radius to achieve long-range exploration along the orthogonal directions This enables the DGS gradient to provide better search directions than the local gradient, making it particularly suitable for optimizing multi-modal functions. However, the optimal performance of the DGS gradient may rely on fine tuning of two important hyper-parameters, i.e., the smoothing radius and the learning rate, which limits its applicability in practice. In this work, we propose AdaDGS, an adaptive optimization method based on the DGS gradient. Instead of designing a schedule for updating the learning rate and the smoothing radius as in (Zhang et al., 2020) , we learn their update rules automatically from a backtracking line search (Nocedal & Wright, 2006) . Our algorithm is based on a simple observation: while the DGS gradient generally points to a good search direction, the best candidate solution along that direction may not locate in nearby neighborhood. More importantly, relying on a single candidate in the search direction based on a prescribed learning rate is simply too susceptible to highly fluctuating landscapes. Therefore, we allow the optimizer to perform more thorough search along the DGS gradient and let the line search determine the step size for the best improvement possible. Our experiments show that the introduction of the line search into the DGS setting requires a small but well-worth extra amount of function queries per iteration. After each line search, we update the smoothing radius according to the learned step size, because this quantity now represents an estimate of the distance to an important mode of the loss function, which we retain in the smoothing process. The performance and comparison of AdaDGS to other methods are demonstrated herein through three medium-and high-dimensional test problems, in particular, a high-dimensional benchmark test suite, an airfoil design problem and a level generation problem for Super Mario Bros. Related works. The literature on black-box optimization is extensive. We only review methods closely related to this work (see (Rios & Sahinidis, 2009; Larson et al., 2019) for overviews). Random search. These methods randomly generate the search direction and either estimate the directional derivative using GS formula or perform direct search for the next candidates. Examples are two-point approaches (Flaxman et al., 2005; Nesterov & Spokoiny, 2017; Duchi et al., 2015; Bubeck & Cesa-Bianchi, 2012) , three-point approaches (Bergou et al., 2019) , coordinate-descent algorithms (Jamieson et al., 2012) , and binary search with adaptive radius (Golovin et al., 2020) . Zeroth order methods based on local gradient surrogate. This family mimics first-order methods but approximate the gradient via function queries (Liu et al., 2017; Chen et al., 2019; Balasubramanian & Ghadimi, 2018) . A exemplary type of these methods is the particular class of Evolution Strategy (ES) based on the traditional GS, first developed by (Salimans et al., 2017) . MC is overwhelmingly used for gradient approximation, and strategies for enhancing MC estimators is an active area of research, see, e.g., (Maggiar et al., 2018; Rowland et al., 2018; Maheswaranathan et al., 2019; Meier et al., 2019; Sener & Koltun, 2020) . Nevertheless, these effort only focus on local regime, rather than the nonlocal regime considered in this work. Orthogonal exploration. It has been investigated in black-box optimization, e.g., finite difference explores orthogonal directions. (Choromanski et al., 2018) introduced orthogonal MC sampling into GS for approximating the local gradient; (Zhang et al., 2020) introduced orthogonal exploration and the Gauss-Hermite quadrature to define and approximate a nonlocal gradient. Adaptive methods. Another adaptive method based on DGS gradient can be found in (Dereventsov et al., 2020) . Our work is dramatically different in that our update rule for the learning rate and smoothing radius is drawn from line search instead of from Lipschitz constant estimation. The long-range line search can better exploit the DGS direction and thus significantly reduce the number of function evaluations and iterations. Line search is a classical method for selecting learning rate (Nocedal & Wright, 2006) and has also been used in adaptation of some nonlocal search techniques, see, e.g., (Hansen, 2008) . In this work, we apply backtracking line search on DGS direction. We do not employ popular terminate conditions such as Armijo (Armijo, 1966) and Wolfe condition (Wolfe, 1969) and always conduct the full line search, as this requires a small extra cost, compared to high-dimensional searching.

2. THE DIRECTIONAL GAUSSIAN SMOOTHING (DGS) GRADIENT

We are concerned with solving the following optimization problem et al., 2005) to approximate ∇F by exploiting lim σ→0 ∇F σ (x) = ∇F (x) (i.e., setting σ small). Hence, the traditional GS is unsuitable for defining a nonlocal gradient where a large smoothing radius σ is needed. (x) = 1 σ E u∼N (0,I d ) [F (x + σu) u] (Flaxman In (Zhang et al., 2020) , the DGS gradient was proposed to circumvent this hurdle. The key idea was to apply the 1D Gaussian smoothing along d orthogonal directions, so that only 1D numerical integration is needed. In particular, define a 1D cross section of F (x) G(y | x, ξ) = F (x+y ξ), y ∈ R, where x is the current state of F and ξ is a unit vector in R d . Then, the Gaussian smoothing of F (x) along ξ is represented as G σ (y | x, ξ) := (1/ √ 2π) R G(y + σv | x, ξ) exp(-v 2 /2)dv. The deriva- tive of the smoothed F (x) along ξ is a 1D expectation D[G σ (0 | x, ξ)] = 1 σ E v∼N (0,1) [G(σv | x, ξ) v] , where D[•] denotes the differential operator. Intuitively, the DGS gradient is formed by assembling these directional derivatives on d orthogonal directions. Let Ξ := (ξ 1 , . . . , ξ d ) be an orthonormal system, the DGS gradient is defined as ∇ σ,Ξ [F ](x) := D[G σ (0 | x, ξ 1 )], • • • , D[G σ (0 | x, ξ d )] Ξ, where Ξ and σ can be adjusted during an optimization process. Since each component of ∇ σ,Ξ [F ](x) only involves a 1D integral, (Zhang et al., 2020) proposed to use the Gauss-Hermite (GH) quadrature rule (Abramowitz & Stegun, 1972) , where each component D[G σ (0 | x, ξ) is approximated as D M [G σ (0 | x, ξ)] = 1 √ πσ M m=1 w m F (x + √ 2σv m ξ) √ 2v m . Here {v m } M m=1 are the roots of the M -th order Hermite polynomial and {w m } M m=1 are quadrature weights, the values of which can be found in (Abramowitz & Stegun, 1972) . It was theoretically proved in (Abramowitz & Stegun, 1972) that the error of the GH estimator is ∼ M !/(2 M (2M )!) that is much smaller than the MC's error ∼ 1/ √ M . Applying the GH quadrature to each component of ∇ σ,Ξ [F ](x) , the following estimator is defined for the DGS gradient: ∇ M σ,Ξ [F ](x) = D M [G σ (0 | x, ξ 1 )], • • • , D M [G σ (0 | x, ξ d )] Ξ. Then, the DGS gradient is readily integrated to first-order schemes to replace the local gradient.

3. THE ADADGS ALGORITHM

In this section, we describe an adaptive procedure to remove manually designing and tuning the update schedules for the learning rate and the smoothing radius of the DGS-based gradient descent (Zhang et al., 2020) . Our intuitions are: (i) for multimodal landscapes, choosing one candidate solution along the search direction according to a single learning rate may make insufficient progress, and (ii) the optimal step size, if known, is a good indicator for the width of optima that dominates the surrounding area and could be used to inform smoothing radius update. Following this rationale, AdaDGS first uses backtracking line search to estimate the optimal learning rate, and then uses the acquired step size to update the smoothing radius. AdaDGS is straightforward to implement and we find this strategy to overcome the sensitivity to the hyper-parameter selection that affects the original DGS method. As we shall see, the most important hyperparameters in AdaDGS control how aggressive we want to conduct the line search. Our key advantage in high-dimensional optimization is that with a modest budget for line search (compared to that for computing DGS gradient), we can still get a very generous number of function evaluations along DGS direction and approximate the optimal learning rate. We suggested some default values of these hyperparameters which are proven to be universally good throughout our test. However, if one prefers, they can definitely adjust these for a more aggressive line search. For example, even doubling or tripling the number of points to be visited along DGS direction will increase the total number of function evaluations by a small fraction (5% and 10% correspondingly). Recall the gradient descent scheme with DGS x t+1 = x t -λ t ∇ M σ,Ξ [F ](x t ) , where x t and x t+1 are the candidate solutions at iteration t and t + 1, and λ t is the learning rate. The details of the AdaDGS algorithm is described below. Learning rate update via line search. At iteration t, we perform the line search along ∇ M σ,Ξ [F ](x t ) within the interval [x t -L min ∇ M σ,Ξ [F ](xt) ∇ M σ,Ξ [F ](xt) , x t -L max ∇ M σ,Ξ [F ](xt) ∇ M σ,Ξ [F ](xt) ], where L max and L min are the maximum and minimum exploration distances, respectively. We visit S points in the interval, equally spaced on a log scale, and choose the best candidate. The corresponding contraction factor is ρ = min{0.9, (L min /L max ) 1/(S-1) }. More rigorously, the selected learning rate is λ t := L max ρ J ∇ M σ,Ξ [F ](x t ) , where J = arg min j∈{0,...,S-1} F x t -L max ρ j ∇ M σ,Ξ [F ](x t ) ∇ M σ,Ξ [F ](x t ) . The default value of L max is the length of the diagonal of the search domain. This value could be refined by running some test iterations, but our algorithm is not sensitive to such refining. The default value of L min is L min = 0.005L max . The default value for S is S = max{12, 0.05M d}, where M d is the number of samples required by the DGS gradient. Algorithm 1: The AdaDGS algorithm This means that when d is high, we spend roughly 5% budget of function evaluations for line search. Note that when S is large, ρ = 0.9 and the actual minimum exploration distance is L max 0.9 S-1 < L min . As long as the DGS gradient points to a good search direction, the line search along a 1D ray is much more cost-effective than searching in ddimensional spaces. Smoothing radius update. The smoothing radius σ t is adjusted based on the learning rate learned from the line search. The initial radius σ 0 is set to be on the same scale as the width of the search domain. At iteration t, we set σ t to be the mean of the smoothing radius and the learning rate from iteration t -1, i.e., σ t = 1 2 (σ t-1 + λ t-1 ). ( ) because both quantities indicate the landscape of the loss function. The number of Gaussian-Hermite points. The AdaDGS method is not sensitive to the number of GH points. We do not observe significant benefit of using more than 5 GH quadrature points per direction. In some tests (Section 4.3), 3 GH quadrature points per direction are sufficient. Random exploration. We incorporate the following strategies to support random exploration and help the AdaDGS algorithm escape undesirable scenarios. We use the condition |F (x t ) -F (x t-1 )|/|F (x t-1 )| < γ to trigger the random exploration, where the default value for γ is 0.001. Users can optionally trigger these strategies when the method fails to make progress, e.g., insufficient decrease or too small step size. • Reset the smoothing radius. Since σ is updated following Eq. ( 4), σ becomes small with the learning rate. Thus, we occasionally reset σ to its initial value. We set a minimum interval of 10 iterations between two consecutive resets. In many of our tests, the function values reached by AdaDGS within 10 first iterations (before the radius reset is triggered) are already lower than those by its competitors can at the end. • Random generation of Ξ. Keeping the directional smoothing along a fixed set of coordinates Ξ may eventually reduce the exploration capability. To alleviate this issue, we occasionally change the nonlocal exploration directions by randomly generating an orthogonal matrix Ξ. An important difference between our approach and the random perturbation strategy in (Zhang et al., 2020) is that the approach in (Zhang et al., 2020) only add small perturbation to the identity matrix, but we generate a totally random rotation matrix.

4. EXPERIMENTS

We present the experimental results using three sets of problems. All experiments were implemented in Python 3.6 and conducted on a set of cloud servers with Intel Xeon E5 CPUs.

4.1. TESTS ON HIGH-DIMENSIONAL BENCHMARK FUNCTIONS

We compare the AdaDGS method with the following (a) DGS: the baseline DGS with polynomial decay update schedule developed in (Zhang et al., 2020) , (b) ES-Bpop: the standard OpenAI evolution strategy in (Salimans et al., 2017) with a big population (i.e., using the same number of samples as AdaDGS), (c) ASEBO : Adaptive ES-Active Subspaces for Blackbox Optimization (Choromanski et al., 2019) with a population of size 4 + 3 log(d), (d) IPop-CMA: the restart covariance matrix adaptation evolution strategy with increased population size (Auger & Hansen, 2005) , (e) Nesterov: the random search method in (Nesterov & Spokoiny, 2017) , (f) FD: the classical central difference scheme, and (g) TuRBO: trust region Bayesian optimization (Eriksson et al., 2019) . The information of the codes used for the baselines is provided in Appendix. We test the performance of the AdaDGS method on 12 high-dimensional benchmark functions (El-Abd, 2010; Jamil & Yang, 2013), including F 1 (x): Ackley, F 2 (x): Alpine, F 3 (x): Ellipsoidal, F 4 (x): Quintic, F 5 (x): Rastrigin, F 6 (x): Rosenbrock, F 7 (x): Salomon, F 8 (x): Schaffer's F7, F 9 (x): Sharp-Ridge, F 10 (x): Sphere, F 11 (x): Trigonometric, and F 12 (x): Wavy. To make the test functions more general, we applied the following linear transformation to x, z = R(x + x opt -x loc ), which first moves the optimal state x opt to a new random location x loc and then applies a random rotation R to make the function non-separable. We substitute z into the standard definitions of the benchmark functions to formulate our test problems. Details about those functions are provided in Appendix. The hyper-parameters of the AdaDGS method are fixed for the six test functions. Specifically, L max is the length of the diagonal of the domain, S = 200 (= 0.05M d), σ 0 ∼ 5 * width, and M = 5. Since S is large, the minimum exploration distance is easily small and we do not need to concerned with L min . We choose contraction factor to be 0.9. We turned off the random perturbation by setting γ = 0. For each test function, we performed 20 trials, each of which has a random initial state, a random rotation matrix R and a random location of x loc . The comparison between AdaDGS and the baselines in the 1000D case are shown in Figure 1 . Additional results are shown in Appendix C, where the loss decay is plotted in log-scale. The AdaDGS has the best performance overall. In particular, the improvement of AdaDGS over baseline DGS is significant, demonstrating the effectiveness of our adaptive mechanism. AdaDGS shows substantially superior performance in optimizing the highly multimodal functions F 1 , F 2 , F 4 , F 5 , F 7 , F 8 , F 11 , which is significant in global optimization. For the ill-conditioned functions F 3 , F 6 and F 9 , AdaDGS can at least match the performance of the best baseline method, e.g., IPop-CMA. The test with sphere function F 10 show AdaDGS converges within 2 steps, confirming the quality of DGS search direction. For F 12 , all the methods fail to find the global minimum because it is highly multi-modal and there is no global structure to exploit, which makes it extremely challenging for all global optimization methods. We also tested AdaDGS in 2000D, 4000D and 6000D to illustrate its scalability with the dimension. The hyper-parameters are set the same as the 1000D cases. The results are shown in Figure 2 . The AdaDGS method still achieves promising performance, even though the number of total function evaluations increases with the dimension.

4.2. TESTS ON AIRFOIL SHAPE OPTIMIZATION

We applied the AdaDGS method to design a 2D airfoil. We used a computational fluid dynamics (CFD) code, XFoil v.6.91 (Drela, 1989) , and its Python interface v.1.1.1foot_0 . The Xfoil can conduct CFD simulations of an airfoil given a 2D contour design. The first step is to choose an appropriate parameterization for the upper and lower parts of the airfoil. In this work, we used the state-of-the-art Class/Shape function Transformation (CST) (Kulfan, 2008) . Specifically, the upper/down airfoil geometry is represented as z (x) = √ x(1 - x)Σ N i=0 [A i N i x i (1-x) N -i ]+x∆z te , where x ∈ [0, 1], N is the polynomial order. The polynomial 

4.3. TESTS ON GAME CONTENT GENERATION FOR SUPER MARIO BROS

We apply the AdaDGS method to generate a variety of Mario game levels with desired attributes. These levels are produced by generative adversarial networks (GAN) (Goodfellow et al., 2014) , which map from latent parameters to high-quality images. To generate a game level with desired characteristic, one needs to search in the latent space of the GAN for parameters that optimize a prescribed stylistic or performance metric. In this paper, we evaluate the performance of AdaDGS in generating game levels for two different types of objectives: (i) levels that have the maximum number of certain tiles. We consider sky tiles (i.e., game objects that lie in the above half of the image) (MaxSkyTiles) and enemy tiles (MaxEnemies); (ii) playable levels that require the AI agent perform maximum certain action. We consider jumping (MaxJumps) and killing an enemy (MaxKills). These characteristics are often considered for evaluating latent space search and optimization methods (Volz et al., 2018; 2019; Fontaine et al., 2020) . Specifically for type (ii) objective, we use the AI agent developed by Robin Baumgartenfoot_1 to evaluate the playability of the level and the objective functions. We set unplayable penalty to be 100 and add that to the objective function when the generated level is unplayable. The game levels are generated from a pre-trained DCGAN by (Fontaine et al., 2020) , whose inputs are vectors in [-1, 1] 32 . Details of the architecture can also be found in (Volz et al., 2018) . The hyper-parameters of the AdaDGS method are set at default values for the four tests. Specifically, L max is the length of the diagonal of the domain, L min = 0.029 (= 0.005L max ), S = 12, σ 0 = search domain width, M = 3 and γ = 0.001. We start with Ξ being a random orthonormal matrix generated by scipy.stats.ortho group.rvs. As demonstrated in (Volz et al., 2018) , the IPop-CMA is by far the mostly used and superior method for this optimization task, so we only compared the performance of our method with IPop-CMA. We used the pycma v.3.0.3 with the population size be 17 and the radius be 0.5, as described in (Fontaine et al., 2020) . We apply tanh function to the latent variables before sending it to the generator model, because this model was trained on [-1, 1] 32 . 50 trials with random initialization are run for each test. The comparison between AdaDGS and IPop-CMA are shown in Figure 3 . AdaDGS outperforms IPop-CMA in three out of four test functions and is close in the other. We find that Ipop-CMA can also find the best optima in many trials, but it is easier to get stuck at undesirable modes, e.g, local minima. Taking the MaxSkyTiles case as an example. There are 4 types of patterns, shown in Figure 4 , are generated by AdaDGS and IPop-CMA in maximizing MaxSkyTiles. The top-left pattern in Figure 4 is the targeted one, and the other three represent different types of local minima. The probability of generating the targeted pattern is 90% for AdaDGS, and 74% for IPop-CMA. Figure 3 : Comparison of the loss decay w.r.t. # function evaluations for four objectives. From left to right: generate a Mario level with i) maximum number of sky tiles, ii) maximum number of enemies, iii) forcing AI Mario to make the most kills, and iv) forcing AI Mario to make the most jumps.

5. CONCLUSION

We developed an adaptive optimization algorithm with the DGS gradient, which successfully removed the need of hyper-parameter fine tuning of the original DGS method in (Zhang et al., 2020) . Experimental results demonstrated the superior performance of the AdaDGS method compared to several the state-of-the-art black-box optimization methods. On the other hand, the AdaDGS method has some drawbacks that need to be addressed. The most important one is sampling complexity. The GH quadrature requires M × d samples per iteration, which is much more than samples required by MC estimators. The reasons why the AdaDGS outperforms several ES-type methods are due to the good quality of the DGS gradient direction and the line search which significantly reduces the number of iterations. However, when the computing budget is very limited (e.g., only allowing d function evaluations for a d-dimensional problem), then our method becomes inapplicable. One way to alleviate this challenge is to adopt dimensionality reduction (DR) techniques (Choromanski et al., 2019) , such as active subspace and sliced linear regression, and apply the AdaDGS in a subspace to reduce the sampling complexity. Incorporating DR into the AdaDGS method will be considered in our future research.

APPENDIX A DEFINITION OF BENCHMARK FUNCTIONS

Here we provide the definition of the 12 benchmark functions tested in Section 4.1. Let Ω denote the search domain. To make the test functions more general, we randomly translate the functions so that the optimal state x opt moves to new location x loc , and then apply a rotation. This transformation can be written as z = R(x + x opt -x loc ), where R is a rotation matrix making the functions non-separable and x loc is a random location in 0.8Ω. The optimization will need to travel farther to reach the new optimal state. We substitute z into the standard definitions of the benchmark functions to formulate our test problems. • F 1 (x): Ackley function F 1 (x) = -a exp   -b 1 d d i=1 x 2 i   -exp 1 d d i=1 cos(cx i ) + a + exp(1), where d is the dimension and a = 20, b = 0.2, c = 2π are used in our experiments. The initial search domain x ∈ [-32.768, 32 .768] d . The global minimum is f (x opt ) = 0 at x opt = (0, . . . , 0). The Ackley function represents non-convex landscapes with nearly flat outer region. The function poses a risk for optimization algorithms, particularly hill-climbing algorithms, to be trapped in one of its many local minima. • F 2 (x): Alpine function F 2 (x) = d i=1 |x i sin(x i ) + 0.1x i |, where the initial search domain is x ∈ [-10, 10] d . The global minimum is f (x opt ) = 0, with 8 d solutions. We choose x opt = (0, . . . , 0). This function represents multimodal landscapes with non-unique global optimum. • F 3 (x): Ellipsoidal function F 3 (x) = d i=1 10 6 i-1 d-1 x 2 i , where d is the dimension and x ∈ [-2, 2] d is the input domain. The global minimum is f (x opt ) = 0 at x opt = (0, . . . , 0). This represents convex and highly ill-conditioned landscapes. • F 4 (x): Quintic function F 4 (x) = d i=1 |x 5 i -3x 4 i + 4x 3 i + 2x 2 i -10x i -4|, where x ∈ [-10, 10] d is the initial search domain. The global minimum is f (x opt ) = 0, at x i is either -1 or 2. We choose x opt = (-1, . . . , -1). This function represents multimodal landscapes with global structure. • F 5 (x): Rastrigin function F 5 (x) = 10d + d i=1 [x 2 i -10 cos(2πx i )], where d is the dimension and x ∈ [-5.12, 5.12] d is the initial search domain. The global minimum is f (x opt ) = 0 at x opt = (0, . . . , 0). This function represents highly multimodal landscapes with global structure. • F 6 (x): Rosenbrock function F 6 (x) = d-1 i=1 [100(x i+1 -x 2 i ) 2 + (x i -1) 2 ], where d is the dimension and x ∈ [-5, 10] d is the initial search domain. The global minimum is f (x opt ) = 0 at x opt = (1, . . . , 1). The function is unimodal, and the global minimum lies in a bending ridge, which needs to be followed to reach solution. • F 7 (x): Salomon function F 7 (x) = 1 -cos   2π d i=1 x 2 i   + 0.1 d i=1 x 2 i , where x ∈ [-100, 100] d is the initial search domain. The global minimum is f (x opt ) = 0 at x opt = (0, . . . , 0). This function represents multimodal landscapes with global structure. • F 8 (x): Schaffer function F 8 (x) = 1 d -1 d-1 i=1 √ s i + √ s i sin 2 (50s 1 5 i ) 2 with s i = x 2 i + x 2 i+1 , where x ∈ [-100, 100] d is the initial search domain. The global minimum is f (x opt ) = 0 at x opt = (0, . . . , 0). This function represents highly multimodal landscapes. • F 9 (x): Sharp-Ridge function F 9 (x) = x 2 1 + 100 d i=2 x 2 i , where d is the dimension and x ∈ [-10, 10] d is the initial search domain. The global minimum is f (x opt ) = 0 at x opt = (0, . . . , 0). This represents convex and anisotropic landscapes. There is a sharp ridge defined along x 2 2 + • • • + x 2 d = 0 that must be followed to reach the global minimum, which creates difficulties for optimizations algorithms. • F 10 (x): Sphere function F 10 (x) = d i=1 x 2 i , where x ∈ [-5.12, 5.12] d is the initial search domain. The global minimum is f (x opt ) = 0, at x opt = (0, • • • , 0). The Sphere function represents unimodal, isotropic landscapes, and can be used to test the quality of the search direction. • F 11 (x): Trigonometric function F 11 (x) =1 + d i=1 {8 sin 2 [7(x i -0.9) 2 ] + 6 sin 2 [14(x i -0.9) 2 ] + (x i -0.9) 2 }, where x ∈ [-500, 500] d is the initial search domain. The global minimum is f (x opt ) = 1, at x opt = (0.9, • • • , 0.9). This function represents multimodal landscapes with global structure. • F 12 (x): Wavy function F 12 (x) = 1 - 1 d d i=1 cos(kx i ) exp -x 2 i 2 , where k = 10, x ∈ [-π, π] d is the initial domain. The global minimum is f (x opt ) = 0, at x opt = (0, • • • , 0). This function represents multimodal landscapes with no global structure.

B.1 THE ES-BPOP METHOD

ES-Bpop refers to the standard OpenAI evolution strategy in (Salimans et al., 2017) with the a big population, i.e., the same population size as the AdaDGS method. The purpose of using a big population is to compare the MC-based estimator for the standard GS gradient and the GH-based estimator for the DGS gradient given the same computational cost.

B.2 THE ASEBO METHOD

ASEBO refers to Adaptive ES-Active Subspaces for Blackbox Optimization proposed in (Choromanski et al., 2019) . This is the state-of-the-art method in the family of ES. It has been shown that other recent developments on ES, e.g., (Akimoto & Hansen, 2016; Loshchilov et al., 2019) , underperform ASEBO in optimizing the benchmark functions. We use the code published at https://github.com/jparkerholder/ASEBO by the authors of the ASEBO method with default hyper-parameters provided in the code.

B.3 THE IPOP-CMA METHOD

IPop-CMA refers to the restart covariance matrix adaptation evolution strategy with increased population size proposed in (Auger & Hansen, 2005) . We use the code pycma v3.0.3 available at https://github.com/CMA-ES/pycma. The main subroutine we use is cma.fmin, in which the hyper-parameters are • restarts=9: the maximum number of restarts with increasing population size; • incpopsize=2: multiplier for increasing the population size before each restart; • σ 0 : the initial exploration radius is set to 1/4 of the search domain width.

B.4 THE NESTEROV METHOD

Nesterov refers to the random search method proposed in (Nesterov & Spokoiny, 2017) . We use the stochastic oracle x t+1 = x t -λ t F (x t , u t ), where u t is a randomly selected direction and F (x t , u t ) is the directional derivative along u t . According to the analysis in (Nesterov & Spokoiny, 2017) , this oracle is more powerful and can be used for non-convex non-smooth functions. As suggested in (Nesterov & Spokoiny, 2017) , we use forward difference scheme to compute the directional derivative.

B.5 THE FD METHOD

FD refers to the classical central difference scheme for local gradient estimation. We implemented our own FD code following the standard numerical recipe.



Avaliable at https://github.com/daniel-de-vries/xfoil-python. https://www.youtube.com/watch?v=DlkMs4ZHHr8



(x), where x = (x 1 , . . . , x d ) ∈ R d consists of d parameters, and F : R d → R is a d-dimensional loss function. The traditional GS method defines the smoothed loss function as F σ (x) = E u∼N (0,I d ) [F (x + σu)] , where N (0, I d ) is the d-dimensional standard Gaussian distribution, and σ > 0 is the smoothing radius. When the local gradient ∇F (x) is unavailable, the traditional GS uses ∇F σ

Figure 1: Comparison of the loss decay w.r.t. # function evaluations for the 12 benchmark functions in 1000D. Each curve is the mean of 20 independent trials and the shaded areas represent [mean-3std, mean+3std]. The global minimum is F i (x opt ) = 0 except for i = 11, where F 11 (x opt ) = 1.The AdaDGS has the best performance overall, especially for the highly multi-modal functions F 1 , F 2 , F 4 , F 5 , F 7 , F 8 , F 11 , F 11 . All the methods fail to find the global minimum of F 12 which has no global structure to exploit.

Figure 2: Tests on AdaDGS's scalability in 2000D, 4000D and 6000D. The hyper-parameters are the same as the 1000D case. The AdaDGS still achieves promising performance, even though the number of total function evaluations increases with the dimension.

Foilcoefficients A i and the position of the airfoil tail ∆z te are the parameters needed to be optimized. We used two different CST polynomials to parameterize the upper and lower part of the airfoil, where the polynomial degree for each polynomial is set to 6 by following the suggestion in(Ceze et al.). Then, the dimension of the optimization problem is d = 15. The initial search domain is set to [-1, 1] d . We simulated all models with Reynolds number 12e6, speed 0.4 mach and the angles of attack from 5 to 8 degrees. The initial condition is the standard NACA 0012 AIR-FOIL. The hyper-parameters of the AdaDGS method are L max is the length of the diagonal of the domain, L min = 0.005L max , S = 12, σ 0 = search domain width, M = 5 and γ = 0.001. The gain function is set to Lift-Drag and the goal is to maximize the gain. The results are shown in

Figure 4: The levels generated by optimizing MaxSkyTiles objective have four patterns. From top left, clockwise: High number (>80) of sky tiles, medium number ( 40) of sky tiles, medium number of sky tiles with no ground, and low number ( 20) of sky tiles. The top-left type of patterns is the targeted pattern and the other three represent local minima. The probabilities of generating the four types of patterns are: AdaDGS: 90%, 4%, 2%, 4% and IPop-CMA: 74%, 8%, 8%, 10% (from top left, clockwise). AdaDGS shows better performance on generating the targeted pattern.



Lift-Drag. The other baselines achieved lower Drag than the AdaDGS but did not achieve very high Lift force.

B.6 THE TURBO METHOD

TuRBO refers to Trust-Region Bayesian Optimization proposed in (Eriksson et al., 2019) , which is the state-of-the-art Bayesian optimization method. We used the code released by the authors at https://github.com/uber-research/TuRBO.C ADDITIONAL RESULTS ON HIGH-DIMENSIONAL BENCHCHMARK FUNCTIONS AdaDGS converges to the global minimum in six of our test functions. Other baselines fail to converge for any of the test functions with the same number of function evaluations. In Figure 5 , we plot the performance of AdaDGS and other baselines in log scale to show the convergence of our method. Figure 6 compares the convergence rate of the AdaDGS when changing the dimension of the test functions. With exception of Rastrigin function, the convergence rate does not change when we increase the dimension from 1000 to 6000. 

