BLACK-BOX OPTIMIZATION REVISITED: IMPROVING ALGORITHM SELECTION WIZARDS THROUGH MAS-SIVE BENCHMARKING

Abstract

Existing studies in black-box optimization for machine learning suffer from low generalizability, caused by a typically selective choice of problem instances used for training and testing different optimization algorithms. Among other issues, this practice promotes overfitting and poor-performing user guidelines. To address this shortcoming, we propose in this work a benchmark suite, OptimSuite, which covers a broad range of black-box optimization problems, ranging from academic benchmarks to real-world applications, from discrete over numerical to mixed-integer problems, from small to very large-scale problems, from noisy over dynamic to static problems, etc. We demonstrate the advantages of such a broad collection by deriving from it Automated Black Box Optimizer (ABBO), a general-purpose algorithm selection wizard. Using three different types of algorithm selection techniques, ABBO achieves competitive performance on all benchmark suites. It significantly outperforms previous state of the art on some of them, including YABBOB and LSGO. ABBO relies on many high-quality base components. Its excellent performance is obtained without any task-specific parametrization. The benchmark collection, the ABBO wizard, its base solvers, as well as all experimental data are reproducible and open source in OptimSuite.

1. INTRODUCTION: STATE OF THE ART

Many real-world optimization challenges are black-box problems; i.e., instead of having an explicit problem formulation, they can only be accessed through the evaluation of solution candidates. These evaluations often require simulations or even physical experiments. Black-box optimization methods are particularly widespread in machine learning (Salimans et al., 2016; Wang et al., 2020) , to the point that it is considered a key research area of artificial intelligence. Black-box optimization algorithms are typically easy to implement and easy to adjust to different problem types. To achieve peak performance, however, proper algorithm selection and configuration are key, since black-box optimization algorithms have complementary strengths and weaknesses (Rice, 1976; Smith-Miles, 2009; Kotthoff, 2014; Bischl et al., 2016; Kerschke & Trautmann, 2018; Kerschke et al., 2018) . But whereas automated algorithm selection has become standard in SAT solving (Xu et al., 2008) and AI planning (Vallati et al., 2015) , a manual selection and configuration of the algorithms is still predominant in the broader black-box optimization context. To reduce the bias inherent to such manual choices, and to support the automation of algorithm selection and configuration, sound comparisons of the different black-box optimization approaches are needed. Existing benchmarking suites, however, are rather selective in the problems they cover. This leads to specialized algorithm frameworks whose performance suffer from poor generalizability. Addressing this flaw in black-box optimization, we present a unified benchmark collection which covers a previously unseen breadth of problem instances. We use this collection to develop a high-performing algorithm selection wizard, ABBO. ABBO uses high-level problem characteristics to select one or several algorithms, which are run for the allocated budget of function evaluations. Originally derived from a subset of the available benchmark collection, in particular YABBOB, the excellent performance of ABBO generalizes across almost all settings of our broad benchmark suite. Implemented as a fork of Nevergrad (Rapin & Teytaud, 2018) , the benchmark collection, the ABBO wizard, the base solvers, and all performance data are open source. The algorithms are automatically rerun at certain time intervals and all Algorithm 1 High-level overview of ABBO. Selection rules are followed in this order, first match applied. d = dimension, budget b = number of evaluations. Details in (Anonymous, 2020). Cobyla (Powell, 1994) For all other cases and all details, please refer to the source code data is exported to the public dashboard (Rapin & Teytaud, 2020) . For ICLR reviewers, all code is available, thanks to github-anonymizer, at (Anonymous, 2020). In summary, our contributions are as follows. (1) OptimSuite Benchmark Collection: Optim-Suite combines several contributions that recently led to improved reliability and generalizability of black-box optimization benchmarking, among them LSGO (Li et al., 2013) , YABBOB (Hansen et al., 2009; Liu et al., 2020; Anonymous, 2020 ), Pyomo (Hart et al., 2017; Anonymous, 2020) , MLDA (Gallagher & Saleem, 2018), and MuJoCo (Todorov et al., 2012; Mania et al., 2018) , and others (novelty discussed in Section 2). (2) Algorithm Selection Wizard ABBO: Our algorithm selection technique, ABBO, can be seen as an extension of the Shiwa wizard presented in (Liu et al., 2020) . It uses three types of selection techniques: passive algorithm selection (choosing an algorithm as a function of a priori available features (Baskiotis & Sebag, 2004; Liu et al., 2020) ), active algorithm selection (a bet-and-run strategy which runs several algorithms for some time and stops all but the strongest (Mersmann et al., 2011; Pitzer & Affenzeller, 2012; Fischetti & Monaci, 2014; Malan & Engelbrecht, 2013; Muñoz Acosta et al., 2015; Cauwet et al., 2016; Kerschke et al., 2018) ), and chaining (running several algorithms in turn, in an a priori defined order (Molina et al., 2009) ). Our wizard combines, among others, algorithms suggested in (Virtanen et al., 2019; Hansen & Ostermeier, 2003; Storn & Price, 1997; Powell, 1964; 1994; Liu et al., 2020; Hellwig & Beyer, 2016; Artelys, 2015; Doerr et al., 2017; 2019; Dang & Lehre, 2016) . Another core contribution of our work is a sound comparison of our wizard to Shiwa, and to the long list of algorithms available in Nevergrad.

2. SOUND BLACK-BOX OPTIMIZATION BENCHMARKING

We summarize desirable features and common shortcomings of black-box optimization benchmarks and discuss how OptimSuite addresses these. Generalization. The most obvious issue in terms of generalization is the statistical one: we need sufficiently many experiments for conducting valid statistical tests and for evaluating the robustness of algorithms' performance. This, however, is probably not the main issue. A biased benchmark, excluding large parts of the industrial needs, leads to biased conclusions, no matter how many experiments we perform. Inspired by (Recht et al., 2018) in the case of image classification, and similar to the spirit of cross-validation for supervised learning, we use a much broader collection of benchmark problems for evaluating algorithms in an unbiased manner. Another subtle issue in terms of generalization is the case of instance-based choices of (hyper-)parameters: an experimenter



Numerical decision variables only, high degree of parallelism Parallelism > b/2 or b < d MetaTuneRecentering (Meunier et al., 2020) Parallelism > b/5, d < 5, and b < 100 DiagonalCMA-ES (Ros & Hansen, 2008) Parallelism > b/5, d < 5, and b < 500 Chaining of DiagonalCMA-ES (100 asks), then CMA-ES+metamodel (Auger et al., 2005) Parallelism > b/5, other cases NaiveTBPSA as in (Cauwet & Teytaud, 2020) Numerical decision variables only, sequential evaluations b > 6000 and d > 7 Chaining of CMA-ES and Powell, half budget each. b < 30d and d > 30 (1 + 1)-Evol. Strategy w/ 1/5-th rule (Rechenberg, 1973) d < 5 and b < 30d CMA-ES + meta-model (Auger et al., 2005) b < 30d

