DESCENDING THROUGH A CROWDED VALLEY -BENCHMARKING DEEP LEARNING OPTIMIZERS

Abstract

Choosing the optimizer is considered to be among the most crucial design decisions in deep learning, and it is not an easy one. The growing literature now lists hundreds of optimization methods. In the absence of clear theoretical guidance and conclusive empirical evidence, the decision is often made based on anecdotes. In this work, we aim to replace these anecdotes, if not with a conclusive ranking, then at least with evidence-backed heuristics. To do so, we perform an extensive, standardized benchmark of more than a dozen particularly popular deep learning optimizers while giving a concise overview of the wide range of possible choices. Analyzing almost 35,000 individual runs, we contribute the following three points: (i) Optimizer performance varies greatly across tasks. (ii) We observe that evaluating multiple optimizers with default parameters works approximately as well as tuning the hyperparameters of a single, fixed optimizer. (iii) While we can not discern an optimization method clearly dominating across all tested tasks, we identify a significantly reduced subset of specific algorithms and parameter choices that generally lead to competitive results in our experiments. This subset includes popular favorites and some lesser-known contenders. We have open-sourced all our experimental results, making them directly available as challenging and well-tuned baselines. 1 This allows for more meaningful comparisons when evaluating novel optimization methods without requiring any further computational efforts.

1. INTRODUCTION

Large-scale stochastic optimization drives a wide variety of machine learning tasks. Because choosing the right optimization algorithm and effectively tuning its hyperparameters heavily influences the training speed and final performance of the learned model, doing so is an important, every-day challenge to practitioners. Hence, stochastic optimization methods have been a focal point of research (cf. Figure 1 ), engendering an ever-growing list of algorithms, many of them specifically targeted towards deep learning. The hypothetical machine learning practitioner who is able to keep up with the literature now has the choice among hundreds of methods (cf. Table 2 in the appendix)-each with their own set of tunable hyperparameters-when deciding how to train their model. There is limited theoretical analysis that would clearly favor one of these choices over the others. Some authors have offered empirical comparisons on comparably small sets of popular methods (e.g. Wilson et al., 2017; Choi et al., 2019; Sivaprasad et al., 2020) ; but for most algorithms, the only formal empirical evaluation is offered by the original work introducing the method. Many practitioners and researchers, meanwhile, rely on personal and anecdotal experience, and informal discussion on social media or with colleagues. The result is an often unclear, perennially changing "state of the art" occasionally driven by hype. The key obstacle for an objective benchmark is the combinatorial cost of such an endeavor posed by comparing a large number of methods on a large number of problems, with the high resource and time cost of tuning each method's parameters and repeating each (stochastic) experiment repeatedly for fidelity. Offering our best attempt to construct such a comparison, we conduct a large-scale benchmark of optimizers to further the debate about deep learning optimizers, and to help understand how the choice of optimization method and hyperparameters influences the training performance. Specifically, Figure 1 : Number of times ArXiv titles and abstracts mention each optimizer per year. All nonselected optimizers from Table 2 in the appendix are grouped into Other. we examine whether recently proposed methods show an improved performance compared to more established methods such as SGD or ADAM. Additionally, we are interested in assessing whether optimization methods with well-working default hyperparameters exist that are able to keep up with tuned optimization methods. To this end, we evaluate more than a dozen optimization algorithms, largely selected for their perceived popularity, on a range of representative deep learning problems (see Figure 4 ) drawing conclusions from tens of thousands of individual training runs. Right up front, we want to state clearly that it is impossible to include all optimizers (cf. Table 2 in the appendix), and to satisfy any and all expectations readers may have on tuning and initialization procedures, or the choice of benchmark problems-not least because everyone has different expectations in this regard. In our personal opinion, what is needed is an empirical comparison by a third party not involved in the original works. As a model reader of our work, we assume a careful practitioner who does not have access to near-limitless resources, nor to a broad range of personal experiences. As such, the core contributions (in order of appearance, not importance) of our work are: A concise summary of optimization algorithms and schedules A partly automated, mostly manual literature review provides a compact but extensive list of recent advances in stochastic optimization. We identify more than a hundred optimization algorithms (cf. Table 2 in the appendix) and more than 20 families of hyperparameter schedules (cf. Table 3 in the appendix) published at least as pre-prints. An extensive optimizer benchmark on deep learning tasks We conduct a large-scale optimizer benchmark, specifically focusing on optimization problems arising in deep learning. We evaluate 14 optimizers on eight deep learning problems using four different schedules, tuning over dozens of hyperparameter settings, to our knowledge, this is the most comprehensive empirical evaluation of deep learning optimizers to date (cf. Section 1.1 on related work). An analysis of thousands of optimization runs Our empirical experiments indicate that an optimizer's performance highly depends on the test problem (see Figure 4 ). But some high-level trends emerge, too: (1) Evaluating multiple optimizers with default hyperparameters works approximately as well as tuning the hyperparameters for a fixed optimizer. (2) Using an additional untuned learning rate schedule helps on average, but its effect varies greatly depending on the optimizer and the test problem. (3) While there is no optimizer that clearly dominates across all tested workloads, some of the algorithms we tested exhibited highly variable performance. Others demonstrated decent performance consistently. We deliberately refrain from recommending a single one among them, because we could not find a clear winner with statistical confidence. An open-source baseline for future optimizer benchmarks Our results are accessible online in an open and easily accessible form (see footnote on Page 1). These results can thus be used as competitive and well-tuned baselines for future benchmarks of new algorithms, drastically reducing the amount of computational budget required for a meaningful optimizer comparison. Our baselines can easily be expanded, and we encourage others to contribute to this collection. The high-level result of our benchmark is, perhaps expectedly, not a clear winner. Instead, our comparison shows that, while some optimizers are frequently decent, they also generally perform similarly, switching their relative positions in the ranking which can partially be explained by the No Free Lunch Theorem (Wolpert & Macready, 1997) . A key insight of our comparison is that a practitioner with a new deep learning task can expect to do about equally well by taking almost any method from our benchmark and tuning it, as they would by investing the same computational resources into running a set of optimizers with their default settings and picking the winner. Possibly the most important takeaway from our comparison is that "there are now enough optimizers." Methods research in stochastic optimization should focus on significant (conceptual, functional, performance) improvements-such as methods specifically suited for certain problem types, innerloop parameter tuning or structurally novel methods. We make this claim not to discourage research but, quite on the contrary, to offer a motivation for more meaningful, non-incremental research.

1.1. RELATED WORK

Following the rapid increase in publications on optimizers, benchmarking these methods for the application in deep learning has only recently attracted significant interest. Schneider et al. (2019) introduced a benchmarking framework called DEEPOBS, which includes a wide range of realistic deep learning test problems together with standardized procedures for evaluating optimizers. Metz et al. (2020) presented TASKSET, another collection of optimization problems focusing on smaller but many more test problems. For the empirical analysis presented here, we use DEEPOBS as it provides optimization problems closer to real-world deep learning tasks. In contrast to our evaluation of existing methods, TASKSET and its analysis focuses on meta-learning new algorithms or hyperparameters. Both Choi et al. (2019) and Sivaprasad et al. (2020) analyzed specific aspects of benchmarking process. Sivaprasad et al. (2020) used DEEPOBS to illustrate that the relative performance of an optimizer depends significantly on the used hyperparameter tuning budget. The analysis by Choi et al. (2019) supports this point, stating that "the hyperparameter search space may be the single most important factor explaining the rankings." They further stress a hierarchy among optimizers, demonstrating that, given sufficient hyperparameter tuning, more general optimizers can never be outperformed by special cases. In their study, however, they manually chose a hyperparameter search space per optimizer and test problem basing it either on prior published results, prior experiences, or pre-tuning trials. Here we instead aim to identify well-performing optimizers in the case of a less extensive tuning budget and especially when there is no prior knowledge about well-working hyperparameter values for each specific test problem. We further elaborate on the influence of our chosen hyperparameter search strategy in Section 4 discussing the limitations of our empirical study. Our work is also related to empirical generalization studies of adaptive methods, such as that of Wilson et al. (2017) which sparked an extensive discussion whether adaptive methods (e.g. ADAM) tend to generalize worse than standard first-order methods (i.e. SGD).

2. BENCHMARKING PROCESS

Any benchmarking effort requires tricky decisions on the experimental setup that influence the result. Evaluating on a specific task or picking a certain tuning budget, for example, may favor or disadvantage certain algorithms (Sivaprasad et al., 2020) . It is impossible to avoid these decisions or to cover all possible choices. Aiming for generality, we evaluate the performance on eight diverse real-world deep learning problems from different disciplines (Section 2.1). From a collection of more than a hundred deep learning optimizers (Table 2 in the appendix) we select 14 of the most popular and most promising choices (cf. Figure 1 ) for this benchmark (Section 2.2). For each test problem and optimizer we evaluate all possible combinations of three different tuning budgets (Section 2.3) and four selected learning rate schedules (Section 2.4), thus covering the following combinatorial space: Problem        P1 P2 . . . P8        8 × Optimizer        AMSBound AMSGrad . . . SGD        14 × Tuning    one-shot small budget large budget    3 × Schedule        constant cosine decay cosine warm restarts trapezoidal        4 . Combining those options results in 1,344 possible configurations and roughly 35,000 individual runs. We consider the eight optimization tasks summarized in Table 1 , available as the "small" (P1-P4) and "large" (P5-P8) problem sets, respectively, together forming the default collection of DEEPOBS. A detailed description of these problems, including architectures, training parameters, etc. can be found in the work of Schneider et al. (2019) .foot_1 DEEPOBS' test problems provide several performance metrics, including the training and test loss, the validation accuracy, etc. While these are all relevant, any comparative evaluation of optimizers requires picking only a few, if not just one particular performance metric. For our analysis (Section 3), we focus on the final test accuracy (or the final test loss, if no accuracy is defined for this problem). This metric captures, for example, the optimizer's ability to generalize and is thus highly relevant for practical use. Our publicly released results include all metrics for completeness. An example of training loss performance is shown in Figure 16 in the appendix. Accordingly, the tuning (Section 2.3) is done with respect to the validation metric. We discuss possible limitations resulting from these choices in Section 4.

2.2. OPTIMIZER SELECTION

In Table 2 in the appendix we collect over a hundred optimizers introduced for, suggested for, or used in deep learning. This list was manually and incrementally collected by multiple researchers trying to keep up with the field over recent years. It is thus necessarily incomplete, although it may well represent one of the most exhaustive of such collections. Even this incomplete list, though, contains too many entries for a meaningful benchmark with the degrees of freedom collected above. This is a serious problem for research: Even an author of a new optimizer, let alone a practitioner, could not possibly be expected to compare their work with every possible competing method. We thus selected a subset of 14 optimizers, which we consider to be currently the most popular choices in the community (see Table 4 in the appendix). These do not necessarily reflect the "best" algorithms, but are either commonly used by practitioners and researchers, or have recently generated enough attention to garner interest. Our selection is focused on first-order optimization methods, both due to their prevalence for non-convex continuous optimization problems in deep learning as well as to simplify the comparison. Whether there is a significant difference between these optimizers or if they are inherently redundant is one of the questions this work investigates. With our list, we tried to focus on optimization algorithms over techniques, although we acknowledge, the line being very blurry. Techniques such as averaging weights (Izmailov et al., 2018, e.g.) or ensemble methods (Garipov et al., 2018, e.g.) have been shown to be simple but effective at improving the optimization performance. Those methods, however, can be applied to all methods in our lists, similar to regularization techniques, learning rate schedules, or tuning methods and we have, therefore, decided to omit them from Table 2 .

2.3. TUNING

Budget Optimization methods for deep learning regularly expose hyperparameters to the user. The user sets them either by relying on the default suggestion; using experience from previous experiments; or using additional tuning runs to find the best-performing setting. All optimizers in our benchmark have tunable hyperparameters, and we consider three different tuning budgets. The first budget consists of just a single run. This one-shot budget uses the default values proposed by the original authors, where available (Table 4 in the appendix lists the default parameters). If an optimizer performs well in this setting, this has great practical value, as it drastically reduces the computational resources required for training. The other budgets consist of 25 and 50 tuning runs for what we call the small and large budget settings, respectively. We only use a single seed for tuning, then repeat the best setting 10 times using different seeds. This allows us to report standard deviations in addition to means, assessing stability. Progressing in this way has the "feature" that our tuning process can sometimes pick "lucky" seeds, which do not perform as well when averaging over multiple runs. This is arguably a good reflection of reality. Stable optimizers should be preferred in practice, which is thus reflected in our benchmark. See Appendix C for further analysis. By contrast, using all 10 random seeds for tuning as well would drastically increase cost, not only for this benchmark, rendering it practically infeasible, but also as an approach for the practical user. Appendix D explores this aspect further: If anything, re-tuning would further broaden the distribution of results. Tuning method We tune parameters by random search, for both the small and the large budget. Random search is a common choice in practice due to its efficiency advantage over grid search (Bergstra & Bengio, 2012) and its ease of implementation and parallelization compared to Bayesian optimization (see also Section 4). A minor complication of random search is that the sampling distribution affects the optimizer's performance. One can think of the sampling distribution as a prior over good parameter settings, and bad priors consequently ruin performance. We followed the mathematical bounds and intuition provided by the optimizers' authors for relevant hyperparameters. The resulting sampling distributions can be found in Table 4 in the appendix. In case there is no prior knowledge provided in the cited work we chose similar distributions for similar hyperparameters across different optimizers. Even though a hyperparameter might have a similar naming throughout different optimization algorithms (e.g. learning rate α), its appropriate search space can differ across optimizers. Without grounded heuristics on how the hyperparameters differ between optimizers, the most straightforward approach for any user is to use the same search space. What should be considered a hyperparameter? There's a fuzzy boundary between (tunable) hyperparameters and (fixed) design parameters. A recently contentious example is the ε in adaptive learning rate methods like ADAM. It was originally introduced as a safeguard against division by zero, but has recently been re-interpreted as a problem-dependent hyperparameter choice (see Choi et al. (2019) for a discussion). Under this view, one can actually consider several separate optimizers called ADAM: From an easy-to-tune but potentially limited ADAM α , only tuning the learning rate, to the tricky-to-tune but all-powerful ADAM α,β1,β2,ε , which subsumes SGD as a corner case in its hyperparameter space. In our benchmark, we include ADAM α,β1,β2 as a popular choice. While they share the same update rule, we consider them to be different optimizers.

2.4. SCHEDULES

The literature on learning rate schedules is now nearly as extensive as that on optimizers (cf. Table 3 in the appendix). In theory, schedules can be applied to all hyperparameters of an optimization algorithm but to keep our configuration space feasible, we only apply schedules to the learning rate, by far the most popular practical choice (Goodfellow et al., 2016; Zhang et al., 2020) . We choose four different learning rate schedules, trying to cover all major types of schedules (see Appendix E): • A constant learning rate schedule; • A cosine decay (Loshchilov & Hutter, 2017) as an example of a smooth decay; • A cosine with warm restarts schedule (Loshchilov & Hutter, 2017 ) as a cyclical schedule; • A trapezoidal schedule (Xing et al., 2018) from the warm-up schedules (Goyal et al., 2017) .

3. RESULTS

How well do optimizers work out-of-the-box? By comparing each optimizer's one-shot results against the tuned versions of all 14 optimizers, we can construct a 14 × 14 matrix of performance gains. Figure 2 illustrates this on five test problems showing improvements by a positive sign and a green cell. Detailed plots for all problems are in Figures 9 and 10 in the appendix. For example, the bottom left cell of the largest matrix in Figure 2 shows that AMSBOUND (1) tuned using a small budget performs 2.5% better than SGD (14) with default parameters on this specific problem. One-shot 0.0 -0.5 0.2 -1.2 0.4 -0.7 -0.9 1.6 0.1 0.1 0.1 -0.6 -1.0 -0.5 39.5 39.0 39.8 38.3 39.9 38.8 38.6 41.1 39.6 39.6 39.6 38.9 38.5 39.0 0.4 -0.1 0.7 -0.8 0.8 -0.3 -0.5 2.0 0.5 0.5 0.5 -0.2 -0.6 -0.1 44.0 43.5 44.2 42.8 44.3 43.3 43.1 45.6 44.1 44.1 44.0 43.4 42.9 43.5 2.0 1.5 2.2 0.7 2.3 1.2 1.0 3.5 2.0 2.0 2.0 1.4 0.9 1.4 1.6 1.1 1.8 0.4 1.9 0.9 0.7 3.2 1.7 1.7 1.6 1.0 0.5 1.1 6.6 6.1 6.8 5.3 6.9 5.8 5.7 8.1 6.7 6.7 6.6 6.0 5.5 6.1 6.1 5.6 6.3 4.9 6.4 5.4 5.2 7.7 6.2 6.2 6.1 5.5 5.1 5.6 0.8 0.3 1.0 -0.4 1.1 0.1 -0.1 2.4 0.9 0.9 0.8 0.2 -0.3 0.3 1.4 0.9 1.6 0.2 1.7 0.7 0.5 3.0 1.5 1.5 1.4 0.8 0.3 0.9 3.3 2.8 3.6 2.1 3.7 2.6 2.4 4.9 3.4 3.4 3.4 2.7 2.3 2.8 2.5 2.0 2.8 1.3 2.9 1.8 1.6 4.1 2.6 2.6 2.6 1.9 1.5 2.0 CIFAR-10 3c3d A green row in Figure 2 indicates that an optimizer's default setting is performing badly, since it can be beaten by any well-tuned competitor. We can observe badly-performing default settings for MOMENTUM, NAG and SGD, advocating the intuition that non-adaptive optimization methods require more tuning, but also for AMSGRAD and ADADELTA. This is just a statement about the default parameters suggested by the authors or the popular frameworks, well-working default parameters might well exist for those methods. Conversely, a white & red row signals a wellperforming default setting, since even tuned optimizers cannot significantly outperform this algorithm. ADAM, NADAM and RADAM, as well as AMSBOUND and ADABOUND all have white or red rows on several (but not all!) test problems, supporting the rule of thumb that adaptive methods have well-working default parameters. Conversely, green (or red) columns highlight optimizers that, when tuned, perform better (or worse) than all untuned optimization methods. We do not observe such columns consistently across tasks. This supports the conclusion that an optimizer's performance is heavily problem-dependent and that there is no single best optimizer across workloads. Figures 9 to 12 in the appendix and our conclusions from them suggest an interesting alternative approach for machine learning practitioners: Instead of picking a single optimizer and tuning its hyperparameters, trying out multiple default setting optimizers and picking the best one should yield competitive results with less computational and tuning choice efforts. The similarity of those two approaches might be due to the fact that optimizers have implicit learning rate schedules and trying out different optimizers is similar to trying out different (well-tested) schedules (Agarwal et al., 2020) . How much do tuning and schedules help? We consider the final performance achieved by varying budgets and schedules to quantify the usefulness of tuning and applying parameter-free schedules (Figure 3 ). While there is no clear trend for any individual setting (gray lines), in the median we observe that increasing the budget improves performance, albeit with diminishing returns. For example, using the large budget without any schedule leads to a median relative improvement of the performance of roughly 3.4 % compared to the default parameters (without schedule). Similarly, applying a parameter-free (i.e. untuned) schedule improves median performance. For example, the large tuning budget coupled with a trapezoidal learning rate schedule leads to a median relative improvement of roughly 5.3 % compared to the default parameters. However, while these trends hold in the median, their individual effect varies wildly among optimizers and test problems, as is apparent from the noisy structure of the individual lines shown in Figure 3 . 

Relative improvement

Figure 3 : Lines in gray (-, smoothed by cubic splines for visual guidance only) show the relative improvement for a certain tuning and schedule (compared to the one-shot tuning without schedule) for all 14 optimizers on all eight test problems. The median over all lines is plotted in orange (-) with the shaded area (T) indicating the area between the 25th and 75th percentile. Which optimizers work well after tuning? Figure 4 compares the optimizers' performance across the test problems. There is no single optimizer that dominates its competitors across all tasks. Nevertheless, some optimizers generally perform well, while others vary wildly in their behavior. Further supporting the hypothesis of previous sections, we note that taking the best out of a small set of un-tuned optimizers -for example, ADAM and ADABOUND-frequently results in competitive overall performance, even compared to well-tuned optimizers. Combining these runs with a tuned version of ADAM (or variants thereof) generally yields competitive results in our benchmark. Nevertheless, achieving (or getting close to) the absolute best performance still requires testing multiple optimizers. Which optimizer wins in the end, though, is problem-dependent: optimizers that achieve top scores on one problem can perform rather badly on other tasks. We note in passing that the individual optimizer rankings can change when considering e.g. a smaller budget or an additional learning rate schedule (see Figures 13 to 15 in the appendix). However, the overall trends described here are consistent.

4. LIMITATIONS

Any empirical benchmark has constraints and limitations. Here we highlight some of them and characterize the context within which our results should be considered. Generalization of the results By using the test problems from DEEPOBS, which span models and data sets of varying complexity, size, and different domains, we aim for generalization. Our results are, despite our best efforts, reflective of not just these setups, but also to the chosen training parameters, the software framework, and further unavoidable choices. The design of our comparisons aims to be close to what an informed practitioners would encounter in practice. It goes without saying that even a carefully curated range of test problems cannot cover all challenges of machine learning or even just deep learning. In particular, our conclusions may not generalize to other types of workloads such as GANs, reinforcement learning, or applications where e.g. memory usage is crucial. Similarly, our benchmark does not cover more large-scale problems such as ImageNet (Deng et al., 2009) or transformer models (Vaswani et al., 2017) for machine translation. Studying, whether there are systematic differences between these types of optimization problems presents an interesting avenue for further research. We don't consider this study the definitive work on benchmark deep learning optimizers, but rather an important step in the right direction. While our comparison includes many "dimensions" of deep learning optimization, e.g. by considering different problems, tuning budgets, and learning rate schedules, there are many more. To keep the benchmark feasible, we chose to use the fixed L2-regularization and batch size that DEEPOBS suggests for each problem. We also did not include optimization techniques such as weight averaging or ensemble methods as they can be combined with all evaluated optimizers. Future works could study how these techniques interact with different optimization methods. However, to keep our benchmark feasible, we have selected what we believe to be the most important aspects affecting an optimizer comparison. We hope, that our study lays the groundwork so that other works can build on it and analyze these questions. Influence of the hyperparameter search strategy As noted by, e.g., Choi et al. (2019) and Sivaprasad et al. ( 2020), the hyperparameter tuning method, its budget, and its search domain, can significantly affect performance. By reporting results from three different hyperparameter optimization budgets (including the tuning-free one-shot setting) we try to quantify the effect of tuning. We argue that our random search process presents a realistic setting for many but certainly not all deep learning practitioners. One may criticize our approach as simplistic, but note that more elaborate schemes, in particular Bayesian optimization, would multiply the number of design decisions (kernels, search utilities, priors, and scales) and thus significantly complicate the analysis. The individual hyperparameter sampling distributions significantly affect the relative rankings of the optimizers. A badly chosen search space can make tuning next to impossible. Note, though, that this problem is inherited by practitioners. It is arguably an implicit flaw of an optimizer to not come with well-identified search spaces for its hyperparameters and should thus be reflected in a benchmark.

5. CONCLUSION

Faced with an avalanche of research to develop new stochastic optimization methods, practitioners are left with the near-impossible task of not just picking a method from this ever-growing list, but also to guess or tune hyperparameters for them, even to continuously tune them during optimization. Despite efforts by the community, there is currently no method that clearly dominates the competition. We have provided an extensive empirical benchmark of optimization methods for deep learning. It reveals structure in the crowded field of optimization for deep learning: First, although many methods perform competitively, a subset of methods tends to come up near the top across the spectrum of problems. Secondly, tuning helps about as much as trying other optimizers. Our open data set allows many, more technical observations, e.g., that the stability to re-runs is an often overlooked challenge. Perhaps the most important takeaway from our study is hidden in plain sight: the field is in danger of being drowned by noise. Different optimizers exhibit a surprisingly similar performance distribution compared to a single method that is re-tuned or simply re-run with different random seeds. It is thus questionable how much insight the development of new methods yields, at least if they are conceptually and functionally close to the existing population. We hope that benchmarks like ours can help the community to rise beyond inventing yet another optimizer and to focus on key challenges, such as automatic, inner-loop tuning for truly robust and efficient optimization. We are releasing our data to allow future authors to ensure that their method contributes to such ends. RMSProp (Tieleman & Hinton, 2012) α 10 -3 LU (10 -4 , 1) ε 10 -10 1 -ρ 0.9 LU (10 -4 , 1)

A LIST OF OPTIMIZERS AND SCHEDULES CONSIDERED

SGD (Robbins & Monro, 1951) α 10 -2 LU (10 -4 , 1) C ROBUSTNESS TO RANDOM SEEDS Data subsampling, random weight initialization, dropout and other aspects of deep learning introduce stochasticity to the training process. As such, judging the performance of an optimizer on a single run may be misleading due to random fluctuations. In our benchmark we use 10 different seeds of the final setting for each budget in order to judge the stability of the optimizer and the results. However, to keep the magnitude of this benchmark feasible, we only use a single seed while tuning, analogously to how a single user would progress. This means that our tuning process can sometimes choose hyperparameter settings which might not even converge for seeds other than the one used for tuning. Figure 5 illustrates this behavior on an example problem where we used 10 seeds throughout a tuning process using grid search. The figure shows that in the beginning performance increases when increasing the learning rate, followed by an area were it sometimes works but other times diverges. Picking hyperparameters from this "danger zone" can lead to unstable results. In this case, where we only consider the learning rate, it is clear that decreasing the learning rate a bit to get away from this "danger zone" would lead to a more stable, but equally well-performing algorithm. In more complicated cases, however, we are unable to use a simple heuristic such as this. This might be the case, for example, when tuning multiple hyperparameters or when the effect of the hyperparameter on the performance is less straight forward. Thus, this is a problem not created by improperly using the tuning method, but by an unstable optimization method. For each learning rate, markers in orange () show the initial seed which would be used for tuning, blue markers () illustrate nine additional seeds with otherwise unchanged settings. The mean over all seeds is plotted as a blue line (-), showing one standard deviation as a shaded area (T). In our benchmark, we observe in total 49 divergent seeds for the small budget and 56 for the large budget, or roughly 1% of the runs in each budget. Most of them occur when using SGD (23 and 18 cases for the small and large budget respectively), MOMENTUM (13 and 17 cases for the small and large budget respectively) or NAG (7 and 12 cases for the small and large budget respectively), which might indicate that adaptive methods are less prone to this kind of behavior. For the small budget tuning, none of these cases occur when using a constant schedule (4 for the large budget), and most of them occur when using the cosine with warm restarts schedule (27 and 25 cases for the small and large budget respectively). However, as our data on diverging seeds is very limited, it is not conclusive enough to draw solid conclusions.

D RE-TUNING EXPERIMENTS

In order to test the stability of our benchmark and especially the tuning method, we selected two optimizers in our benchmark and re-tuned them on all test problems a second time. We used completely independent random seeds for both tuning and the 10 repetitions with the final setting. Figure 6 and Figure 7 show the distribution of all 10 random seeds for both the original tuning as well as the re-tuning runs for RMSPROP and ADADELTA. It is evident, that re-tuning results in a shift of this distribution, since small (stochastic) changes during tuning can result in a different chosen hyperparameter setting. These differences also highlight how crucial it is to look at multiple test problems. Individually, small changes, such as re-doing the tuning with different seeds can lead to optimization methods changing rankings. However, they tend to average out when looking at an unbiased list of multiple problems. These results also further supports the statement made in Section 3 that there is no optimization method clearly domination the competition, as small performance margins might vanish when re-tuning. Figure 7 : Mean test set performance of all 10 seeds of ADADELTA (-) on all eight optimization problems using the small budget for tuning and no learning rate schedule. The mean is shown with a thicker line. We repeated the full tuning process on all eight test problems using different random seeds, which is shown in dashed lines blue (--). The mean performance of all other optimizers is shown in transparent gray lines.

E LIST OF SCHEDULES SELECTED

The schedules selected for our benchmark are illustrated in Figure 8 . All learning rate schedules are multiplied by the initial learning rate found via tuning or picked as the default choice. We use a cosine decay (Loshchilov & Hutter, 2017 ) that starts at 1 and decays in the form of a half period of a cosine to 0. As an example of a cyclical learning rate schedule, we test a cosine with warm restarts schedule with a cycle length ∆t = 10 which increases by a factor of 2 after each cycle without any discount factor. Depending on the number of epochs we train our model, it is possible that training stops shortly after one of those warm restarts. Since performance typically declines shortly after increasing the learning rate, we don't report the final performance for this schedule, but instead the performance achieved after the last complete period (just before the next restart). This approach is suggested by the original work of Loshchilov & Hutter (2017) . However, we still use the final performance while tuning. A representation of a schedule including warm-up is the trapezoidal schedule from Xing et al. (2018) . For our benchmark we set a warm-up and cool-down period of 1 /10 the training time.

G OPTIMIZER PERFORMANCE ACROSS TEST PROBLEMS

Similarly to Figure 4 , we show the corresponding plots for the small budget with no learning rate schedule in Figure 13 and the large budget with the cosine and trapezoidal learning rate schedule in Figures 14 and 15 . Additionally, in Figure 16 we show the same setting as Figure 4 Figure 13 : Mean test set performance over 10 random seeds of all tested optimizers on all eight optimization problems using the small budget for tuning and no learning rate schedule. One standard deviation for the tuned ADAM optimizer is shown with a red error bar (I). The performance of the untuned versions of ADAM (M) and ADABOUND (L) are marked for reference. Note, the upper bound of each axis represents the best performance achieved in the benchmark, while the lower bound is chosen in relation to the performance of ADAM with default parameters. The high-level trends mentioned in Section 3 also hold for the smaller tuning budget in Figure 13 . Namely, taking the winning optimizer for several untuned algorithms (here marked for ADAM and ADABOUND) will result in a decent performance in most test problems with much less effort. Adding a tuned version ADAM (or variants thereof) to this selection would result in a very competitive performance. The absolute top-performance however, is achieved by changing optimizers across different test problems. Note, although the large budget is a true superset of the small budget it is not given that it will always perform better. Our tuning procedure guarantees that the validation performance on the seed that has been used for tuning is as least as good on the large budget than on the small budget. But due to averaging over multiple seeds and reporting test performance instead of validation performance, this hierarchy is no longer guaranteed. We discuss the possible effects of averaging over multiple seeds further in Appendix C. The same high-level trends also emerge when considering the cosine or trapezoidal learning rate schedule in Figures 14 and 15 . We can also see that the top performance in general increase when adding a schedule (cf. Figure 4 and Figure 15 ). Comparing Figure 4 and Figure 16 we can assess the generalization performance of the optimization method not only to an unseen test set, but also to a different performance metric (accuracy instead of loss). Again, the overall picture of varying performance across different test problems remains consistent when considering the training loss performance. Similarily to the figures showing test set performance we cannot identify a clear winner, although ADAM ands its variants, such as RADAM perform near the top consistently. Note that while Figure 16 shows the training loss, the optimizers have still be tuned to achieve the best validation performance (i.e. accuracy if available, else the loss). Figure 16 : Mean training loss performance over 10 random seeds of all tested optimizers on all eight optimization problems using the large budget for tuning and no learning rate schedule. One standard deviation for the tuned ADAM optimizer is shown with a red error bar (I). The performance of the untuned versions of ADAM (M) and ADABOUND (L) are marked for reference. Note, the upper bound of each axis represents the best performance achieved in the benchmark, while the lower bound is chosen in relation to the performance of ADAM with default parameters (and no schedule). 



https://github.com/AnonSubmitter3/Submission543 All experiments were performed using version 1.2.0-beta of DEEPOBS and TensorFlow version 1.15Abadi et al. (2015).



41.3 40.8 41.5 40.1 41.6 40.6 40.4 42.9 41.4 41.4 41.4 40.7 40.3 40.8 18.9 18.4 19.1 17.7 19.2 18.2 18.0 20.5 19.0 19.0 19.0 18.3 17.9 18.4

Figure2: The absolute test set performance improvement after switching from any untuned optimizer (y-axis, one-shot) to any tuned optimizer (x-axis, small budget) as an average over 10 random seeds for the constant schedule. We discuss the unintuitive occurrence of negative diagonal entries in Appendix F. The colormap is capped at ±3 to improve presentation, although larger values occur.

Figure4: Mean test set performance over 10 random seeds of all tested optimizers on all eight optimization problems using the large budget for tuning and no learning rate schedule. One standard deviation for the tuned ADAM optimizer is shown with a red error bar (I; error bars for other methods omitted for legibility). The performance of untuned ADAM (M) and ADABOUND (L) are marked for reference. The upper bound of each axis represents the best performance achieved in the benchmark, while the lower bound is chosen in relation to the performance of ADAM with default parameters.

Figure5: Performance of SGD on a simple multilayer perceptron. For each learning rate, markers in orange () show the initial seed which would be used for tuning, blue markers () illustrate nine additional seeds with otherwise unchanged settings. The mean over all seeds is plotted as a blue line (-), showing one standard deviation as a shaded area (T).

Figure 8: Illustration of the selected learning rate schedules for a training duration of 150 epochs.

Figure15: Mean test set performance over 10 random seeds of all tested optimizers on all eight optimization problems using the large budget for tuning and the trapezoidal learning rate schedule. One standard deviation for the tuned ADAM optimizer is shown with a red error bar (I). The performance of the untuned versions of ADAM (M) and ADABOUND (L) are marked for reference (this time with the trapezoidal learning rate schedule). Note, the upper bound of each axis represents the best performance achieved in the benchmark, while the lower bound is chosen in relation to the performance of ADAM with default parameters (and no schedule).

Summary of test problems used in our experiments. The exact model configurations can be found in the work ofSchneider et al. (2019).

List of optimizers we considered for our benchmark. Note, that this is still far from being a complete list of all existing optimization methods applicable to deep learning, but only a subset, comprising of some of the most popular choices.

Overview of commonly used parameter schedules. Note, while we list the schedules parameters, it isn't clearly defined what aspects of a schedule are (tunable) parameters and what is a-priori fixed. In this column, α 0 denotes the initial learning rate, α lo and α up the lower and upper bound, ∆t indicates an epoch count at which to switch decay styles, k denotes a decaying factor.

Selected optimizers for our benchmarking process with their respective color, hyperparameters, default values, tuning distributions and scheduled hyperparameters. Here, LU(•, •) denotes the log-uniform distribution while U{•, •} denotes the discrete uniform distribution.

Mean test set performance of all 10 seeds of RMSPROP (-) on all eight optimization problems using the small budget for tuning and no learning rate schedule. The mean is shown with a thicker line. We repeated the full tuning process on all eight test problems using different random seeds, which is shown in dashed lines blue (--). The mean performance of all other optimizers is shown in transparent gray lines.

but showing the training loss instead of the test loss/accuracy.

Mean test set performance over 10 random seeds of all tested optimizers on all eight optimization problems using the large budget for tuning and the cosine learning rate schedule. One standard deviation for the tuned ADAM optimizer is shown with a red error bar (I). The performance of the untuned versions of ADAM (M) and ADABOUND (L) are marked for reference (this time with the cosine learning rate schedule). Note, the upper bound of each axis represents the best performance achieved in the benchmark, while the lower bound is chosen in relation to the performance of ADAM with default parameters (and no schedule).

Tabular version of Figure4. Mean test set performance and standard deviation over 10 random seeds of all tested optimizers on all eight optimization problems using the large budget for tuning and no learning rate schedule. For comprehensability, mean and standard deviation are rounded.

F IMPROVEMENT AFTER TUNING

When looking at Figure 2 , one might realize that few diagonal entries contain negative values. Since diagonal entries reflect the intra-optimizer performance change when tuning on the respective task, this might feel quite counterintuitive at first. In theory, this can occur if the respective tuning distributions is chosen poorly, the tuning randomness simply got "unlucky", or we observe significantly worse results for our additional seeds (see Figure 5 ).If we compare Figures 9 and 10 to Figures 11 and 12 we can see most negative diagonal entries vanish or at least diminish in magnitude. For the latter two figures we allow for more tuning runs and only consider the seed that has been used for this tuning process. The fact that the effect of negative diagonal entries reduces is an indication that they mostly result from the two latter reasons mentioned. -3.9 6.5 -2.3 -30.8 4.3 3.2 6.0 -15.1 6.1 6.0 4.9 5.9 5.7 2.9 -8. 17.8 18.1 17.8 17.9 18.0 18.1 -8.9 7.8 10.0 9.8 18.2 18.1 17.9 9.9 -0.1 0.0 -0.0 -0.9 -0.2 -0.5 -2.7 -0.1 -2.0 -2.0 -0.6 -0.5 -2.6 -0.4 4.2 4.3 4.3 3.4 4.2 3.9 1.7 4.2 2.4 2.3 3.7 3.8 1.7 3.9 -0.1 -0.0 -0.1 -0.9 -0.2 -0.5 -2.7 -0.2 -2.0 -2.0 -0.6 -0.6 -2.6 -0.4 13.4 13.5 13.5 12.6 13.3 13.0 10.8 13.4 11.5 11.5 12.9 12.9 10.9 13.1 0.8 0.9 0.9 0.0 0.8 0.4 -1.8 0.8 -1.1 -1.1 0.3 0.4 -1.7 0.5 0.7 0.8 0.8 -0.1 0.7 0.4 -1.8 0.7 -1.1 -1.1 0.2 0.3 -1.8 0.4 0.8 0.9 0.8 -0.0 0.7 0.4 -1.8 0.8 -1.1 -1.1 0.3 0.3 -1.7 0.5 2.1 2.2 2.2 1.3 2.1 1.8 -0.4 2.1 0.3 0.2 1.6 1.7 -0.4 1.86.3 6.5 6.4 5.6 6.3 6.0 3.8 6.3 4.5 4.5 5.9 5.9 3.9 6.1 4.9 5.0 5.0 4.1 4.9 4.5 2.3 4.9 3.0 3.0 4.4 4.5 2.4 4.6 0.8 0.9 0.8 -0.0 0.7 0.4 -1.8 0.7 -1.1 -1.1 0.3 0.3 -1.7 0.5 0.7 0.8 0.8 -0.1 0.7 0.4 -1.8 0.7 -1.1 -1.1 0.2 0.3 -1.8 0.4 0.6 0.8 0.7 -0.1 0.6 0.3 -1.9 0.6 -1.2 -1.2 0.2 0.2 -1.8 0.41.5 1.7 1.6 0.8 1.5 1.2 -1.0 1.5 -0.3 -0.3 1.1 1.1 -0.9 1.3 2.0 1.5 2.2 0.7 2.3 1.2 1.0 3.5 2.0 2.0 2.0 1.4 0.9 1.4

F-MNIST 2c2d

1.6 1.1 1.8 0.4 1.9 0.9 0.7 3.2 1.7 1.7 1.6 1.0 0.5 1.1 6.6 6.1 6.8 5.3 6.9 5.8 5.7 8.1 6.7 6.7 6.6 6.0 5.5 6.1 6.1 5.6 6.3 4.9 6.4 5.4 5.2 7.7 6.2 6.2 6.1 5.5 5.1 5.6 0.8 0.3 1.0 -0.4 1.1 0.1 -0.1 2.4 0.9 0.9 0.8 0.2 -0.3 0.3 1.4 0.9 1.6 0.2 1.7 0.7 0.5 3.0 1.5 1.5 1.4 0.8 0.3 0.9 3.3 2.8 3.6 2.1 3.7 2.6 2.4 4.9 3.4 3.4 3.4 2.7 2.3 2.8 2.5 2.0 2.8 1.3 2.9 1.8 1.6 4.1 2.6 2.6 2.6 1.9 1.5 2.0 CIFAR-10 3c3d -0.9 -3.4 0.2 -16.4 -34.0 -2.3 1.2 -0.6 0.8 1.8 -2.2 -2.2 -6.9 -4.12.1 -0.4 3.2 -13.3 -30.9 0.8 4.2 2.5 3.9 4.9 0.8 0.9 -3.8 -1.11.9 -0.6 3.0 -13.5 -31.2 0.5 4.0 2.2 3.7 4.6 0.6 0.7 -4.1 -1. 0.7 0.5 0.6 -0.4 0.2 0.2 0.9 0.6 0.0 0.7 0.2 -0.1 -0.9 0.3 1.0 0.8 0.9 -0.1 0.5 0.5 1.2 1.0 0.3 1.0 0.5 0.3 -0.5 0.6 3.8 7.2 4.5 5.0 7.6 7.4 5.8 5.7 7.5 7.6 7.6 7.5 7.9 6.9 0.1 3.5 0.7 1.3 3.9 3.7 2.1 1.9 3.7 3.9 3.9 3. 9.3 12.7 9.9 10.5 13.1 12.9 11.3 11.1 12.9 13.1 13.1 12.9 13.3 12.3 -3.2 0.2 -2.5 -2.0 0.6 0.4 -1.2 -1.3 0.5 0.7 0.6 0.5 0.9 -0.1 -0.3 3.1 0.4 0.9 3.5 3.3 1.7 1.6 3.4 3.5 3.5 3.4 3.8 2.826.5 29.9 27.2 27.7 30.3 30.1 28.5 28.4 30.2 30.4 30.3 30.2 30.6 29.6 -1.8 1.6 -1.1 -0.6 2.0 1.8 0.2 0.1 1.9 2.1 2.0 1.9 2.3 1.3 -1.8 1.6 -1.1 -0.6 2.0 1.8 0.2 0.1 1.9 2.1 2.0 1.9 2.3 1.3 -3.1 0.3 -2.4 -1.9 0.7 0.5 -1.1 -1.2 0.5 0.7 0.7 0.6 0.9 -0.1 -3.2 0.1 -2.6 -2.1 0.6 0.3 -1.2 -1.4 0.4 0.6 0.5 0.4 0.8 -0.2 -3.9 -0.5 -3.2 -2.7 -0.1 -0.3 -1.9 -2.0 -0.2 -0.0 -0. 11.3 2.6 9.5 -33.4 6.5 6.5 0.2 0.8 0.2 0.2 6.6 6.6 -0.0 11.3 12.6 3.9 10.7 -32.2 7.8 7.8 1.4 2.1 1.5 1.4 7.9 7.9 1.3 12.5 11.4 2.7 9.6 -33.3 6.6 6.6 0.3 1.0 0.3 0.3 6.7 6.7 0.1 11.48.8 0.1 6.9 -36.0 4.0 4.0 -2.3 -1.7 -2.3 -2. 3.2 3.6 3.2 2.9 3.3 3.0 1.7 3.3 1.8 1.6 2.9 2.9 0.4 3.1 -0.1 0.3 -0.1 -0.4 -0.0 -0.3 -1.6 -0.0 -1.5 -1.7 -0.4 -0.4 -2.9 -0.2 13.7 14.1 13.7 13.4 13.8 13.5 12.2 13.8 12.3 12.1 13.4 13.3 10.9 13.6 0.9 1.3 1.0 0.6 1.0 0.8 -0.6 1.0 -0.5 -0.7 0.6 0.6 -1.9 0.9 0.8 1.2 0.9 0.5 0.9 0.7 -0.7 0.9 -0.6 -0.8 0.6 0.5 -2.0 0.8 1.0 1.3 1.0 0.6 1.0 0.8 -0.6 1.0 -0.5 -0.7 0.7 0.6 -1.8 0.91.9 2.3 1.9 1.6 2.0 1.7 0.4 2.0 0.5 0.3 1.6 1.6 -0.9 1.8 8.2 8.6 8.2 7.9 8.3 8.0 6.7 8.3 6.8 6.6 7.9 7.8 5.4 8.1 4.7 5.0 4.7 4.3 4.7 4.5 3.1 4.7 3.2 3.0 4.4 4.3 1.9 4.6 0.4 0.8 0.4 0.1 0.5 0.2 -1.1 0.5 -1.0 -1.2 0.1 0.0 -2.4 0.3 0.8 1.2 0.9 0.5 0.9 0.7 -0.7 0.9 -0.6 -0.8 0.5 0.5 -2.0 0.8 0.6 0.9 0.6 0.2 0.6 0.4 -1.0 0.6 -0.8 -1.1 0.3 0.2 -2.2 0.5 3.4 2.2 3.1 1.6 2.9 3.4 1.3 2.4 3.0 3.5 3.3 3.0 0.8 2.7 1.7 0.6 1.5 -0.1 1.2 1.8 -0.3 0.8 1.4 1.9 1.7 1.4 -0.8 1.1 6.3 5.2 6.0 4.5 5.8 6.3 4.2 5.3 6.0 6.4 6.3 5.9 3.8 5.66.8 5.6 6.5 5.0 6.3 6.8 4.7 5.8 6.4 6.9 6.7 6. 1.4 0.3 1.2 -0.4 0.9 1.5 -0.6 0.5 1.1 1.6 1.4 1.1 -1.1 0.8 2.1 0.9 1.8 0.3 1.6 2.1 -0.0 1.1 1.7 2.2 2.0 1.7 -0.5 1.4 7.5 6.4 7.3 5.7 7.0 7.6 5.5 6.6 7.2 7.6 7.5 7.2 5.0 6.9 3.7 2.5 3.4 1.9 3.2 3.7 1.6 2.7 3.3 3.8 3.6 3.3 1.1 3.0

CIFAR-10 3c3d

Figure 11 : The absolute test set performance improvement after switching from any untuned optimizer (y-axis, one-shot) to any tuned optimizer (x-axis, large budget) for the constant schedule. This is structurally the same plot as Figure 9 but comparing to the large budget and only considering the seed that has been used for tuning. 14.5 14.8 14.5 14.8 14.7 14.9 12.7 13.8 14.1 14.0 14.9 14.8 14.8 14.3 -0.2 0.1 -0.2 0.1 0.0 0.2 -2.0 -0.9 -0.6 -0.7 0.2 0.1 0.1 -0.4 0.5 0.9 0.5 0.8 0.7 0.9 -1.2 -0. 4.1 0.6 2.5 -6.0 3.1 2.9 -14.7 2.6 4.5 5.8 3.0 2.8 -2.2 1.6 -0.4 -3.9 -2.0 -10.5 -1.3 -1.5 -19.2 -1.9 0.1 1.4 -1.4 -1.6 -6.6 -2.93.5 0.0 1.9 -6.6 2.6 2.4 -15.3 2.0 3.9 5.3 2.4 2.2 -2.8 1.0 3.3 -0.2 1.7 -6.8 2.4 2.2 -15.5 1.8 3.8 5.1 2.3 2.0 -3.0 0.8 3.1 -0.4 1.5 -7.0 2.1 1.9 -15.8 1.5 3.5 4.8 2.0 1. 0.4 0.3 0.1 -0.0 -0.0 -0.0 0.7 0.7 0.6 0.9 0.3 0.6 0.2 0.1 1.4 1.2 1.0 0.9 0.9 0.9 1.6 1.6 1.6 1.8 1.2 1.6 1.1 1.0 3.7 6.6 4.1 4.5 6.9 6.8 5.0 6.5 6.8 6.7 6.9 6.7 7.1 6.2 The absolute test set performance improvement after switching from any untuned optimizer (y-axis, one-shot) to any tuned optimizer (x-axis, large budget) for the constant schedule. This is structurally the same plot as Figure 10 but comparing to the large budget and only considering the seed that has been used for tuning.

