DESCENDING THROUGH A CROWDED VALLEY -BENCHMARKING DEEP LEARNING OPTIMIZERS

Abstract

Choosing the optimizer is considered to be among the most crucial design decisions in deep learning, and it is not an easy one. The growing literature now lists hundreds of optimization methods. In the absence of clear theoretical guidance and conclusive empirical evidence, the decision is often made based on anecdotes. In this work, we aim to replace these anecdotes, if not with a conclusive ranking, then at least with evidence-backed heuristics. To do so, we perform an extensive, standardized benchmark of more than a dozen particularly popular deep learning optimizers while giving a concise overview of the wide range of possible choices. Analyzing almost 35,000 individual runs, we contribute the following three points: (i) Optimizer performance varies greatly across tasks. (ii) We observe that evaluating multiple optimizers with default parameters works approximately as well as tuning the hyperparameters of a single, fixed optimizer. (iii) While we can not discern an optimization method clearly dominating across all tested tasks, we identify a significantly reduced subset of specific algorithms and parameter choices that generally lead to competitive results in our experiments. This subset includes popular favorites and some lesser-known contenders. We have open-sourced all our experimental results, making them directly available as challenging and well-tuned baselines. 1 This allows for more meaningful comparisons when evaluating novel optimization methods without requiring any further computational efforts.

1. INTRODUCTION

Large-scale stochastic optimization drives a wide variety of machine learning tasks. Because choosing the right optimization algorithm and effectively tuning its hyperparameters heavily influences the training speed and final performance of the learned model, doing so is an important, every-day challenge to practitioners. Hence, stochastic optimization methods have been a focal point of research (cf. Figure 1 ), engendering an ever-growing list of algorithms, many of them specifically targeted towards deep learning. The hypothetical machine learning practitioner who is able to keep up with the literature now has the choice among hundreds of methods (cf. Table 2 in the appendix)-each with their own set of tunable hyperparameters-when deciding how to train their model. There is limited theoretical analysis that would clearly favor one of these choices over the others. Some authors have offered empirical comparisons on comparably small sets of popular methods (e.g. Wilson et al., 2017; Choi et al., 2019; Sivaprasad et al., 2020) ; but for most algorithms, the only formal empirical evaluation is offered by the original work introducing the method. Many practitioners and researchers, meanwhile, rely on personal and anecdotal experience, and informal discussion on social media or with colleagues. The result is an often unclear, perennially changing "state of the art" occasionally driven by hype. The key obstacle for an objective benchmark is the combinatorial cost of such an endeavor posed by comparing a large number of methods on a large number of problems, with the high resource and time cost of tuning each method's parameters and repeating each (stochastic) experiment repeatedly for fidelity. Offering our best attempt to construct such a comparison, we conduct a large-scale benchmark of optimizers to further the debate about deep learning optimizers, and to help understand how the choice of optimization method and hyperparameters influences the training performance. Specifically, we examine whether recently proposed methods show an improved performance compared to more established methods such as SGD or ADAM. Additionally, we are interested in assessing whether optimization methods with well-working default hyperparameters exist that are able to keep up with tuned optimization methods. To this end, we evaluate more than a dozen optimization algorithms, largely selected for their perceived popularity, on a range of representative deep learning problems (see Figure 4 ) drawing conclusions from tens of thousands of individual training runs. 2006 Right up front, we want to state clearly that it is impossible to include all optimizers (cf. Table 2 in the appendix), and to satisfy any and all expectations readers may have on tuning and initialization procedures, or the choice of benchmark problems-not least because everyone has different expectations in this regard. In our personal opinion, what is needed is an empirical comparison by a third party not involved in the original works. As a model reader of our work, we assume a careful practitioner who does not have access to near-limitless resources, nor to a broad range of personal experiences. As such, the core contributions (in order of appearance, not importance) of our work are: A concise summary of optimization algorithms and schedules A partly automated, mostly manual literature review provides a compact but extensive list of recent advances in stochastic optimization. We identify more than a hundred optimization algorithms (cf. Table 2 in the appendix) and more than 20 families of hyperparameter schedules (cf. Table 3 in the appendix) published at least as pre-prints. An extensive optimizer benchmark on deep learning tasks We conduct a large-scale optimizer benchmark, specifically focusing on optimization problems arising in deep learning. We evaluate 14 optimizers on eight deep learning problems using four different schedules, tuning over dozens of hyperparameter settings, to our knowledge, this is the most comprehensive empirical evaluation of deep learning optimizers to date (cf. Section 1.1 on related work). An analysis of thousands of optimization runs Our empirical experiments indicate that an optimizer's performance highly depends on the test problem (see Figure 4 ). But some high-level trends emerge, too: (1) Evaluating multiple optimizers with default hyperparameters works approximately as well as tuning the hyperparameters for a fixed optimizer. (2) Using an additional untuned learning rate schedule helps on average, but its effect varies greatly depending on the optimizer and the test problem. (3) While there is no optimizer that clearly dominates across all tested workloads, some of the algorithms we tested exhibited highly variable performance. Others demonstrated decent performance consistently. We deliberately refrain from recommending a single one among them, because we could not find a clear winner with statistical confidence. An open-source baseline for future optimizer benchmarks Our results are accessible online in an open and easily accessible form (see footnote on Page 1). These results can thus be used as competitive and well-tuned baselines for future benchmarks of new algorithms, drastically reducing the amount of computational budget required for a meaningful optimizer comparison. Our baselines can easily be expanded, and we encourage others to contribute to this collection. The high-level result of our benchmark is, perhaps expectedly, not a clear winner. Instead, our comparison shows that, while some optimizers are frequently decent, they also generally perform similarly, switching their relative positions in the ranking which can partially be explained by the



https://github.com/AnonSubmitter3/Submission543



Number of times ArXiv titles and abstracts mention each optimizer per year. All nonselected optimizers from Table2in the appendix are grouped into Other.

