DESCENDING THROUGH A CROWDED VALLEY -BENCHMARKING DEEP LEARNING OPTIMIZERS

Abstract

Choosing the optimizer is considered to be among the most crucial design decisions in deep learning, and it is not an easy one. The growing literature now lists hundreds of optimization methods. In the absence of clear theoretical guidance and conclusive empirical evidence, the decision is often made based on anecdotes. In this work, we aim to replace these anecdotes, if not with a conclusive ranking, then at least with evidence-backed heuristics. To do so, we perform an extensive, standardized benchmark of more than a dozen particularly popular deep learning optimizers while giving a concise overview of the wide range of possible choices. Analyzing almost 35,000 individual runs, we contribute the following three points: (i) Optimizer performance varies greatly across tasks. (ii) We observe that evaluating multiple optimizers with default parameters works approximately as well as tuning the hyperparameters of a single, fixed optimizer. (iii) While we can not discern an optimization method clearly dominating across all tested tasks, we identify a significantly reduced subset of specific algorithms and parameter choices that generally lead to competitive results in our experiments. This subset includes popular favorites and some lesser-known contenders. We have open-sourced all our experimental results, making them directly available as challenging and well-tuned baselines. 1 This allows for more meaningful comparisons when evaluating novel optimization methods without requiring any further computational efforts.

1. INTRODUCTION

Large-scale stochastic optimization drives a wide variety of machine learning tasks. Because choosing the right optimization algorithm and effectively tuning its hyperparameters heavily influences the training speed and final performance of the learned model, doing so is an important, every-day challenge to practitioners. Hence, stochastic optimization methods have been a focal point of research (cf. Figure 1 ), engendering an ever-growing list of algorithms, many of them specifically targeted towards deep learning. The hypothetical machine learning practitioner who is able to keep up with the literature now has the choice among hundreds of methods (cf. Table 2 in the appendix)-each with their own set of tunable hyperparameters-when deciding how to train their model. There is limited theoretical analysis that would clearly favor one of these choices over the others. Some authors have offered empirical comparisons on comparably small sets of popular methods (e.g. Wilson et al., 2017; Choi et al., 2019; Sivaprasad et al., 2020) ; but for most algorithms, the only formal empirical evaluation is offered by the original work introducing the method. Many practitioners and researchers, meanwhile, rely on personal and anecdotal experience, and informal discussion on social media or with colleagues. The result is an often unclear, perennially changing "state of the art" occasionally driven by hype. The key obstacle for an objective benchmark is the combinatorial cost of such an endeavor posed by comparing a large number of methods on a large number of problems, with the high resource and time cost of tuning each method's parameters and repeating each (stochastic) experiment repeatedly for fidelity. Offering our best attempt to construct such a comparison, we conduct a large-scale benchmark of optimizers to further the debate about deep learning optimizers, and to help understand how the choice of optimization method and hyperparameters influences the training performance. Specifically,



https://github.com/AnonSubmitter3/Submission543 1

