HOW MUCH PROGRESS HAVE WE MADE IN NEURAL NETWORK TRAINING? A NEW EVALUATION PROTO-COL FOR BENCHMARKING OPTIMIZERS

Abstract

Many optimizers have been proposed for training deep neural networks, and they often have multiple hyperparameters, which make it tricky to benchmark their performance. In this work, we propose a new benchmarking protocol to evaluate both end-to-end efficiency (training a model from scratch without knowing the best hyperparameter) and data-addition training efficiency (the previously selected hyperparameters are used for periodically re-training the model with newly collected data). For end-to-end efficiency, unlike previous work that assumes random hyperparameter tuning, which over-emphasizes the tuning time, we propose to evaluate with a bandit hyperparameter tuning strategy. A human study is conducted to show that our evaluation protocol matches human tuning behavior better than the random search. For data-addition training, we propose a new protocol for assessing the hyperparameter sensitivity to data shift. We then apply the proposed benchmarking framework to 7 optimizers and various tasks, including computer vision, natural language processing, reinforcement learning, and graph mining. Our results show that there is no clear winner across all the tasks.

1. INTRODUCTION

Due to the enormous data size and non-convexity, stochastic optimization algorithms have become widely used in training deep neural networks. In addition to Stochastic Gradient Descent (SGD) (Robbins & Monro, 1951) , many variations such as Adagrad (Duchi et al., 2011) and Adam (Kingma & Ba, 2014) have been proposed. Unlike classical, hyperparameter free optimizers such as gradient descent and Newton's methodfoot_0 , stochastic optimizers often hold multiple hyperparameters including learning rate and momentum coefficients. Those hyperparameters are critical not only to the speed, but also to the final performance, and are often hard to tune. It is thus non-trivial to benchmark and compare optimizers in deep neural network training. And a benchmarking mechanism that focuses on the peak performance could lead to a false sense of improvement when developing new optimizers without considering tuning efforts. In this paper, we aim to rethink the role of hyperparameter tuning in benchmarking optimizers and develop new benchmarking protocols to reflect their performance in practical tasks better. We then benchmark seven recently proposed and widely used optimizers and study their performance on a wide range of tasks. In the following, we will first briefly review the two existing benchmarking protocols, discuss their pros and cons, and then introduce our contributions. Benchmarking performance under the best hyperparameters A majority of previous benchmarks and comparisons on optimizers are based on the best hyperparameters. Wilson et al. (2017) ; Shah et al. (2018) made a comparison of SGD-based methods against adaptive ones under their best hyperparameter configurations. They found that SGD can outperform adaptive methods on several datasets under careful tuning. Most of the benchmarking frameworks for ML training also assume knowing the best hyperparameters for optimizers (Schneider et al., 2019; Coleman et al., 2017; Zhu et al., 2018) . Also, the popular MLPerf benchmark evaluated the performance of optimizers under the best hyperparameter. It showed that ImageNet and BERT could be trained in 1 minute using the combination of good optimizers, good hyperparameters, and thousands of accelerators. Despite each optimizer's peak performance being evaluated, benchmarking under the best hyperparameters makes the comparison between optimizers unreliable and fails to reflect their practical performance. First, the assumption of knowing the best hyperparameter is unrealistic. In practice, it requires a lot of tuning efforts to find the best hyperparameter, and the tuning efficiency varies greatly for different optimizers. It is also tricky to define the "best hyperparameter", which depends on the hyperparameter searching range and grids. Further, since many of these optimizers are sensitive to hyperparameters, some improvements reported for new optimizers may come from insufficient tuning for previous work. Benchmarking performance with random hyperparameter search It has been pointed out in several papers that tuning hyperparameter needs to be considered in evaluating optimizers (Schneider et al., 2019; Asi & Duchi, 2019) , but having a formal evaluation protocol on this topic is nontrivial. Only recently, two papers Choi et al. (2019) and Sivaprasad et al. (2020) take hyperparameter tuning time into account when comparing SGD with Adam/Adagrad. However, their comparisons among optimizers are conducted on random hyperparameter search. We argue that these comparisons could over-emphasize the role of hyperparameter tuning, which could lead to a pessimistic and impractical performance benchmarking for optimizers. This is due to the following reasons: First, in the random search comparison, each bad hyperparameter has to run fully (e.g., 200 epochs). In practice, a user can always stop the program early for bad hyperparameters if having a limited time budget. For instance, if the learning rate for SGD is too large, a user can easily observe that SGD diverges in a few iterations and directly stops the current job. Therefore, the random search hypothesis will over-emphasize the role of hyperparameter tuning and does not align with a real user's practical efficiency. Second, the performance of the best hyperparameter is crucial for many applications. For example, in many real-world applications, we need to re-train the model every day or every week with newly added data. So the best hyperparameter selected in the beginning might benefit all these re-train tasks rather than searching parameters from scratch. In addition, due to the expensive random search, random search based evaluation often focuses on the low-accuracy regionfoot_2 , while practically we care about the performance for getting reasonably good accuracy. Our contributions Given that hyperparameter tuning is either under-emphasized (assuming the best hyperparameters) or over-emphasize (assuming random search) in existing benchmarking protocols and comparisons, we develop new evaluation protocols to compare optimizers to reflect the real use cases better. Our evaluation framework includes two protocols. First, to evaluate the end-to-end training efficiency for a user to train the best model from scratch, we develop an efficient evaluation protocol to compare the accuracy obtained under various time budgets, including the hyperparameter tuning time. Instead of using the random search algorithm, we adopt the Hyperband (Li et al., 2017) algorithm for hyperparameter tuning since it can stop early for bad configurations and better reflect the real running time required by a user. Further, we also propose to evaluate the data addition training efficiency for a user to re-train the model with some newly added training data, with the knowledge of the best hyperparameter tuned in the previous training set. We also conduct human studies to study how machine learning researchers are tuning parameters in optimizers and how that aligns with our proposed protocols. Based on the proposed evaluation protocols, we study how much progress has recently proposed algorithms made compared with SGD or Adam. Note that most of the recent proposed optimizers have shown outperforming SGD and Adam under the best hyperparameters for some particular tasks, but it is not clear whether the improvements are still significant when considering hyper-parameter tuning, and across various tasks. To this end, we conduct comprehensive experiments comparing state-of-the-art training algorithms, including SGD (Robbins & Monro, 1951 ), Adam (Kingma & Ba, 2014) , RAdam (Liu et al., 2019 ), Yogi (Zaheer et al., 2018) , LARS (You et al., 2017 ), LAMB (You et al., 2019 ), and Lookahead (Zhang et al., 2019) , on a variety of training tasks including image classification, generated adversarial networks (GANs), sentence classification (BERT fine-tuning), reinforcement learning and graph neural network training. Our main conclusions are: 1) On CIFAR-10 and CIFAR-100, all the optimizers including SGD are competitive. 2) Adaptive methods are generally better on more complex tasks (NLP, GCN, RL). 3) There is no clear winner among adaptive methods. Although RAdam is more stable than Adam across tasks, Adam is still a very competitive baseline even compared with recently proposed methods.

2. RELATED WORK

Optimizers. Properties of deep learning make it natural to apply stochastic first order methods, such as Stochastic Gradient Descent (SGD) (Robbins & Monro, 1951) . Severe issues such as a zig-zag training trajectory and a uniform learning rate have been exposed, and researchers have then drawn extensive attention to modify the existing SGD for improvement. Along this line of work, tremendous progresses have been made including SGDM (Qian, 1999) , Adagrad (Duchi et al., 2011) , RMSProp (Tieleman & Hinton, 2012), and Adam (Kingma & Ba, 2014) . These methods utilize momentums to stabilize and speed up training procedures. Particularly, Adam is regarded as the default algorithm due to its outstanding compatibility. Then variants such as Amsgrad (Reddi et al., 2019) , Adabound (Luo et al., 2019 ), Yogi (Zaheer et al., 2018) , and RAdam (Liu et al., 2019) have been proposed to resolve different drawbacks of Adam. Meanwhile, the requirement of large batch training has inspired the development of LARS (You et al., 2017) and LAMB (You et al., 2019) . Moreover, Zhang et al. (2019) has put forward a framework called Lookahead to boost optimization performance by iteratively updating two sets of weights. Hyperparameter tuning methods. Random search and grid search (Bergstra & Bengio, 2012) can be a basic hyperparameter tuning method in the literature. However, the inefficiency of these methods stimulates the development of more advanced search strategies. Bayesian optimization methods including Bergstra et al. (2011) and Hutter et al. (2011) accelerate random search by fitting a black-box function of hyperparameter and the expected objective to adaptively guide the search direction. Parallel to this line of work, Hyperband (Li et al., 2017) focuses on reducing evaluation cost for each configuration and early terminates relatively worse trials. Falkner et al. (2018) proposes BOHB to combine the benefits of both Bayesian Optimization and Hyperband. All these methods still require huge computation resources. A recent work (Metz et al., 2020) has tried to obtain a list of potential hyperparameters by meta-learning from thousands of representative tasks. We strike a balance between effectiveness and computing cost and leverage Hyperband in our evaluation protocol to compare a wider range of optimizers.

3. PROPOSED EVALUATION PROTOCOLS

In this section, we introduce the proposed evaluation framework for optimizers. We consider two evaluation protocols, each corresponding to an important training scenario: • Scenario I (End-to-end training): This is the general training scenario, where a user is given an unfamiliar optimizer and task, the goal is to achieve the best validation performance after several trials and errors. In this case, the evaluation needs to include hyperparameter tuning time. We develop an efficiency evaluation protocol to compare various optimizers in terms of CPE and peak performance. • Scenario II (Data-addition training): This is another useful scenario encountered in many applications, where the same model needs to be retrained regularly after collecting some fresh data. In this case, a naive solution is to reuse the previously optimal hyperparameters and retrain the model. However, since the distribution is shifted, the result depends on the sensitivity to that shift. We describe the detailed evaluation protocol for each setting in the following subsections.

3.1. END-TO-END TRAINING EVALUATION PROTOCOL

Before introducing our evaluation protocol for Scenario I, we first formally define the concept of optimizer and its hyperparameters. Definition 1. An optimizer is employed to solve a minimization problem min θ L(θ) and can be defined by a tuple o ∈ O = (U, Ω), where O contains all types of optimizers. U is a specific update rule and Ω = (ω 1 , . . . , ω N ) ∈ R N represents a vector of N hyperparameters. Search space of these hyperparameters is denoted by F. Given an initial parameter value θ 0 , together with a trajectory of optimization procedure H t = {θ s , L(θ s ), ∇L(θ s )}, the optimizer updates θ by θ t+1 = U(H t , Ω). We aim to evaluate the end-to-end time for a user to get the best model, including the hyperparameter tuning time. A recent work (Sivaprasad et al., 2020) assumes that a user conducts random search for finding the best hyperparameter setting. Still, we argue that the random search procedure will over-emphasize the importance of hyperparameters when tuning is considered -it assumes a user never stops the training even if they observe divergence or bad results in the initial training phase, which is unrealistic. Figure 1 illustrates why random search might not lead to a fair comparison of optimizers. In Figure 1 , we are given two optimizers, A and B, and their corresponding loss w.r.t. hyperparameter. According to Sivaprasad et al. (2020) , optimizer B is considered better than optimizer A under a constrained budget since most regions of the hyperparameter space of A outperforms B. For instance, suppose we randomly sample the same hyperparamter setting for A and B. The final config ω * r (B) found under this strategy can have a lower expected loss than that of ω * r (A), as shown in Figure 1a . However, there exists a more practical search strategy which can invalidate this statement with the assumption of a limited searching budget: a user can early terminate a configuration trial when trapped in bad results or diverging. Hence, we can observe in Figure 1b that for optimizer A, this strategy earlystops many configurations and only allow a limited number of trials to explore to the deeper stage. Therefore, the bad hyperparameters will not affect the overall efficiency of Optimizer A too much. In contrast, for optimizer B, performances of different hyperparameters are relatively satisfactory and hard to distinguish, resulting in similar and long termination time for each trial. Therefore, it may be easier for a practical search strategy p to find the best configuration ω * p (A) of optimizer A than ω * p (B), given the same constrained budget. This example suggests that random search may over-emphasize the parameter sensitivity when benchmarking optimizers. To better reflect a practical hyperparameter tuning scenario, our evaluation assumes a user applies Hyperband (Li et al., 2017) , a simple but effective hyperparameter tuning scheme to get the best model. Hyperband formulates hyperparameter optimization as a unique bandit problem. It accelerates random search through adaptive resource allocation and earlystopping, as demonstrated in Figure 1b . Compared with its more complicated counterparts such as BOHB (Falkner et al., 2018) , Hyperband requires less computing resources and performs similarly within a constrained budget. The algorithm is presented in Appendix A. Despite different hyperparameter tuning algorithms, human tuning by experts is still regarded as the most effective. To verify and support that Hyperband is an effective method and even competitive with humans, we conduct a human study as follows: for image classification on CIFAR10, given 10 learning rate configurations of SGD in the grid [1.0 × 10 -8 , 1.0 × 10 -7 , 1.0 × 10 -6 , . . . , 10], participants are requested to search the best one at their discretion. Namely, they can stop or pause a trial any time and continue to evaluate a new configuration until they feel it has already reached the best performance. 10 participants are sampled randomly from Ph.D. students with computer science backgrounds. We collect their tuning trajectories and average them as human performance, which is considered "optimal" in this human study. In Figure 2 , we plot curves for hyperparameter tuning of human, Hyperband, random search, random search with an early stopping (ES) strategy in Sivaprasad et al.

Hyperparameter Expected loss

Optimizer A Optimizer B * r (A) * p (A) * r (B) * p (B)

Hyperparameter Termination time

(2020), and Hyperband with ES. We find that Hyperband matches humans' behavior better, while random search tends to trap in suboptimal configurations although random with early stopping can mitigate this issue to some extent. This finding shows the advantage of Hyperband over random search regardless of early stopping, and justifies the use of Hyperband in optimizer benchmarking. More details of this human study can be found in Appedix B. With Hyperband incorporated in end-to-end training, we assume that each configuration is run sequentially and record the best performance obtained at time step t as P t . Specifically, P t represents the evaluation metric for each task, e.g., accuracy for image classification and return for reinforcement learning. {P t } T t=1 forms a trajectory for plotting learning curves on test set like Figure 3 . Although it is intuitive to observe the performance of different optimizers according to such figures, summarizing a learning curve into a quantifiable, scalar value can be more insightful for evaluation. Thus, as shown in Eq. 1, we use λ-tunability defined in Sivaprasad et al. (2020) to further measure the performance of optimizers: λ-tunability = T t=1 λ t • P t , where t λ t = 1 and ∀ t λ t > 0. (1) One intuitive way is to set λ t = 1 t=T to determine which optimizer can reach the best model performance after the whole training procedure. However, merely considering the peak performance is not a good guidance on the choice of optimizers. In practice, we tend to take into account the complete trajectory and exert more emphasis on the early stage. Thus, we employ the Cumulative Performance-Early weighting scheme where λ t ∝ (Ti), to compute λ-tunablity instead of the extreme assignment λ t = 1 t=T . The value obtained is termed as CPE for simplicity. We present our evaluation protocol in Algorithm 1. As we can see, end-to-end training with hyperparameter optimization is conducted for various optimizers on the given task. The trajectory {P t } T t=1 is recorded to compute the peak performance as well as CPE value. Note that the procedure is repeated M times to obtain a reliable result. We use M = 3 in all experiments. More details of time cost and acceleration of the algorithm can be found in Appendix E. In Scenario II, we have a service (e.g., a search or recommendation engine) and we want to re-train the model every day or every week with some newly added training data. One may argue that an online learning algorithm should be used in this case, but in practice online learning is unstable and industries still prefer this periodically retraining scheme which is more stable. In this scenario, once the best hyperparameters were chosen in the beginning, we can reuse them for every training, so no hyperparameter tuning is required and the performance (including both efficiency and test accuracy) under the best hyperparameter becomes important. However, an implicit assumption made in this process is that "the best hyperparameter will still work when the training task slightly changes". This can be viewed as transferability of hyperparameters for a particular optimizer, and our second evaluation protocol aims to evaluate this practical scenario. We simulate data-addition training with all classification tasks, and the evaluation protocol works as follows: 1) Extract a subset D δ containing partial training data from the original full dataset D with a small ratio δ; 2) Conduct a hyperparameter search on D δ to find the best setting under this scenario; 3) Use these hyperparameters to train the model on the complete dataset; 4) Observe the potential change of the ranking of various optimizers before and after data addition. For step 4) when comparing different optimizers, we will plot the training curve in the full-data training stage in Section 4, and also summarize the training curve using the CPE value. The detailed evaluation protocol is described in Algorithm 2. 

4. EXPERIMENTAL RESULTS

Optimizers to be evaluated. As shown in Table 1 , we consider 7 optimizers including nonadaptive methods using only the first-order momentum, and adaptive methods considering both firstorder and second-order momentum. We also provide lists of tunable hyperparameters for different optimizers in Table 1 . Moreover, we consider following two combinations of tunable hyperparameters to better investigate the performance of different optimizers: a) only tuning initial learning rate with the others set to default values and b) tuning a full list of hyperparameters. A detailed description of optimizers as well as default values and search range of these hyperparameters can be found in Appendix E. Note that we adopt a unified search space for a fair comparison following Metz et al. (2020) , to eliminate biases of specific ranges for different optimizers. The tuning budget of Hyperband is determined by three items: maximum resource (in this paper we use epoch) per configuration R, reduction factor η, and number of configurations n c . According to Li et al. (2017) , a single Hyperband execution contains n s = log η (R) + 1 of SuccessiveHalving, each referred to as a bracket. These brackets take strategies from least to most aggressive early-stopping, and each one is designed to use approximately B = R • n s resources, leading to a finite total budget. The number of randomly sampled configurations in one Hyperband run is also fixed and grows with R. Then given R and η, n c determines the repetition times of Hyperband. We set η = 3 as this default value performs consistently well, and R to a value which each task usually takes for a complete run. For n c , it is assigned as what is required for a single Hyperband execution for all tasks, except for BERT fine-tuning, where a larger number of configurations is necessary due to a relatively small R. In Appendix E, we give assigned values of R, η, and n c for each task.

Optimizer

Tunable hyperparameter Non-adaptive SGD α 0 , µ LARS α 0 , µ, Adaptive Adam, RAdam, Yogi Lookahead, LAMB α 0 , β 1 , β 2 , Table 1 : Optimizers to be evaluated with their tunable hyperparameters. Specifically, α 0 represents the initial learning rate. µ is the decay factor of the first-order momentum for non-adaptive methods while β 1 and β 2 are coefficients to compute the running averages of first-order and second-order momentums. is a small scalar used to prevent division by 0. Tasks for benchmarking. For a comprehensive and reliable assessment of optimizers, we consider a wide range of tasks in different domains. When evaluating end-to-end training efficiency, we implement our protocol on tasks covering several popular and promising applications in Table 2 . Apart from common tasks in computer vision and natural language processing, we introduce two extra tasks in graph neural network training and reinforcement learning. For simplicity, we will use the dataset to represent each task in our subsequent tables of experimental results. (For the reinforcement learning task, we just use the environment name.) The detailed settings and parameters for each task can be found in Appendix D. 

4.1. END-TO-END EFFICIENCY (SECNARIO I)

To evaluate end-to-end training efficiency, we adopt the protocol in Algorithm 1. Specifically, we record the average training trajectory with Hyperband {P t } T t=1 for each optimizer on benchmarking tasks, where P t is the evaluation metric for each task (e.g., accuracy, return). We visualize these trajectories in Figure 3 for CIFAR10 and CIFAR100, and calculate CPE and peak performance in Table 3 and 9 respectively. More results for other tasks and the peak performance can be found in Appendix F. Besides, in Eq. 2 we compute performance ratio r o,a for each optimizer and each task, and then utilize the distribution function of a performance metric called performance profile ρ o (τ ) to summarize the performance of different optimizers over all the tasks. For tasks where a lower CPE is better, we just use r o,a = CPE o,a / min{CPE o,a } instead to guarantee r o,a ≥ 1. The function ρ o (τ ) for all optimizers is presented in Figure 4 . Based on the definition of performance profile (Dolan & Moré, 2002) , the optimizers with large probability ρ o (τ ) are to be preferred. In particular, the value of ρ o (1) is the probability that one optimizer will win over the rest and can be a reference for selecting the proper optimizer for an unknown task. We also provided a probabilistic performance profile to summarize different optimizers in Figure 7 in Appendix F. r o,a = max{CPE o,a : o ∈ O} CPE o,a , ρ o (τ ) = 1 |A| size a ∈ A : r o,a ≤ τ . Our findings are summarized below: • It should be emphasized from Table 3 and 9 , that under our protocol based on Hyperband, SGD performs similarly to Adam in terms of efficiency as well as peak performance, and can even surpass it in some cases like training on CIFAR100. Under Hyperband, the best configuration of SGD is less tedious to find than random search because Hyperband can early-stop bad runs and thus they will affect less to the search efficiency and final performance. • For image classification tasks all the methods are competitive, while adaptive methods tend to perform better in more complicated tasks (NLP, GCN, RL). • There is no significant distinction among adaptive variants. Performance of adaptive optimizers tends to fall in the range within 1% of the best result. • According to performance profile in Figure 4 , the RAdam achieves probability 1 with the smallest τ , and Adam is the second method achieving that. This indicates that RAdam and Adam are achieving relatively stable and consistent performance among these tasks. Table 3 : CPE for different optimizers on benchmarking tasks. The best performance is highlighted in bold and blue and results within the 1% range of the best are emphasized in bold only.  Optimizer CIFAR10 (%) ↑ CIFAR100 (%) ↑ CelebA ↓ MRPC ↑ PPI (%) ↑ Walker2d-v3 ↑ (classification) (classification) (VAE) (NLP) (GCN) (RL)

4.2. DATA-ADDITION TRAINING (SCENARIO II)

We then conduct evaluation on data-addition training based on the protocol in Algorithm 2. We choose four classification problems on CIFAR10, CIFAR100, MRPC and PPI since the setting do not apply to RL. We search the best hyperparameter configuration, denoted by Ω partial , under the sub training set with the ratio δ = 0.3. Here we tune all hyperparameters. Then we directly apply Ω partial on the full dataset for a complete training process. The training curves are shown in Figure 5 , and we also summarize the training curve with CPE by Eq. 1 in Table 4 . We have the following findings: • There is no clear winner in data-addition training. RAdam is outperforming other optimizers in 2/4 tasks so is slightly preferred, but other optimizers except Lookahead are also competitive (within 1% range) on at least 2/4 tasks. • To investigate whether the optimizer's ranking will change when adding 70% data, we compare the training curve on the original 30% data versus the training curve on the full 100% data in Figure 5 . We observe that the ranking of optimizers slightly changes after data addition. 

5. CONCLUSIONS AND DISCUSSIONS

In conclusion, we found there is no strong evidence that newly proposed optimizers consistently outperform Adam, while each of them may be good for some particular tasks. When deciding the choice of the optimizer for a specific task, people can refer to results in Table 3 and 9 . If the task is contained in Table 2 , he/she can directly choose the one with the best CPE or best peak performance based on his/her goal of the task (easy to tune or high final performance). On the other hand, even though the desired task is covered, people can also gain some insights from the results of the most similar task in 

A HYPERBAND

We present the whole algorithm for Hyperband in Algorithm 3, and you can refer to Li et al. (2017) for more details. for i ∈ {0, . . . , s} do 6: n i = nη -i , r i = rη i 7: L = {run then return val loss(t, r i ): t ∈ T } 8: T = top k(T, L, n i /η ) 9: end for 10: end for return Hyperparameter configuration with the smallest loss seen so far

B DETAILS OF HUMAN STUDY

In this human study, 10 participants are sampled from Ph.D. students with computer science backgrounds (machine learning backgrounds specifically). They are recruited as follows: We first asked the administrators to distribute the program of this human study to Ph.D. students in machine learning labs in our institutions. Provided the population, we assumed that they had prior knowledge about some basic machine learning experiments, such as image classification on MNIST and CI-FAR10. They were requested to conduct hyperparameter tuning of learning rate based on their knowledge. They were informed that the target experiment was image classification on CIFAR10 with SGD, and the search range of learning rate was in the grid [1.0 × 10 -8 , 1.0 × 10 -7 , 1.0 × 10 -6 , . . . , 10]. Each configuration was run for 200 epochs at most. Moreover, we also told them that they could pause any configuration if they wanted to evaluate others, and even stop the whole tuning process only when they thought the accuracy could not be further improved and the number of total tuning epochs exceeded 600 at the same time. We collected results from 17 people and we determined the validity by checking whether the length of the trajectory was greater than 600 and removed 7 invalidity trajectories. Finally there remained 10 trajectories and we averaged them as the human performance in Figure 2 .

C OPTIMIZERS

Notations. Given a vector of parameters θ ∈ R d , we denote a sub-vector of its i-th layer's parameters by θ (i) . {α t } T t=1 is a sequence of learning rates during the optimization procedure of a horizon T . {φ t , ψ t } T t=1 represents a sequence of functions to calculate the first-order and second-order momentum of the gradient g t , which are m t and v t respectively at time step t. Different optimization algorithms are usually specified by the choice of φ(•) and ψ(•). {r t } T t=1 is an additional sequence of adaptive terms to modify the magnitude of the learning rate in some methods. For algorithms only using the first-order momentum, µ is the , while β 1 and β 2 are coefficients to compute the running averages m and v. is a small scalar (e.g., 1 × 10 -8 ) used to prevent division by 0. Generic optimization framework. Based on, we further develop a thorough generic optimization framework including an extra adaptive term in Algorithm 4. The debiasing term used in the original version of Adam is ignored for simplicity. Note that for {α t } T t=1 , different learning rate scheduling strategies can be adopted and the choice of scheduler is also regarded as a tunable hyperparameter. Without loss of generality, in this paper, we only consider a constant value and a linear decay (Shallue et al., 2018) in the following equation, introducing γ as a hyperparameter. α t = α 0 , constant; α 0 -(1 -γ)α 0 t T , linear decay. With this generic framework, we can summarize several popular optimization methods by explicitly specifying m t , v t and r t in Table 5 . It should be clarified that Lookahead is an exception of the generic framework. In fact it is more like a high-level mechanism, which can be incorporated with any other optimizer. However, as stated in Zhang et al. (2019) , this optimizer is robust to inner optimization algorithm, k, and α s in Algorithm 5, we still include Lookahead here with Adam as the base for a more convincing and comprehensive evaluation. We consider Lookahead as a special adaptive method, and tune the same hyperparamters for it as other adaptive optimziers. Algorithm 4 Generic framework of optimization methods Input: parameter value θ 1 , learning rate with scheduling {α t }, sequence of functions {φ t , ψ t , χ t } T t=1 to compute m t , v t , and r t respectively. 1: for t = 1 to T do 2:  g t = ∇f t (θ t ) 3: m t = φ t (g 1 , • • • , g t ) 4: v t = ψ t (g 1 , • • • , g t ) 5: r t = χ t (θ t , m t , v t ) 6: θ t+1 = θ t -α t r t m t / √ v t 7: end for SGD(M) µmt-1 + gt 1 1 Adam β1mt-1 + (1 -β1)gt β2vt-1 + (1 -β2)g 2 t 1 RAdam β1mt-1 + (1 -β1)gt β2vt-1 + (1 -β2)g 2 t (ρ t -4)(ρ t -2)ρ∞ (ρ∞-4)(ρ∞-2)ρ t Yogi β1mt-1 + (1 -β1)gt vt-1 -(1 -β2) sign(vt-1 -g 2 t )g 2 t 1 LARS µmt-1 + gt 1 θ (i) t / m (i) t LAMB β1mt-1 + (1 -β1)gt β2vt-1 + (1 -β2)g 2 t θ (i) t m (i) t / v (i) t Lookahead* β1mt-1 + (1 -β1)gt β2vt-1 + (1 -β2)g 2 t 1 Algorithm 5 Lookahead Optimizer Input: Initial parameters θ 0 , objective function f , synchronization period k, slow weights step size α s , optimizer A 1: for t = 1, 2, . . . do 2: Synchronize parameters θt,0 ← θ t-1 3: for i = 1, 2, . . . , k do Perform outer update θ t ← θ t-1 + α s ( θt,kθ t-1 ) 8: end for return Parameters θ

D TASK DESCRIPTION

We make a concrete description of tasks selected for our optimizer evaluation protocol: • Image classifcation. For this task, we adopt a ResNet-50 (He et al., 2016) • Natural language processing. In this domain, we finetune RoBERTa-base on MRPC, one of the test suit in GLUE benchmark. For each optimizer, we set the maximal exploration budget to be 800 epochs. The batch size is 16 sentences. • Graph learning. Among various graph learning problems, we choose node classification as semisupervised classification. In GCN training, in there are multiple ways to deal with the neighborhood explosion of stochastic optimizers. We choose Cluster- GCN Chiang et al. (2019) as the backbone to handle neighborhood expansion and PPI as the dataset. • Reinforcement learning. We select Walker2d-v3 from OpenAI Gym (Brockman et al. (2016) ) as our training environment, and PPO (Schulman et al. (2017) ), implemented by OpenAI SpinningUp (Achiam (2018) ), as the algorithm that required tuning. We use the same architectures for both action value network Q and the policy network π. We define 40,000 of environment interactions as one epoch, with a batch size of 4,000. The return we used is the highest average test return of an epoch during the training.

E IMPLEMENTATION DETAILS

Implementation details of our experiments are provided in this section. Specifically, we give the unified search space for all hyperparamters and their default values in Table 6 . Note that we tune the learning rate decay factor for image classification tasks when tuning every hyperparamter. For the task on MRPC, γ is tuned for all experiments. In other cases, we only tune original hyperparamters without a learning rate scheduler. In addition, Hyperband parameter values for each task are listed in Sample S configurations from F, initialize the library with an empty list for each setting 3: for i = 1 to M do 4: Simulate Hyperband with o using configurations re-sampled from the library on a 5: if the desired accuracy is pre-computed in the library then 6: Retrieve the value directly Average peak and CPE values over M repetitions for the optimizer o 14: end for 15: Evaluate optimizers according to their peak and CPE values F ADDITIONAL RESULTS More detailed experimental results are reported in this section.

F.1 IMPACT OF η

Since there is an extra hyperparameter, the reduction factor η in Hyperband, we conduct an experiment with different values (η = 2, 3, 4, 5) to observe the potential impact of this additional hyperparameter on our evaluation. Specifically, we use Hyperband to tune the learning rate for three optimizers, SGD, Adam, and Lookahead on CIFAR10, and results are presented in Table 8 . As we can see, although the change of η may lead to different CPE values, the relative ranking among three optimizers remains unchanged. Besides, they all achieve comparable peak performance at the end of training. Considering the efficiency of Hyperband, we choose η = 3 based on the convention in Li et al. (2017) in all our experiments. 9 shows peak performance for optimizers on each task. For GAN, we only conduct evaluation on optimizers tuning learning rate due to time limit, and present its CPE and peak performance in Table 10 . There is also an end-to-end training curve for GAN on CIFAR10 in Figure 6 .  where µ o,a and σ o,a are the mean and standard deviation of CPE of the optimizer o on the task a respetively, and b a is just the best expectation of CPE on a among all optimizers. It can be seen in Figure 7 that the probabilistic performance profiles shows a similar trend to Figure 4 . We also attach two end-to-end training trajectories on CIFAR10 with the error in Figure 8 . Since it is hard to distinguish optimizers with the standard deviation added, we just instead report the std of CPE and peak performance in Table 3 and 9 . 



The step sizes of gradient descent and Newton's method can be automatically adjusted by a line search procedure(Nocedal & Wright, 2006). For instance, Sivaprasad et al. (2020) only reaches < 50% accuracy in their CIFAR-100 comparisons.



Figure 1: An illustration example showing that different hyperparamter tuning methods are likely to affect comparison of optimizers. Optimizer A is more sensitive to hyperparamters than optimizers B, but it may be prefered if bad hyperparameters can be terminated in the early stage.

Figure 2: Hyperband tuning used in our evaluation protocol is closer to human behavior than random search.

End-to-End Efficiency Evaluation Protocol Input: A set of optimizers O = {o : o = (U, Ω)}, task a ∈ A, feasible search space F 1: for o ∈ O do 2: for i = 1 to M do 3: Conduct hyperparameter search in F with the optimizer o using HyperBand on a 4: Record the performance trajectory {P t } T t=1 explored by HyperBand 5: Calculate the peak performance and CPE by Eq. peak and CPE values over M repetitions for the optimizer o 8: end for 9: Evaluate optimizers according to their peak and CPE values 3.2 DATA-ADDITION TRAINING EVALUATION PROTOCOL

Under review as a conference paper at ICLR 2021 Algorithm 2 Data-Addition Training Evaluation Protocol Input: A set of optimizers O = {o : o = (U, Ω)}, task a ∈ A with a full dataset D, a split ratio δ 1: for o ∈ O do 2: for i = 1 to M do 3: Conduct hyperparameter search with the optimizer o using Hyperband on a with a partial dataset D δ , and record the best hyperparameter setting Ω partial found under this scenario 4: Apply the optimizer with Ω partial on D δ and D, then save the training curves 5: end for 6: Average training curves of o over M repetitions to compute CPE 7: end for 8: Compare performance of different optimizers under data-addition training

Figure 4: Performance profile of 7 optimizers in the range [1.0, 1.3].

HyperbandInput: R, η Initialization: s max = log η R , B = (s max + 1)R 1: for s ∈ {s max , s max -1, . . . , 0} do

Figures for end-to-end training curves on the rest of tasks are shown in Figure 9 and 10.

Figure 6: End-to-end training curves for GAN on CIFAR10.

Figure 7: Probabilistic performance profile of 7 optimizers in the range [1.0, 1.3].

Figure 8: End-to-end training curves on CIFAR10

Tasks for benchmarking optimizers. Details are provided in Appendix D.

CPE of different optimizers computed under curves trained with Ω partial on four full datasets.

Training curves under Ω partial for both partial and full datasets.In addition to the proposed two evaluation criteria, there could be other factors that affect the practical performance of an optimizer. First, the memory consumption is becoming important for training large DNN models. For instance, although Lookahead performs well in certain tasks, it requires more memory than other optimizers, restricting their practical use in some memory constrained applications. Another important criterion is the scalability of optimizers. When training with a massively distributed system, optimizing the performance of a large batch regime (e.g., 32K batch size for ImageNet) is important. LARS and LAMB algorithms included in our study aredeveloped  for large batch training. We believe it is another important metric for comparing optimizers worth studying further. Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In Advances in neural information processing systems, pp. 4148-4158, 2017. Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888, 2017. Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962, 2019. Manzil Zaheer, Sashank Reddi, Devendra Sachan, Satyen Kale, and Sanjiv Kumar. Adaptive methods for nonconvex optimization. In Advances in neural information processing systems, pp. 9793-9803, 2018. Michael Zhang, James Lucas, Jimmy Ba, and Geoffrey E Hinton. Lookahead optimizer: k steps forward, 1 step back. In Advances in Neural Information Processing Systems, pp. 9597-9608, 2019. Hongyu Zhu, Mohamed Akrout, Bojian Zheng, Andrew Pelegris, Amar Phanishayee, Bianca Schroeder, and Gennady Pekhimenko. Tbd: Benchmarking and analyzing deep neural network training. arXiv preprint arXiv:1803.06905, 2018.

A summary of popular optimization algorithms with different choices of m t , v t and r t .

model on CIFAR10 and CIFAR100 with a batch size of 128 and the maximum epoch of 200 per trial. • VAE. We use a vanilla variational autoencoder in Kingma & Welling (2013) with five convolutional and five deconvolutional layers with a latent space of dimension 128 on CelebA. There are no dropout layers. Trained with a batch size of 144. • GAN. We train SNGAN with the same network architecture and objective function with spectral normalization for CIFAR10 in Miyato et al. and the batch size of the generator and the discriminator is 128 and 64 respectively.

These parameters are assigned based on properties of different tasks.For time cost of our evaluation protocols, it depends on how many budgets are available. Specifically, in our paper, the unit of time budget is one epoch, then the total time will be B epoch * T epoch , where B epoch is the total available budget and T epoch is the running time for one epoch. There is no additional computational cost, i.e., running our protocol once takes the same time as running one hyperparameter search with Hyperband. In our experiment on CIFAR10, we roughly evaluated 200 hyperparameter configurations in one Hyperband running, while the same time can only allow about 50 configurations in random search.

Hyperparamter search space and default value

Hyperband parameters for each task.Moreover, we can further accelerate our evaluation protocol by resampling, shown in Algorithm 6. The basic idea is that we keep a library of different hyperparameter settings. At the beginning, the library is empty. And in each repetition, we sample a number of configurations required by running Hyperband once. During the simulation of Hyperband, we just retrieve the value from the library if the desired epoch of current configuration is contained in the library. Otherwise, we run this configuration based on Hyperband, and store the piece of the trajectory to the library.

CPE on CIFAR10 with different η. The value in the round brackets is peak performance.



Peak performance during end-to-end training. The best one for each task is highlighted in bold.

CPE on GAN for end-to-end training. The value in the bracket is peak performance.

