HOW MUCH PROGRESS HAVE WE MADE IN NEURAL NETWORK TRAINING? A NEW EVALUATION PROTO-COL FOR BENCHMARKING OPTIMIZERS

Abstract

Many optimizers have been proposed for training deep neural networks, and they often have multiple hyperparameters, which make it tricky to benchmark their performance. In this work, we propose a new benchmarking protocol to evaluate both end-to-end efficiency (training a model from scratch without knowing the best hyperparameter) and data-addition training efficiency (the previously selected hyperparameters are used for periodically re-training the model with newly collected data). For end-to-end efficiency, unlike previous work that assumes random hyperparameter tuning, which over-emphasizes the tuning time, we propose to evaluate with a bandit hyperparameter tuning strategy. A human study is conducted to show that our evaluation protocol matches human tuning behavior better than the random search. For data-addition training, we propose a new protocol for assessing the hyperparameter sensitivity to data shift. We then apply the proposed benchmarking framework to 7 optimizers and various tasks, including computer vision, natural language processing, reinforcement learning, and graph mining. Our results show that there is no clear winner across all the tasks.

1. INTRODUCTION

Due to the enormous data size and non-convexity, stochastic optimization algorithms have become widely used in training deep neural networks. In addition to Stochastic Gradient Descent (SGD) (Robbins & Monro, 1951) , many variations such as Adagrad (Duchi et al., 2011) and Adam (Kingma & Ba, 2014) have been proposed. Unlike classical, hyperparameter free optimizers such as gradient descent and Newton's methodfoot_0 , stochastic optimizers often hold multiple hyperparameters including learning rate and momentum coefficients. Those hyperparameters are critical not only to the speed, but also to the final performance, and are often hard to tune. It is thus non-trivial to benchmark and compare optimizers in deep neural network training. And a benchmarking mechanism that focuses on the peak performance could lead to a false sense of improvement when developing new optimizers without considering tuning efforts. In this paper, we aim to rethink the role of hyperparameter tuning in benchmarking optimizers and develop new benchmarking protocols to reflect their performance in practical tasks better. We then benchmark seven recently proposed and widely used optimizers and study their performance on a wide range of tasks. In the following, we will first briefly review the two existing benchmarking protocols, discuss their pros and cons, and then introduce our contributions. Benchmarking performance under the best hyperparameters A majority of previous benchmarks and comparisons on optimizers are based on the best hyperparameters. Wilson et al. (2017); Shah et al. (2018) made a comparison of SGD-based methods against adaptive ones under their best hyperparameter configurations. They found that SGD can outperform adaptive methods on several datasets under careful tuning. Most of the benchmarking frameworks for ML training also assume knowing the best hyperparameters for optimizers (Schneider et al., 2019; Coleman et al., 2017; Zhu et al., 2018) . Also, the popular MLPerf benchmark evaluated the performance of optimizers under the best hyperparameter. It showed that ImageNet and BERT could be trained in 1 minute using the combination of good optimizers, good hyperparameters, and thousands of accelerators. Despite each optimizer's peak performance being evaluated, benchmarking under the best hyperparameters makes the comparison between optimizers unreliable and fails to reflect their practical performance. First, the assumption of knowing the best hyperparameter is unrealistic. In practice, it requires a lot of tuning efforts to find the best hyperparameter, and the tuning efficiency varies greatly for different optimizers. It is also tricky to define the "best hyperparameter", which depends on the hyperparameter searching range and grids. Further, since many of these optimizers are sensitive to hyperparameters, some improvements reported for new optimizers may come from insufficient tuning for previous work. Benchmarking performance with random hyperparameter search It has been pointed out in several papers that tuning hyperparameter needs to be considered in evaluating optimizers (Schneider et al., 2019; Asi & Duchi, 2019) , but having a formal evaluation protocol on this topic is nontrivial. Only recently, two papers Choi et al. ( 2019) and Sivaprasad et al. ( 2020) take hyperparameter tuning time into account when comparing SGD with Adam/Adagrad. However, their comparisons among optimizers are conducted on random hyperparameter search. We argue that these comparisons could over-emphasize the role of hyperparameter tuning, which could lead to a pessimistic and impractical performance benchmarking for optimizers. This is due to the following reasons: First, in the random search comparison, each bad hyperparameter has to run fully (e.g., 200 epochs). In practice, a user can always stop the program early for bad hyperparameters if having a limited time budget. For instance, if the learning rate for SGD is too large, a user can easily observe that SGD diverges in a few iterations and directly stops the current job. Therefore, the random search hypothesis will over-emphasize the role of hyperparameter tuning and does not align with a real user's practical efficiency. Second, the performance of the best hyperparameter is crucial for many applications. For example, in many real-world applications, we need to re-train the model every day or every week with newly added data. So the best hyperparameter selected in the beginning might benefit all these re-train tasks rather than searching parameters from scratch. In addition, due to the expensive random search, random search based evaluation often focuses on the low-accuracy regionfoot_2 , while practically we care about the performance for getting reasonably good accuracy. Our contributions Given that hyperparameter tuning is either under-emphasized (assuming the best hyperparameters) or over-emphasize (assuming random search) in existing benchmarking protocols and comparisons, we develop new evaluation protocols to compare optimizers to reflect the real use cases better. Our evaluation framework includes two protocols. First, to evaluate the end-to-end training efficiency for a user to train the best model from scratch, we develop an efficient evaluation protocol to compare the accuracy obtained under various time budgets, including the hyperparameter tuning time. Instead of using the random search algorithm, we adopt the Hyperband (Li et al., 2017) algorithm for hyperparameter tuning since it can stop early for bad configurations and better reflect the real running time required by a user. Further, we also propose to evaluate the data addition training efficiency for a user to re-train the model with some newly added training data, with the knowledge of the best hyperparameter tuned in the previous training set. We also conduct human studies to study how machine learning researchers are tuning parameters in optimizers and how that aligns with our proposed protocols. Based on the proposed evaluation protocols, we study how much progress has recently proposed algorithms made compared with SGD or Adam. Note that most of the recent proposed optimizers have shown outperforming SGD and Adam under the best hyperparameters for some particular tasks, but it is not clear whether the improvements are still significant when considering hyper-parameter tuning, and across various tasks. To this end, we conduct comprehensive experiments comparing state-of-the-art training algorithms, including SGD (Robbins & Monro, 1951 ), Adam (Kingma & Ba, 2014 ), RAdam (Liu et al., 2019 ), Yogi (Zaheer et al., 2018) , LARS (You et al., 2017 ), LAMB (You et al., 2019 ), and Lookahead (Zhang et al., 2019) , on a variety of training tasks including image classification, generated adversarial networks (GANs), sentence classification (BERT fine-tuning), reinforcement learning and graph neural network training. Our main conclusions are: 1) On CIFAR-10 and CIFAR-100, all the optimizers including SGD are competitive. 2) Adaptive methods are generally better on more complex tasks (NLP, GCN, RL). 3) There is no clear winner among adaptive methods. Although RAdam is more stable than Adam across tasks, Adam is still a very competitive baseline even compared with recently proposed methods.



The step sizes of gradient descent and Newton's method can be automatically adjusted by a line search procedure(Nocedal & Wright, 2006). For instance, Sivaprasad et al. (2020) only reaches < 50% accuracy in their CIFAR-100 comparisons.

