HOW MUCH PROGRESS HAVE WE MADE IN NEURAL NETWORK TRAINING? A NEW EVALUATION PROTO-COL FOR BENCHMARKING OPTIMIZERS

Abstract

Many optimizers have been proposed for training deep neural networks, and they often have multiple hyperparameters, which make it tricky to benchmark their performance. In this work, we propose a new benchmarking protocol to evaluate both end-to-end efficiency (training a model from scratch without knowing the best hyperparameter) and data-addition training efficiency (the previously selected hyperparameters are used for periodically re-training the model with newly collected data). For end-to-end efficiency, unlike previous work that assumes random hyperparameter tuning, which over-emphasizes the tuning time, we propose to evaluate with a bandit hyperparameter tuning strategy. A human study is conducted to show that our evaluation protocol matches human tuning behavior better than the random search. For data-addition training, we propose a new protocol for assessing the hyperparameter sensitivity to data shift. We then apply the proposed benchmarking framework to 7 optimizers and various tasks, including computer vision, natural language processing, reinforcement learning, and graph mining. Our results show that there is no clear winner across all the tasks.

1. INTRODUCTION

Due to the enormous data size and non-convexity, stochastic optimization algorithms have become widely used in training deep neural networks. In addition to Stochastic Gradient Descent (SGD) (Robbins & Monro, 1951) , many variations such as Adagrad (Duchi et al., 2011) and Adam (Kingma & Ba, 2014) have been proposed. Unlike classical, hyperparameter free optimizers such as gradient descent and Newton's methodfoot_0 , stochastic optimizers often hold multiple hyperparameters including learning rate and momentum coefficients. Those hyperparameters are critical not only to the speed, but also to the final performance, and are often hard to tune. It is thus non-trivial to benchmark and compare optimizers in deep neural network training. And a benchmarking mechanism that focuses on the peak performance could lead to a false sense of improvement when developing new optimizers without considering tuning efforts. In this paper, we aim to rethink the role of hyperparameter tuning in benchmarking optimizers and develop new benchmarking protocols to reflect their performance in practical tasks better. We then benchmark seven recently proposed and widely used optimizers and study their performance on a wide range of tasks. In the following, we will first briefly review the two existing benchmarking protocols, discuss their pros and cons, and then introduce our contributions. Benchmarking performance under the best hyperparameters A majority of previous benchmarks and comparisons on optimizers are based on the best hyperparameters. Wilson et al. (2017); Shah et al. (2018) made a comparison of SGD-based methods against adaptive ones under their best hyperparameter configurations. They found that SGD can outperform adaptive methods on several datasets under careful tuning. Most of the benchmarking frameworks for ML training also assume knowing the best hyperparameters for optimizers (Schneider et al., 2019; Coleman et al., 2017; Zhu et al., 2018) . Also, the popular MLPerf benchmark evaluated the performance of optimizers under the best hyperparameter. It showed that ImageNet and BERT could be trained in 1 minute using the combination of good optimizers, good hyperparameters, and thousands of accelerators.



The step sizes of gradient descent and Newton's method can be automatically adjusted by a line search procedure(Nocedal & Wright, 2006).

