WIDE-MINIMA DENSITY HYPOTHESIS AND THE EXPLORE-EXPLOIT LEARNING RATE SCHEDULE Anonymous

Abstract

Several papers argue that wide minima generalize better than narrow minima. In this paper, through detailed experiments that not only corroborate the generalization properties of wide minima, we also provide empirical evidence for a new hypothesis that the density of wide minima is likely lower than the density of narrow minima. Further, motivated by this hypothesis, we design a novel explore-exploit learning rate schedule. On a variety of image and natural language datasets, compared to their original hand-tuned learning rate baselines, we show that our explore-exploit schedule can result in either up to 0.84% higher absolute accuracy using the original training budget or up to 57% reduced training time while achieving the original reported accuracy. For example, we achieve state-of-the-art (SOTA) accuracy for IWSLT'14 (DE-EN) and WMT'14 (DE-EN) datasets by just modifying the learning rate schedule of a high performing model.

1. INTRODUCTION

One of the fascinating properties of deep neural networks (DNNs) is their ability to generalize well, i.e., deliver high accuracy on the unseen test dataset. It is well-known that the learning rate (LR) schedules play an important role in the generalization performance (Keskar et al., 2016; Wu et al., 2018; Goyal et al., 2017) . In this paper, we study the question, what are the key properties of a learning rate schedule that help DNNs generalize well during training? We start with a series of experiments training Resnet18 on Cifar-10 over 200 epochs. We vary the number of epochs trained at a high LR of 0.1, called the explore epochs, from 0 to 100 and divide up the remaining epochs equally for training with LRs of 0.01 and 0.001. Note that the training loss typically stagnates around 50 epochs with 0.1 LR. Despite that, we find that as the number of explore epochs increase to 100, the average test accuracy also increases. We also find that the minima found in higher test accuracy runs are wider than the minima from lower test accuracy runs, corroborating past work on wide-minima and generalization (Keskar et al., 2016; Hochreiter & Schmidhuber, 1997; Jastrzebski et al., 2017; Wang et al., 2018) . Moreover, what was particularly surprising was that, even when using fewer explore epochs, a few runs out of many trials still resulted in high test accuracies! Thus, we not only find that an initial exploration phase with a high learning rate is essential to the good generalization of DNNs, but that this exploration phase needs to be run for sufficient time, even if the training loss stagnates much earlier. Further, we find that, even when the exploration phase is not given sufficient time, a few runs still see high test accuracy values. To explain these observations, we hypothesize that, in the DNN loss landscape, the density of narrow minima is significantly higher than that of wide minima. A large learning rate can escape narrow minima easily (as the optimizer can jump out of them with large steps). However, once it reaches a wide minima, it is likely to get stuck in it (if the "width" of the wide minima is large compared to the step size). With fewer explore epochs, a large learning rate might still get lucky occasionally in finding a wide minima but invariably finds only a narrower minima due to their higher density. As the explore duration increase, the probability of eventually landing in a wide minima also increase. Thus, a minimum duration of explore is necessary to land in a wide minimum with high probability. Heuristic-based LR decay schemes such as cosine decay (Loshchilov & Hutter, 2016) implicitly maintain a higher LR for longer than schemes like linear decay. Thus, the hypothesis also explains 1

