ADAPTIVE GRADIENT METHODS WITH LOCAL GUAR-ANTEES

Abstract

Adaptive gradient methods are the method of choice for optimization in machine learning and used to train the largest deep models. In this paper we study the problem of learning a local preconditioner, that can change as the data is changing along the optimization trajectory. We propose an adaptive gradient method that has provable adaptive regret guarantees vs. the best local preconditioner. To derive this guarantee, we prove a new adaptive regret bound in online learning that improves upon previous adaptive online learning methods. We demonstrate the practical value of our algorithm for learning rate adaptation in both online and offline settings. For the online experiments, we show that our method is robust to unforeseen distribution shifts during training and consistently outperforms popular off-the-shelf learning rate schedulers. For the offline experiments in both vision and language domains, we demonstrate our method's robustness and its ability to select the optimal learning rate on-the-fly and achieve comparable task performance as well-tuned learning rate schedulers, albeit with less total computation resources.

1. INTRODUCTION

Adaptive gradient methods have revolutionized optimization for machine learning and are routinely used for training deep neural networks. These algorithms are stochastic gradient based methods, that also incorporate a changing data-dependent preconditioner (multi-dimensional generalization of learning rate). Their empirical success is accompanied with provable guarantees: in any optimization trajectory with given gradients, the adapting preconditioner is comparable to the best in hindsight, in terms of rate of convergence to local optimality. Their success has been a source of intense investigations over the past decade, since their introduction, with literature spanning thousands of publications, some highlights are surveyed below. The common intuitive understanding of their success is their ability to change the preconditioner, or learning rate matrix, per coordinate and on the fly. A methodological way of changing the learning rate allows treating important coordinates differently as opposed to commonly appearing features of the data, and thus achieve faster convergence. In this paper we investigate whether a more refined goal can be obtained: namely, can we adapt the learning rate per coordinate, and also in short time intervals? The intuition guiding this question is the rising popularity in "exotic learning rate schedules" for training deep neural networks. The hope is that an adaptive learning rate algorithm can automatically tune its preconditioner, on a per-coordinate and per-time basis, such to guarantee optimal behavior even locally. To pursue this goal, we use and improve upon techniques from the literature on adaptive regret in online learning to create a provable method that is capable of attaining optimal regret in any sub-interval of the optimization trajectory. We then test the resulting method and compare it to learning a learning rate schedule from scratch. Experiments conducted validate that our algorithm can improve accuracy and robustness upon existing algorithms for online tasks, and for offline tasks it saves overall computational resources for hyperparameter optimization.

