ADAPTIVE GRADIENT METHODS WITH LOCAL GUAR-ANTEES

Abstract

Adaptive gradient methods are the method of choice for optimization in machine learning and used to train the largest deep models. In this paper we study the problem of learning a local preconditioner, that can change as the data is changing along the optimization trajectory. We propose an adaptive gradient method that has provable adaptive regret guarantees vs. the best local preconditioner. To derive this guarantee, we prove a new adaptive regret bound in online learning that improves upon previous adaptive online learning methods. We demonstrate the practical value of our algorithm for learning rate adaptation in both online and offline settings. For the online experiments, we show that our method is robust to unforeseen distribution shifts during training and consistently outperforms popular off-the-shelf learning rate schedulers. For the offline experiments in both vision and language domains, we demonstrate our method's robustness and its ability to select the optimal learning rate on-the-fly and achieve comparable task performance as well-tuned learning rate schedulers, albeit with less total computation resources.

1. INTRODUCTION

Adaptive gradient methods have revolutionized optimization for machine learning and are routinely used for training deep neural networks. These algorithms are stochastic gradient based methods, that also incorporate a changing data-dependent preconditioner (multi-dimensional generalization of learning rate). Their empirical success is accompanied with provable guarantees: in any optimization trajectory with given gradients, the adapting preconditioner is comparable to the best in hindsight, in terms of rate of convergence to local optimality. Their success has been a source of intense investigations over the past decade, since their introduction, with literature spanning thousands of publications, some highlights are surveyed below. The common intuitive understanding of their success is their ability to change the preconditioner, or learning rate matrix, per coordinate and on the fly. A methodological way of changing the learning rate allows treating important coordinates differently as opposed to commonly appearing features of the data, and thus achieve faster convergence. In this paper we investigate whether a more refined goal can be obtained: namely, can we adapt the learning rate per coordinate, and also in short time intervals? The intuition guiding this question is the rising popularity in "exotic learning rate schedules" for training deep neural networks. The hope is that an adaptive learning rate algorithm can automatically tune its preconditioner, on a per-coordinate and per-time basis, such to guarantee optimal behavior even locally. To pursue this goal, we use and improve upon techniques from the literature on adaptive regret in online learning to create a provable method that is capable of attaining optimal regret in any sub-interval of the optimization trajectory. We then test the resulting method and compare it to learning a learning rate schedule from scratch. Experiments conducted validate that our algorithm can improve accuracy and robustness upon existing algorithms for online tasks, and for offline tasks it saves overall computational resources for hyperparameter optimization.

1.1. STATEMENT OF OUR RESULTS

The (stochastic/sub)-gradient descent algorithm is given by the following iterative update rule: x τ +1 = x τ -η τ ∇ τ . If η τ is a matrix, it is usually called a preconditioner. A notable example for a preconditioner is when η τ is equal to the inverse Hessian (or second differential), which gives Newton's method. Let ∇ 1 , ..., ∇ T be the gradients observed in an optimization trajectory, the Adagrad algorithm (and subsequent adaptive gradient methods, notably Adam) achieves the following regret guarantee for online convex optimization (OCO): Õ( min H∈H T τ =1 ∇ τ * 2 H ), where H is a family of matrix norms, most commonly those with a bounded trace. In this paper we propose a new algorithm SAMUEL, which improves upon this guarantee in terms of the local performance over any sub-interval of the optimization trajectory. For any sub-interval I = [s, t], the regret over I can be bounded by Õ( min H∈H t τ =s ∇ τ * 2 H ), which also implies a new regret bound over [1, T ]: Õ   min k min H1,...,H k ∈H k j=1 τ ∈Ij ∇ τ * 2 Hj   This regret can be significantly lower than the regret of Adagrad, Adam and other global adaptive gradient methods that do not perform local optimization to the preconditioner. We spell out such a scenario in the next subsection. Our main technical contribution is a variant of the multiplicative weight algorithm, that achieves full-matrix regret bound over any interval by automatically selecting the optimal local preconditioner. The difficulty in this new update method stems from the fact that the optimal multiplicative update parameter, to choose the best preconditioner, depends on future gradients and cannot be determined in advance. To overcome this difficulty, we run in parallel many instantiations of the update rule, and show that this can be done albeit increasing the number of base adaptive gradient methods by only a logarithmic factor. A comparison of our results in terms of adaptive regret is given in Table 1 . We conduct experiments in optimal learning rate scheduling to support our theoretical findings. We show that for an online vision classification task with distribution shifts unknown to the learning algorithm, our method achieves better accuracy than previous algorithms. For offline tasks, our method is able to achieve near-optimal performance robustly, with fewer overall computational resources in hyperparameter optimization.

1.2. WHEN DO LOCAL GUARANTEES HAVE AN ADVANTAGE?

Our algorithm provides near optimal adaptive regret bounds for any sub-interval [s, t] ⊂ [1, T ] simultaneously, giving more stable regret guarantee for a changing environment. In terms of classical regret bound over the whole interval [1, T ], our algorithm obtains the optimal bound of Adagrad up to a O( √ log T ) factor. Moreover, adaptive regret guarantees can drastically improve the loss over the entire interval. Consider the following example in one dimension. For t ∈ [1, T 2 ] the loss function is f t (x) = (x + 1) 2 and for the rest of time it is f t (x) = (x -1) 2 . Running a standard online gradient descent method that is known to be optimal for strongly convex losses, i.e. with η t = 1 t , gives an O(log T ) regret. However, the overall loss is Ω(T ) because the best comparator in hindsight is x = 0 which has overall loss T . However, if we have adaptive regret guarantees, the overall loss on both [1, T 2 ] and [ T 2 + 1, T ] are both O(log T ), which is a dramatic O(T ) improvement in regret.

