ADAPTIVE GRADIENT METHODS WITH LOCAL GUAR-ANTEES

Abstract

Adaptive gradient methods are the method of choice for optimization in machine learning and used to train the largest deep models. In this paper we study the problem of learning a local preconditioner, that can change as the data is changing along the optimization trajectory. We propose an adaptive gradient method that has provable adaptive regret guarantees vs. the best local preconditioner. To derive this guarantee, we prove a new adaptive regret bound in online learning that improves upon previous adaptive online learning methods. We demonstrate the practical value of our algorithm for learning rate adaptation in both online and offline settings. For the online experiments, we show that our method is robust to unforeseen distribution shifts during training and consistently outperforms popular off-the-shelf learning rate schedulers. For the offline experiments in both vision and language domains, we demonstrate our method's robustness and its ability to select the optimal learning rate on-the-fly and achieve comparable task performance as well-tuned learning rate schedulers, albeit with less total computation resources.

1. INTRODUCTION

Adaptive gradient methods have revolutionized optimization for machine learning and are routinely used for training deep neural networks. These algorithms are stochastic gradient based methods, that also incorporate a changing data-dependent preconditioner (multi-dimensional generalization of learning rate). Their empirical success is accompanied with provable guarantees: in any optimization trajectory with given gradients, the adapting preconditioner is comparable to the best in hindsight, in terms of rate of convergence to local optimality. Their success has been a source of intense investigations over the past decade, since their introduction, with literature spanning thousands of publications, some highlights are surveyed below. The common intuitive understanding of their success is their ability to change the preconditioner, or learning rate matrix, per coordinate and on the fly. A methodological way of changing the learning rate allows treating important coordinates differently as opposed to commonly appearing features of the data, and thus achieve faster convergence. In this paper we investigate whether a more refined goal can be obtained: namely, can we adapt the learning rate per coordinate, and also in short time intervals? The intuition guiding this question is the rising popularity in "exotic learning rate schedules" for training deep neural networks. The hope is that an adaptive learning rate algorithm can automatically tune its preconditioner, on a per-coordinate and per-time basis, such to guarantee optimal behavior even locally. To pursue this goal, we use and improve upon techniques from the literature on adaptive regret in online learning to create a provable method that is capable of attaining optimal regret in any sub-interval of the optimization trajectory. We then test the resulting method and compare it to learning a learning rate schedule from scratch. Experiments conducted validate that our algorithm can improve accuracy and robustness upon existing algorithms for online tasks, and for offline tasks it saves overall computational resources for hyperparameter optimization.

1.1. STATEMENT OF OUR RESULTS

The (stochastic/sub)-gradient descent algorithm is given by the following iterative update rule: x τ +1 = x τ -η τ ∇ τ . If η τ is a matrix, it is usually called a preconditioner. A notable example for a preconditioner is when η τ is equal to the inverse Hessian (or second differential), which gives Newton's method. Let ∇ 1 , ..., ∇ T be the gradients observed in an optimization trajectory, the Adagrad algorithm (and subsequent adaptive gradient methods, notably Adam) achieves the following regret guarantee for online convex optimization (OCO): Õ( min H∈H T τ =1 ∇ τ * 2 H ), where H is a family of matrix norms, most commonly those with a bounded trace. In this paper we propose a new algorithm SAMUEL, which improves upon this guarantee in terms of the local performance over any sub-interval of the optimization trajectory. For any sub-interval I = [s, t], the regret over I can be bounded by Õ( min H∈H t τ =s ∇ τ * 2 H ), which also implies a new regret bound over [1, T ]: Õ   min k min H1,...,H k ∈H k j=1 τ ∈Ij ∇ τ * 2 Hj   This regret can be significantly lower than the regret of Adagrad, Adam and other global adaptive gradient methods that do not perform local optimization to the preconditioner. We spell out such a scenario in the next subsection. Our main technical contribution is a variant of the multiplicative weight algorithm, that achieves full-matrix regret bound over any interval by automatically selecting the optimal local preconditioner. The difficulty in this new update method stems from the fact that the optimal multiplicative update parameter, to choose the best preconditioner, depends on future gradients and cannot be determined in advance. To overcome this difficulty, we run in parallel many instantiations of the update rule, and show that this can be done albeit increasing the number of base adaptive gradient methods by only a logarithmic factor. A comparison of our results in terms of adaptive regret is given in Table 1 . We conduct experiments in optimal learning rate scheduling to support our theoretical findings. We show that for an online vision classification task with distribution shifts unknown to the learning algorithm, our method achieves better accuracy than previous algorithms. For offline tasks, our method is able to achieve near-optimal performance robustly, with fewer overall computational resources in hyperparameter optimization.

1.2. WHEN DO LOCAL GUARANTEES HAVE AN ADVANTAGE?

Our algorithm provides near optimal adaptive regret bounds for any sub-interval [s, t] ⊂ [1, T ] simultaneously, giving more stable regret guarantee for a changing environment. In terms of classical regret bound over the whole interval [1, T ], our algorithm obtains the optimal bound of Adagrad up to a O( √ log T ) factor. Moreover, adaptive regret guarantees can drastically improve the loss over the entire interval. Consider the following example in one dimension. For t ∈ [1, T 2 ] the loss function is f t (x) = (x + 1) 2 and for the rest of time it is f t (x) = (x -1) 2 . Running a standard online gradient descent method that is known to be optimal for strongly convex losses, i.e. with η t = 1 t , gives an O(log T ) regret. However, the overall loss is Ω(T ) because the best comparator in hindsight is x = 0 which has overall loss T . However, if we have adaptive regret guarantees, the overall loss on both [1, T 2 ] and [ T 2 + 1, T ] are both O(log T ), which is a dramatic O(T ) improvement in regret.

Algorithm

Regret over I = [s, t] Hazan & Seshadhri (2007) Õ( √ T ) Daniely et al. (2015) , Jun et al. (2017) Õ( |I|) Cutkosky (2020) Õ( t τ =s ∇ τ 2 ) SAMUEL (ours) Õ( t τ =s ∇ τ * 2 H ) Table 1 : Comparison of results. We evaluate the regret performance of the algorithms on any interval I = [s, t]. For the ease of presentation we hide secondary parameters. Our algorithm achieves the regret bound of Adagrad, which is known to be tight in general, but on any interval.

1.3. RELATED WORK

Our work lies in the intersection of two related areas: adaptive gradient methods for continuous optimization, and adaptive regret algorithms for regret minimization, surveyed below. Adaptive Gradient Methods. Adaptive gradient methods and the Adagrad algorithm were proposed in (Duchi et al., 2011) . Soon afterwards followed other popular algorithms, most notable amongst them are Adam (Kingma & Ba, 2014) and RMSprop (Tieleman & Hinton, 2012) . Despite significant practical impact, their properties are still debated Wilson et al. (2017) . Numerous efforts were made to improve upon these adaptive gradient methods in terms of parallelization, memory consumption and computational efficiency of batch sizes, e.g. (Shazeer & Stern, 2018; Agarwal et al., 2019; Gupta et al., 2018; Chen et al., 2019) . A survey of adaptive gradient methods appears in Goodfellow et al. (2016) ; Hazan (2019) . Adaptive Regret Minimization in Online Convex Optimization. The concept of competing with a changing comparator was pioneered in the work of (Herbster & Warmuth, 1998; Bousquet & Warmuth, 2003) on tracking the best expert. Motivated by computational considerations for convex optimization, the notion of adaptive regret was first introduced by Hazan & Seshadhri (2007) , which generalizes regret by considering the regret of every interval. They also provided an algorithm Follow-The-Leading-History which attains Õ( √ T ) adaptive regret. Daniely et al. (2015) considered the worst regret performance among all intervals with the same length and obtain O( |I| log 2 T ) interval-length dependent bounds, improved later by Jun et al. (2017) and Cutkosky (2020) . For other related work, some considered the dynamic regret of strongly adaptive methods Zhang et al. (2018; 2020) . Zhang et al. (2019) Li & Arora (2019) . Learning the learning rate schedule itself was studied in Wu et al. (2018) . Large-scale experimental evaluations (Choi et al., 2019; Schmidt et al., 2020; Nado et al., 2021) conclude that hyperparameter optimization over the learning rate schedules are essential to state-of-the-art performance.

2. SETTING AND PRELIMINARIES

Online convex optimization. Consider the problem of online convex optimization (see Hazan (2016) for a comprehensive treatment). At each round τ , the learner outputs a point x τ ∈ K for some convex domain K ⊂ R d , then suffers a convex loss τ (x τ ) which is chosen by the adversary. The learner also receives the sub-gradients ∇ τ of τ () at x τ . The goal of the learner in OCO is to Algorithm 1 Strongly Adaptive regularization via MUltiplicative-wEights (SAMUEL ) Input: OCO algorithm A, geometric interval set S, constant Q = 4 log(dT D 2 G 2 ). Initialize: for each I ∈ S, Q copies of OCO algorithm A I,q . Set η I,q = 1 2GD2 q for q ∈ [1, Q]. Initialize w 1 (I, q) = min{1/2, η I,q } if I = [1, s], and w 1 (I, q) = 0 otherwise for each I ∈ S. for τ = 1, . . . , T do Let x τ (I, q) = A I (τ ) Let W τ = I∈S(τ ),q w τ (I, q). Let x τ = I∈S(τ ),q w τ (I, q)x τ (I, q)/W τ . Predict x τ . Receive loss τ (x τ ), define r τ (I) = τ (x τ ) -τ (x τ (I, q)). For each I = [s, t] ∈ S, update w τ +1 (I, q) as follows, w (I,q) τ +1 = 0 τ + 1 / ∈ I min{1/2, η I,q } τ + 1 = s w τ (I, q)(1 + η I,q r τ (I)) else end for minimize regret, defined as Regret = T τ =1 τ (x τ ) -min x∈K T τ =1 τ (x). Henceforth we make the following basic assumptions for simplicity (these assumptions are known in the literature to be removable): Assumption 1. There exists D, D ∞ > 1 such that x 2 ≤ D and x ∞ ≤ D ∞ for any x ∈ K. Assumption 2. There exists G > 1 such that ∇ τ 2 ≤ G, ∀τ ∈ [1, T ]. We make the notation of the norm ∇ H , for any PSD matrix H to be: ∇ H = √ ∇ H∇ And we define its dual norm to be ∇ * H = √ ∇ H -1 ∇. In particular, we denote H = {H|H 0, tr(H) ≤ d}. We consider Adagrad from Duchi et al. (2011) , which achieves the following regret if run on I = [s, t]: Regret(I) = O   Dd 1 2 min H∈H t τ =s ∇ τ H -1 ∇ τ   The multiplicative weight method. The multiplicative weight algorithm is a generic algorithmic methodology first used to achieve vanishing regret for the problem of prediction from expert advice Littlestone & Warmuth (1994) . Various variants of this method are surveyed in Arora et al. (2012) , that attain expert regret of O( T log(N )) for binary prediction with N experts.

3. AN IMPROVED ADAPTIVE REGRET ALGORITHM

In this section, we describe the SAMUEL algorithm 1, which combines a novel variant of multiplicative weight as well as adaptive gradient methods to obtain stronger regret bounds in online learning and optimization. The SAMUEL algorithm 1 guarantees that given any black-box OCO algorithm A as experts, achieves an Õ   min H∈H t τ =s ∇ τ H -1 ∇ τ   regret bound (w.r.t. the experts) over any interval J = [s, t] simultaneously. Next, by setting Adagrad as the black-box OCO algorithm A, the above bound matches the regret of the best expert and holds w.r.t. any fixed comparator as a result, implying an optimal full-matrix adaptive regret bound. Roughly speaking, Algorithm 1 first picks a subset S of all sub-intervals and initiates an instance of the black-box OCO algorithm A on any interval I ∈ S as an expert. The expert for interval I is especially designed to achieve optimal regret over I instead of [1, T ]. To improve upon previous works and achieve the full-matrix regret bound, we make O(log T ) duplicates of each expert with different decaying factors η, which is the main novel mechanism of our algorithm (notice that these duplicates share the same model therefore won't bump up computational cost). Then Algorithm 1 runs a multiplicative weight update on all active experts A I,q denoting the expert over I with the q-th decaying factor η (if τ ∈ I) according to the loss of their own predictions, normalized by the loss of the true output of the algorithm. We follow Daniely et al. (2015) on the construction of S: without loss of generality, we assume T = 2 k and define the geometric covering intervals following Daniely et al. (2015) : Daniely et al. (2015) . The intuition behind using S is to reduce the Ω(T ) computational cost of the naive method which constructs an expert for every subinterval of [1, T ]. Henceforth at any time τ the number of 'active' intervals is only O(log(T )), this guarantees that the running time and memory cost per round of SAMUEL is as fast as O(log(T )). Decompose the total regret over an interval J as R 0 (J) + R 1 (J), where R 0 (J) is the regret of an expert A J and R 1 (J) is the regret of the multiplicative weight algorithm 1. Our main theoretical result is the following: Theorem 2. Under assumptions 1 and 2, the regret R 1 (J) of the multiplicative weight part in Algorithm 1 satisfies that for any interval Definition 1. Define S i = {[1, 2 i ], [2 i + 1, 2 i+1 ], ..., [2 k -2 i + 1, 2 k ]} for 0 ≤ i ≤ k. Define S = ∪ i S i and S(τ ) = {I ∈ S|τ ⊂ I}. For 2 k < T < 2 k+1 , one can similarly define S i = {[1, 2 i ], [2 i + 1, 2 i+1 ], ..., [2 i T -1 2 i + 1, T ]}, see J = [s, t], R 1 (J) = O   D log(T ) max    G log(T ), d 1 2 min H∈H t τ =s ∇ τ * 2 H      Remark 3. We note that q that r τ (I, q) and x τ (I, q) doesn't depend on q for the same I, so we may write r τ (I) and x τ (I) instead for simplicity. We use convex combination in line 8 of Algorithm because the loss is convex, otherwise we can still sample according to the weights. In contrast, vanilla weighted majority algorithm achieves Õ( √ T ) regret only over the whole interval [1, T ], and we improve upon the previous best result Õ( √ t -s) Daniely et al. ( 2015) Jun et al. (2017) . The proof of Theorem 2 can be found in the appendix.

3.1. OPTIMAL ADAPTIVE REGRET WITH ADAPTIVE GRADIENT METHODS

In this subsection, we show how to achieve full-matrix adaptive regret bounds by using Adagrad as experts as an application of Theorem 2, together with other extensions. We note that this reduction is general, and can be applied with any adaptive gradient method that has a regret guarantee, such as Adam or Adadelta. Theorem 2 bounds the regret R 1 of the multiplicative weight part, while the total regret is R 0 + R 1 . To get the optimal total regret bound, we only need to find an expert algorithm that also haves the optimal full-matrix regret bound matching that of R 1 . As a result, we choose Adagrad as our expert algorithm A, and prove regret bounds for both full-matrix and diagonal-matrix versions. Full-matrix adaptive regularization Corollary 4 (Full-matrix Adaptive Regret Bound). Under assumptions 1 and 2, when Adagrad is used as the blackbox A, the total regret Regret(I) of the multiplicative weight algorithm in Algorithm 1 satisfies that for any interval I = [s, t], Regret(I) = O   D log(T ) max    G log(T ), d 1 2 min H∈H t τ =s ∇ τ * 2 H      Remark 5. We notice that the log(T ) overhead is brought by the use of S and Cauchy-Schwarz. We remark here that by replacing S with the set of all sub-intervals, we can achieve an improved bound with only a log(T ) overhead using the same analysis. On the other hand, such improvement in regret bound is at the cost of efficiency, that each round we need to make Θ(T ) computations. Diagonal-matrix adaptive regularization If we restrict our expert optimization algorithm to be diagonal Adagrad, we can derive a similar guarantee for the adaptive regret. Corollary 6. Under assumptions 1 and 2, when diagonal Adagrad is used as the blackbox A, the total regret Regret(I) of the multiplicative weight algorithm in Algorithm 1 satisfies that for any interval I = [s, t], Regret(I) = Õ D ∞ d i=1 ∇ s:t,i 2 Here ∇ s:t,i denotes the ith coordinate of t τ =s ∇ τ .

4. EXPERIMENTS

In this section, we demonstrate empirical effectiveness of the proposed framework for online and offline learning scenarios. For online learning experiment, we consider a simulated data distribution shift setting using CIFAR-10. For offline supervised learning, we experimented on standard benchmarks in vision and natural language processing domains.

4.1. ONLINE EXPERIMENTS

experiment setup: Our simulated online experiment is designed to assess robustness to unforeseen data distribution changes during training. Algorithms do not know in advance whether or when the data shift will happen. We design this online data distribution shift with the CIFAR-10 dataset. We partition the CIFAR-10 dataset into two non-overlapping groups with five classes each. We denote D 1 as the distribution for the first subset of data {X 1 , Y 1 } and D 2 for the other subset of data {X 2 , Y 2 }. Specifically, the two subsets of data we used in our implementation have categories {dog, frog, horse, ship, truck} and {airplane, automobile, bird, cat, deer}. We shift the data from D 1 to D 2 at iteration 17,000 out of a total of 25,600 training iterations. We choose this transition time point because empirically all baselines have stable performance at this point, which permits a fair comparison when the data shift occurs. We use the ResNet-18 model for all experiments under this online setup. Since each subset of data only contains 5 classes, we modified the model's last layer corresponding. baselines: We compare our learning rate adaptation framework with different combinations of offthe-shelf learning rate schedulers and optimizers from the optax libray. To ensure a fair comparison, we well-tuned the hyperparameters associated with each of the baseline learning rate schedule × optimizer combinations. Specifically, our baseline learning rate schedulers include constant learning rate, cosine annealing, exponential decay, and warmup with cosine annealing. Our baseline optimizers include SGD, AdaGrad, and Adam. In total, we have 12 learning rate scheduler × optimizer pairs for baseline experiments. We report detailed hyperparameter choices for each baseline in the appendix. evaluation metrics: We evaluate our method and baselines using three performance metrics: • post-shift local accuracy: the average evaluation accuracy during a specified window starting at the beginning of the data distribution shift. We consider three window sizes: 100, 500, and 1000 iterations. This metric is used to measure the robustness of algorithms immediately after the data distribution change. • pre-shift accuracy: the maximum evaluation accuracy prior to the data distribution shift. • post-shift accuracy: the maximum evaluation accuracy after the data distribution shift. implementation: We follow Algorithm 1 for SAMUEL implementation under the online setup. Our SAMUEL framework admits any choice of black-box OCO algorithms; for our online experiment we use Adagrad. Each expert is an Adagrad optimizer with a specific external learning rate multiplier. The total number of training iterations is 25,600 and we specify the smallest geometric interval to results: We report the quantitative scores under five evaluation metrics of our algorithm and baselines in Table 2 . We find that SAMUEL surpasses all baselines for every performance metric we considered. Although a number of baselines, such as Adagrad with cosine annealing, Adam with cosine annealing, and SGD with warmup cosine annealing, have comparable pre-shift test accuracy to SAMUEL, SAMUEL's ability to adaptively select the learning rate multiplier confers robustness to unforeseen changes in data distribution. This is unsurprising, given that typical off-the-shelf learning rate schedulers give a deterministic learning rate multiplier function across training and are therefore prone to suffering from data distribution changes. We also compare the qualitative behaviors of our algorithm and baselines within a 100-iteration window after the data distribution change in Figure 1 . It is clear from the plots that SAMUEL recovers faster than baselines. Furthermore, SAMUEL consistently maintains a higher test accuracy throughout the window.

4.2. OFFLINE EXPERIMENTS

experiment setup: We experiment with popular vision and language tasks to demonstrate SAMUEL's ability in selecting optimal learning rates on-the-fly without hyperparameter tuning. The tasks conducted are image classification on CIFAR-10 and ImageNet, and sentiment classification on SST-2. We use ResNet-18 for CIFAR-10, ResNet-50 for ImageNet, and LSTM for SST-2. baseline: We use the step learning rate scheduler as baseline, which is a commonly used off-the-shelf scheduler. We specifically use a three-phase schedule where we fix the two step transition points based on heuristics and provide five candidate learning rates to each phase. An exhaustive search thus yields a total of 125 different schedules. implementation: We adjusted Algorithm 1 to be computationally efficient. Instead of running experts for each of the log T geometric intervals, we take a fixed number of experts (five total experts for these experiments, with one candidate learning rate per expert) with exponential decay factor on the history. Unlike Algorithm 1 where experts are initialized at the start of each geometric interval, we initialize experts at the step transition points. We introduce a parameter α that determines the effective memory length: x t+1 = x t - η √ I+ t τ =1 α t-τ ∇τ ∇ τ ∇ t . A fixed interval with different αs can be seen as a "soft" version of the geometric intervals in Algorithm 1. All experiments were conducted on TPU-V2 hardware. We provide pseudo-code for the implementation in the appendix. CIFAR-10: We compare a ResNet-18 model trained with SAMUEL to ResNet-18 trained with Adagrad using brute-force searched step learning rate schedules. We process and augment the data following He et al. (2016) . For training, we use a batch size of 256 and 250 total epochs. We fix the learning rate transition point at epoch 125 and 200, and provide five candidate learning rates {0.0001, 0.001, 0.01, 0.1, 1} for each region. Thus an exhaustive search yields 125 different schedules for the baseline. For a fair comparison, we adopt the same learning rate changing points for our method. We compare the test accuracy curves of the baselines and our methods in Fig. 2 . The left plot in Fig. 2 displays 125 runs using Adagrad for each learning rate schedule, where the highest accuracy is 94.95%. A single run of SAMUEL achieves 94.76% with the same random seed (average 94.50% across 10 random seeds), which ranks in the top 3 of 125 exhaustively searched schedules. ImageNet: We continue examining the performance of SAMUEL on the large-scale ImageNet dataset. We trained ResNet-50 with exhaustive search of learning rate schedules and compare with SAMUEL. We also consider a more practical step learning rate scheduling scheme where the learning rate decays after each stepping point. Specifically, the candidate learning rates are {0.2, 0.4, 0.6, 0.8, 1.0} in the first phase, and decay by 10× when stepping into the next phase. We set the stepping position at epoch 50 and 75 in a total of 100 training epochs. We adopted the training pipeline from Heek et al. (2020) . For both baselines and SAMUEL, we used the SGD optimizer with nesterov momentum of 0.9 and training batch size of 1024. The second column of Fig. 2 displays the comparison of the exhaustive search baseline (top) to SAMUEL (bottom). The best validation accuracy out of exhaustively searched learning rate schedules is 76.32%. SAMUEL achieves 76.22% in a single run (average 76.15% across 5 random seeds). Note that 76.22% is near-SOTA given the model architecture.

SST-2:

We conduct experiments on the Stanford Sentiment Treebank (SST-2) dataset. We adopt the pipeline from (Heek et al., 2020) for pre-processing the SST-2 dataset and train a simple bi-directional For both baseline and our algorithm, we use SGD with momentum of 0.9 and additive weight decay of 3e-6 with training batch size of 64. The learning rate schedule setting is the same as that of CIFAR-10. The right column of Fig. 2 shows that the best accuracy of exhaustive search is 86.12%, and the accuracy of SAMUEL using the same seed is 85.55% (average 85.58% among 10 different random seeds). stability of SAMUEL : We demonstrate the stability of SAMUEL to hyperparameter tuning. Since our algorithm will automatically select the optimal learning rate, the only tunable hyperparameters are the number of multiplicative weight factor η and the quantity of history decaying factors, α. We conduct 18 trials with different hyperparameter combinations and display the test accuracy curves in Fig. 3 . Specifically, we consider the quantity of decaying factors α with values {2, 3, 6} and {5, 10, 15, 20, 25, 30} number of η . As Fig. 3 shows, all trials in SAMUEL converge to nearly the same final accuracy regardless of the exact hyperparameters. computation considerations: A table of runtime comparison is provided in the appendix. As described in the implementation section, SAMUEL here has five experts in total, which incurs five times more compute than one single run of the baseline. Nevertheless, this is a dramatic improvement over brute-force hyperparameter sweeping of learning rate schedulers. For the step learning rate scheduler we experimented with, SAMUEL is 25 times more computationally efficient than tuning the scheduler with grid search. In addition, experts can be fully parallelized across different acceleration devices. It is expected that the run time of SAMUEL would approach that of a single run of the baseline with efficient implementation.

5. CONCLUSION

In this paper we study adaptive gradient methods with local guarantees. The methodology is based on adaptive online learning, in which we contribute a novel twist on the multiplicative weight method that we show has better adaptive regret guarantees than state of the art. This, combined with known results in adaptive gradient methods, gives an algorithm SAMUEL with optimal full-matrix local adaptive regret guarantees. We demonstrate the effectiveness and robustness of SAMUEL in experiments, where we show that SAMUEL can automatically adapt to the optimal learning rate and achieve better task accuracy in online tasks with distribution shifts. For offline tasks, SAMUEL consistently achieves comparable accuracy to an optimizer with fine-tuned learning rate schedule, using fewer overall computational resources in hyperparameter tuning. we obtain for any q log( wt+1 (I, q)) ≥ t τ =s η I,q r τ (I) -t τ =s η 2 I,q r τ (I) 2 Now we upper bound the term t τ =s r τ (I) 2 . By convexity we have that r τ (I) = τ (x τ ) - τ (x τ (I)) ≤ ∇ τ (x τ -x τ (I)), hence t τ =s r τ (I) ≤ 4 log(T ) η I,q + 4η I,q t τ =s (∇ τ (x τ -x τ (I))) 2 The next step is to upper bound the term ∇ τ (x τ -x τ (I)). By Hölder's inequality we have that ∇ τ (x τ -x τ (I)) ≤ ∇ τ H -1 x τ -x τ (I) H for any H. As a result, we have that for any H which is PSD and tr(H) ≤ d, (∇ τ (x τ -x τ (I))) 2 ≤ ∇ τ H -1 ∇ τ x τ -x τ (I) 2 H ≤ ∇ τ H -1 ∇ τ 4D 2 d where x τ -x τ (I) 2 H ≤ 4D 2 d is by elementary algebra: let H = V -1 M V be its diagonal decomposition where B is a standard orthogonal matrix and M is diagonal. Then x τ -x τ (I) 2 H = (x τ -x τ (I)) H(x τ -x τ (I)) = (V (x τ -x τ (I))) M V (x τ -x τ (I)) ≤ (V (x τ -x τ (I))) dIV (x τ -x τ (I)) ≤ 4D 2 d Hence t τ =s r τ (I) ≤ 4 log(T ) η I,q + 4η I,q D 2 d min H t τ =s ∇ τ H -1 ∇ τ The optimal choice of η is of course 4 log(T ) D 2 d min H t τ =s ∇ τ H -1 ∇ τ When D 2 d min H t τ =s ∇ τ H -1 ∇ τ ≤ 64G 2 D 2 log(T ), η I,1 gives the bound O(GD log(T )). When D 2 d min H t τ =s ∇ τ H -1 ∇ τ > 64G 2 D 2 log(T ), there always exists q such that 0.5η I,q ≤ η ≤ 2η I,q by the construction of q so that the regret R 1 (I) is upper bounded by O   D log(T ) max    G log(T ), d 1 2 min H∈H t τ =s ∇ τ H -1 ∇ τ      Now we have proven an optimal regret for any interval I ∈ S, it's left to extend the regret bound to any interval J. We show that by using Cauchy-Schwarz, we can achieve the goal at the cost of an additional log(T ) term. We need the following lemma from Daniely et al. (2015) : Lemma 7 (Lemma 5 in Daniely et al. (2015) ). For any interval J, there exists a set of intervals S J such that S J contains only disjoint intervals in S whose union is exactly J, and |S J | = O(log(T )) We now use Cauchy-Schwarz to bound the regret: Lemma 8. For any interval J which can be written as the union of n disjoint intervals ∪ i I i , its regret Regret(J) can be upper bounded by: Regret(J) ≤ n n i=1 Regret(I i ) 2 Proof. The regret over J can be controlled byRegret(J) ≤ n i=1 Regret(I i ). By Cauchy-Schwarz we have that ( n i=1 Regret(I i )) 2 ≤ n n i=1 Regret 2 (I i ) which concludes our proof. We can now upper bound the regret R 1 (J) using Lemma 8, replacing Regret by R 1 and n by |S J | = O(log(T )). For any interval J, its regret R 1 (J) can be upper bounded by: R 1 (J) ≤ |S J | I∈S J R 1 (I) 2 Combining the above inequality with the upper bound on R 1 (I) 2, we reach the desired conclusion. 

A.4 BASELINE HYPERPARAMETERS FOR ONLINE EXPERIMENTS

Here we report the hyperparmeters used in the baseline learning rate schedulers in the online experiments. We use the off-the-shelf learning rate schedulers from the optax library. Please refer to the optax documentation for the specific meaning of the parameters.

AdaGrad

• constant learning rate: learning rate 0.2. • cosine annealing: init value = 0.2, decay steps = 25600, alpha = 0. • warmup with cosine annealing: init value = 1e-5, peak value = 0.15, warmup steps = 1000, end value = 0. • exponential decay: init value = 0.35, transition steps= 3000, decay rate = 0.5.

SGD

• constant learning rate: learning rate 0.15. • cosine annealing: init value = 0.3, decay steps = 25600, alpha = 0. • warmup with cosine annealing: init value = 1e-5, peak value = 0.5, warmup steps = 1000, end value = 0. • exponential decay: init value = 0.6, transition steps= 3000, decay rate = 0.5.

Adam

• constant learning rate: learning rate 0.001. • cosine annealing: init value = 0.001, decay steps = 25600, alpha = 0. • warmup with cosine annealing: init value = 1e-5, peak value = 0.005, warmup steps = 1000, end value = 0. • exponential decay: init value = 0.005, transition steps= 3000, decay rate = 0.5.

A.5 COMPUTE COMPARISON FOR OFFLINE EXPERIMENTS

We report the compute resource consumption of both baselines and SAMUEL from the offline experiments. We run experts sequentially and the running time of our algorithm is longer than the baselines. With more efficient implementation and parallelizing each expert across TPU devices, it is expected the running time of SAMUEL would approach the running time of the baseline algorithm. 



Figure 1: Behavior comparison following data distribution shift. Each subplot compares SAMUEL with an optimizer paired with different learning rate schedulers. We focus on a window of size 100 iterations post data distribution shift. SAMUEL systematically recovers fastest from data change and has a leading test accuracy throughout the window. The confidence band for each trace is the standard deviation computed across three different random seeds.

Figure 2: Comparison of exhaustive searched step learning rate schedule (top) and SAMUEL (bottom) on CIFAR-10, ImageNet and SST-2.

Figure 3: stability study of SAMUEL with different hyperparameters.

A.2 PROOF OF COROLLARY 4Proof. Using Theorem 2 we have that R 1 (I) is upper bounded byR 1 (I) = O   D log(T ) max interval J ∈ S,one of the Adagrad experts achieve the bound R 0 I, using the result from Daniely et al. (2015) (Lemma 7) and Lemma 8 by replacing Regret by R 0 , it follows R 0 give the desired bound on Regret(I).A.3 PROOF OF COROLLARY 6Proof. The proof is almost identical to that of the previous corollary, observing that the re-gret R 0 (I) is Õ(D ∞ d i=1 ∇ s:t,i 2 ) due toDuchi et al. (2011), and the regret R 1 (I) remains Õ(D min H∈H t τ =s ∇ τ H -1 ∇ τ ), which is upper bounded by Õ(D ∞ d i=1 ∇ s:t,i 2 ).

Five accuracy metrics (%) for SAMUEL and baseline methods under online data distribution shift setup. Standard deviation is computed using three runs with different random seeds.

A APPENDIX

A.1 PROOF OF THEOREM 2 Proof. We define the pseudo weight wτ (I, q) = w τ (I, q)/η I,q for τ ≤ t, and for τ > t we just set wτ (I, q) = wt (I, q). Let Wτ = I∈S(τ ),q wτ (I, q), we are going to show the following inequalityWe prove this by induction. For τ = 1 it follows since on any interval [1, t] the number of experts is exactly the number of possible qs, and the number of intervals [1, t] ⊂ S is O(log(T )). Now we assume it holds for all τ ≤ τ . We havew τ (I, q)r τ (I)We further show that I∈S(τ ),q w τ (I, q)r τ (I) ≤ 0:p τ (I, q)( J∈S(τ ),q w τ (J, q) τ (x τ (J, q))/W ττ (x τ (I, q)))= 0which finishes the proof of induction.Based on this, we proceed to prove that for anyBy inequality 1, we have thatTaking the logarithm of both sides, we have log( wt+1 (I, q)) ≤ log(t + 1) + log(log(t + 1) + 1) + log(log(dT D 2 G 2 )) + log(log(T ))Recall the expression(1 + η I,q r τ (I))By using the fact that log(1 + x) ≥ x -x 2 , ∀x ≥ -1/2 andAlgorithm 2 SAMUEL experiment pseudocode 1: Input: AdaGrad optimizer A, constant Q, a set of learning rates {1, 0.1, 0.001, 0.0001, 0.00001}, reinitialize frequency K. 2: Initialize: for each learning rate i ∈ S, a copy of A i . 3: Set η i,q = 1 2 q for q ∈ [1, Q]. 4: Initialize w 1 (i, q) = min{1/2, η I,q }. Initialize NN params x 0 5: for τ = 1, . . . , T do 6:Let updated NN params x τ (i, q) = A i (τ )

7:

Let W τ = i,q w τ (i, q). 8: sample x τ according to w τ (i, q)/W τ . 9:Receive batch loss τ (x τ ), define r τ (i) = τ (x τ )τ (x τ (i, q)).

10:

For each i, update w τ +1 (i, q) as follows.w τ +1 (i, q) = w τ (i, q)(1 + η i,q r τ (i))11:if τ %K = 0 then 12:Re-initialize w τ (i, q) = min{1/2, η I,q } 13:All copies A i start from NN params x τ 14: end if 15: end for

