ADMETA: A NOVEL DOUBLE EXPONENTIAL MOVING AVERAGE TO ADAPTIVE AND NON-ADAPTIVE MOMEN-TUM OPTIMIZERS WITH BIDIRECTIONAL LOOKING

Abstract

Optimizer is an essential component for the success of deep learning, which guides the neural network to update the parameters according to the loss on the training set. SGD and Adam are two classical and effective optimizers on which researchers have proposed many variants, such as SGDM and RAdam. In this paper, we innovatively combine the backward-looking and forward-looking aspects of the optimizer algorithm and propose a novel ADMETA (A Double exponential Moving averagE To Adaptive and non-adaptive momentum) optimizer framework. For backward-looking part, we propose a DEMA variant scheme, which is motivated by a metric in the stock market, to replace the common exponential moving average scheme. While in the forward-looking part, we present a dynamic lookahead strategy which asymptotically approaching a set value, maintaining its speed at early stage and high convergence performance at final stage. Based on this idea, we provide two optimizer implementations, ADMETAR and ADMETAS, the former based on RAdam and the latter based on SGDM. Through extensive experiments on diverse tasks, we find that the proposed ADMETA optimizer outperforms our base optimizers and shows advantages over recently proposed competitive optimizers. We also provide theoretical proof of these two algorithms, which verifies the convergence of our proposed ADMETA.

1. INTRODUCTION

The field of training neural network is dominated by gradient decent optimizers for a long time, which use first order method. Typical ones include SGD (Robbins & Monro, 1951) and SGD with momentum (SGDM) (Sutskever et al., 2013) , which are simple yet efficient algorithms and enjoy even better resulting convergence than many recently proposed optimizers. However, it suffers the disadvantage of low speed in initial stage and poor performance in sparse training datasets. This shortcoming can not be ignored since with the development of deep learning, the amount of data becomes much larger, and the model becomes much more complex. Time to train a network is also considered an important metric when evaluating an optimizer. To address this issue, optimizers with adaptive learning rate have been proposed which use nonuniform stepsizes to scale the gradient while training, and the usual implementation is scaling the gradient by square roots of some kind of combination of the squared values of historical gradients. By far the most used are Adam (Kingma & Ba, 2014) and AdamW (Loshchilov & Hutter, 2017) due to their simplicity and high training speed in early stage. Despite their popularity, Adam and many variants like of it (such as RAdam (Liu et al., 2019) ) is likely to achieve worse generalization ability than non-adaptive optimizers, observing that their performance quickly plateaus on validation sets. To achieve a better tradeoff, researchers have made many improvements based on SGD and Adam family optimizers. One attempt is switching from adaptive learning rate methods to SGD, based on the idea of complementing each other's advantages. However, a sudden change from one optimizer to another in a set epoch or step is not applicable because different algorithms make characteristic choices at saddle points and tend to converge to final points whose loss functions nearby have different geometry (Im et al., 2016) . Therefore, many optimizers based on this idea seek for a smooth switch. The representative ones are AdaBound (Luo et al., 2019) and SWATS (Keskar & Socher, 2017) . The second attempt is proposing new method to further accelerate SGDM, including introducing power exponent (pbSGD (Zhou et al., 2020) ), aggregated momentum (AggMo (Lucas et al., 2018) ) and warm restarts (SGDR (Loshchilov & Hutter, 2016) ). The third attempt is modifying the process of optimizers with adaptive learning rate to achieve better local optimum, which is the most popular field in recent researches (Zhuang et al., 2020; Li et al., 2020) . Due to space constraints, please see more related work in Appendix A. We focus in this paper on the use of historical and future information about the optimization process of the model, both of which we argue are important for models to reach their optimal points. To this end, we introduce a bidirectional view, backward-looking and forward-looking. In the backward-looking view, EMA is an exponentially decreasing weighted moving average, which is used as a trend-type indicator in terms of the optimization process. And since the training uses a mini-batch strategy, each batch is likely to have deviations from the whole, so it may mislead the model to the local optimal point. Inspired by stock market indicators, DEMA (Mulloy, 1994) is an exponential average calculated on the traditional EMA and current input, which can effectively maintain the trend while reducing the impact caused by short-term bias. We thus replace the traditional exponential moving average (EMA) with double exponential moving average (DEMA). It is worth noting that our usage is not equivalent to the original DEMA, but rather a variant of it. In the forward-looking part, since we observe that a constant weight adopted by the original Lookahead optimizer (Zhang et al., 2019) to control the scale of fast weights and slow weights in each synchronization period makes the early stage training slow and lossy, we propose a new dynamic strategy which adopts an asymptotic weight for improvement. By applying these two ideas, we propose ADMETA optimizer with ADMETAR and ADMETAS implementations based on RAdam and SGDM respectively. Extensive experiments have been conducted on computer vision (CV), natural language processing (NLP) and audio processing tasks, which demonstrate that our method achieves better convergence results compared to other recently proposed optimizers. Further analysis show that ADMETAS achieves higher generalization ability than SGDM and ADMETAR achieves better convergence results and maintain high speed in initial stage compared to other adaptive learning rate methods. We further find that DEMA and dynamic looking strategy can improve performance compared to EMA and constant strategy, respectively. In addition, we provide convergence proof of our proposed ADMETA in convex and non-convex optimizations.

2.1. BACKGROUND

The role of the optimizer in model training is to minimize the loss on the training set and thus drive the learning of model parameters. Formally, consider a loss function f : R d → R that is bounded below greater than zero, where R represents the field of real numbers, d denotes the dimension of the parameter and thus R d denotes d-dimensional Euclidean space. The optimization problem can be formulated as: min θ∈F d f (θ), where θ indicates a parameter whose domain is F and F ⊂ R d . If we define the optimum parameter of the above loss function as θ * , then the optimization objective can be written as: θ * = arg min θ∈F d f (θ). Optimizers iteratively update parameters to make them close to the optimum as training step t increases, that is to make: lim t→∞ ∥θ t -θ * ∥ = 0. Stochastic gradient algorithm SGD (Robbins & Monro, 1951) optimizes f by iteratively updating parameter θ t at step t in the opposite direction of the stochastic gradient g(θ t-1 ; ξ t ) where ξ t is the input variables of the t-th mini-batch in training datasets. For the sake of clarity, we abbreviate g(θ t-1 ; ξ t ) as g t for the rest of the paper unless specified. SGD optimization aims to calculate the updated model parameters based on the previous model parameters, the current gradient and the learning rate. Define learning rate as α t , the update process is summarized as follows: θ t = θ t-1 -α t g t . Original SGD tend to vibrate along the process due to the mini-batch strategy and not using of past gradients. What's more, this disadvantage also results in its long-time plateaus in valleys and saddle points, thus slowing the speed. To smooth the oscillation and speed up convergence rate, momentum, also known as Polyak's Heavy Ball (Polyak, 1964) , is introduced to modify SGD. Momentum at step t is often denoted as m t and obtained by iterative calculation with a dampening coefficient β. Thus, the update process of SGD with momentum (SGDM) (Sutskever et al., 2013) becomes as follows: m t = βm t-1 + (1 -β)g t , (3) θ t = θ t-1 -α t m t , Although momentum works well, the uniform stepsize on every parameter is also another factor to limit the speed, especially in large datasets and sparse datasets. To further accelerate the update, adaptive learning rate optimizer is introduced which adopts an individual stepsize for each parameter based on their unique update process. Since a smoothing mechanism is employed in the calculation of stepsize, two dampening coefficients, β 1 and β 2 , are introduced for balancing the current and historical information. Adam (Kingma & Ba, 2014) , a typical adaptive learning rate optimizer, is implemented as follows: m t = β 1 m t-1 + (1 -β 1 )g t , v t = β 2 v t-1 + (1 -β 2 )g 2 t , (6) θ t = θ t-1 -α t m t / √ v t , where m t indicates the first momentum, corresponding to the momentum in SGDM; v t indicates the second momentum. To emphasize the functionality of v t , we call it adaptive item for the rest of the paper. Adam may sometimes converge to bad local optimum, partly due to its large variance in the early stage. To fix this issue, RAdam (Liu et al., 2019) introduces a further rectified item r t and split the update process into two sub-processes sequentially connected: ρ ∞ = 2/(1 -β 2 ) -1, ρ t = ρ ∞ -2tβ t 2 /(1 -β t 2 ), r t ← (ρ t -4)(ρ t -2)ρ ∞ (ρ ∞ -4)(ρ ∞ -2)ρ t , θ t = θ t-1 -α t m t , ρ t ≤ 4 θ t-1 -α t r t m t / √ v t , ρ t > 4 .

2.2. BACKWARD-LOOKING

In fact, the calculation of momentum m t in Eq. (3) and Eq. ( 5) is an exponential moving average (EMA) on gradient g t . EMA, also known as exponential weighted moving average, can be used to estimate the local mean value of variables, so that the update of variables is related to historical values over a period of time. Formally, EMA is expressed as: S t = βS t-1 + (1 -β)p t , where the variable S is denoted as S t at time t and p t is the new assigned values. Particularly, S t = p t without using EMA. In Eq. ( 3), SGDM employs EMA to take a moving average of the past gradients. While in Eq. ( 5), Adam and RAdam further apply EMA on the square of past gradients to construct the adaptive item. In the EMA, the moving average of the variable S at time t is roughly equal to the average of the values p over the past 1/(1 -β) steps. This makes the moving average vary more at the beginning, so a bias correction is proposed and used in Adam (Eq. ( 7)) and in RAdam (Eq. ( 11)) when ρ > 4. EMA can be regarded as obtaining the average values of the variables over time. Compared with the direct assignment of values to variables, the change curve of the values obtained by moving average is smoother and less jittery, and the moving average does not fluctuate greatly when inputting outliers, which is very important for the optimization using sampled mini-batch. Although efficient, EMA is not necessarily the best strategy for using historical information when it comes to the backwardlooking part. Although it can effectively suppress the vibration caused by mini-batch training by performing the moving average on g t , it also brings a lag time that affects the convergence speed and increases with the length of the moving average. What's more, it can result in overshoot problem (An et al., 2018) , one possible reason is that EMA might make the wrong use of historical gradients in the final stage and thus have a "burden" to converge to optimum. Double Exponential Moving Average (DEMA), first proposed by Mulloy (1994) , is a faster moving average strategy and was invented to reduce the lag time of EMA. Thus, motivated by the advantage of DEMA, we developed a DEMA variant for the model optimization. It is worth noting DEMA is not simply taking a moving average of historical gradients twice, instead, it takes the moving average of the linear combination of the current gradient the moving average of past gradients. The form of our DEMA variant can be written as: DEMA = EMA out (µEMA in + κg t ), where µ and κ are coefficients that control the scale of current gradient and only depends on β. From the formula EMA = Σ n i=1 β n-i g i , past gradients follow a fixed proportionality, that is, the ratio of gradient weight at one time to gradient weight at the previous time is β. Due to the use of minibatch training strategy, the input is randomly sampled. The effect of each minibatch towards optimization is varied. Therefore, applying a fixed proportional to past gradients is not a reasonable approach since it does not take into account the changeable situation. The disadvantage of overshoot that EMA usually has may also be caused by the above reasons (An et al., 2018) . Thus, we deal with the relationship between the historical gradients and the current gradient more flexibly by further controlling the proportion of past gradients. Our design of coefficients in DEMA is also for this purpose. Based on Eq. ( 13), our actual implementation on algorithm is: I t = λI t-1 + g t , ( ) h t = κg t + µI t + ν, ( ) m t = βm t-1 + (1 -β)h t , where I t is the output of EMA in with a 0 initial value and m t is the output of EMA out also initiated with 0. λ and β are dampening coefficients of inner EMA and outer EMA respectively, ν is a bias item, which is set to a small amount that decreases exponentially to 0 and chosen as λ t g 1 . The bias item does not affect the convergence proof, so for the sake of brevity, it is omitted for the rest of this paper and the details can be seen in the code. Please refer to Appendix B for more comparison and discussion between EMA and DEMA.

2.3. FORWARD-LOOKING

Focusing on gradient history, that is, backward-looking, optimizer is conducive to alleviating the vibration problem in the optimization process and preventing it from being misled by local noise information. However, since the optimization problem of the deep neural network is very complex, the optimizer can make the optimization process more robust by pre-exploration, so as to obtain better optimization results, which is called forward-looking. Based on Reptile algorithm and advances in understanding the loss surface, Zhang et al. (2019) proposed Lookahead optimizer, which introduces two update processes and averages fast and slow weights periodically. The algorithm can be expressed as the cycle of the following process: Pre-exploration : θ t = OPTIM(θ t-1 ) Synchronization : (every k steps) ϕ t = ϕ t-k + η(θ t-1 -ϕ t-k ) θ t = ϕ t where OPTIM(•) denotes a chosen optimizer, k denotes the synchronization period, or in other words, the period of forward-looking, ϕ t denotes the slow weights , θ t denotes the fast weight updated with a chosen optimizer, and η is a constant coefficient controlling the proportion of slow weights and fast weights in each synchronization. Generally, the chosen optimizer can be arbitrary. We can get an intuitive explanation of Lookahead optimizer from the pseudo code above: Guided by fast weight θ t , the slow weight ϕ t updates by taking linear interpolation between itself and the fast weight. Every time the fast weight updates k steps, the slow weight updates 1 step. The update direction of slow weight can be regarded as θ t -ϕ t from the equation. Therefore, η can also be interpreted as the stepsize of slow weight in each synchronization. In order not to be confused with the stepsize of fast weight, we rename the stepsize of slow weight as stepsize s . The recommended value of η in (Zhang et al., 2019) is 0.5 and 0.8. In the original Lookahead optimizer implementation, the fast and slow optimization processes were synchronized according to a given period, and parameters are fused at a fixed ratio during synchronization. However, optimization is a continuous process. In different optimization stages, fast optimization steps have different guiding effects on parameters. We argue that using fixed stepsize s in each synchronization is not an optimal strategy, and may even lead to negative effects. For this consideration, we turns the constant η into a η t that changes over step monotonously and asymptotically. Generally, η t is a function that starts from 1 and converges to a set value and depends only on the step t. In this setting, the proportion of slow weights increases and this part gradually turns into the original Lookahead method. In other words, the slow weights in our method adopts a faster stepsize s at the beginning, and it asymptotically slows down as processing. Specifically, we define two asymptotic functions for η t : η t = 0.5 * 1 + 1 0.01 √ t + 1 , η t = 0.8 * 1 + 1 0.1 √ t + 3.8 , thus we call this as dynamic asymptotic lookahead. The two functions are designed to turn η t from 1 to 0.5 and 0.8 respectively. Notably, these asymptotic functions may not be the best. We just find that it works well and maybe future work can done to investigate a more suitable one. For the sake of clarity, we will use the latter one in the rest of the paper and the results of experiments trained from scratch are based on this function unless specified. To illustrate the advantages of our dynamic lookhead strategy over no lookhead and the original constant lookahead, we give an optimization example in Figure 1 . In region 1 , which is around early stage, the direction of the update is relatively stable and a large stepsize s is needed. θ 1 → θ 4 denotes the update of fast weights. A constant lookahead method will slow the update process in each synchronization period, as can be seen in θ 1 → θ 2 . In our method, fast weights share more proportion in each synchronization period in early stage, thus updating more fast, as can be seen in In region 2 , which is around final stage, the direction of the update is relatively ocillated, and a small stepsize s is needed. fast weights tend to overshoot the optimum, as can be seen in θ 5 → θ 8 . Lookahead optimizer can achieve better convergence result than general algorithm as it averages the weights to make them more close to the optimum point, as can be seen in θ 5 → θ 6 . In our method, the proportion of fast weights have already reduced asymptotically to a set value, thus can achieve similar efficacy as Lookahead optimizer as can be seen in θ 5 → θ 7 . From these analysis, we demonstrate that our dynamic lookhead strategy method improves the robustness of training. θ 1 → θ 3 . ① ② θ 1 x f(x) θ 2 θ 3 θ 4 θ 5 θ 6 θ 7 θ * θ 8 SGD SGD with Lookahead

2.4. IMPLEMENTATIONS OF ADMETAR AND ADMETAS

Since optimizers of the Adam family and SGD family have their own advantages and disadvantages, and the bidirectional looking optimizer framework and improvement we propose do not have too many restrictions on the basic optimizer, we have implemented improved versions ADMETAR and ADMETAS based on RAdam and SGDM optimizer. The final algorithm form is shown in Algorithm 1 and 2. Detailed proof of convergence and convergence rate for our ADMETAR and ADMETAS is putted in Appendix C and D.

3. EXPERIMENTS

In this section, we demonstrate the effectiveness of our optimizer by turning to an empirically exploration of different datasets and different models to compare some popular optimizers. Specifically, we conduct experiments on typical CV, NLP, and audio processing tasks. Influenced by the Transformer structure, models are becoming deeper and larger, and therefore training is becoming more difficult. The current paradigm of pre-training-fine-tuning is mainly used for large models. Therefore, we compare optimizers not only in the training-from-scratch setup, but also in the fine-tuning setup. In this section, we compare our proposed optimizer with several typical optimizers, including classic SGD (Robbins & Monro, 1951) and Adam (Kingma & Ba, 2014) , our base, SGDM (Sutskever et al., 2013) foot_0 and RAdam (Liu et al., 2019) , the current state-of-the-art AdaBelief (Zhuang et al., 2020) , and the optimizer combined of many modules, Ranger (Wright, 2019) . Please refer to Appendix E for more experimental details. Algorithm 1: ADMETAR Optimizer. All operations are element-wise. Initialize θ 1 ∈ F, ϕ 0 ← 0, m 0 ← 0 , v 0 ← 0, I 0 ← 0, t ← 0 for t=1,2,... do t ← t + 1 g t ← ∇ t f t (θ t ) I t ← λI t-1 + g t h t ← κg t + µI t m t ← β 1 m t-1 + (1 -β 1 )h t ρ t ← ρ ∞ -2t β t 2 1-β t 2 if the variance is tractable, i.e., ρ t > 4, then v t ← β 2 v t-1 + (1 -β 2 )h 2 t r t ← (ρt-4)(ρt-2)ρ∞ (ρ∞-4)(ρ∞-2)ρt m t ← mt 1-β t 1 , v t ← vt 1-β t 2 θ t+1 ← Π F , √ vt (θ t -α t rt mt √ vt+ϵ ) else θ t+1 ← Π F , √ vt (θ t -α t m t ) if t+1 % k == 0: ϕ t ← η t θ t + (1 -η t )ϕ t-k θ t ← ϕ t end for return x Notations: • α t : learning rate at step t • λ, β, β 1 , β 2 : the momentum coefficients • ϵ: a small value used to avoid a zero denominator • k: synchronization period • F ,M (y) = argmin x∈F ||M 1/2 (x-y)|| • µ = 25 -10 λ + 1 λ , κ = 10 λ -9 Algorithm 2: ADMETAS Optimizer. All oper- ations are element-wise. Initialize θ 1 ∈ F, ϕ 0 ← 0, m 0 ← 0 , I 0 ← 0, t ← 0 for t=1,2,... do t ← t + 1 g t ← ∇f t (θ t ) I t ← λI t-1 + g t h t ← κg t + µI t m t ← βm t-1 + (1 -β)h t θ t+1 ← θ t -α t m t if t+1 % k == 0: ϕ t ← η t θ t + (1 -η t )ϕ t-k θ t ← ϕ t end for return x

3.1. IMAGE CLASSIFICATION

Consistent with general optimizer researches (Zhuang et al., 2020) , we conduct experiments on two image classification tasks, CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009) in CV field, and the results are presented in Table 1 . For model baselines, we choose the popular and leading performance ResNet-110 (He et al., 2016) and PyramidNet (Han et al., 2017) , respectively. From the experimental results, whether in CIFAR-10 or CIFAR-100 dataset, and based on the ResNet-110 or PyramidNet model, SGDM achieves better results than SGD, indicating that backward-looking improves the optimization effect. EMA with rectified item in RAdam performs better than EMA in Adam, suggesting that a better backward-looking process can lead to performance gains. Comparing SGDM and RAdam, we find that SGDM has a performance advantage, showing that though Adam uses an adaptive learning rate to improve the speed of convergence, it is lossy for performance. Among optimizers with adaptive learning rate, AdaBelief achieves better results than Adam and RAdam in CIFAR-10 with PyramidNet and CIFAR-100 with ResNet-110 and PyramidNet. Ranger, which combines forward and backward looking, achieves better performance than the backwardlooking-only RAdam in CIFAR-10 and CIFAR-100 with PyramidNet. Our ADMETAR achieves consistent improvement over the optimizer baseline RAdam, which also confirms the gain of bidirectional looking for optimization. And ADMETAR has better results than Ranger, indicating that our bidirectional looking is better than Ranger's simple combination of multiple optimization features. Our ADMETAS also performs better than SGDM, further demonstrating the adaptability of our approach, which not only performs well in Adam family, but also works in SGD family. Following the previous practice (Liu et al., 2019) , we visualize the optimization process of the ResNet-110 model with Adam, RAdam, SGDM, and our ADMETAS, ADMETAR optimizers on the CIFAR-10 and CIFAR-100 datasets in Figure 2 . As can be seen from the training loss figure, the above optimizers can successfully train the model to converge to a stable state, but ADMETAS obtains the lowest training loss on CIFAR-10, while AdaBelief obtains the training loss on CIFAR-100. In terms of performance on the test set, ADMETAS has obtained the best generalization ability, which shows that the lower the loss of the training set may not necessarily lead to the better the generalization ability of the model. In addition, from the accuracy of the test set, the convergence speed of the SGD family including SGDM and ADMETAS is generally slower than that of the Adam family (Adam, RAdam, Ranger, AdaBelief and ADMETAR), but the final convergence result of the SGD family is better than the Adam family. However, our ADMETAR achieve more comparable performance to the SGD family, while maintaining the advantage of the fast convergence of the Adam family. ADMETAR has the highest results on the test set in the early stage of optimization (< 80 epoch), which demonstrates that bidirectional looking improves both accuracy and speed, making ADMETAR a efficient and effective optimizer implementation. (SRFK 7UDLQLQJ/RVV &,)$5 6*' 6*'0 $GDP 5$GDP 5DQJHU $GD%HOLHI $GPHWD5 $GPHWD6 (SRFK 7HVW$FF &,)$5 6*' 6*'0 $GDP 5$GDP 5DQJHU $GD%HOLHI $GPHWD5 $GPHWD6 (SRFK 7UDLQLQJ/RVV &,)$5 6*' 6*'0 $GDP 5$GDP 5DQJHU $GD%HOLHI $GPHWD5 $GPHWD6 (SRFK 7HVW$FF &,)$5 6*' 6*'0 $GDP 5$GDP 5DQJHU $GD%HOLHI $GPHWD5 $GPHWD6 Compared to ResNet-110, PyramidNet has a more complicated structure and can achieve better results in these tasks. In cases where the model is strong enough, the selection of optimizer will not be the main factor for the final performance. As shown in Table 1 , compared to Adam, RAdam and AdaBelief achieve just a bit of improvement on CIFAR-10 task and even achieve worse results on CIFAR-100 task, which also verifies our above claims.

3.2. NATURAL LANGUAGE UNDERSTANDING

As a general AI component, the general capability requirement for various tasks and various models is a basic requirements for optimizers. We evaluate the adaptability of our ADMETA optimizer on the finetune training scenario with current popular pre-trained language models. Since the SGD family converges slowly on the finetune stage of the Transformer architecture, we only compare the various optimizers of the Adam family here. Specifically, we conduct experiments based on the pre-trained language model BERT (Devlin et al., 2018) on three natural language understanding tasks, GLUE benchmark (Wang et al., 2018) , machine reading comprehension (SQuAD v1.1 and v2.0 (Rajpurkar et al., 2016) ) and named entity recognition (NER-CoNLL03 (Sang & De Meulder, 2003) ). We report results for two model sizes, BERT base and BERT large to explore whether model size has an effect on the optimizer. In Table 2 , we report the results of on the development set of 8 datasets of the GLUE benchmark, where Acc, MCC, SCC are abbreviations of accuracy, Matthews Correlation and Spearman Correlation Coefficient, respectively. First, under the BERT-base model, compared with the basic optimizer RAdam, ADMETAR achieves consistent improvement. The most significant improvement is obtained on RTE and CoLA, which indicates that our ADMETA optimizer exhibits greater stability for lowresource optimization. On the other seven datasets, some of them are slightly improved. This is because most of the parameters of the model in the pre-training-fine-tuning paradigm have converged to a certain extent in the pre-training stage, so the further advantage of the optimizer in finetune is not apparent. And when the model is switched to a larger BERT-large, most tasks receive performance gains, except for CoLA and RTE using AdamW optimizer. Due to the further increase in model parameters, the low-resource dataset is not enough to fine-tune the large model, it will even reduce the model performance. But RAdam with rectified item, Ranger with bidirectional looking, and our ADMETAR handle the low-resource challenge well, continue to improve performance, and take advantage of large models. Our ADMETAR achieves the best results on these two low-resource datasets, demonstrating the effectiveness of our bidirectional looking approach. In Table 3 , we further report the results of machine reading comprehension and named entity recognition. ADMETAR achieved improvements at both model sizes in SQuAD v1.1 dataset, while similar improvements were achieved in SQuAD v2.0 with more complex models, illustrating that our optimizer is model-independent. Named entity recognition has reached a very accurate level with the help of pre-trained language models, and our ADMETAR optimizer also brings performance improvements over such a strong baseline, showing that optimization is also a bottleneck that restricts further performance improvement in addition to model structure and data.

3.3. AUDIO CLASSIFICATION

Like images and natural language, speech is one of the mainstream fields of deep learning research. In speech processing, there are also a large number of pre-trained large models, such as Wav2vec (Schneider et al., 2019) . To highlight the input-independent nature of the optimizer, we also conduct experiments on two typical tasks of audio classification, keyword spotting (SUPERB) (Yang et al., 2021) and language identification (Common Language) (Sinisetty et al., 2021) . We employ Wav2vec 2.0 base as the baseline model and report the results of each optimizer in Table 4 . In addition, we also list the training time of each optimizer to evaluate the impact of the bidirectional looking mechanism on the optimizer time overheadfoot_1 . ADMETAR shows better classification accuracy than AdamW, RAdam, Ranger and AdaBelief, which is consistent with the experimental conclusions in image and natural language tasks. Consistent 

4. ABLATION STUDY

We perform ablation study on various designs of ADMETA in bidirectional looking in this section. -DEMA means removing the DEMA mechanism in backward-looking and using the original EMA. -LB means complete removal of backward-looking, -LF means complete removal of forward-looking. -LB-LF means to remove bidirectional looking at the same time. w/ constant LF means use the original Lookahead mechanism in the forward-looking. The results are evaluated using the ResNet-110 model on the test set of CIFAR-10. According to the results shown in Table 5 , it can be found that the improvement of SGDM compared with SGD initially shows the advantage of backward-looking. And compared with Adam, RAdam reveals that the EMA with the rectified item in backward-looking is more suitable for the training of the model than the original EMA. Our ADMETA (including ADMETAR and ADMETAS) achieved the best results. After removing DEMA and replacing dynamic lookahead with constant lookahead, respectively, the performance drops, indicating that both DEMA and dynamic asymptotic lookahead play an important role in stable optimization. After further removing the backward-looking, the forward-looking, and the bidirectional looking, the results drop further, validating our argument that bidirectional looking is beneficial for optimization.

5. CONCLUSION

In this paper, we introduce a bidirectional looking optimizer framework, exploring the use of historical and future information for optimization. For backward-looking, we introduce a DEMA scheme to replace the traditional EMA strategy, while for forward-looking, we propose a dynamic asymptotic lookahead strategy to replace the constant lookahead scheme. In this way, we propose ADMETA optimizer, and provide two implement versions, ADMETAR and ADMETAS, which are based on adaptive and non-adaptive momentum optimizers, RAdam and SGDM respectively. We verify the benefits of ADMETA with intuitive examinations and various experiments, showing the effectiveness of our proposed optimizer. Please refer to Appendix F for future work discussion.

APPENDIX A RELATED WORK

As an important part of machine learning and deep learning, optimizer has received much attention in recent years. The optimizer plays a prominent role in the convergence speed and the convergence effect of the model. To seek good properties like fast convergence, good generalization and robustness, many algorithms have been put forward recently, and they can be divided into four families according to their characteristics and motivation.

SGD Family

In this family, the optimizers adopt the method of update like θ t = θ t-1 -α t m t , where θ t denotes the parameter to be optimized at iteration step t and m t refers to some combination of past gradients (such as EMA), which can be represented as f 1 (g 1 , g 2 , ..., g t ). Original SGD (Robbins & Monro, 1951) directly minus the product of global learning rate and the gradient at each step. Despite of its simplicity, it is still widely used in many datasets. However, SGD is blamed for its low convergence rate and high fluctuation, thus many methods have been proposed to accelerate the speed and smooth the update process. One efficient optimizer to tackle this issue is SGDM (Sutskever et al., 2013) , which uses a exponential moving average (EMA, also known as momentum) to replace the gradient with an exponential weight decay of past gradients. SGDM-Nesterov (Nesterov, 1983 ) is a variant of SGDM which modifies the momentum by computing gradient based on the approximation of the next position and thus changing the descent direction. Experiments have shown that Nesterov momentum tends to achieve a higher speed and performance. Adam Family The Adam family optimizers usually update parameters by θ t = θ t-1 -α t m t / √ v t , where v t is the adaptive item and can be represented as f 2 (g 2 1 , g 2 2 , ..., g 2 t ). Compared to SGD family, instead of using a uniform learning rate, this kind of optimizer computes an individual learning rate for each parameter due to the effect of the denominator √ v t in the equation. v t is usually an dimension-reduction approximation to the matrix which contains the information of second order curvature, such as Fisher matrix (Pascanu & Bengio, 2013) . Adadelta Zeiler (2012), Adagrad (Duchi et al., 2011) and RMSprop (Tieleman & Hinton, 2012) are early optimizers in this family. A stand out generation is Adam (Kingma & Ba, 2014) which combines the RMSprop with Adagrad. It has been widely used in a wide range of datasets and works well even with sparse gradients. However, there are problems with Adam with respect to convergence and generalization, thus many methods have been proposed to make improvements Based on the large variance in the early stage that may leads to a bad optimum, heuristic warmup (Vaswani et al., 2017; Popel & Bojar, 2018) and RAdam (Liu et al., 2019) are proposed, of which the former starts with a small initial learning rate and the latter introduces a rectified item. To fix the convergence error, Reddi et al. ( 2019) proposed AMSGrad which requires the non-decreasing property of the second momentum. In fact, this method can be interpolated into other Adam family algorithms to guarantee the convergence in convex situations. Considering curvature of the loss function, AdaBelief (Zhuang et al., 2020) and AdaMomentum (Wang et al., 2021) are proposed. More recently, there are still numerous studies devoted to improving Adam, such as AdaX (Li et al., 2020) and AdaFamily (Fassold, 2022) . However, we notice that most researches put a solid emphasis on modifying the second momentum term, i.e., the adaptive item and ignores the possibility to make a relative overall change to the algorithms. Stochastic Second-Order Family In the stochastic second-order optimizers, parameters are updated using second-order information related to Hessian matrix. The update process is typically written as θ t = θ t-1 -α t H -1 m t , where H is the Hessian matrix or approximation matrix to it. Ideally, they can achieve better results than the first order optimizers (like Adam family and SGD family), but their practicality is limited due to the large computational cost of the second order information, like the fisher / hessian matrix. Some methods have been proposed using low-rank decomposition and approximating to hessian diagonal to reduce the cost, like Apollo (Ma, 2020) , AdHessian (Yao et al., 2021) and Shampoo (Anil et al., 2020) . Other Optimizers There are some algorithms that are not convenient to be categorized into the above families and we list some examples here. Motivated by PID controller, SGD-PID (An et al., 2018) takes an analogy between gradient and the input error in a automatic control system. Analysis show that it can reduce the overshoot problem in SGD and SGD variants. Furthermore, Weng et al. (2022) applied PID into Adam and proposed AdaPID optimizer. Lookahead (Zhang et al., 2019) optimizer updates two sets of weight wherein "fast weights" function as a guider to search for the direction and "slow weights" follows the guide to achieve better optimization. Ranger (Wright, 2019) optimizer further combines RAdam and Lookahead to get a compound algorithm and shows a better convergence performance. Discussion To show the advantage of bidirectional looking, we propose ADMETA optimizer. Specifically, it is based on the idea of considering backward-looking and forward-looking, wherein DEMA plays a important role in the former aspect and dynamic asymptotic forward-looking strategy serves for the latter aspect. In practical use, we provide two versions, ADMETAS and ADMETAR, using the framework of ADMETA and based on SGDM and RAdam respectively. Specifically, ADMETAS replace the traditionally used EMA in backward-looking part of SGDM with DEMA and add the forward-looking part which is derived from Lookahead optimizer. ADMETAR is based on RAdam in the same way. The second order family is also introduced above because the framework of ADMETA can also be applied in this family, and it is remained as the future work.

B EMA VS. DEMA

To corroborate our analysis of EMA and DEMA, we compared the optimization process of EMA and DEMA on the SGD optimizer according to the practice of (Goh, 2017) . Using the same learning rate α and starting from the same starting point, the convergence process is shown in Figure 3 . The decent surface in the figure is the convex quadratic, which is a useful model despite its simplicity, for it comprises an important structure, the "valleys", which is often studied as an example in momentumbased optimizers. As demonstrated in Figure 3 , on the one hand, DEMA achieves faster speed than EMA, which can be easily seen by comparing the distance to the optimal point at the same time; on the other hand, DEMA achieves better convergence results than EMA as can be seen in the distance between the point of convergence and optimum.

C PROOF OF CONVERGENCE

In this section, following (Chen et al., 2018) , (Alacaoglu et al., 2020) and (Reddi et al., 2019) , we provide detailed proofs of convergence for ADMETAR and ADMETAS optimizers in convex and non-convex situations.

C.1 CONVERGENCE ANALYSIS IN CONVEX AND NON-CONVEX OPTIMIZATION

Optimization problem For deterministic problems, the problem to be optimized is min θ∈F f (θ), where f denotes the loss function. For online optimization, the problem is min θ∈F T t=1 f t (θ), where f t is the loss function of the model with the given parameters at the t-th step. The criteria for judging convergence in convex and non-convex cases are different. For convex optimization, following (Reddi et al., 2019) , the goal is to ensure R(T ) = o(T ), i.e., lim T →∞ R(T )/T = 0. For non-convex optimization, following (Chen et al., 2018) , the goal is to ensure min t∈[T ] E ∇f (θ t ) 2 = o(T ). Theorem C.1. (Convergence of ADMETAR for convex optimization) Let {θ t } be the sequence obtained from ADMETAR, 0 ≤ λ, β 1 , β 2 < 1, γ = β 2 1 β2 < 1, α t = α √ t and v t ≤ v t+1 , ∀t ∈ [T ]. Suppose x ∈ F, where F ⊂ R d and has bounded diameter D ∞ , i.e. ||θ t -θ|| ∞ ≤ D ∞ , ∀t ∈ [T ]. Assume f (θ) is a convex function and ||g t || ∞ is bounded. Denote the optimal point as θ. For θ t generated, ADMETAR achieves the regret: R(T ) = T t=1 [f t (θ t ) -f t (θ)] = O( √ T ) Theorem C.2. (Convergence of ADMETAR for non-convex optimization) Under the assumptions: • ∇f exits and is Lipschitz-continuous,i.e, ||∇f (x) -∇f (y)|| ≤ L||x -y||, ∀x, y; f is also lower bounded. • At step t, the algorithm can access a bounded noisy gradient g t , and the true gradient ∇f is also bounded. • The noisy gradient is unbiased, and has independent noise, i.e. g t = ∇f (θ t ) + δ t , E[δ t ] = 0 and δ i ⊥δ j , ∀i ̸ = j. Assume min j∈[d] (v 1 ) j ≥ c > 0 and α t = α/ √ t, then for any T we have: min t∈[T ] E ∇f (θ t ) 2 ≤ 1 √ T (Q 1 + Q 2 logT ) where Q 1 and Q 2 are constants independent of T. Theorem C.3. (Convergence of ADMETAS for convex optimization) Let {θ t } be the sequence obtained by ADMETAS, 0 ≤ λ, β < 1, α t = α √ t , ∀t ∈ [T ]. Suppose x ∈ F, where F ⊂ R d and has bounded diameter D ∞ , i.e. ||θ t -θ|| ∞ ≤ D ∞ , ∀t ∈ [T ]. Assume f (θ) is a convex function and ||g t || ∞ is bounded. Denote the optimal point as θ. For θ t generated, ADMETAS achieves the regret: R(T ) = T t=1 [f t (θ t ) -f t (θ)] = O( √ T ) Theorem C.4. (Convergence of ADMETAS for non-convex optimization) Under the assumptions: • ∇f exits and is Lipschitz-continuous,i.e, ||∇f (x) -∇f (y)|| ≤ L||x -y||, ∀x, y; f is also lower bounded. • At step t, the algorithm can access a bounded noisy gradient g t , and the true gradient ∇f is also bounded. • The noisy gradient is unbiased, and has independent noise, i.e. g t = ∇f (θ t ) + δ t , E[δ t ] = 0 and δ i ⊥δ j , ∀i ̸ = j. Assume α t = α/ √ t, then for any T we have: min t∈[T ] E ∇f (θ t ) 2 ≤ 1 √ T (Q ′ 1 + Q ′ 2 logT ) where Q ′ 1 and Q ′ 2 are constants independent of T. Before formally proving the theorems, here list some remarks and preparations. Remark 1. For brevity, we omit the rectified item of ADMETAR in the proof. However, it does not influence the proof since it can be integrated into the learning rate. Remark 2. Following (Luo et al., 2019) , the bias correction 1/1 -β t 1 of the first momentum m t is omitted in the convergence of ADMETAR. Since 1/1 -β t 1 is bounded above 1 and below 10, the order of the terms used is not affected, thus hardly affecting the proof. Remark 3. The forward-looking part is not considered in the proof. On the one hand, explanations and proofs of constant Lookahead have been given in (Zhang et al., 2019) and (Wang et al., 2020) , which can be imitated by our dynamic method. On the other hand, forward-looking part is exactly the interpolation of fast weights and slow weights at each synchronization period, and the fast weights are updated by the given optimizer. Therefore, the convergence proof is equivalent to only proving convergence of fast weights. Lemma C.5. if ∥g t ∥ ∞ is bounded,i.e. ∥g t ∥ ∞ ≤ G ∞ , ∀t ∈ [T ], where G ∞ is a constant independent of T, then I t , h t and m t are also bounded.

Proof. First of all, we prove ∥I

t ∥ ∞ ≤ (1 + λ)G ∞ by induction: when t = 1 ∥I 1 ∥ ∞ = ∥g 1 ∥ ∞ ≤ G ∞ Suppose t = k satisfies, then for t = k + 1 ∥I k+1 ∥ ∞ = ∥λI k + g k+1 ∥ ∞ ≤ λ∥I k ∥ ∞ + ∥g k+1 ∥ ∞ ≤ (λ + 1)max{∥I k ∥ ∞ , ∥g k+1 ∥ ∞ } ≤ (1 + λ)G ∞ Next, for ∥h k ∥ ∞ ∥h t ∥ ∞ = ∥κg t + µI t ∥ ∞ ≤ κ∥g t ∥ ∞ + µ∥I t ∥ ∞ ≤ [κ + (1 -λ)µ)]G ∞ Since m t is the moving average of h i where i=1,...,t, we can get that it is also bounded following the proof of I t . In this way, we can redefine G ∞ by enlarging it and the bounded stochastic gradient assumption in the theorem is equivalent to assuming ∥g t ∥ ∞ , ∥I t ∥ ∞ , ∥h t ∥ ∞ , ∥m t ∥ ∞ ≤ G ∞ . Remark 4. As for non-convex optimization, in the same way, the bounded noisy gradient assumption is equivalent to ∥g t ∥, ∥I t ∥, ∥h t ∥, ∥m t ∥ ≤ H where H is a constant independent of T. This remark will be used in several places in the following proof. Lemma C.6 (Generalized Hölder inequality, (Beckenbach & Bellman, 2012)). For x, y, z ∈ R n + and positive p, q, r such that 1 p + 1 q + 1 r = 1, we have n j=1 θ j y j z j ≤ ∥x∥ p ∥y∥ q ∥z∥ r . This is a common mathematical inequality, so the proof is omitted here.  F ⊂ R d , suppose u 1 = arg min x∈F ∥Q 1/2 (x -z 1 )∥ and u 2 = arg min x∈F ∥Q 1/2 (x -z 2 )∥ then we have ∥Q 1/2 (u 1 -u 2 )∥ ≤ ∥Q 1/2 (z 1 -z 2 )∥. Proof. First, we claim that ⟨u 1 -z 1 , Q(u 2 -u 1 )⟩ ≥ 0 and ⟨u 2 -z 2 , Q(u 1 -u 2 )⟩ ≥ 0 (We only prove the former as the proofs are exactly the same). Otherwise, consider a small δ, we have u 1 + δ(u 2 -u 1 ) ∈ F 1 2 ⟨u 1 + δ(u 2 -u 1 ) -z 1 , Q(u 1 + δ(u 2 -u 1 ) -z 1 )⟩ = 1 2 ⟨u 1 -z 1 , Q(u 1 -z 1 )⟩ + 1 2 δ 2 ⟨u 2 -u 1 , Q(u 2 -u 1 )⟩ + δ⟨u 1 -z 1 , Q(u 2 -u 1 )⟩ If there exists ⟨u 1 -z 1 , Q(u 2 -u 1 )⟩ < 0, δ can be chosen so small that it satisfies 1 2 δ 2 ⟨u 2 - u 1 , Q(u 2 -u 1 )⟩ + δ⟨u 1 -z 1 , Q(u 2 -u 1 )⟩ < 0, which contradicts the definition of u 1 . Using the above claim, we further have ⟨u 1 -z 1 , Q(u 2 -u 1 )⟩ -⟨u 2 -z 2 , Q(u 2 -u 1 )⟩ ≥ 0 ⇒⟨z 2 -z 1 , Q(u 2 -u 1 )⟩ ≥ ⟨u 2 -u 1 , Q(u 2 -u 1 )⟩ (18) Also, observing the following ⟨(u 2 -u 1 ) -(z 2 -z 1 ), Q((u 2 -u 1 ) -(z 2 -z 1 ))⟩ ≥ 0 ⇒⟨u 2 -u 1 , Q(z 2 -z 1 )⟩ ≤ 1 2 [⟨u 2 -u 1 , Q(u 2 -u 1 )⟩ + ⟨z 2 -z 1 , Q(z 2 -z 1 )⟩] Combining ( 18) and ( 19), we have the required result.

C.2 CONVERGENCE ANALYSIS OF ADMETAR FOR CONVEX OPTIMIZATION

Lemma C.8. consider m t = β 1 m t-1 + (1 -β 1 )h t , ∀t ≥ 1. it follows that ⟨h t , θ t -θ⟩ =⟨m t-1 , θ t-1 -θ⟩ - β 1 1 -β 1 ⟨m t-1 , θ t -θ t-1 ⟩ + 1 1 -β 1 (⟨m t , θ t -θ⟩ -⟨m t-1 , θ t-1 -θ⟩) . Proof. By definition of m t , h t = 1 1-β1 m t -β1 1-β1 m t-1 . Thus, we have ⟨h t , θ t -θ⟩ = 1 1 -β 1 ⟨m t , θ t -θ⟩ - β 1 1 -β 1 ⟨m t-1 , θ t -θ⟩ = 1 1 -β 1 ⟨m t , θ t -θ⟩ - β 1 1 -β 1 ⟨m t-1 , θ t-1 -θ⟩ - β 1 1 -β 1 ⟨m t-1 , θ t -θ t-1 ⟩ = 1 1 -β 1 ⟨m t , θ t -θ⟩ -⟨m t-1 , θ t-1 -θ⟩ + ⟨m t-1 , θ t-1 -θ⟩ - β 1 1 -β 1 ⟨m t-1 , θ t -θ t-1 ⟩. Lemma C.9 (Bound for T t=1 α t ∥v -1/4 t m t ∥ 2 ). Under Assumption in Theorem 1, we have T t=1 α t ∥v -1/4 t m t ∥ 2 ≤ (1 -β 1 )α √ 1 + log T (1 -β 2 )(1 -γ) d i=1 ∥h 1:T,i ∥ 2 Proof. First, we bound ∥v -1/4 t m t ∥ 2 . From the definition of m t and v t , it follows that m t = (1 -β 1 ) t j=1 β t-j 1 h j , v t = (1 -β 2 ) t j=1 β t-j 2 h 2 j Then we have ∥v -1/4 t m t ∥ 2 ≤ ∥v -1/4 t m t ∥ 2 = d i=1 m 2 t,i v 1/2 t,i = d i=1 t j=1 (1 -β 1 )β t-j 1 h j,i 2 t j=1 (1 -β 2 )β t-j 2 h 2 j,i = (1 -β 1 ) 2 √ 1 -β 2 d i=1 t j=1 β t-j 1 h j,i 2 t j=1 β t-j 2 h 2 j,i ≤ (1 -β 1 ) 2 √ 1 -β 2 d i=1 t j=1 (β t-j 4 2 |h j,i | 1 2 ) 4 1 4 t j=1 (β 1/2 1 β -1/4 2 ) 4(t-j) 1 4 t j=1 (β t-j 1 |h j,i |) 1 2 •2 1 2 2 t j=1 β t-j 2 h 2 j,i = (1 -β 1 ) 2 √ 1 -β 2 d i=1   t j=1 γ t-j   1 2 t j=1 β t-j 1 |h j,i | ≤ (1 -β 1 ) 2 (1 -β 2 )(1 -γ) d i=1 t j=1 β t-j 1 |h j,i |, where the first inequality follows from the fact that v1/2 t,i ≥ v 1/2 t,i , the second one follows from the generalized Hölder inequality for θ j = β t-j 4 2 |h j,i | 1 2 , y j = (β 1 β -1/2 2 ) t-j 2 , z j = (β t-j 1 |h j,i |) 1 2 and p = q = 4, r = 2, and the third one follows from the sum of geometric series and the assumption γ = β 2 1 β2 < 1. In this way, we can bound T t=1 α t ∥v -1/4 t m t ∥ 2 . T t=1 α t ∥v -1/4 t m t ∥ 2 ≤ (1 -β 1 ) 2 (1 -β 2 )(1 -γ) d i=1 T t=1 α t t j=1 β t-j 1 |h j,i | = (1 -β 1 ) 2 (1 -β 2 )(1 -γ) d i=1 T j=1 T t=j α t β t-j 1 |h j,i | ≤ (1 -β 1 ) (1 -β 2 )(1 -γ) d i=1 T j=1 α j |h j,i | ≤ 1 -β 1 (1 -β 2 )(1 -γ) d i=1 T j=1 α 2 j T j=1 h 2 j,i ≤ (1 -β 1 )α √ 1 + log T (1 -β 2 )(1 -γ) d i=1 T t=1 h 2 t,i = (1 -β 1 )α √ 1 + log T (1 -β 2 )(1 -γ) d i=1 ∥h 1:T,i ∥ where the first inequality follows from (20).The first equality is by changing order of summation. The second inequality follows from the fact that T t=j α t β t-j 1 ≤ αj 1-β1 . The third inequality is by Cauthy-Schwartz. The last inequality is by using T j=1 1 j ≤ 1 + log T Theorem C.10. (Convergence of ADMETAR for convex optimization) Let {θ t } be the se- quence obtained from ADMETAR, 0 ≤ λ, β 1 , β 2 < 1, γ = β 2 1 β2 < 1, α t = α √ t and v t ≤ v t+1 , ∀t ∈ [T ]. Suppose x ∈ F, where F ⊂ R d and has bounded diameter D ∞ , i.e. ||θ t -θ|| ∞ ≤ D ∞ , ∀t ∈ [T ]. Assume f (θ) is a convex function and ||g t || ∞ is bounded. Denote the optimal point as θ. For θ t generated, ADMETAR achieves the regret: R(T ) = T t=1 [f t (θ t ) -f t (θ)] = O( √ T ) Proof. • Bound for T t=1 ⟨m t , θ t -θ⟩. As x ∈ F, we get θ t+1 = Π F , √ vt (θ t -α t v-1/2 t m t ) = min x∈F ∥v 1/4 t (x -(θ t -α t v-1/2 t m t ))∥. Furthermore, Π F , √ vt (x) = x for all x ∈ F. Using Lemma C.7 with u 1 = θ t+1 and u 2 = θ, we have the following: ∥v 1/4 t (θ t+1 -θ)∥ 2 ≤ ∥v 1/4 t (θ t -α t v-1/2 t m t -θ)∥ 2 = ∥v 1/4 t (θ t -θ)∥ 2 + α 2 t ∥v -1/4 t m t ∥ 2 -2α t ⟨m t , θ t -θ⟩ we rearrange and divide both sides of ( 21) by 2α t to get ⟨m t , θ t -θ⟩ ≤ 1 2α t ∥v 1/4 t (θ t -θ)∥ 2 - 1 2α t ∥v 1/4 t (θ t+1 -θ)∥ 2 + α t 2 ∥v -1/4 t m t ∥ 2 = 1 2α t-1 ∥v 1/4 t-1 (θ t -θ)∥ 2 - 1 2α t ∥v 1/4 t (θ t+1 -θ)∥ 2 + 1 2 d i=1 v1/2 t,i α t - v1/2 t-1,i α t-1 (θ t,i -θ i ) 2 + α t 2 ∥v -1/4 t m t ∥ 2 ≤ 1 2α t-1 ∥v 1/4 t-1 (θ t -θ)∥ 2 - 1 2α t ∥v 1/4 t (θ t+1 -θ)∥ 2 + D 2 ∞ 2 d i=1 v1/2 t,i α t - v1/2 t-1,i α t-1 + α t 2 ∥v -1/4 t m t ∥ 2 (22) where the last inequality is due to the fact that vt,i ≥ vt-1,i , 1 αt ≥ 1 αt-1 , and the definition of D ∞ . Summing ( 22) over t = 1, . . . T and using that v0 = 0 yields T t=1 ⟨m t , θ t -θ⟩ ≤ D 2 ∞ 2α T d i=1 v1/2 T,i + 1 2 T t=1 α t ∥v -1/4 t m t ∥ 2 . • Bound for T t=1 ⟨m t-1 , θ t-1 -θ t ⟩. T t=1 ⟨m t-1 , θ t-1 -θ t ⟩ = T t=2 ⟨m t-1 , θ t-1 -θ t ⟩ = T -1 t=1 ⟨m t , θ t -θ t+1 ⟩ ≤ T -1 t=1 ∥v -1/4 t m t ∥∥v 1/4 t (θ t+1 -θ)∥ = T -1 t=1 ∥v -1/4 t m t ∥ v1/4 t [Π F ,v 1/2 t θ t -α t v-1/2 t m t -Π F ,v 1/2 t (θ t )] ≤ T -1 t=1 α t ∥v -1/4 t m t ∥∥v -1/4 t m t ∥ = T -1 t=1 α t ∥v -1/4 t m t ∥ 2 where the first inequality follows from Hölder inequality and the second inequality is due to lemma C.7 • Bound for ⟨m T , θ T -θ⟩. ⟨m T , θ T -θ⟩ ≤ ∥v -1/4 t m T ∥∥v 1/4 t (θ T -θ)∥ ≤ α T ∥v -1/4 t m T ∥ 2 + 1 4α T ∥v 1/4 t (θ T -θ)∥ 2 ≤ α T ∥v -1/4 t m T ∥ 2 + D 2 ∞ 4α T d i=1 v1/2 T,i where the first inequality follows from Hölder inequality and the second inequality follows from Young's inequality. The last inequality is due to the definition of D ∞ . After all these preparations, we obtain: T t=1 ⟨h t , θ t -θ⟩ = β 1 1 -β 1 ⟨m T , θ T -θ⟩ + T t=1 ⟨m t-1 , θ t-1 -θ t ⟩ + T t=1 ⟨m t , θ t -θ⟩ ≤ β 1 1 -β 1 D 2 ∞ 4α T d i=1 v1/2 T,i + T t=1 α t ∥v -1/4 t m t ∥ 2 + D 2 ∞ 2α T d i=1 v1/2 T,i + 1 2 T t=1 α t ∥v -1/4 t m t ∥ 2 = (2 -β 1 )D 2 ∞ 4α T (1 -β 1 ) d i=1 v1/2 T,i + 2 + β 1 2(1 -β 1 ) T t=1 α t ∥v -1/4 t m t ∥ 2 ≤ (2 -β 1 )D 2 ∞ √ T 4α(1 -β 1 ) d i=1 v1/2 T,i + (2 + β 1 )α √ 1 + log T 2 (1 -β 2 )(1 -γ) d i=1 ∥h 1:T,i ∥ 2 This proves that T t=1 ⟨h t , θ t -θ⟩ = O( √ T ). Suppose the optimizer runs for a long time, the bias of EMA is small (Zhuang et al., 2020) , thus E(I t ) approaches E(g t ) as step increases. Since h t = κg t + µI t , h t is the same order as g t when the time is long enough, thus we have T t=1 ⟨g t , θ t -θ⟩ = O( √ T ) In addition, due to the convexity of f (.), we have R(T ) = T t=1 f t (θ t ) -f t (x) ≤ T t=1 ⟨g t , θ t -θ⟩ Combined with (23), we complete the proof.

C.3 CONVERGENCE ANALYSIS OF ADMETAR FOR NON-CONVEX OPTIMIZATION

Lemma C.11. Set θ 0 ≜ x 1 in Algorithm, and define z t as z t = θ t + β 1 1 -β 1 (θ t -θ t-1 ), ∀t ≥ 1. ( ) Then the following holds true z t+1 -z t = - β 1 1 -β 1 α t √ vt - α t-1 vt-1 m t-1 -α t h t / vt Proof. By the update of ADMETAR, we have θ t+1 -θ t = -α t m t / vt = -α t (β 1 m t-1 + (1 -β 1 )h t )/ vt = β 1 α t α t-1 vt-1 √ vt (θ t -θ t-1 ) -α t (1 -β 1 )h t / vt = β 1 (θ t -θ t-1 ) + β 1 α t α t-1 vt-1 √ vt -1 (θ t -θ t-1 ) -α t (1 -β 1 )h t / vt = β 1 (θ t -θ t-1 ) -β 1 α t √ vt - α t-1 vt-1 m t-1 -α t (1 -β 1 )h t / vt (25) Since we also have θ t+1 -θ t = (1 -β 1 )θ t+1 + β 1 (θ t+1 -θ t ) -(1 -β 1 )θ t Combined with (25), we have (1 -β 1 )θ t+1 + β 1 (θ t+1 -θ t ) =(1 -β 1 )θ t + β 1 (θ t -θ t-1 ) -β 1 α t √ vt - α t-1 vt-1 m t-1 -α t (1 -β 1 )h t / vt . Divide both sides by 1 -β 1 , we have θ t+1 + β 1 1 -β 1 (θ t+1 -θ t ) =θ t + β 1 1 -β 1 (θ t -θ t-1 ) - β 1 1 -β 1 α t √ vt - α t-1 vt-1 m t-1 -α t h t / vt . Lemma C.12. Suppose that the conditions in Theorem C.2 hold, then E [f (z t+1 ) -f (z 1 )] ≤ 4 i=1 T i , where T 1 = -E t i=1 ⟨∇f (z i ), β 1 1 -β 1 α i √ vi - α i-1 vi-1 m i-1 ⟩ T 2 = -E t i=1 α i ⟨∇f (z i ), h i / vi ⟩ T 3 = E   t i=1 L β 1 1 -β 1 α t √ vi - α i-1 vi-1 m i-1 2   T 4 = E t i=1 L α i h i / vi 2 Proof. By the Lipschitz smoothness of ∇f , f (z t+1 ) ≤ f (z t ) + ⟨∇f (z t ), z t+1 -z t ⟩ + L 2 ∥z t+1 -z t ∥ 2 , Based on (C.18),we have E[f (z t+1 ) -f (z 1 )] =E t i=1 f (z i+1 ) -f (z i ) ≤E t i=1 ⟨∇f (z i ), z i+1 -z i ⟩ + L 2 ∥z i+1 -z i ∥ 2 = -E t i=1 ⟨∇f (z i ), β 1 1 -β 1 α i √ vi - α i-1 vi-1 m i-1 ⟩ -E t i=1 α i ⟨∇f (z i ), h i / vi ⟩ + E t i=1 L 2 ∥z i+1 -z i ∥ 2 = T 1 + T 2 + E t i=1 L 2 ∥z i+1 -z i ∥ 2 , Then, using inequality ∥a + b∥ 2 ≤ 2∥a∥ 2 + 2∥b∥ 2 and combined with lemma C.11, E t i=1 L 2 ∥z i+1 -z i ∥ 2 ≤ T 3 + T 4 Lemma C.13. In this part, we bound T 1 , T 2 , T 3 Proof. • Bound for T 1 T 1 = -E t i=2 ⟨∇f (z i ), β 1 1 -β 1 α i vi - α i-1 vi-1 m i-1 ⟩ ≤E   t i=1 ∥∇f (z i )∥ ∥m i-1 ∥ 1 1 -β 1 -1 d j=1 α i √ vi - α i-1 vi-1 j   ≤H 2 β 1 1 -β 1 E   t i=1 d j=1 α i √ vi - α i-1 vi-1 j   • Bound for T 3 T 3 ≤LE   t i=2 β 1 1 -β 1 2 d j=1   α t √ vi - α i-1 vi-1 2 j (m i-1 ) 2 j     ≤ β 1 1 -β 1 2 LH 2 E   t i=2 d j=1 α t √ vi - α i-1 vi-1 2 j   • Bound for T 2 T 2 = -E t i=1 α i ⟨∇f (z i ), h i / vi ⟩ = -E t i=1 α i ⟨∇f (θ i ), h i / vi ⟩ -E t i=1 α i ⟨∇f (z i ) -∇f (θ i ), h i / vi ⟩ . ( ) The second term of ( 27) can be bounded as -E t i=1 α i ⟨∇f (z i ) -∇f (θ i ), h i / vi ⟩ ≤E t i=2 1 2 ∥∇f (z i ) -∇f (θ i )∥ 2 + 1 2 ∥α i h i / vi ∥ 2 ≤ L 2 2 E t i=2 β 1 1 -β 1 α i-1 m i-1 / vi-1 2 + 1 2 E t i=2 ∥α i h i / vi ∥ 2 = L 2 2 β 1 1 -β 1 2 E t i=2 α i-1 m i-1 / vi-1 2 + 1 2 E t i=2 ∥α i h i / vi ∥ 2 where the inequality is due to ∥∇f (z i ) -∇f (θ i )∥ ≤ L∥z i -θ i ∥. Then consider the first term of ( 27) E t i=1 α i ⟨∇f (θ i ), h i / vi ⟩ =κE t i=1 α i ⟨∇f (θ i ), g i / vi ⟩ + µE t i=1 α i ⟨∇f (θ i ), I i / vi ⟩ Consider the term with κ E t i=1 α i ⟨∇f (θ i ), g i / vi ⟩ =E t i=1 α i ⟨∇f (θ i ), (∇f (θ i ) + δ i )/ vi ⟩ =E t i=1 α i ⟨∇f (θ i ), ∇f (θ i )/ vi ⟩ + E t i=1 α i ⟨∇f (θ i ), δ i / vi ⟩ . ( ) For the second term in RHS of (28), we have E t i=1 α i ⟨∇f (θ i ), δ i / vi ⟩ =E t i=2 ⟨∇f (θ i ), δ i (α i / vi -α i-1 / vi-1 )⟩ + E t i=2 α i-1 ⟨∇f (θ i ), δ i (1/ vi-1 )⟩ + E α 1 ⟨∇f (x 1 ), δ 1 / v1 ⟩ ≥E t i=2 ⟨∇f (θ i ), δ i (α i / vi -α i-1 / vi-1 )⟩ -2H 2 E   d j=1 (α 1 / v1 ) j   where the last equation is because given θ i , vi-1 , E δ i (1/ vi-1 )|θ i , vi-1 = 0 and ∥δ i ∥ ≤ 2H Further, we have E t i=2 ⟨∇f (θ i ), δ t (α i / vi -α i-1 / vi-1 )⟩ =E   t i=2 d j=1 (∇f (θ i )) j (δ t ) j (α i /( vi ) j -α i-1 /( vi-1 ) j )   ≥ -E   t i=2 d j=1 |(∇f (θ i )) j | |(δ t ) j | (α i /( vi ) j -α i-1 /( vi-1 ) j )   ≥ -2H 2 E   t i=2 d j=1 (α i /( vi ) j -α i-1 /( vi-1 ) j )   Substitute ( 29) and ( 30) into (28), we then get -E t i=1 α i ⟨∇f (θ i ), g i / vi ⟩ ≤2H 2 E   t i=2 d j=1 (α i /( vi ) j -α i-1 /( vi-1 ) j )   + 2H 2 E   d j=1 (α 1 / v1 ) j   -E t i=1 α i ⟨∇f (θ i ), ∇f (θ i )/ vi ⟩ Then, consider the term with µ. Suppose the optimizer runs for a long time, the bias of EMA is small (Zhuang et al., 2020) , thus E(I t ) approaches E(g t ) as step increases. In other words, we can bound it the same way as the term with κ. After all these bounds, we finally get T 2 ≤ L 2 2 E t i=2 β 1 1 -β 1 α i-1 m i-1 / vi-1 2 + 1 2 E t i=2 ∥α i h i / vi ∥ 2 + 2(κ + µ)H 2 E   t i=2 d j=1 (α i /( vi ) j -α i-1 /( vi-1 ) j )   + 2(κ + µ)H 2 E   d j=1 (α 1 / v1 ) j   -(κ + µ)E t i=1 α i ⟨∇f (θ i ), ∇f (θ i )/ vi ⟩ Lemma C.14. Suppose the conditions in theorem C.2 holds, then we have E t i=1 α i ⟨∇f (θ i ), ∇f (θ i )/ vi ⟩ ≤E C 1 t i=1 α t h t √ vt 2 + C 2 t i=2 α i-1 m i-1 vi-1 2 + C 3 t i=2 α t √ vt - α t-1 vt-1 1 + C 4 t-1 i=2 α t √ vt - α t-1 vt-1 2 + C 5 where C 1 , C 2 , C 3 , C 4 and C 5 are independent of the step. Proof. Combine lemma C.12 and lemma C.13, we get E [f (z t+1 ) -f (z 1 )] ≤H 2 β 1 1 -β 1 E   t i=1 d j=1 α i √ vi - α i-1 vi-1 j   + β 1 1 -β 1 2 LH 2 E   t i=2 d j=1 α t √ vi - α i-1 vi-1 2 j   + E t i=1 L α i h i / vi 2 + L 2 2 E t i=2 β 1 1 -β 1 α i-1 m i-1 / vi-1 2 + 1 2 E t i=2 ∥α i h i / vi ∥ 2 + 2(κ + µ)H 2 E   t i=2 d j=1 α i √ vi - α i-1 vi-1 j   + 2(κ + µ)H 2 E   d j=1 (α 1 / v1 ) j   -(κ + µ)E t i=1 α i ⟨∇f (θ i ), ∇f (θ i )/ vi ⟩ By merging similar terms in above inequality and noticing that κ + µ > 0, we get E t i=1 α i ⟨∇f (θ i ), ∇f (θ i )/ vi ⟩ ≤ 2H 2 + β 1 H 2 (1 -β 1 )(κ + µ) E   t i=1 d j=1 α i √ vi - α i-1 vi-1 j   + β 1 1 -β 1 2 LH 2 κ + µ E   t i=2 d j=1 α t √ vi - α i-1 vi-1 2 j   + 2L + 1 2(κ + µ) E t i=2 ∥ α i h i √ vi ∥ 2 + L 2 2(κ + µ) β 1 1 -β 1 2 E   t i=2 α i-1 m i-1 vi-1 2   + 2H 2 E   d j=1 (α 1 / v1 ) j   + 1 κ + µ E [f (z 1 ) -f (z t+1 )] =E C 1 t i=1 α t h t √ vt 2 + C 2 t i=2 α i-1 m i-1 vi-1 2 + C 3 t i=2 α t √ vt - α t-1 vt-1 1 + C 4 t-1 i=2 α t √ vt - α t-1 vt-1 2 + C 5 Theorem C.15. (Convergence of ADMETAR for non-convex optimization) Under the assumptions: • ∇f exits and is Lipschitz-continuous,i.e, ||∇f (x) -∇f (y)|| ≤ L||x -y||, ∀x, y; f is also lower bounded. • At step t, the algorithm can access a bounded noisy gradient g t , and the true gradient ∇f is also bounded. • The noisy gradient is unbiased, and has independent noise, i.e. g t = ∇f (θ t ) + δ t , E[δ t ] = 0 and δ i ⊥δ j , ∀i ̸ = j. Assume min j∈[d] (v 1 ) j ≥ c > 0 and α t = α/ √ t, then for any T we have: min t∈[T ] E ∇f (θ t ) 2 ≤ 1 √ T (Q 1 + Q 2 logT ) where Q 1 and Q 2 are constants independent of T. Proof. We bound non-constant terms in RHS of (32), which is given by E C 1 T t=1 α t h t √ vt 2 + C 2 t i=2 α i-1 m i-1 vi-1 2 + C 3 T t=2 α t √ vt - α t-1 vt-1 1 +C 4 T -1 t=2 α t √ vt - α t-1 vt-1 2 + C 5 • Bound the term with C 1 . Note that min j∈[d] ( √ v1 ) j ≥ min j∈[d] |(h 1 ) j | ≥ c > 0, thus we have E T t=1 α t h t √ vt 2 ≤E T t=1 α t h t c 2 = E T t=1 αh t c √ t 2 = E T t=1 α c √ t 2 ∥h t ∥ 2 ≤ H 2 α 2 c 2 T t=1 1 t ≤ H 2 α 2 c 2 (1 + log T ) where the first inequality is due to (v t ) j ≥ (v t-1 ) j , and the last inequality is due to T t=1 1 t ≤ 1 + log T . • Bound the term with C 2 . Apply the same proof as above, we get t i=2 α i-1 m i-1 vi-1 2 ≤ H 2 α 2 c 2 (1 + log T ) • Bound the term with C 3 . E T t=2 α t √ vt - α t-1 vt-1 1 = E   d j=1 T t=2 α t-1 ( vt-1 ) j - α t ( √ vt ) j   =E   d j=1 α 1 ( √ v1 ) j - α T ( √ vT ) j   ≤ E   d j=1 α 1 ( √ v1 ) j   ≤ dα c where the first equality is due to(v t ) j ≥ (v t-1 ) j and α t ≤ α t-1 , and the second equality is due to telescope sum. • Bound the term with C 4 . E   T -1 t=2 α t √ vt - α t-1 vt-1 2   =E   T -1 t=2 d j=1 α t √ vt - α t-1 vt-1 2 i   ≤E   T -1 t=2 d j=1 α c α t √ vt - α t-1 vt-1 i   ≤ dα 2 c 2 where the first inequality is due to |(α t / √ vt -α t-1 / vt-1 ) j | ≤ 1/c. Then we have for ADMETAR, E C 1 T t=1 α t h t √ vt 2 + C 2 t i=2 α i-1 m i-1 vi-1 2 + C 3 T t=2 α t √ vt - α t-1 vt-1 1 (34) +C 4 T -1 t=2 α t √ vt - α t-1 vt-1 2 + C 5 (35) ≤C 1 H 2 α 2 c 2 (1 + log T ) + C 2 H 2 α 2 c 2 (1 + log T ) + C 3 dα c + C 4 dα 2 c 2 + C 5 Furthermore, due to ∥g t ∥ ≤ H, we have (v t ) j ≤ H 2 , then we get α/( vt ) j ≥ 1 H √ t Thus we have E T t=1 α i ⟨∇f (θ t ), ∇f (θ t )/ vt ⟩ ≥ E T t=1 1 H √ t ∥∇f (θ t )∥ 2 ≥ √ T H min t∈[T ] E ∥∇f (θ t )∥ 2 Combine ( 36) and (37), we have min t∈[T ] E ∥∇f (θ t )∥ 2 ≤ H √ T (C 1 + C 2 ) H 2 α 2 c 2 (1 + log T ) + C 3 dα c + C 4 dα 2 c 2 + C 5 = 1 √ T (Q 1 + Q 2 log T ) This completes the proof.

C.4 CONVERGENCE ANALYSIS OF ADMETAS FOR CONVEX OPTIMIZATION

Lemma C.16 (Bound for T t=1 α t ∥m t ∥ 2 ). Under Assumption in Theorem 3, we have T t=1 α t ∥m t ∥ 2 ≤ 2αdG 2 ∞ √ T Proof. First, we bound ∥m t ∥. ∥m t ∥ 2 ≤ d∥m t ∥ 2 ∞ ≤ dG 2 ∞ (38) Now we can bound T t=1 α t ∥m t ∥ 2 T t=1 α t ∥m t ∥ 2 ≤ dG 2 ∞ T t=1 α t = αdG 2 ∞ T t=1 1 √ t ≤ 2αdG 2 ∞ √ T Theorem C.17. (Convergence of ADMETAS for convex optimization) Let {θ t } be the sequence obtained by ADMETAS, 0 ≤ λ, β < 1, α t = α √ t , ∀t ∈ [T ]. Suppose x ∈ F, where F ⊂ R d and has bounded diameter D ∞ , i.e. ||θ t -θ|| ∞ ≤ D ∞ , ∀t ∈ [T ].. Assume f (θ) is a convex function and ||g t || ∞ is bounded. Denote the optimal point as θ. For θ t generated, ADMETAS achieves the regret: R(T ) = T t=1 [f t (θ t ) -f t (θ)] = O( √ T ) Proof. • Bound for T t=1 ⟨m t , θ t -θ⟩. From the update process, we get ∥θ t+1 -θ∥ = ∥θ t -θ -α t m t ∥ 2 = ∥θ t -θ∥ 2 -2α t ⟨m t , θ t -θ⟩ + α 2 t ∥m t ∥ 2 thus we have T t=1 ⟨m t , θ t -θ⟩ = T t=1 1 2α t ∥θ t -θ∥ 2 -∥θ t+1 -θ∥ 2 + T i=1 α t 2 ∥m t ∥ 2 consider the left-hand side T t=1 1 2α t ∥θ t -θ∥ 2 -∥θ t+1 -θ∥ 2 = 1 2α 1 ∥θ 1 -θ∥ 2 + T t=2 1 2α t - 1 2α t-1 ∥θ t -θ∥ 2 - 1 2α T ∥θ T +1 -θ∥ 2 ≤ dD 2 ∞ 2α 1 + dD 2 ∞ T t=2 1 2α t - 1 2α t-1 + 0 = dD 2 ∞ 2α T Finally,we get T t=1 ⟨m t , θ t -θ⟩ ≤ dD 2 ∞ 2α T + T i=1 α t 2 ∥m t ∥ 2 • Bound for T t=1 ⟨m t-1 , θ t-1 -θ t ⟩. T t=1 ⟨m t-1 , θ t-1 -θ t ⟩ = T -1 t=1 ⟨m t , θ t -θ t+1 ⟩ = T -1 t=1 ⟨m t , α t m t ⟩ = T -1 t=1 α t ∥m t ∥ 2 • Bound for ⟨m T , θ T -θ⟩. ⟨m T , θ T -θ⟩ ≤ α T ∥m T ∥ 2 + 1 4α T ∥θ T -θ∥ 2 ≤ α T ∥m T ∥ 2 + dD 2 ∞ 4α T where the first inequality follows from Young's inequality. Combine all these preparations, we obtain T t=1 ⟨h t , θ t -θ⟩ = 1 1 -β ⟨m T , θ T -θ⟩ -⟨m 0 , θ 0 -θ⟩ + ⟨m 0 , θ 0 -θ⟩ + T -1 t=1 ⟨m t , θ t -θ⟩ + β 1 -β T t=1 ⟨m t-1 , θ t-1 -θ t ⟩ = β 1 -β ⟨m T , θ T -θ⟩ + β 1 -β T t=1 ⟨m t-1 , θ t-1 -θ t ⟩ + T t=1 ⟨m t , θ t -θ⟩ ≤ β 1 -β dD ∞ 4α T + T t=1 α t ∥m t ∥ 2 + dD 2 ∞ 2α T + T i=1 α t 2 ∥m t ∥ 2 ≤ β 1 -β + 2 dD ∞ 4α T + 2αβ 1 -β + α dG 2 ∞ √ T This proves that T t=1 ⟨h t , θ t -θ⟩ = O( √ T ). Suppose the optimizer runs for a long time, the bias of EMA is small (Zhuang et al., 2020) , thus E(I t ) approaches E(g t ) as step increases. Since h t = κg t + µI t , h t is the same order as g t when the time is long enough, thus we have T t=1 ⟨g t , θ t -θ⟩ = O( √ T ) In addition, due to the convexity of f (.), we have R(T ) = T t=1 f t (θ t ) -f t (x) ≤ T t=1 ⟨g t , θ t -θ⟩ Combined with (39), we complete the proof.

C.5 CONVERGENCE ANALYSIS OF ADMETAS FOR NON-CONVEX OPTIMIZATION

Lemma C.18. Set θ 0 ≜ θ 1 in Algorithm, and define z t as z t = θ t + β 1 -β (θ t -θ t-1 ), ∀t ≥ 1. ( ) Then the following holds z t+1 -z t = - β 1 -β (α t -α t-1 )m t-1 -α t h t Proof. By the update rule of ADMETAS, we have θ t+1 -θ t = -α t m t = -α t [βm t-1 + (1 -β)h t ] = β α t α t-1 (θ t -θ t-1 ) -α t (1 -β)h t = β(θ t -θ t-1 ) + β α t α t-1 -1 (θ t -θ t-1 ) -α t (1 -β)h t = β(θ t -θ t-1 ) -β(α t -α t-1 )m t-1 -α t (1 -β)h t Since we also have θ t+1 -θ t = (1 -β)θ t+1 + β(θ t+1 -θ t ) -(1 -β)θ t Combined with (41), we have (1 -β)θ t+1 + β(θ t+1 -θ t ) =(1 -β)θ t + β(θ t -θ t-1 ) -β(α t -α t-1 )m t-1 -α t (1 -β)h t Divide both sides by 1 -β θ t+1 + β 1 -β (θ t+1 -θ t ) =θ t + β 1 -β (θ t -θ t-1 ) - β 1 -β (α t -α t-1 )m t-1 -α t h t Lemma C.19. Suppose that the conditions in Theorem C.4 hold, then E [f (z t+1 ) -f (z 1 )] ≤ 4 i=1 T i , where T 1 = -E t i=1 ⟨∇f (z i ), β 1 1 -β 1 (α i -α i-1 )m i-1 ⟩ T 2 = -E t i=1 α i ⟨∇f (z i ), h i ⟩ T 3 = E t i=1 L β 1 -β (α i -α i-1 )m i-1 2 T 4 = E t i=1 L ∥α i h i ∥ 2 Proof. By the Lipschitz smoothness of ∇f , f (z t+1 ) ≤ f (z t ) + ⟨∇f (z t ), z t+1 -z t ⟩ + L 2 ∥z t+1 -z t ∥ 2 , Based on (C.18),we have E[f (z t+1 ) -f (z 1 )] =E t i=1 f (z i+1 ) -f (z i ) ≤E t i=1 ⟨∇f (z i ), z i+1 -z i ⟩ + L 2 ∥z i+1 -z i ∥ 2 = -E t i=1 ⟨∇f (z i ), β 1 -β (α i -α i-1 )m i-1 ⟩ -E t i=1 α i ⟨∇f (z i ), h i ⟩ + E t i=1 L 2 ∥z i+1 -z i ∥ 2 Then, using inequality ∥a + b∥ 2 ≤ 2∥a∥ 2 + 2∥b∥ 2 and combined with lemma C.18, E t i=1 L 2 ∥z i+1 -z i ∥ 2 ≤ T 3 + T 4 Lemma C.20. In this part, we bound T 1 , T 2 , T 3 , T 4 Proof. •Bound for T 1 T 1 ≤ E t i=1 ∥∇f (z i )∥∥m i-1 ∥ β 1 -β |α i -α i-1 | ≤ H 2 β 1 -β E t i=1 |α i -α i-1 | ≤ H 2 β 1 -β α where the second and last inequality is due to the monotone decreasing property of α i •Bound for T 3 T 3 ≤ β 1 -β 2 LH 2 E t i=1 (α i -α i-1 ) 2 ≤ 2α β 1 -β 2 LH 2 E t i=1 |α i -α i-1 | ≤ 2α 2 β 1 -β 2 LH 2 where the monotone decreasing property of α i is also used •Bound for T 4 T 4 ≤ H Lα 2 E t i=1 1 t ≤ H 2 Lα 2 (1 + logT ) where the second inequality is due to t i=1 1 t ≤ 1 + logT •Bound for T 2 T 2 = -E t i=1 α i ⟨∇f (θ i ), h i ⟩ -E t i=1 ⟨∇f (z i ) -∇f (θ i ), h i ⟩ (42) The second term of ( 42) can be bounded as -E t i=1 ⟨∇f (z i ) -∇f (θ i ), h i ⟩ ≤E t i=1 1 2 ∥∇f (z i ) -∇f (θ i )∥ 2 + 1 2 ∥α i h i ∥ 2 ≤ L 2 2 E t i=1 ∥ β 1 -β α i-1 m i-1 ∥ 2 + 1 2 E t i=1 ∥α i h i ∥ 2 ≤ α 2 H 2 L 2 2 β 1 -β 2 t i=1 1 t + α 2 H 2 2 t i=1 1 t ≤ α 2 H 2 2 L 2 β 1 -β 2 + 1 (1 + logT ) where the second inequality is due to ∥∇f (z i ) -∇f (θ i )∥ ≤ L∥z i -θ i ∥. Then, consider the first term of (42) E t i=1 α i ⟨∇f (θ i ), h i ⟩ =E t i=1 α i ⟨∇f (θ i ), κg i + µI i ⟩ ≈κE t i=1 α i ⟨∇f (θ i ), ∇f (θ i ) + δ i ⟩ + µE t i=1 α i ⟨∇f (θ i ), ∇f (θ i ) + δ i ⟩ =(κ + µ)E t i=1 α i ⟨∇f (θ i ), ∇f (θ i )⟩ The second and third equality holds for the follow reasons: on the one hand, g t = ∇f (θ t ) + δ t in which E[δ t ] = 0, so according to (Chen et al., 2018) , given θ i , E[δ i |θ i ] = 0; On the other hand, suppose the optimizer runs for a long time, the bias of EMA is small (Zhuang et al., 2020) , thus E(I t ) approaches E(g t ) as step increases. Finally, we can finally bound T 2 T 2 ≤ α 2 H 2 2 L 2 β 1 -β 2 + 1 (1 + logT ) + (κ + µ)E t i=1 α i ⟨∇f (θ i ), ∇f (θ i )⟩ Theorem C.21. (Convergence of ADMETAS in non-convex stochastic optimization) Under the assumptions: • ∇f exits and is Lipschitz-continuous,i.e, ||∇f (x) -∇f (y)|| ≤ L||x -y||, ∀x, y; f is also lower bounded. • At step t, the algorithm can access a bounded noisy gradient g t , and the true gradient ∇f is also bounded. • The noisy gradient is unbiased, and has independent noise, i.e. g t = ∇f (θ t ) + δ t , E[δ t ] = 0 and δ i ⊥δ j , ∀i ̸ = j. And α t = α/ √ t, then for any T we have: min t∈[T ] E ∇f (θ t ) 2 ≤ 1 √ T (Q ′ 1 + Q ′ 2 logT ) where Q ′ 1 and Q ′ 2 are constants independent of T. Proof. We combine lemma C.18, lemma C.19 and lemma C.20 to bound the overall expected descent of the objective. First, we have E [f (z t+1 ) -f (z 1 )] ≤T 1 + T 2 + T 3 + T 4 (43) ≤H 2 β 1 -β α + α 2 H 2 2 L 2 β 1 -β 2 + 1 (1 + logT ) (44) -(κ + µ)E t i=1 α i ⟨∇f (θ i ), ∇f (θ i )⟩ (45) + 2α 2 β 1 -β 2 LH 2 + H 2 Lα 2 (1 + logT ) Under review as a conference paper at 2023 Notice that E T t=1 α i ⟨∇f (θ t ), ∇f (θ t )⟩ ≥ E T t=1 1 √ t ∥∇f (θ t )∥ 2 ≥ √ T min t∈[T ] E ∥∇f (θ t )∥ 2 (47) Rearrange ( 43), combined with (47) and notice that κ + µ > 0, we have min t∈[T ] E ∥∇f (θ t )∥ 2 ≤ 1 √ T E T t=1 α i ⟨∇f (θ t ), ∇f (θ t )⟩ ≤ 1 √ T 1 κ + µ α 2 H 2 L 2 2 β 1 -β 2 + α 2 H 2 2 + H 2 Lα 2 (1 + logT ) + 1 κ + µ H 2 β 1 -β α + 2α 2 β 1 -β 2 LH 2 + E[f (z 1 ) -f (z * )] = 1 √ T (Q ′ 1 + Q ′ 2 logT ) where z * is the optimal of f , i.e. z * = arg min z f (z) This completes the proof.

C.6 CONVERGENCE ANALYSIS OF FORWARD-LOOKING

In this section, based on Wang et al. (2020) , we further analysis forward-looking part to complete the convergence proof of ADMETA optimizer. According to (Zhang et al., 2019) , Lookahead is an algorithm that can be combined with any standard optimization method. The same is true for dynamic lookahead method in forward-looking part. What's more, optimizers with forward-looking is essentially processing with two loops as discussed in the main text. The fast weight is updated by optimizers, while the slow weight is updated by interpolating with fast weight every given period. In other words, the slow weight is updated passively. Therefore, though the slow weight is relevant to optimizers, it is almost irrelevant to the selection of optimizers. For this reason, we only prove the convergence of forward-looking of ADMETAS, which can be easily extended to the ADMETAR. Remarks:(some preliminaries) Based on the design of the asymptotic dynamic weight η t of the forward-looking part, it can be concluded that when it runs for a long time, η t is highly close to the set point, at which we can safely assume that η t is a constant and thus we denote it as η. In this way, the analysis of a dynamic lookahead is the same as the case of static lookahead. According to algorithm of ADMETA, the slow weight ϕ t updates every k steps. We can assume that the slow weight is trained in sync with fast weight. For this purpose, all we should do is to stipulate ϕ τ k+l = ϕ τ k , where k denotes the synchronization period, τ ∈ N * and 0 ≤ l < k. Define y t = ηθ t + (1 -η)θ t , then according to the update of θ t and ϕ t , we have y t+1 = y t -ηα t m t and on each period of synchronization, we have y τ k -θ τ k = (1 -η)(ϕ τ k -θ τ k ) = 0 y τ k -ϕ τ k = η(θ τ k -ϕ τ k ) = 0 Theorem C.22. (convergence of forward-looking part) Suppose f (.) is L-smooth, i.e, ||∇f (x) -∇f (y)|| ≤ L||x -y||, ∀x, y. The bias of noisy gradient is bounded, i.e., |δ t | ≤ σ, where δ t = ∇f (θ t ) -g t . Then we have that: 1 T T t=0 E ∇f (θ t ) 2 ≤ O( 1 √ T ) Consider the last term of (52), we have E[∥ϕ τ k+l -θ τ k+l ∥ 2 ] = E[∥θ τ k -θ τ k+l 2 ] ≤ α 2 τ k E   ∥ l-1 j=0 m τ k+j ∥ 2   =2κ 2 α 2 τ k E   ∥ l-1 j=0 g τ k+j ∥ 2   + 2µ 2 α 2 τ k E   ∥ l-1 j=0 I τ k+j ∥ 2   ≤4κ 2 α 2 τ k E    l-1 j=0 (g τ k+j -∇f (θ τ k+j )) 2    + 4κ 2 α 2 τ k E    l-1 j=0 ∇f (θ τ k+j ) 2    + 4µ 2 α 2 τ k E    l-1 j=0 (I τ k+j -∇f (θ τ k+j )) 2    + 4µ 2 α 2 τ k E    l-1 j=0 ∇f (θ τ k+j ) 2    ≤4(κ 2 + µ 2 )σ 2 lα 2 τ k + 4(µ 2 + κ 2 )α 2 τ k E    l-1 j=0 ∇f (θ τ k+j ) 2    ≤4(κ 2 + µ 2 )σ 2 lα 2 τ k + 4(µ 2 + κ 2 )lα 2 τ k l-1 j=0 E[∥∇f (θ τ k+j )∥ 2 ] where the first equality using the property that θ τ k = ϕ τ k = ϕ τ k+l . Summing from l = 0 to l = k -1, we get, k-1 l=0 E[∥ϕ τ k+l -θ τ k+l ∥ 2 ] ≤2(κ 2 + µ 2 )σ 2 α 2 τ k k(k -1) + 4(µ 2 + κ 2 )α 2 τ k k-1 l=0 l l-1 j=0 E[∥∇f (θ τ k+j )∥ 2 ] =2(κ 2 + µ 2 )σ 2 α 2 τ k k(k -1) + 4(µ 2 + κ 2 )α 2 τ k k-2 j=0 E[∥∇f (θ τ k+j )∥ 2 ] k-1 l=j+1 l =2(κ 2 + µ 2 )σ 2 α 2 τ k k(k -1) + 2(µ 2 + κ 2 )α 2 τ k k-2 j=0 E[∥∇f (θ τ k+j )∥ 2 ](j + k)(k -j -1) (j + k)(k -j -1 ) achieves its maximal value when j = 0. Therefore, we have k-1 l=0 E[∥ϕ τ k+l -θ τ k+l ∥ 2 ] ≤2(κ 2 + µ 2 )σ 2 α 2 τ k k(k -1) + 2(µ 2 + κ 2 )α 2 τ k k(k -1) k-2 j=0 E[∥∇f (θ τ k+j )∥ 2 ] Here, we can finally bound the the last term of ( 52) E[f (y (τ +1)k )] -E[f (y τ k )] ≤ - ηα (τ +1)k (µ + κ) 2 k-1 l=0 E[∥∇f (y τ k+l )∥ 2 ] + G + M k-1 l=0 E[∥∇f (θ τ k+l )∥ 2 ] ≤ - ηα (τ +1)k (µ + κ) 2 k-1 l=0 E[∥∇f (y τ k+l )∥ 2 ] + G (53) where G = 2(µ 2 + κ 2 )kη 2 α 2 τ k Lσ 2 + (κ 2 + µ 2 )(κ + µ)η(1 -2 L 2 σ 2 k(k -1)α 3 τ k and M = - ηα (τ +1)k (µ + κ -4(µ 2 + κ 2 )ηα (τ +1)k L) 2 + (κ 2 + µ 2 )(κ + µ)η(1 -η) 2 L 2 σ 2 k(k -1)α 3 τ k When α is small enough, M is below zero, for which the second inequality of (53) holds. Summing from τ = 0 to τ = Υ -1, we get E[f (y Υk )] -E[f (y 0 )] ≤ - η(µ + κ) 2 Υ-1 τ =0 α (τ +1)k k-1 l=0 E[∥∇f (y τ k+l )∥ 2 ] + 2(µ 2 + κ 2 )kη 2 Lσ 2 Υ-1 τ =0 α 2 τ k + (κ 2 + µ 2 )(κ + µ)η(1 -η) 2 L 2 σ 2 k(k -1) Υ-1 τ =0 α 3 τ k Following Wang et al. (2020) , we first assume the learning rate α as a fixed constant, then rearrange the inequality above, we get 1 Υk Υ-1 τ =0 k-1 l=0 E[∥∇f (y τ k+l )∥ 2 ] ≤ 2[f (y 0 ) -f inf ] ηαΥk(µ + κ) + 4(µ 2 + κ 2 )ηαLσ 2 µ + κ + 2(κ 2 + µ 2 )(1 -η) 2 α 2 L 2 σ 2 (k -1) Define T as Υk and set the learning rate α to 1/ √ T 1 T T -1 t=0 E[∥∇f (y t )∥ 2 ] ≤ 2[f (y 0 ) -f inf ] η √ T (µ + κ) + 4(µ 2 + κ 2 )ηLσ 2 (µ + κ) √ T + 2(κ 2 + µ 2 )(1 -η) 2 L 2 σ 2 (k -1) T =O( 1 √ T )

D ANALYSIS OF CONVERGENCE RATE

For convex situation, we adopt the regret function to estimate the convergence rate. And for nonconvex situation, we adopt the minimum of the expectation of the squared gradient to estimate the convergence, which are corresponding to the proof of convergence since the process of the convergence proof is actually the process of finding the convergence rate. From Table 6 , we notice that the convergence rates for all optimizers for convex case are of magnitude of O(1/ √ T ) and for non-convex are of O(logT / √ T ) , which means in essence, algorithms based on gradient decent follows a similar rate constraint. However, the convergence speed of different optimizers may attribute to many other factors, such as on the implementation. Therefore additional statistical experiments are needed for analysis, as we did in Table 4 .

E EXPERIMENTAL DETAILS E.1 HYPERPARAMETER TUNING

For ADMETA optimizer, we first determined a rough value range for learning rate and lambda with the toy model according to the visualization as in Figure B . While for other baseline optimizers, we refer to the recommended/default hyperparameter settings in the original paper. In this way, we get the rouge range of the hyperparameter in optimizers. Then, we search the hyperparameters in the adjacent interval, which is listed in the following three subsections.  D 2 ∞ 2α T + G 2 ∞ 2 T t=1 αt AMSGrad Reddi et al. (2019) D 2 ∞ √ T α(1-β 1 ) d i=1 v1 T ,i /2 + D∞ 2(1-β 1 ) T t=1 d i=1 β v1 t,i /2 α t + α √ 1+logT (1-β 1 ) 2 (1-γ) √ 1-β 2 d i=1 ∥g 1:T ,i ∥ ADMETAS - β 1-β + 2 dD∞ 4α T + 2αβ 1-β + α dG 2 ∞ √ T ADMETAR - (2-β 1 )D 2 ∞ √ T 4α(1-β 1 ) d i=1 v1/2 T ,i + (2+β 1 )α √ 1+log T 2 √ (1-β 2 )(1-γ) d i=1 ∥h 1:T ,i ∥2 Non-convex SGD - - AMSGrad Chen et al. ( ) H √ T C1 H 2 c 2 (1 + log T ) + C2 d c + C3 d c 2 + C4 ADMETAS - 1 √ T 1 κ+µ α 2 H 2 L 2 2 β 1-β 2 + α 2 H 2 2 + H 2 Lα 2 (1 + logT ) + 1 κ+µ H 2 β 1-β α + 2α 2 β 1-β 2 LH 2 + E[f (z1) -f (z * )] ADMETAR - H √ T (K1 + K2) H 2 α 2 c 2 (1 + log T ) + K3 dα c + K4 dα 2 c 2 + K5

E.2 IMAGE CLASSIFICATION

We conduct image classification experiments on CIFAR-10 and CIFAR-100 datasets, which are trained on a single NVIDIA RTX-3090 GPU. Typical architectures like ResNet-110 and PyramidNet are employed as the baseline models. In the ResNet-110 architecture, there are 54 stacked identical 3 × 3 convolutional layers with 54 two-layer Residual Units (He et al., 2016) . While in the PyramidNet architecture, there are 110 layers with a widening factor of 48 (Han et al., 2017) . We set the training batch size to 128 and the validation batch size to 256. Both model is trained with 160 epochs. Milestone schedule is adopted as the learning rate decay strategy, with learning rate decaying at the end of 80-th and 120-th epochs by 0.1. We report the hyperparameters tuning for our proposed ADMETA and other optimizers for reproduction of our experiments. For all optimizers, the weight decay is fixed as 1e -4. The searching scheme of hyperparameter settings for each optimizer is concluded as follows: • For SGD and SGDM, the momentum is fixed as 0.9, and the best-performing learning rate is searched from {0.01, 0.05, 0.1} and recommended values in original paper. For our AD-METAS, the λ is set to fixed 0.9 and we search the best-performing β from {0.1, 0.2, 0.3, 0.4} and learning rate from {0.01, 0.05, 0.1}. • For all adaptive learning rate optimizers, hyperparameters β 1 , β 2 and ϵ are set to β 1 = 0.9, β 2 = 0.999 and ϵ = 1e-9 respectively. For Adam, RAdam and AdaBelief optimizer, the learning rate is searched from {0.1, 0.01, 0.001}. For Ranger, η and k are set to η = 0.5 and k = 6 according to Wright (2019) . The learning rate is searched from {0.1, 0.01, 0.001}. And for our ADMETAR, the setting of k is the same as Ranger, and we search λ from {0.05, 0.1, 0.2, 0.3, 0.4} and learning rate from {0.1, 0.05, 0.01}. The resulting hyperparameters reported in the paper are shown in Table 7 , where LR is the abbreviation of learning rate. encoder, contextualized representations and quantization module. In the feature encoder, there are 7 blocks with temporal convolutions that have 512 channels for each block and a relative positional embeddings of the convolutional layer modeling has kernel size of 128 and 16 groups. Among the configurations of Wav2vec 2.0, we choose Wav2vec 2.0 base model, which has 12 Transformer blocks, 95M parameters and 8 attention heads, with model dimension of 768 and inner dimension (FFN) of 3072. We finetune Wav2vec 2.0 base for keyword spotting and language identification on SUPERB dataset (Yang et al., 2021) and Common Language (Sinisetty et al., 2021) dataset respectively. The dataset size of keyword spotting is smaller than that of language identification, so we use a single NVIDIA RTX-3090 GPU for training on the SUPERB dataset, and use four GPUs for parallel training on the Common Language dataset. The keyword spotting model is trained for 5 epochs with training batch size 32 and language identification model for 10 epochs with training batch size 8 per device. Due to the same reason as in NLU experiments, i.e. the pre-training-fine-tuning paradigm, we only employ adaptive learning rate optimizers here. For all optimizers chosen, we fix β 1 = 0.9, β 2 = 0.999, ϵ = 1e -8 and set weight decay to 0.0. The learning rate is searched from {5e-5, 8e-5, 1e-4, 3e-4, 5e-4, 8e-4}, and for ADMETAR, λ is searched from {0.05, 0.1 0.2}. The resulting hyperparameters reported in the paper are shown in Table 10 .

F FUTURE WORK

In the future work, for backward-looking part, though DEMA provides a more flexible way to deal with past gradients, it is still unable to intelligently judge the value of certain historical gradient information, such as discarding some obviously unreasonable gradients caused by noise. A better optimizer may have the ability to forget these wrong information and take advantage of what works, just working like human brains. For forward-looking part, our method takes the constant coefficient into a dynamic one. It is kind of like milestone scheme of learning rate decay strategies to some extent. However, several experiments (Huang et al., 2017; Ma, 2020) have shown that cosine strategy (Loshchilov & Hutter, 2016) works better. Therefore, we will follow the cosine scheme and propose a new forward-looking strategy that may work even better.

G PERFORMANCE OF SGDM AND ADMETAS ON FINETUNE SETTING

In this section, we test the performance of SGDM and ADMETAS on fintune setting and the results are shown in Table 11 . For keyword spotting (SUPERB) (Yang et al., 2021 ) task, we train the models for 5 epochs and use Wav2vec base (Schneider et al., 2019) as the baseline model. And for CIFAR-10 ( Krizhevsky et al., 2009) task, we train the model for 40 epochs from the checkpoint already trained with Adam using learning rate of 0.001 for 160 epochs. The baseline model of CIFAR-10 is ResNet-110 (He et al., 2016) with deep CNN architecture. We report the results of best hyperparameter settings for SGD and ADMETAS via grid searching. From Table 11 , we notice that in SUPERB task, compared to adaptive learning rate methods, SGDM achieves worse results in SUPERB task, but not by much, which shows that SGDM can also be used in finetune setting. While ADMETAS can achieve better result than any other learning rate methods used in our experiment, demonstrating the advantage of our approach. This phenomenon contradicts the mainstream view that SGD family is not suitable for finetune task. While for CIFAR-10 task, SGDM and ADMETAS both improve the performance compared to the start point. However, they are both obviously worse than the performance of training the task from scratch using SGDM and ADMETAS respectively, which shows that pre-training is a very strong approach that makes the model achieve a good state. The reason why ADMETAS performs better than SGDM in finetune setting may lie in two aspects. On the one hand, DEMA scheme in the backward-looking part reduces the overshoot problem that may do harm especially near convergence. On the other hand, the forward-looking part improves the stability of the training process.

H INFLUENCE OF DIFFERENT LEARNING RATES IN SGD FAMILY OPTIMIZERS

Since the learning rate of 0.5 for SGDM is a recommended value in Han et al. (2017) but not in (He et al., 2016) , to alleviate the influence of different learning rates, we also try the performance of SGDM with a learning rate of 0.5 in the ResNet-110 network and the results are listed in Table 12 . The results show that choosing a large learning rate for SGDM may increase the performance, as shown that when setting the learning rate to 0.5 instead of 0.1, the recommended value in ResNet-110. However, this is not always true since the performance on CIFAR-10 when using the learning rate of 0.5 does not get prompted.



Notably, we employed nesternov momentum(Nesterov, 1983) in the SGDM for a stronger comparison baseline Notably, the reported training time is only for rough comparison due to the influence of environments.



SGD with dynamic Lookahead (Ours)| θ1 -θ * | is large, a large η scan maintain the convergence speed | θ5 -θ * | is small, a relative small η can achieve better convergence

Figure 1: Comparison between no lookahead, constant lookahead and dynamic lookahead.

Figure 2: Training loss and test accuracy comparison on CIFAR-10 and CIFAR-100 datasets.

Figure 3: EMA vs. DEMA in SGD optimizer. Please refer to our online demo https://sites. google.com/view/optimizer-admeta for more comparison.

Lemma C.7 (nonexpansiveness property of arg min x∈F ∥.∥, (McMahan & Streeter, 2010)). For any Q ∈ S d + ,i.e. Q is a Positive definite matrice and convex feasible set

Results on CIFAR-10 and CIFAR-100 test sets.

Development results on GLUE benchmark.   .08 87.72 90.74 93.23 60.32 89.11 90.85 67.51  82.92 Ranger 83.80/84.24 87.83 90.76 92.32 58.87 89.19 90.05 68.59 82.68 AdaBelief 83.91/84.42 86.76 90.92 92.55 58.05 88.94 90.38 67.87 82.42 RAdam 83.91/84.24 87.66 90.88 92.20 59.31 89.07 90.91 70.04 83.00 ADMETAR 83.90/84.53 87.91 91.14 93.35 62.07 89.62 91.47 71.48 83.87 .72 88.36 92.35 93.69 62.61 89.64 91.29 71.48 84.48 ADMETAR 86.21/86.54 88.54 92.63 93.69 64.12 89.92 92.10 73.65 85.11

Results on SQuAD v1.1 and v2.0 development sets and NER-CoNLL03 test sets.

Results on speech keyword spotting and language identification tasks.

Ablation study on ADMETA optimizer.

The comparison of convergence rate of several optimizers.

Optimizer hyperparameter settings on the CIFAR task.

Hyperparameter settings of SUPERB and Common Language.

Performance of SGDM and ADMETAS on finetune setting.

Performance of SGD family optimizers in CIFAR task.

annex

Proof. Following the L-smooth property, we haveTaking the expectation of both sides,Consider the term with κ,Suppose the optimizer runs for a long time, the bias of EMA is small enough, thus E(I t ) approaches E(g t ). For this reason, we can estimate the term with µ in (49) the same way as (50).Based on the bounded bias gradient assumption and inequality that (a + b) 2 ≤ 2a 2 + 2b 2 , we have:Combined with (48), ( 49), ( 50) and (51), rearrange the inequality and take the expectationSince the learning rate is decreasing to zero, we can safely assume that after several iterations, 1 -ηα t L > 0. Then, summing over one outer loop -ADMETAR 1.5e-4 0.08 1e-4 0.36 2e-4 0.03 1e-4 0.03 7e-4 0.02 1e-3 0.08 1.2e-3 0.3 1.8e-3 0.36 BERTlarge AdamW 2e-5-ADMETAR 1.5e-4 0.08 8e-5 0.2 8e-5 0.03 9e-5 0.3 7e-4 0.02 1e-3 0.03 6e-4 0.08 8e-4 0.1 Table 9 : Hyperparameter settings of SQuAD v1.1 and v2.0 development sets.In the NLU experiments, we employ a pre-trained language model BERT (Devlin et al., 2018) as our backbone. There are two model sizes for BERT: BERT base and BERT large , where the base model size has 12 Transformer layers with 768 hidden size, 12 self-attention heads and 110M model parameters and the large model size has 24 Transformer layers with 1024 hidden size, 16 self-attention heads and 340M parameters.In natural language understanding, we perform experiments on three modeling types of tasks: text classification, machine reading comprehension and token classification. The text classification uses the GLUE benchmark as the evaluation data set, the machine reading comprehension uses SQuAD v1.1 and v2.0, and the token classification uses the NER-CoNLL03 named entity recognition data set.We train the eight tasks in GLUE benchmark for 3 epochs on a single NVIDIA RTX-3090 GPU, except for MRPC, which is trained for 5 epochs due to its relatively small data size. The maximum sequence length is set to 128 and the training batch size is set to 32. SQuAD v1.1 and SQuAD v2.0 are trained for 2 epochs with two GPUs. The maximum sequence length is set to 384 and the training batch size per device is set to 12. And NER-CoNLL03 is trained for for 3 epochs on a single GPU.The training batch size per device is set to 8.Because of the pre-training-fine-tuning paradigm, we only employ the adaptive learning rate optimizer. We set β 1 , β 2 , ϵ and weight decay of these optimizers to 0.9, 0.999, 1e-8 and 0.0 respectively. η and k are set to 0.5 and 6 in the Ranger optimizer and ADMETA uses the same value of k as Ranger. We perform hyperparameter tuning on the learning rate and λ, and the resulting hyperparameters reported in the paper are shown in Table 8 and 9 .

E.4 AUDIO CLASSIFICATION

Based on Wav2vec (Schneider et al., 2019) , the Wav2vec 2.0 (Baevski et al., 2020) is a framework for self-supervised learning of speech representations which is composed of 3 modules: feature

