ADAPTIVE OPTIMIZERS WITH SPARSE GROUP LASSO

Abstract

We develop a novel framework that adds the regularizers to a family of adaptive optimizers in deep learning, such as MOMENTUM, ADAGRAD, ADAM, AMS-GRAD, ADAHESSIAN, and create a new class of optimizers, which are named GROUP MOMENTUM, GROUP ADAGRAD, GROUP ADAM, GROUP AMSGRAD and GROUP ADAHESSIAN, etc., accordingly. We establish theoretically proven convergence guarantees in the stochastic convex settings, based on primal-dual methods. We evaluate the regularized effect of our new optimizers on three largescale real-world ad click datasets with state-of-the-art deep learning models. The experimental results reveal that compared with the original optimizers with the post-processing procedure which use the magnitude pruning method, the performance of the models can be significantly improved on the same sparsity level. Furthermore, in comparison to the cases without magnitude pruning, our methods can achieve extremely high sparsity with significantly better or highly competitive performance.

1. INTRODUCTION

With the development of deep learning, deep neural network (DNN) models have been widely used in various machine learning scenarios such as search, recommendation and advertisement, and achieved significant improvements. In the last decades, different kinds of optimization methods based on the variations of stochastic gradient descent (SGD) have been invented for training DNN models. However, most optimizers cannot directly produce sparsity which has been proven effective and efficient for saving computational resource and improving model performance especially in the scenarios of very high-dimensional data. Meanwhile, the simple rounding approach is very unreliable due to the inherent low accuracy of these optimizers. In this paper, we develop a new class of optimization methods, that adds the regularizers especially sparse group lasso to prevalent adaptive optimizers, and retains the characteristics of the respective optimizers. Compared with the original optimizers with the post-processing procedure which use the magnitude pruning method, the performance of the models can be significantly improved on the same sparsity level. Furthermore, in comparison to the cases without magnitude pruning, the new optimizers can achieve extremely high sparsity with significantly better or highly competitive performance. In this section, we describe the two types of optimization methods, and explain the motivation of our work.

1.1. ADAPTIVE OPTIMIZATION METHODS

Due to the simplicity and effectiveness, adaptive optimization methods (Robbins & Monro, 1951; Polyak, 1964; Duchi et al., 2011; Zeiler, 2012; Kingma & Ba, 2015; Reddi et al., 2018; Yao et al., 2020) have become the de-facto algorithms used in deep learning. There are multiple variants, but they can be represented using the general update formula (Reddi et al., 2018) : x t+1 = x t -α t m t / V t , (1) where α t is the step size, m t is the first moment term which is the weighted average of gradient g t and V t is the so called second moment term that adjusts updated velocity of variable x t in each direction. Here, √ V t := V 1/2 t , m t / √ V t := √ V t -1 • m t . By setting different m t , V t and α t , we can derive different adaptive optimizers including MOMENTUM (Polyak, 1964) , ADAGRAD (Duchi et al., 2011) , ADAM (Kingma & Ba, 2015) , AMSGRAD (Reddi et al., 2018) and ADAHESSIAN (Yao et al., 2020) , etc. See Table 1 .  2 i )/t α √ t ADAM β 1 m t-1 + (1 -β 1 )g t β 2 V t-1 + (1 -β 2 )diag(g 2 t ) α √ 1-β t 2 1-β t 1 AMSGRAD β 1 m t-1 + (1 -β 1 )g t max(V t-1 , β 2 V t-1 + (1 -β 2 )diag(g 2 t )) α √ 1-β t 2 1-β t 1 ADAHESSIAN β 1 m t-1 + (1 -β 1 )g t β 2 V t-1 + (1 -β 2 )D 2 t * α √ 1-β t 2 1-β t 1 * D t = diag(H t ) , where H t is the Hessian matrix.

1.2. REGULARIZED OPTIMIZATION METHODS

Follow-the-regularized-leader (FTRL) (McMahan & Streeter, 2010; McMahan et al., 2013) has been widely used in click-through rates (CTR) prediction problems, which adds 1 -regularization (lasso) to logistic regression and can effectively balance the performance of the model and the sparsity of features. The update formula (McMahan et al., 2013) is: x t+1 = arg min x g 1:t • x + 1 2 t s=1 σ s x -x s 2 2 + λ 1 x 1 , where g 1:t = t s=1 g s , 1 2 t s=1 σ s x -x s 2 2 is the strong convex term that stabilizes the algorithm and λ 1 x 1 is the regularization term that produces sparsity. However, it doesn't work well in DNN models since one input feature can correspond to multiple weights and lasso only can make single weight zero hence can't effectively delete zeros features. To solve above problem, Ni et al. (2019) adds the 21 -regularization (group lasso) to FTRL, which is named G-FTRL. Yang et al. (2010) conducts the research on a group lasso method for online learning that adds 21 -regularization to the algorithm of Dual Averaging (DA) (Nesterov, 2009) , which is named DA-GL. Even so, these two methods cannot been applied to other optimizers. Different scenarios are suitable for different optimizers in the deep learning fields. For example, MOMENTUM (Polyak, 1964) is typically used in computer vision; ADAM (Kingma & Ba, 2015) is used for training transformer models for natural language processing; and ADAGRAD (Duchi et al., 2011) is used for recommendation systems. If we want to produce sparsity of the model in some scenario, we have to change optimizer which probably influence the performance of the model.

1.3. MOTIVATION

Eq. ( 1) can be rewritten into this form: x t+1 = arg min x m t • x + 1 2α t V t 1 2 (x -x t ) 2 2 . (3) Furthermore, we can rewrite Eq. (3) into x t+1 = arg min x m 1:t • x + t s=1 1 2α s Q 1 2 s (x -x s ) 2 2 , where m 1:t = t s=1 m s , t s=1 Q s /α s = √ V t /α t . It is easy to prove that Eq. (3) and Eq. (4) are equivalent using the method of induction. The matrices Q s can be interpreted as generalized learning rates. To our best knowledge, V t of Eq. ( 1) of all the adaptive optimization methods are diagonal for the computation simplicity. Therefore, we consider Q s as diagonal matrices throughout this paper. We find that Eq. ( 4) is similar to Eq. ( 2) except for the regularization term. Therefore, we add the regularization term Ψ(x) to Eq. ( 4), which is the sparse group lasso penalty also including 2 -regularization that can diffuse weights of neural networks. The concrete formula is: Ψ t (x) = G g=1 λ 1 x g 1 + λ 21 d x g A 1 2 t x g 2 + λ 2 x 2 2 , where λ 1 , λ 21 , λ 2 are regularization parameters of 1 , 21 , 2 respectively, G is the total number of groups of weights, x g is the weights of group g and d x g is the size of group g. In DNN models, each group is defined as the set of outgoing weights from a unit which can be an input feature, or a hidden neuron, or a bias unit (see, e.g., Scardapane et al. (2016) ). A t can be arbitrary positive matrix satisfying A t+1 A t , e.g., A t = I. In Section 2.1, we let A t = ( t s=1 Q g s 2αs + λ 2 I) just for solving the closed-form solution directly, where Q g s is a diagonal matrix whose diagonal elements are part of Q s corresponding to x g . The ultimate update formula is: x t+1 = arg min x m 1:t • x + t s=1 1 2α s Q 1 2 s (x -x s ) 2 2 + Ψ t (x). (6) 1.4 OUTLINE OF CONTENTS The rest of the paper is organized as follows. In Section 1.5, we introduce the necessary notations and technical background. In Section 2, we present the closed-form solution of Eq. ( 4) and the algorithm of general framework of adaptive optimization methods with sparse group lasso. We prove the algorithm is equivalent to adaptive optimization methods when regularization terms vanish. In the end, we give two concrete examples of the algorithm.foot_0  In Section 3, we derive the regret bounds of the method and convergence rates. In Section 4, we validate the performance of new optimizers in the public datasets. In Section 5, we summarize the conclusion. Appendices A-B list the details of GROUP ADAM and Group Adagrad respectively. Appendices C-F contain technical proofs of our main results and Appendix G includes the details of the empirical results of Section 4.4.

1.5. NOTATIONS AND TECHNICAL BACKGROUND

We use lowercase letters to denote scalars and vectors, and uppercase letters to denote matrices. We denote a sequence of vectors by subscripts, that is, x 1 , . . . , x t , and entries of each vector by an additional subscript, e.g., x t,i . We use the notation g 1:t as a shorthand for t s=1 g s . Similarly we write m 1:t for a sum of the first moment m t , and f 1:t to denote the function f 1:t (x) = t s=1 f s (x). Let M t = [m 1 • • • m t ] denote the matrix obtained by concatenating the vector sequence {m t } t≥1 and M t,i denote the i-th row of this matrix which amounts to the concatenation of the i-th component of each vector. The notation A 0 (resp. A 0) for a matrix A means that A is symmetric and positive semidefinite (resp. definite). Similarly, the notations A B and A B mean that A -B 0 and A -B 0 respectively, and both tacitly assume that A and B are symmetric. Given A 0, we write A 1 2 for the square root of A, the unique X 0 such that XX = A (McMahan & Streeter (2010) , Section 1.4). Let E be a finite-dimension real vector space, endowed with the Mahalanobis norm • A which is denoted by • A = •, A• as induced by A 0. Let E * be the vector space of all linear functions on E. The dual space E * is endowed with the dual norm • * A = •, A -1 • . Let Q be a closed convex set in E. A continuous function h(x) is called strongly convex on Q with norm • H if Q ⊆ dom h and there exists a constant σ > 0 such that for all x, y ∈ Q and α ∈ [0, 1] we have h(αx + (1 -α)y) ≤ αh(x) + (1 -α)h(y) - 1 2 σα(1 -α) x -y 2 H . The constant σ is called the convexity parameter of h(x), or the modulus of strong convexity. We also denote by • h = • H . Further, if h is differential, we have h(y) ≥ h(x) + ∇h(x), y -x + σ 2 x -y 2 h . We use online convex optimization as our analysis framework. On each round t = 1, . . . , T , a convex loss function f t : Q → R is chosen, and we pick a point x t ∈ Q hence get loss f t (x t ). Our goal is minimizing the regret which is defined as the quantity R T = T t=1 f t (x t ) -min x∈Q T t=1 f t (x). Online convex optimization can be seen as a generalization of stochastic convex optimization. Any regret minimizing algorithm can be converted to a stochastic optimization algorithm with convergence rate O(R T /T ) using an online-to-batch conversion technique (Littlestone, 1989) . In this paper, we assume Q ≡ E = R n , hence we have E * = R n . We write s T x or s • x for the standard inner product between s, x ∈ R n . For the standard Euclidean norm, x = x 2 = x, x and s * = s 2 . We also use x 1 = n i=1 |x (i) | and x ∞ = max i |x (i) | to denote 1 -norm and ∞ -norm respectively, where x (i) is the i-th element of x.

2.1. CLOSED-FORM SOLUTION

We will derive the closed-form solution of Eq. ( 6) with specific A t and Algorithm 1 with slight modification in this section. We have the following theorem. Theorem 1. Given A t = ( t s=1 Q g s 2αs + λ 2 I) of Eq. (5), z t = z t-1 + m t -Qt αt x t at each iteration t = 1, . . . , T and z 0 = 0, the optimal solution of Eq. ( 6) is updated accordingly as follows: x t+1 = ( t s=1 Q s α s + 2λ 2 I) -1 max(1 - d x g t λ 21 st 2 , 0)s t (8) where the i-th element of s t is defined as s t,i = 0 if |z t,i | ≤ λ 1 , sign(z t,i )λ 1 -z t,i otherwise, st is defined as st = ( t s=1 Q s 2α s + λ 2 I) -1 s t (10) and t s=1 Qs αs is the diagonal and positive definite matrix. The proof of Theorem 1 is given in Appendix C. We slightly modify (8) where we let st = s t . Our purpose is to let every entry of the group have the same effect of 21 -regularization. Hence, we get Algorithm 1. Furthermore, we have the following theorem which shows the relationship between Algorithm 1 and adaptive optimization methods. The proof is given in Appendix D. Theorem 2. If regularization terms of Algorithm 1 vanish, Algorithm 1 is equivalent to Eq. (1).

2.2. CONCRETE EXAMPLES

Using Algorithm 1, we can easily derive the new optimizers based on ADAM (Kingma & Ba, 2015) , ADAGRAD (Duchi et al., 2011) which we call GROUP ADAM, GROUP ADAGRAD respectively.

GROUP ADAM

The detail of the algorithm is given in Appendix A. From Theorem 2, we know that when λ 1 , λ 2 , λ 21 are all zeros, Algorithm 2 is equivalent to ADAM (Kingma & Ba, 2015) . Algorithm 1 Generic framework of adaptive optimization methods with sparse group lasso 1: Input: parameters λ1, λ21, λ2 x1 ∈ R n , step size {αt > 0} T t=1 , sequence of functions {φt, ψt} T t=1 , initialize z0 = 0, V0 = 0, α0 = 0 2: for t = 1 to T do 3: gt = ∇ft(xt) 4: mt = φt(g1, . . . , gt) and Vt = ψt(g1, . . . , gt) 5: Q t α t = √ V t α t - √ V t-1 α t-1 6: zt ← zt-1 + mt -Q t α t xt 7: for i ∈ {1, . . . , n} do 8: st,i = 0 if |zt,i| ≤ λ1 sign(zt,i)λ1 -zt,i otherwise. 9: end for 10: xt+1 = ( √ V t α t + 2λ2I) -1 max(1 - d x g t λ 21 s t 2 , 0)st 11: end for

GROUP ADAGRAD

The detail of the algorithm is given in Appendix B. Similarly, from Theorem 2, when λ 1 , λ 2 , λ 21 are all zeros, Algorithm 3 is equivalent to ADAGRAD (Duchi et al., 2011) . Furthermore, we can find that when λ 21 = 0, Algorithm 3 is equivalent to FTRL (McMahan et al., 2013) . Therefore, GROUP ADAGRAD can also be called GROUP FTRL from the research of Ni et al. (2019) . Similarly, GROUP MOMENTUM, GROUP AMSGRAD, GROUP ADAHESSIAN, etc., can be derived from MOMENTUM (Polyak, 1964) , AMSGRAD (Reddi et al., 2018) , ADAHESSIAN (Yao et al., 2020) , etc., with the same framework and we will not list the details.

3. CONVERGENCE AND REGRET ANALYSIS

Using the framework developed in Nesterov (2009); Xiao (2010); Duchi et al. (2011) , we have the following theorem providing the bound of the regret. Theorem 3. Let the sequence {x t } be defined by the update (6) and x 1 = arg min x∈Q 1 2 x -c 2 2 , ( ) where c is an arbitrary constant vector. Suppose f t (x) is convex for any t ≥ 1 and there exists an optimal solution x * of T t=1 f t (x), i.e., x * = arg min x∈Q T t=1 f t (x), which satisfies the condition m t-1 , x t -x * ≥ 0, t ∈ [T ], ) where m t is the weighted average of the gradient f t (x t ) and [T ] = {1, . . . , T } for simplicity. Without loss of generality, we assume m t = γm t-1 + g t , where γ < 1 and m 0 = 0. Then R T ≤ Ψ T (x * ) + T t=1 1 2α t Q 1 2 t (x * -x t ) 2 2 + 1 2 T t=1 m t 2 h * t-1 , where • h * t is the dual norm of • ht . h t is 1-strongly convex with respect to • √ Vt/αt for t ∈ [T ] and h 0 is 1-strongly convex with respect to • 2 . The proof of Theorem 3 is given in Appendix E. Since in most of adaptive optimizers, V t is the weighted average of diag(g 2 t ), without loss of generality, we assume α t = α and V t = ηV t-1 + diag(g 2 t ), t ≥ 1, where V 0 = 0 and η ≤ 1. Hence, we have the following lemma whose proof is given in Appendix F.1. Lemma 1. Suppose V t is the weighted average of the square of the gradient which is defined by (15), α t = α, m t is defined by (13) and V t satisfies the following arbitrary conditions: 1. η = 1, 2. η < 1, η ≥ γ and κV t V t-1 for all t ≥ 1 where κ < 1. Then we have T t=1 m t 2 ( √ V t α t ) -1 < 2α 1 -ν d i=1 M T,i 2 , ( ) where ν = max(γ, κ) and d is the dimension of x t . We can always add δ 2 I to V t at each step to ensure V t 0. Therefore, h t (x) is 1-strongly convex with respect to • √ δ 2 I+Vt/αt . Let δ ≥ max t∈[T ] g t ∞ , for t > 1, we have m t 2 h * t-1 = m t , α t (δ 2 I + V t-1 ) -1 2 m t ≤ m t , α t diag(g 2 t ) + ηV t-1 -1 2 m t = m t , α t V -1 2 t m t = m t 2 ( √ V t α t ) -1 . ( ) For t = 1, we have m 1 2 h * 0 = m 1 , α 1 (δ 2 I + I) -1 2 m 1 ≤ m 1 , α 1 diag -1 2 (g 2 1 ) m 1 = m 1 , α 1 V -1 2 1 m 1 = m 1 2 ( √ V 1 α 1 ) -1 . From ( 17), ( 18) and Lemma 1, we have Lemma 2. Suppose V t , m t , α t , ν, d are defined the same as Lemma 1, max t∈[T ] g t ∞ ≤ δ, • 2 h * t = •, α t (δ 2 I + V t ) -1 2 • for t ≥ 1 and • 2 h * 0 = •, α 1 (δ 2 + 1)I -1 2 • . Then T t=1 m t 2 h * t-1 < 2α 1 -ν d i=1 M T,i 2 . ( ) Therefore, from Theorem 3 and Lemma 2, we have Corollary 1. Suppose V t , m t , α t , h * t , ν, d are defined the same as Lemma 2, there exist constants G, D 1 , D 2 such that max t∈[T ] g t ∞ ≤ G ≤ δ, x * ∞ ≤ D 1 and max t∈[T ] x t -x * ∞ ≤ D 2 . Then R T < dD 1 λ 1 + λ 21 ( √ T G 2α + λ 2 ) 1 2 + λ 2 D 1 + dG D 2 2 2α + α (1 -ν) 2 √ T . ( ) The proof of Corollary 1 is given in F.2. Furthermore, from Corollary 1, we have Corollary 2. Suppose m t is defined as (13), α t = α and satisfies the condition (19). There exist constants G, D 1 , D 2 such that tG 2 I V t , max t∈[T ] g t ∞ ≤ G, x * ∞ ≤ D 1 and max t∈[T ] x t - x * ∞ ≤ D 2 . Then R T < dD 1 λ 1 + λ 21 ( √ T G 2α + λ 2 ) 1 2 + λ 2 D 1 + dG D 2 2 2α + α (1 -ν) 2 √ T . Therefore, we know that the regret of the update ( 6) is O( √ T ) and can achieve the optimal convergence rate O(1/ √ T ) under the conditions of Corollary 1 or Corollary 2.

4.1. EXPERIMENT SETUP

We test the algorithms on three different large-scale real-world datasets with different neural network structures. These datasets are various display ads logs for the purpose of predicting ads CTR. The details are as follows. a) The Avazu CTR dataset (Avazu, 2015) For the convenience of discussion, we use MLP, OPNN and DCN to represent the aforementioned three datasets coupled with their corresponding models. It is obvious that the embedding layer has most of parameters of the neural networks when the features have very high dimension, therefore we just add the regularization terms to the embedding layer. Furthermore, each embedding vector is considered as a group, and a visual comparison between 1 , 21 and mixed regularization effect is given in Fig. 2 of Scardapane et al. (2016) . We treat the training set as the streaming data, hence we train 1 epoch with a batch size of 512 and do the validation. The experiments are conducted with 4-9 workers and 2-3 parameter servers, which depends on the different sizes of the datasets. We use the area under the receiver-operator curve (AUC) as the evaluation criterion since it is widely used in evaluating classification problems. Besides, some work validates AUC as a good measurement in CTR estimation (Graepel et al., 2010) . We explore 5 learning rates from 1e-5 to 1e-1 with increments of 10x and choose the one with the best AUC for each new optimizer in the case of no regularization terms (It is equivalent to the original optimizer according to Theorem 2). All the experiments are run 5 times repeatedly and tested statistical significance using t-test. Without loss of generality, we choose two new optimizers to validate the performance, which are GROUP ADAM and GROUP ADAGRAD.

4.2. ADAM VS. GROUP ADAM

First, we compare the performance of the two optimizers on the same sparsity level. We keep λ 1 , λ 2 be zeros and choose different values of λ 21 of Algorithm 2, i.e., GROUP ADAM, and achieve the same sparsity with ADAM that uses the magnitude pruning method, i.e., sort the norm of embedding vector from largest to smallest, and keep top N embedding vectors which depend on the sparsity when finish the training. Table 2 reports the average results of the two optimizers in the three datasets. Note that GROUP ADAM significantly outperforms ADAM on the AUC metric on the same sparsity level for most experiments. Furthermore, as shown in Figure 1 , the same 21 -regularization strength λ 21 has different effects of sparsity and accuracy on different datasets. The best choice of λ 21 depends on the dataset as well as the application (For example, if the memory of serving resource is limited, sparsity might be relative more important). One can trade off accuracy to get more sparsity by increasing the value of λ 21 . Next, we compare the performance of ADAM without post-processing procedure, i.e., no magnitude pruning, and GROUP ADAM with appropriate regularization terms which we choose in Table 3 on the AUC metric. In general, good default settings of λ 2 is 1e-5. The results are shown in Table 4 . Note that compared with ADAM, GROUP ADAM with appropriate regularization terms can achieve significantly better or highly competitive performance with producing extremely high sparsity.

4.3. ADAGRAD VS. GROUP ADAGRAD

We compare with the performance of ADAGRAD without magnitude pruning and GROUP ADAGRAD with appropriate regularization terms which we choose in Table 5 on the AUC metric. The results are shown in Table 6 . Again note that in comparison to ADAGRAD, GROUP ADAGRAD can not only achieve significantly better or highly competitive performance of AUC, but also effectively and efficiently reduce the dimensions of the features. 

4.4. DISCUSSION

In this section we will discuss the hyperparameters of emdedding dimension, 1 -regularization and 21 -regularization to show how these hyperparameters affect the effects of regularization. Embedding Dimension Table 7 of Appendix G reports the average results of different embedding dimensions of MLP, whose optimizer is GROUP ADAM and regularization terms are same to MLP of Table 5 . Note that the sparsity increases with the growth of the embedding dimension. The reason is that the square root of the embedding dimension is the multiplier of 21 -regularization. 1 vs. 21 From lines 8 and 10 of Algorithm 1, we know that if z t has the same elements, the values of 1 and 21 , i.e., λ 1 and λ 21 , have the same regularization effects. However, this situation almost cannot be happen in reality. Without loss of generality, we set optimizer, λ 2 and embedding dimension be GROUP ADAM, 1e-5 and 16 respectively, and choose different values of λ 1 , λ 21 . The results on MLP are shown in Table 8 of Appendix G. It is obvious that 21 -regularization is much more effective than 1 -regularization in producing sparsity. For example, when λ 1 = 0 and λ 21 = 5e-3, the feature sparsity is 0.136, while for λ 1 = 5e-3 and λ 21 = 0, the feature sparsity is 0.470. Therefore, if just want to produce sparsity, we can only tune λ 21 and use default settings for λ 2 and λ 1 , i.e., λ 2 = 1e-5 and λ 1 = 0.

5. CONCLUSION

In this paper, we propose a novel framework that adds the regularization terms to a family of adaptive optimizers for producing sparsity of DNN models. We apply this framework to create a new class of optimizers. We provide closed-form solutions and algorithms with slight modification. We built the relation between new and original optimizers, i.e., our new optimizers become equivalent with the corresponding original ones, once the regularization terms vanish. We theoretically prove the convergence rate of the regret and also conduct empirical evaluation on the proposed optimizers in comparison to the original optimizers with and without magnitude pruning. The results clearly demonstrate the advantages of our proposed optimizers in both getting significantly better performance and producing sparsity. Finally, it would be interesting in the future to investigate the convergence in non-convex settings and evaluate our optimizers on more applications from fields such as compute vision, natural language processing and etc.

APPENDIX A GROUP ADAM

Algorithm 2 Group Adam 1: Input: parameters λ1, λ21, λ2, β1, β2, x1 ∈ R n , step size α, initialize z0 = 0, m0 = 0, V0 = 0, V0 = 0 2: for t = 1 to T do 3: gt = ∇ft(xt) 4: mt ← β1 mt-1 + (1 -β1)gt 5: mt = mt/(1 -β t 1 ) 6: Vt ← β2 Vt-1 + (1 -β2)diag(g 2 t ) 7: Vt = Vt/(1 -β t 2 ) 8: Qt = √ Vt - √ Vt-1 + I t = 1 √ Vt - √ Vt-1 t > 1 9: zt ← zt-1 + mt -1 α Qtxt 10: for i ∈ {1, . . . , n} do 11: st,i = 0 if |zt,i| ≤ λ1 sign(zt,i)λ1 -zt,i otherwise. 12: end for 13: xt+1 = ( √ V t + I α + 2λ2I) -1 max(1 - d x g t λ 21 s t 2 , 0)st 14: end for

B GROUP ADAGRAD

Algorithm 3 Group Adagrad 1: Input: parameters λ1, λ21, λ2, x1 ∈ R n , step size α, initialize z0 = 0, V0 = 0 2: for t = 1 to T do 3: gt = ∇ft(xt) 4: mt = gt 5: Vt = Vt-1 + diag(g 2 t ) + I t = 1 Vt-1 + diag(g 2 t ) t > 1 6: Qt = √ Vt - √ Vt-1 7: zt ← zt-1 + mt -1 α Qtxt 8: for i ∈ {1, . . . , n} do 9: st,i = 0 if |zt,i| ≤ λ1 sign(zt,i)λ1 -zt,i otherwise. 10: end for 11: xt+1 = ( √ V t α + 2λ2I) -1 max(1 - d x g t λ 21 s t 2 , 0)st 12: end for

C PROOF OF THEOREM 1

Proof. x t+1 = arg min x m 1:t • x + t s=1 1 2α s (x -x s ) T Q s (x -x s ) + Ψ t (x) = arg min x m 1:t • x + t s=1 1 2α s ( Q 1 2 s x 2 2 -2x • (Q s x s ) + Q 1 2 s x s 2 2 ) + Ψ t (x) = arg min x m 1:t - t s=1 Q s α s x s • x + t s=1 1 2α s Q 1 2 s x 2 2 + Ψ t (x). Define z t-1 = m 1:t-1 -t-1 s=1 Qs αs x s (t ≥ 2) and we can calculate z t as z t = z t-1 + m t - Q t α t x t , t ≥ 1. By substituting ( 23), ( 22) is simplified to be x t+1 = arg min x z t • x + t s=1 Q s 2α s x 2 2 + Ψ t (x). By substituting Ψ t (x) (Eq. ( 5)) into (24), we get x t+1 = arg min x z t • x + G g=1 λ 1 x g 1 + λ 21 d x g ( t s=1 Q g s 2α s + λ 2 I) 1 2 x g 2 + ( t s=1 Q s 2α s + λ 2 I) 1 2 x 2 2 . ( ) Since the objective of ( 25) is component-wise and element-wise, we can focus on the solution in one group, say g, and one entry, say i, in the g-th group. Let t s=1 Q g s 2αs = diag(σ g t ) where σ g t = (σ g t,1 , . . . , σ g t,d x g ). The objective of (25) on x g t+1,i is Ω(x g t+1,i ) = z g t,i x g t+1,i + λ 1 |x g t+1,i | + Φ(x g t+1,i ), where Φ(x g t+1,i ) = λ 21 √ d x g (σ g t,i + λ 2 ) 1 2 x g t+1,i 2 + (σ g t,i + λ 2 ) 1 2 x g t+1,i 2 is a non-negative function and Φ(x g t+1,i ) = 0 iff x g t+1,i = 0 for all i ∈ {1, . . . , d x g }. We discuss the optimal solution of (26) in three cases: a) If z g t,i = 0, then x g t+1,i = 0. b) If z g t,i > 0, then x g t+1,i ≤ 0. Otherwise, if x g t+1,i > 0, we have Ω(-x g t+1,i ) < Ω(x g t+1,i ), which contradicts the minimization value of Ω(x) on x g t+1,i . Next, if z g t,i ≤ λ 1 , then x g t+1,i = 0. Otherwise, if x g t+1,i < 0, we have Ω(x g t+1,i ) = (z g t,iλ 1 )x g t+1,i + Φ(x g,i t+1 ) > Ω(0), which also contradicts the minimization value of Ω(x) on x g t+1,i . Third, z g t,i > λ 1 (∀ i = 1, . . . , d x g ). The objective of (26) for the g-th group, Ω(x g t+1 ), becomes (z g t -λ 1 1 d x g ) • x g t+1 + Φ(x g t+1 ). c) If z g t,i < 0, the analysis is similar to b). We have x g t+1,i ≥ 0. When -z g t,i ≤ λ 1 , x g t+1,i = 0. When -z g t,i > λ 1 (∀ i = 1, . . . , d x g ), we have Ω(x g t+1 ) = (z g t + λ 1 1 d x g ) • x g t+1 + Φ(x g t+1 ). From a), b), c) above, we have x g t+1 = arg min x -s g t • x + Φ(x), where the i-th element of s g t is defined same as (9). Define y = (diag(σ g t ) + λ 2 I) 1 2 x. By substituting ( 28) into ( 27), we get y g t+1 = arg min y -s g t • y + λ 21 d x g y 2 + y 2 2 , where sg t = (diag(σ g t ) + λ 2 I) -1 s g t which is defined same as (10). This is unconstrained non-smooth optimization problem. Its optimality condition (see Rockafellar (1970) , Section 27) states that y g t+1 is an optimal solution if and only if there exists ξ ∈ ∂ y g t+1 2 such that -s g t + λ 21 d x g ξ + 2y g t+1 = 0. (30) The subdifferential of y 2 is ∂ y 2 = {ζ ∈ R d x g | -1 ≤ ζ (i) ≤ 1, i = 1, . . . , d x g } if y = 0, y y 2 if y = 0. Similarly to the analysis of 1 -regularization, we discuss the solution of (30) in two different cases: 30) . We also show that there is no solution other than y g t+1 = 0. Without loss of generality, we assume y g t+1,i = 0 for all i ∈ {1, . . . , d x g }, then ξ = y g t+1 y g t+1 2 , and a) If sg t 2 ≤ λ 21 √ d x g , then y g t+1 = 0 and ξ = sg t λ21 √ d x g ∈ ∂ 0 2 satisfy ( -s g t + λ 21 √ d x g y g t+1 2 y g t+1 + 2y g t+1 = 0. From ( 31), we can derive ( λ 21 √ d x g y g t+1 2 + 2) y g t+1 2 = sg t 2 . Furthermore, we have y g t+1 2 = 1 2 ( sg t 2 -λ 21 d x g ), where y g t+1 2 > 0 and sg 31) and ( 32), we get t 2 -λ 21 √ d x g ≤ 0 contradict each other. b) If sg t 2 > λ 21 √ d x g , then from ( y g t+1 = 1 2 (1 - λ 21 √ d x g sg t 2 )s g t . We replace y g t+1 of (33) by x g t+1 using (28), then we have x g t+1 = (diag(σ g t ) + λ 2 I) -1 2 y g t+1 = (2diag(σ g t ) + 2λ 2 I) -1 (1 - λ 21 √ d x g sg t 2 )s g t = ( t s=1 Q s α s + 2λ 2 I) -1 (1 - λ 21 √ d x g sg t 2 )s g t . Combine a) and b) above, we finish the proof.

D PROOF OF THEOREM 2

Proof. We use the method of induction. a) When t = 1, then Algorithm 1 becomes Q 1 = α 1 ( √ V 1 α 1 - √ V 0 α 0 ) = V 1 , z 1 = z 0 + m 1 - Q 1 α 1 x 1 = m 1 - √ V 1 α 1 x 1 , s 1 = -z 1 = √ V 1 α 1 x 1 -m 1 , x 2 = ( √ V 1 α 1 ) -1 s 1 = x 1 -α 1 m 1 √ V 1 , which equals to Eq. (1). b) Assume t = T , Eq. ( 35) are true. z T = m T - √ V T α T x T , x T +1 = x T -α T m T √ V T . For t = T + 1, we have z T +1 = z T + m T +1 - Q T +1 α T +1 x T +1 = m T - √ V T α T x T + m T +1 - Q T +1 α T +1 x T +1 = m T - √ V T α T (x T +1 + α T m T √ V T ) + m T +1 - Q T +1 α T +1 x T +1 = m T +1 -( √ V T α T + Q T +1 α T +1 )x T +1 = m T +1 - V T +1 α T +1 x T +1 , x T +2 = ( V T +1 α T +1 ) -1 s T +1 = -( V T +1 α T +1 ) -1 z T +1 = x T +1 -α T m T +1 V T +1 . Hence, we complete the proof.

E PROOF OF THEOREM 3

Proof. Let h t (x) = t s=1 1 2αs Q 1 2 s (x -x s ) 2 2 ∀ t ∈ [T ], 1 2 x -c 2 2 t = 0. It is easy to verify that for all t ∈ [T ], h t (x) is 1-strongly convex with respect to • √ Vt/αt which √ Vt αt = t s=1 Qs αs , and h 0 (x) is 1-strongly convex with respect to • 2 . From (7), we have R T = T t=1 (f t (x t ) -f t (x * )) ≤ T t=1 g t , x t -x * = T t=1 m t -γm t-1 , x t -x * ≤ T t=1 m t , x t -x * = T t=1 m t , x t + Ψ T (x * ) + h T (x * ) + ( T t=1 -m t , x * -Ψ T (x * ) -h T (x * )) ≤ T t=1 m t , x t + Ψ T (x * ) + h T (x * ) + sup x∈Q { -m 1:T , x -Ψ T (x) -h T (x)} , where in the first and second inequality above, we use the convexity of f t (x) and the condition (12) respectively. We define h * t (u) to be the conjugate dual of Ψ t (x) + h t (x): h * t (u) = sup x∈Q { u, x -Ψ t (x) -h t (x)} , t ≥ 0, where Ψ 0 (x) = 0. Since h t (x) is 1-strongly convex with respect to the norm  As a trivial corollary of (37), we have the following inequality: h * t (u + δ) ≤ h * t (u) + ∇h * t (u), δ + 1 2 δ 2 t=1 m t 2 h * t-1 . F ADDITIONAL PROOFS F.1 PROOF OF LEMMA 1 Proof. Let V t = diag(σ t ) where σ t is the vector of the diagonal elements of V t . For i-th entry of σ t , by substituting (13) into (15), we have σ t,i = g 2 t,i + ησ t-1,i = (m t,i -γm t-1,i ) 2 + ηg 2 t-1,i + η 2 σ t-2,i = t s=1 η t-s (m s,i -γm s-1,i ) 2 ≥ t s=1 η t-s (1 -γ)(m 2 s,i -γm 2 s-1,i ) = (1 -γ) m 2 t,i + (η -γ) t-1 s=1 η t-s-1 m 2 s,i . Next, we will discuss the value of η in two cases. a) η = 1. From (42), we have σ t,i ≥ (1 -γ) m 2 t,i + (1 -γ) t-1 s=1 m 2 s,i > (1 -γ) 2 t s=1 m 2 s,i ≥ (1 -ν) 2 t s=1 m 2 s,i . (43) Recalling the definition of M t,i in Section 1.5, from (43), we have T t=1 m 2 t,i √ σ t,i < 1 1 -ν T t=1 m 2 t,i M t,i 2 ≤ 2 1 -ν M T,i 2 , where the last inequality above follows from Appendix C of Duchi et al. (2011) . Therefore, we get T t=1 m t 2 ( √ V t α t ) -1 = α T t=1 d i=1 m 2 t,i √ σ t,i < 2α 1 -ν d i=1 M T,i 2 . (44) b) η < 1. We assume η ≥ γ and κV t V t-1 where κ < 1, then we have Hence, we get σ t,i ≥ 1 -κ 1 -κ t (1 -γ) t s=1 m 2 s,i > (1 -κ)(1 -γ) t s=1 m 2 s,i ≥ (1 -ν) 2 t s=1 m 2 s,i , which deduces the same conclusion (44) of a). Combine a) and b), we complete the proof.

F.2 PROOF OF COROLLARY 1

Proof. From the definition of m t (13), V t (15), we have |m t,i | = | t s=1 γ t-s g s,i | ≤ 1 -γ t 1 -γ G < G 1 -γ ≤ G 1 -ν , |σ t,i | = | t s=1 η t-s g 2 s,i | ≤ tG 2 . Hence, we have Ψ T (x * ) ≤ λ 1 dD 1 + λ 21 dD 1 ( √ T G 2α + λ 2 ) 1 2 + λ 2 dD 2 1 , h T (x * ) ≤ dD 2 2 G 2α √ T , ( ) 1 2 T t=1 m t 2 h * t-1 < α 1 -ν d i=1 √ T G 1 -ν = dαG (1 -ν) 2 √ T . Combining ( 46), ( 47), (48), we complete the proof. G ADDITIONAL EMPIRICAL RESULTS 



To fulfill research interest of optimization methods, we will release the code in the future. We only use the data from season 2 and because of the same data schema.3 See https://github.com/Atomu2014/Ads-RecSys-Datasets/ for details. Limited by training resources available, we don't use the optimal hyperparameter settings ofWang et al. (2017). * It is significantly better than embedding dimensions of 4, 8 but has no difference in 95% confidence level of the embedding dimension of 32.



Figure1: The AUC across different sparsity on two optimizers for the three datasets. The x-axis is sparsity (number of non-zero features whose embedding vectors are not equal to 0 divided by the total number of features present in the training data). The y-axis is AUC.

Adaptive optimizers with choosing different m t , V t and α t .

contains approximately 40M samples and 22 categorical features over 10 days. In order to handle categorical data, we use the one-hot-encoding based embedding technique (see, e.g.,Wang et al. (2017), Section 2.1 orNaumov et al. (2019), Section 2.1.1) and get 9.4M features in total. For this dataset, the samples from the first 9 days (containing 8.7M one-hot features) are used for training, while the rest is for testing. Our DNN model follows the basic structure of most deep CTR models. Specifically, the model comprises one embedding layer, which maps each one-hot feature into 16-dimensional embeddings, and four fully connected layers (with output dimension of 64, 32, 16 and 1, respectively) in sequence.b) The iPinYou dataset2 (iPinYou, 2013) is another real-world dataset for ad click logs over 21 days. The dataset contains 16 categorical features 3 . After one-hot encoding, we get a dataset containing 19.5M instances with 1033.1K input dimensions. We keep the original train/test splitting scheme, where the training set contains 15.4M samples with 937.7K one-hot features. We use Outer Product-based Neural Network (OPNN) (Qu et al., 2016), and follow the standard settings of Qu et al. (2016), i.e., one embedding layer with the embedding dimension of 10, one product layer and three hidden layers of size 512, 256, 128 respectively where we set dropout rate at 0.5. c) The third dataset is the Criteo Display Ads dataset (Criteo, 2014) which contains approximately 46M samples over 7 days. There are 13 integer features and 26 categorical features. After onehot encoding of categorical features, we have total 33.8M features. We split the dataset into 7 partitions in chronological order and select the earliest 6 parts for training which contains 29.6M features and the rest for testing though the dataset has no timestamp. We use Deep & Cross Network (DCN) (Wang et al., 2017) and choose the following settings 4 : one embedding layer with embedding dimension 8, two deep layers of size 64 each, and two cross layers.

AUC for the two optimizers and sparsity (feature rate) in parentheses. The best AUC for each dataset on each sparsity level is bolded. The p-value of the t-test of AUC is also listed.



AUC for three datasets and sparsity (feature rate) in parentheses. The best value for each dataset is bolded. The pvalue of t-test is also listed.



AUC for three datasets and sparsity (feature rate) in parentheses. The best value for each dataset is bolded. The pvalue of t-test is also listed.

AUC of MLP for different embedding dimensions and sparsity (feature rate) in parentheses. The best results are bolded.

Sparsity (feature rate) of MLP for different values of λ 21 , λ 1 and AUC in parentheses.

