ADAPTIVE OPTIMIZERS WITH SPARSE GROUP LASSO

Abstract

We develop a novel framework that adds the regularizers to a family of adaptive optimizers in deep learning, such as MOMENTUM, ADAGRAD, ADAM, AMS-GRAD, ADAHESSIAN, and create a new class of optimizers, which are named GROUP MOMENTUM, GROUP ADAGRAD, GROUP ADAM, GROUP AMSGRAD and GROUP ADAHESSIAN, etc., accordingly. We establish theoretically proven convergence guarantees in the stochastic convex settings, based on primal-dual methods. We evaluate the regularized effect of our new optimizers on three largescale real-world ad click datasets with state-of-the-art deep learning models. The experimental results reveal that compared with the original optimizers with the post-processing procedure which use the magnitude pruning method, the performance of the models can be significantly improved on the same sparsity level. Furthermore, in comparison to the cases without magnitude pruning, our methods can achieve extremely high sparsity with significantly better or highly competitive performance.

1. INTRODUCTION

With the development of deep learning, deep neural network (DNN) models have been widely used in various machine learning scenarios such as search, recommendation and advertisement, and achieved significant improvements. In the last decades, different kinds of optimization methods based on the variations of stochastic gradient descent (SGD) have been invented for training DNN models. However, most optimizers cannot directly produce sparsity which has been proven effective and efficient for saving computational resource and improving model performance especially in the scenarios of very high-dimensional data. Meanwhile, the simple rounding approach is very unreliable due to the inherent low accuracy of these optimizers. In this paper, we develop a new class of optimization methods, that adds the regularizers especially sparse group lasso to prevalent adaptive optimizers, and retains the characteristics of the respective optimizers. Compared with the original optimizers with the post-processing procedure which use the magnitude pruning method, the performance of the models can be significantly improved on the same sparsity level. Furthermore, in comparison to the cases without magnitude pruning, the new optimizers can achieve extremely high sparsity with significantly better or highly competitive performance. In this section, we describe the two types of optimization methods, and explain the motivation of our work.

1.1. ADAPTIVE OPTIMIZATION METHODS

Due to the simplicity and effectiveness, adaptive optimization methods (Robbins & Monro, 1951; Polyak, 1964; Duchi et al., 2011; Zeiler, 2012; Kingma & Ba, 2015; Reddi et al., 2018; Yao et al., 2020) have become the de-facto algorithms used in deep learning. There are multiple variants, but they can be represented using the general update formula (Reddi et al., 2018) : x t+1 = x t -α t m t / V t , (1) where α t is the step size, m t is the first moment term which is the weighted average of gradient g t and V t is the so called second moment term that adjusts updated velocity of variable x t in each direction. Here,  √ V t := V 1/2 t , m t / √ V t := √ V t -1 • m t . MOMENTUM γm t-1 + g t I α ADAGRAD g t diag( t i=1 g 2 i )/t α √ t ADAM β 1 m t-1 + (1 -β 1 )g t β 2 V t-1 + (1 -β 2 )diag(g 2 t ) α √ 1-β t 2 1-β t 1 AMSGRAD β 1 m t-1 + (1 -β 1 )g t max(V t-1 , β 2 V t-1 + (1 -β 2 )diag(g 2 t )) α √ 1-β t 2 1-β t 1 ADAHESSIAN β 1 m t-1 + (1 -β 1 )g t β 2 V t-1 + (1 -β 2 )D 2 t * α √ 1-β t 2 1-β t 1 * D t = diag(H t ) , where H t is the Hessian matrix.

1.2. REGULARIZED OPTIMIZATION METHODS

Follow-the-regularized-leader (FTRL) (McMahan & Streeter, 2010; McMahan et al., 2013) has been widely used in click-through rates (CTR) prediction problems, which adds 1 -regularization (lasso) to logistic regression and can effectively balance the performance of the model and the sparsity of features. The update formula (McMahan et al., 2013) is: x t+1 = arg min x g 1:t • x + 1 2 t s=1 σ s x -x s 2 2 + λ 1 x 1 , where g 1:t = t s=1 g s , 1 2 t s=1 σ s x -x s 2 is the strong convex term that stabilizes the algorithm and λ 1 x 1 is the regularization term that produces sparsity. However, it doesn't work well in DNN models since one input feature can correspond to multiple weights and lasso only can make single weight zero hence can't effectively delete zeros features. To solve above problem, Ni et al. ( 2019) adds the 21 -regularization (group lasso) to FTRL, which is named G-FTRL. Yang et al. ( 2010) conducts the research on a group lasso method for online learning that adds 21 -regularization to the algorithm of Dual Averaging (DA) (Nesterov, 2009) , which is named DA-GL. Even so, these two methods cannot been applied to other optimizers. Different scenarios are suitable for different optimizers in the deep learning fields. For example, MOMENTUM (Polyak, 1964) is typically used in computer vision; ADAM (Kingma & Ba, 2015) is used for training transformer models for natural language processing; and ADAGRAD (Duchi et al., 2011) is used for recommendation systems. If we want to produce sparsity of the model in some scenario, we have to change optimizer which probably influence the performance of the model.

1.3. MOTIVATION

Eq. ( 1) can be rewritten into this form: x t+1 = arg min x m t • x + 1 2α t V t 1 2 (x -x t ) 2 2 . (3) Furthermore, we can rewrite Eq. (3) into x t+1 = arg min x m 1:t • x + t s=1 1 2α s Q 1 2 s (x -x s ) 2 2 , where m 1:t = t s=1 m s , t s=1 Q s /α s = √ V t /α t . It is easy to prove that Eq. (3) and Eq. (4) are equivalent using the method of induction. The matrices Q s can be interpreted as generalized learning rates. To our best knowledge, V t of Eq. ( 1) of all the adaptive optimization methods are diagonal for the computation simplicity. Therefore, we consider Q s as diagonal matrices throughout this paper. We find that Eq. ( 4) is similar to Eq. ( 2) except for the regularization term. Therefore, we add the regularization term Ψ(x) to Eq. ( 4), which is the sparse group lasso penalty also including 2 -



Adaptive optimizers with choosing different m t , V t and α t .

