ADAPTIVE OPTIMIZERS WITH SPARSE GROUP LASSO

Abstract

We develop a novel framework that adds the regularizers to a family of adaptive optimizers in deep learning, such as MOMENTUM, ADAGRAD, ADAM, AMS-GRAD, ADAHESSIAN, and create a new class of optimizers, which are named GROUP MOMENTUM, GROUP ADAGRAD, GROUP ADAM, GROUP AMSGRAD and GROUP ADAHESSIAN, etc., accordingly. We establish theoretically proven convergence guarantees in the stochastic convex settings, based on primal-dual methods. We evaluate the regularized effect of our new optimizers on three largescale real-world ad click datasets with state-of-the-art deep learning models. The experimental results reveal that compared with the original optimizers with the post-processing procedure which use the magnitude pruning method, the performance of the models can be significantly improved on the same sparsity level. Furthermore, in comparison to the cases without magnitude pruning, our methods can achieve extremely high sparsity with significantly better or highly competitive performance.

1. INTRODUCTION

With the development of deep learning, deep neural network (DNN) models have been widely used in various machine learning scenarios such as search, recommendation and advertisement, and achieved significant improvements. In the last decades, different kinds of optimization methods based on the variations of stochastic gradient descent (SGD) have been invented for training DNN models. However, most optimizers cannot directly produce sparsity which has been proven effective and efficient for saving computational resource and improving model performance especially in the scenarios of very high-dimensional data. Meanwhile, the simple rounding approach is very unreliable due to the inherent low accuracy of these optimizers. In this paper, we develop a new class of optimization methods, that adds the regularizers especially sparse group lasso to prevalent adaptive optimizers, and retains the characteristics of the respective optimizers. Compared with the original optimizers with the post-processing procedure which use the magnitude pruning method, the performance of the models can be significantly improved on the same sparsity level. Furthermore, in comparison to the cases without magnitude pruning, the new optimizers can achieve extremely high sparsity with significantly better or highly competitive performance. In this section, we describe the two types of optimization methods, and explain the motivation of our work.

1.1. ADAPTIVE OPTIMIZATION METHODS

Due to the simplicity and effectiveness, adaptive optimization methods (Robbins & Monro, 1951; Polyak, 1964; Duchi et al., 2011; Zeiler, 2012; Kingma & Ba, 2015; Reddi et al., 2018; Yao et al., 2020) have become the de-facto algorithms used in deep learning. There are multiple variants, but they can be represented using the general update formula (Reddi et al., 2018) : x t+1 = x t -α t m t / V t , (1) where α t is the step size, m t is the first moment term which is the weighted average of gradient g t and V t is the so called second moment term that adjusts updated velocity of variable x t in each direction. Here,  √ V t := V 1/2 t , m t / √ V t := √ V t -1 • m t .

