A HALF-SPACE STOCHASTIC PROJECTED GRADIENT METHOD FOR GROUP SPARSITY REGULARIZATION

Abstract

Optimizing with group sparsity is significant in enhancing model interpretability in machining learning applications, e.g., feature selection, compressed sensing and model compression. However, for large-scale stochastic training problems, effective group sparsity exploration are typically hard to achieve. Particularly, the state-of-the-art stochastic optimization algorithms usually generate merely dense solutions. To overcome this shortage, we propose a stochastic method-Half-space Stochastic Projected Gradient (HSPG) method to search solutions of high group sparsity while maintain the convergence. Initialized by a simple Prox-SG Step, the HSPG method relies on a novel Half-Space Step to substantially boost the sparsity level. Numerically, HSPG demonstrates its superiority in deep neural networks, e.g., VGG16, ResNet18 and MobileNetV1, by computing solutions of higher group sparsity, competitive objective values and generalization accuracy.

1. INTRODUCTION

In many recent machine learning optimization tasks, researchers not only focus on finding solutions with small prediction/generalization error but also concentrate on improving the interpretation of model by filtering out redundant parameters and achieving slimmer model architectures. One technique to achieve the above goal is by augmenting the sparsity-inducing regularization terms to the raw objective functions to generate sparse solutions (including numerous zero elements). The popular 1 -regularization promotes the sparsity of solutions by element-wise penalizing the optimization variables. However, in many practical applications, there exist additional constraints on variables such that the zero coefficients are often not randomly distributed but tend to be clustered into varying more sophisticated sparsity structures, e.g., disjoint and overlapping groups and hierarchy (Yuan & Lin, 2006; Huang et al., 2010; 2009) . As the most important and natural form of structured sparsity, the disjoint group-sparsity regularization, which assumes the pre-specified disjoint blocks of variables are selected (non-zero variables) or ignored (zero variables) simultaneously (Bach et al., 2012) , serves as a momentous role in general structured sparsity learning tasks since other instances such as overlapping group and hierarchical sparsity are typically solved by converting into the equivalent disjoint group versions via introducing latent variables (Bach et al., 2012) , and has found numerous applications in computer vision (Elhamifar et al., 2012) , signal processing (Chen & Selesnick, 2014), medical imaging (Liu et al., 2018) , and deep learning (Scardapane et al., 2017) , especially on the model compression of deep neural networks, where the group sparsityfoot_0 is leveraged to remove redundant entire hidden structures directly. Problem Setting. We study the disjoint group sparsity regularization problem which can be typically formulated as the mixed 1 / p -regularization problem, and pay special attention to the most popular and widely used instance p as 2 (Bach et al., 2012; Halabi et al., 2018) , minimize x∈R n Ψ(x) def = f (x) + λΩ(x) = 1 N N i=1 f i (x) + λ g∈G [x] g , where λ > 0 is a weighting factor, • denotes 2 -norm, f (x) is the average of numerous N continuously differentiable instance functions f i : R n → R, such as the loss functions measuring the deviation from the observations in various data fitting problems, Ω(x) is the so-called mixed 1 / 2



Group sparsity is defined as # of zero groups, where a zero group means all its variables are exact zeros.1

