A HALF-SPACE STOCHASTIC PROJECTED GRADIENT METHOD FOR GROUP SPARSITY REGULARIZATION

Abstract

Optimizing with group sparsity is significant in enhancing model interpretability in machining learning applications, e.g., feature selection, compressed sensing and model compression. However, for large-scale stochastic training problems, effective group sparsity exploration are typically hard to achieve. Particularly, the state-of-the-art stochastic optimization algorithms usually generate merely dense solutions. To overcome this shortage, we propose a stochastic method-Half-space Stochastic Projected Gradient (HSPG) method to search solutions of high group sparsity while maintain the convergence. Initialized by a simple Prox-SG Step, the HSPG method relies on a novel Half-Space Step to substantially boost the sparsity level. Numerically, HSPG demonstrates its superiority in deep neural networks, e.g., VGG16, ResNet18 and MobileNetV1, by computing solutions of higher group sparsity, competitive objective values and generalization accuracy.

1. INTRODUCTION

In many recent machine learning optimization tasks, researchers not only focus on finding solutions with small prediction/generalization error but also concentrate on improving the interpretation of model by filtering out redundant parameters and achieving slimmer model architectures. One technique to achieve the above goal is by augmenting the sparsity-inducing regularization terms to the raw objective functions to generate sparse solutions (including numerous zero elements). The popular 1 -regularization promotes the sparsity of solutions by element-wise penalizing the optimization variables. However, in many practical applications, there exist additional constraints on variables such that the zero coefficients are often not randomly distributed but tend to be clustered into varying more sophisticated sparsity structures, e.g., disjoint and overlapping groups and hierarchy (Yuan & Lin, 2006; Huang et al., 2010; 2009) . As the most important and natural form of structured sparsity, the disjoint group-sparsity regularization, which assumes the pre-specified disjoint blocks of variables are selected (non-zero variables) or ignored (zero variables) simultaneously (Bach et al., 2012) , serves as a momentous role in general structured sparsity learning tasks since other instances such as overlapping group and hierarchical sparsity are typically solved by converting into the equivalent disjoint group versions via introducing latent variables (Bach et al., 2012) , and has found numerous applications in computer vision (Elhamifar et al., 2012) , signal processing (Chen & Selesnick, 2014), medical imaging (Liu et al., 2018) , and deep learning (Scardapane et al., 2017) , especially on the model compression of deep neural networks, where the group sparsityfoot_0 is leveraged to remove redundant entire hidden structures directly. Problem Setting. We study the disjoint group sparsity regularization problem which can be typically formulated as the mixed 1 / p -regularization problem, and pay special attention to the most popular and widely used instance p as 2 (Bach et al., 2012; Halabi et al., 2018) , minimize x∈R n Ψ(x) def = f (x) + λΩ(x) = 1 N N i=1 f i (x) + λ g∈G [x] g , where λ > 0 is a weighting factor, • denotes 2 -norm, f (x) is the average of numerous N continuously differentiable instance functions f i : R n → R, such as the loss functions measuring the deviation from the observations in various data fitting problems, Ω(x) is the so-called mixed 1 / 2 norm, G is a prescribed fixed partition of index set I = {1, 2, • • • , n}, wherein each component g ∈ G indexes a group of variables upon the perspective of applications. Theoretically, a larger λ typically results in a higher group sparsity while sacrifices more on the bias of model estimation, hence λ needs to be carefully fine-tuned to achieve both low f and high group-sparse solutions. Literature Review. Problem (1) has been well studied in deterministic optimization with various algorithms that are capable of returning solutions with both low objective value and high group sparsity under proper λ (Yuan & Lin, 2006; Roth & Fischer, 2008; Huang et al., 2011; Ndiaye et al., 2017) . Proximal methods are classical approaches to solve the structured non-smooth optimization (1), including the popular proximal gradient method (Prox-FG) which only uses the first-order derivative information. When N is huge, stochastic methods become ubiquitous to operate on a small subset to avoid the costly evaluation over all instances in deterministic methods for large-scale problems. Particularly, these existing stochastic algorithms typically meet difficulties to achieve both decent convergence and effective group sparsity identification simultaneously (e.g., small function values but merely dense solutions), because of the randomness and the limited sparsity-promotion mechanisms. In depth, Prox-SG, RDA, Prox-SVRG, Prox-Spider and SAGA derive from proximal gradient method to utilize the proximal operator to produce group of zero variables. Such operator is generic to extensive non-smooth problems, consequently perhaps not sufficiently insightful if the target problems possess certain properties, e.g., the group sparsity structure as problem (1). In fact, in convex setting, the proximal operator suffers from variance of gradient estimate; and in non-convex setting, especially deep learning, the discreet step size (learning rate) further deteriorates its effectiveness on the group sparsity promotion, as will show in Section 2 that the projection region vanishes rapidly except RDA. RDA has superiority on finding manifold structure to others (Lee & Wright, 2012), but inferiority on the objective convergence. Besides, the variance reduction techniques are typically required to measure over a huge mini-batch data points in both theory and practice which is probably prohibitive for large-scale problems, and have been observed as sometimes noneffective for deep learning applications (Defazio & Bottou, 2019) . On the other hand, to introduce sparsity, there exist heuristic weight pruning methods (Li et al., 2016; Luo et al., 2017) , whereas they commonly do not equip with theoretical guarantee, so that easily diverge and hurt generalization accuracy. Our Contributions. Half-Space Stochastic Projected Gradient (HSPG) method overcomes the limitations of the existing stochastic algorithms on the group sparsity identification, while maintains comparable convergence characteristics. While the main-stream works on (group) sparsity have focused on using proximal operators of regularization, our method is unique and fresh in enforcing group sparsity more effectively by leveraging half-space structure and is well supported by the theoretical analysis and empirical evaluations. We now summarize our contributions as follows. • Algorithmic Design: We propose the HSPG to solve the disjoint group sparsity regularized problem as (1). Initialized with a Prox-SG Step for seeking a close-enough but perhaps dense solution estimate, the algorithmic framework relies on a novel Half-Space Step to exploit group sparse patterns. We delicately design the Half-Space Step with the following main features: (i) it utilizes previous iterate as the normal direction to construct a reduced space consisting of a set of halfspaces and the origin; (ii) a new group projection operator maps groups of variables onto zero if they fall out of the constructed reduced space to identify group sparsity considerably more effectively than the proximal operator; and (iii) with proper step size, the Half-Space Step enjoys the sufficient decrease property, and achieves progress to optimum in both theory and practice. • Theoretical Guarantee: We provide the convergence guarantees of HSPG. Moreover, we prove HSPG has looser requirements to identify the sparsity pattern than Prox-SG, revealing its superiority



Group sparsity is defined as # of zero groups, where a zero group means all its variables are exact zeros.



Proximal stochastic gradient method (Prox-SG)(Duchi & Singer, 2009)  is the natural stochastic extension of Prox-FG. Regularized dual-averaging method (RDA)(Xiao, 2010; Yang et al., 2010)  is proposed by extending the dual averaging scheme in(Nesterov, 2009). To improve the convergence rate, there exists a set of incremental gradient methods inspired by SAG(Roux et al., 2012)  to utilizes the average of accumulated past gradients. For example, proximal stochastic variance-reduced gradient method (Prox-SVRG)(Xiao & Zhang, 2014)  and proximal spider (Prox-Spider) (Zhang & Xiao, 2019) are developed to adopt multi-stage schemes based on the well-known variance reduction technique SVRG proposed in(Johnson & Zhang, 2013)  and Spider developed in(Fang et al., 2018)   respectively. SAGA(Defazio et al., 2014)  stands as the midpoint between SAG and Prox-SVRG.

