A HALF-SPACE STOCHASTIC PROJECTED GRADIENT METHOD FOR GROUP SPARSITY REGULARIZATION

Abstract

Optimizing with group sparsity is significant in enhancing model interpretability in machining learning applications, e.g., feature selection, compressed sensing and model compression. However, for large-scale stochastic training problems, effective group sparsity exploration are typically hard to achieve. Particularly, the state-of-the-art stochastic optimization algorithms usually generate merely dense solutions. To overcome this shortage, we propose a stochastic method-Half-space Stochastic Projected Gradient (HSPG) method to search solutions of high group sparsity while maintain the convergence. Initialized by a simple Prox-SG Step, the HSPG method relies on a novel Half-Space Step to substantially boost the sparsity level. Numerically, HSPG demonstrates its superiority in deep neural networks, e.g., VGG16, ResNet18 and MobileNetV1, by computing solutions of higher group sparsity, competitive objective values and generalization accuracy.

1. INTRODUCTION

In many recent machine learning optimization tasks, researchers not only focus on finding solutions with small prediction/generalization error but also concentrate on improving the interpretation of model by filtering out redundant parameters and achieving slimmer model architectures. One technique to achieve the above goal is by augmenting the sparsity-inducing regularization terms to the raw objective functions to generate sparse solutions (including numerous zero elements). The popular 1 -regularization promotes the sparsity of solutions by element-wise penalizing the optimization variables. However, in many practical applications, there exist additional constraints on variables such that the zero coefficients are often not randomly distributed but tend to be clustered into varying more sophisticated sparsity structures, e.g., disjoint and overlapping groups and hierarchy (Yuan & Lin, 2006; Huang et al., 2010; 2009) . As the most important and natural form of structured sparsity, the disjoint group-sparsity regularization, which assumes the pre-specified disjoint blocks of variables are selected (non-zero variables) or ignored (zero variables) simultaneously (Bach et al., 2012) , serves as a momentous role in general structured sparsity learning tasks since other instances such as overlapping group and hierarchical sparsity are typically solved by converting into the equivalent disjoint group versions via introducing latent variables (Bach et al., 2012) , and has found numerous applications in computer vision (Elhamifar et al., 2012) , signal processing (Chen & Selesnick, 2014) , medical imaging (Liu et al., 2018) , and deep learning (Scardapane et al., 2017) , especially on the model compression of deep neural networks, where the group sparsityfoot_0 is leveraged to remove redundant entire hidden structures directly. Problem Setting. We study the disjoint group sparsity regularization problem which can be typically formulated as the mixed 1 / p -regularization problem, and pay special attention to the most popular and widely used instance p as 2 (Bach et al., 2012; Halabi et al., 2018) , minimize x∈R n Ψ(x) def = f (x) + λΩ(x) = 1 N N i=1 f i (x) + λ g∈G [x] g , where λ > 0 is a weighting factor, • denotes 2 -norm, f (x) is the average of numerous N continuously differentiable instance functions f i : R n → R, such as the loss functions measuring the deviation from the observations in various data fitting problems, Ω(x) is the so-called mixed 1 / 2 norm, G is a prescribed fixed partition of index set I = {1, 2, • • • , n}, wherein each component g ∈ G indexes a group of variables upon the perspective of applications. Theoretically, a larger λ typically results in a higher group sparsity while sacrifices more on the bias of model estimation, hence λ needs to be carefully fine-tuned to achieve both low f and high group-sparse solutions. Literature Review. Problem (1) has been well studied in deterministic optimization with various algorithms that are capable of returning solutions with both low objective value and high group sparsity under proper λ (Yuan & Lin, 2006; Roth & Fischer, 2008; Huang et al., 2011; Ndiaye et al., 2017) . Proximal methods are classical approaches to solve the structured non-smooth optimization (1), including the popular proximal gradient method (Prox-FG) which only uses the first-order derivative information. When N is huge, stochastic methods become ubiquitous to operate on a small subset to avoid the costly evaluation over all instances in deterministic methods for large-scale problems. Proximal stochastic gradient method (Prox-SG) (Duchi & Singer, 2009) is the natural stochastic extension of Prox-FG. Regularized dual-averaging method (RDA) (Xiao, 2010; Yang et al., 2010) is proposed by extending the dual averaging scheme in (Nesterov, 2009) . To improve the convergence rate, there exists a set of incremental gradient methods inspired by SAG (Roux et al., 2012) to utilizes the average of accumulated past gradients. For example, proximal stochastic variance-reduced gradient method (Prox-SVRG) (Xiao & Zhang, 2014) and proximal spider (Prox-Spider) (Zhang & Xiao, 2019) are developed to adopt multi-stage schemes based on the well-known variance reduction technique SVRG proposed in (Johnson & Zhang, 2013) and Spider developed in (Fang et al., 2018) respectively. SAGA (Defazio et al., 2014) stands as the midpoint between SAG and Prox-SVRG. Compared to deterministic methods, the studies of mixed 1 / 2 -regularization (1) in stochastic field become somewhat rare and limited. Prox-SG, RDA, Prox-SVRG, Prox-Spider and SAGA are valuable state-of-the-art stochastic algorithms for solving problem (1) but with apparent weakness. Particularly, these existing stochastic algorithms typically meet difficulties to achieve both decent convergence and effective group sparsity identification simultaneously (e.g., small function values but merely dense solutions), because of the randomness and the limited sparsity-promotion mechanisms. In depth, Prox-SG, RDA, Prox-SVRG, Prox-Spider and SAGA derive from proximal gradient method to utilize the proximal operator to produce group of zero variables. Such operator is generic to extensive non-smooth problems, consequently perhaps not sufficiently insightful if the target problems possess certain properties, e.g., the group sparsity structure as problem (1). In fact, in convex setting, the proximal operator suffers from variance of gradient estimate; and in non-convex setting, especially deep learning, the discreet step size (learning rate) further deteriorates its effectiveness on the group sparsity promotion, as will show in Section 2 that the projection region vanishes rapidly except RDA. RDA has superiority on finding manifold structure to others (Lee & Wright, 2012) , but inferiority on the objective convergence. Besides, the variance reduction techniques are typically required to measure over a huge mini-batch data points in both theory and practice which is probably prohibitive for large-scale problems, and have been observed as sometimes noneffective for deep learning applications (Defazio & Bottou, 2019) . On the other hand, to introduce sparsity, there exist heuristic weight pruning methods (Li et al., 2016; Luo et al., 2017) , whereas they commonly do not equip with theoretical guarantee, so that easily diverge and hurt generalization accuracy. Our Contributions. Half-Space Stochastic Projected Gradient (HSPG) method overcomes the limitations of the existing stochastic algorithms on the group sparsity identification, while maintains comparable convergence characteristics. While the main-stream works on (group) sparsity have focused on using proximal operators of regularization, our method is unique and fresh in enforcing group sparsity more effectively by leveraging half-space structure and is well supported by the theoretical analysis and empirical evaluations. We now summarize our contributions as follows. • Algorithmic Design: We propose the HSPG to solve the disjoint group sparsity regularized problem as (1). Initialized with a Prox-SG Step for seeking a close-enough but perhaps dense solution estimate, the algorithmic framework relies on a novel Half-Space Step to exploit group sparse patterns. We delicately design the Half-Space Step with the following main features: (i) it utilizes previous iterate as the normal direction to construct a reduced space consisting of a set of halfspaces and the origin; (ii) a new group projection operator maps groups of variables onto zero if they fall out of the constructed reduced space to identify group sparsity considerably more effectively than the proximal operator; and (iii) with proper step size, the Half-Space Step enjoys the sufficient decrease property, and achieves progress to optimum in both theory and practice. • Theoretical Guarantee: We provide the convergence guarantees of HSPG. Moreover, we prove HSPG has looser requirements to identify the sparsity pattern than Prox-SG, revealing its superiority on the group sparsity exploration. Particularly, for the sparsity pattern identification, the required distance to the optimal solution x * of HSPG is better than the distance required by Prox-SG. • Numerical Experiments: Experimentally, HSPG outperforms the state-of-the-art methods in the aspect of the group sparsity exploration, and achieves competitive objective value convergence and runtime in both convex and non-convex problems. In the popular deep learning tasks, HSPG usually computes the solutions with multiple times higher group sparsity and similar generalization performance on unseen testing data than those generated by the competitors, which may be further used to construct smaller and more efficient network architectures. Algorithm 1 Outline of HSPG for solving (1).

2. THE HSPG METHOD

1: Input: x 0 ∈ R n , α 0 ∈ (0, 1), ∈ [0, 1), and N P ∈ Z + . 2: for k = 0, 1, 2, . . . do 3: if k < N P then 4: Compute x k+1 ← Prox-SG(x k , α k ) by Algorithm 2. 5: else 6: Compute x k+1 ← Half-Space(x k , α k , ) by Algorithm 3. 7: Update α k+1 . Algorithm 2 Prox-SG Step. 1: Input: Current iterate x k , and step size α k . 2: Compute the stochastic gradient of f on mini-batch B k ∇f B k (x k ) ← 1 |B k | i∈B k ∇f i (x k ). (2) 3: Return x k+1 ← Prox α k λΩ(•) (x k -α k ∇f B k (x k )) . Initialization Stage. The Initialization Stage performs the vanilla proximal stochastic gradient method (Prox-SG, Algorithm 2) to approach the solution of (1). At kth iteration, a mini-batch B k is sampled to generate an unbiased estimator of the full gradient of f (line 2, Algorithm 2) to compute a trial iterate x k+1 := x k -α k ∇f B k (x k ) , where α k is the step size, and f B k is the average of the instance functions f i cross B k . The next iterate x k+1 is then updated based on the proximal mapping x k+1 = Prox α k λΩ(•) ( xk+1 ) = arg min x∈R n 1 2α k x -xk+1 2 + λΩ(x), where the regularization term Ω(x) is defined in (1). Notice that the above subproblem (3) has a closed-form solution, where for each g ∈ G, we have [x k+1 ] g = max {0, 1 -α k λ/ [ x k+1 ] g } • [ x k+1 ] g . In HSPG, the Initialization Stage proceeds Prox-SG Step N P times as a localization mechanism to seek an estimation which is close enough to a solution of problem (1), where N P := min{k : k ∈ Z + , x k -x * ≤ R/2} associated with a positive constant R related to the optima, see ( 23) in Appendix C. In practice, although the close-enough requirement is perhaps hard to be verified, we empirically suggest to keep running the Prox-SG Step until observing some stage-switch signal by testing on the stationarity of objective values, norm of (sub)gradient or validation accuracy similarly to (Zhang et al., 2020) . However, the Initialization Stage alone is insufficient to exploit the group sparsity structure, i.e., the computed solution estimate is typically dense, due to the randomness and the moderate truncation mechanism of proximal operator constrained in its projection region, i.e., the trial iterate [ x k+1 ] g is projected to zero only if it falls into an 2 -ball centered at the origin with radius α k λ by (4). Our remedy is to incorporate it with the following Half-Space Step, which exhibits an effective sparsity promotion mechanism while still remains the convergent property. x k+1 O S k [x] 1 [x] 2 x k ǫx k xk+1 xE -αk ∇ Ψ Bk ( xk ) ǫ > 0 θ < 90 • (a) Half-Space Projection O [x] 2 [x] 1 α k λ -α k λ λ -λ [x] 2 ǫ = 0 ǫ ≈ 1 x k Prox-SG Prox-SVRG Prox-Spider SAGA RDA ǫ = 0 O HSPG HSPG ǫ ∈ (0, 1) (b) Projection Region Algorithm 3 Half-Space Step 1: Input: Current iterate x k , step size α k , and . 2: Compute the stochastic gradient of Ψ on I =0 (x k ) by mini-batch B k [∇Ψ B k (x k )] I =0 (x k ) ← 1 |B k | i∈B k [∇Ψ i (x k )] I =0 (x k ) (5) 3: Compute [ xk+1 ] I =0 (x k ) ← [x k -α k ∇Ψ B k (x k )] I =0 (x k ) and [ xk+1 ] I 0 (x k ) ← 0. 4: for each group g in I =0 (x k ) do 5: if [ xk+1 ] g [x k ] g < [x k ] g 2 then 6: [ xk+1 ] g ← 0. 7: Return x k+1 ← xk+1 . Group-Sparsity Stage. The Group-Sparsity Stage is designed to effectively determine the groups of zero variables and capitalize convergence characteristic, which is in sharp contrast to other heuristic aggressive weight pruning methods but typically lacking theoretical guarantee (Li et al., 2016; Luo et al., 2017) . The underlying intuition of its atomic Half-Space Step (Algorithm 3) is to project [x k ] g to zero only if -[x k ] g serves as a descent step to Ψ(x k ), i.e., -[x k ] g [∇Ψ(x k ))] g < 0, hence updating [x k+1 ] g ← [x k ] g -[x k ] g = 0 still results in some progress to the optimality. Before introducing that, we first define the following index sets for any x ∈ R n : I 0 (x) := {g : g ∈ G, [x] g = 0} and I =0 (x) := {g : g ∈ G, [x] g = 0}, where I 0 (x) represents the indices of groups of zero variables at x, and I =0 (x) indexes the groups of nonzero variables at x. To proceed, we further define an artificial set that x lies in: S(x) := z ∈ R n : [z] g = 0 if g ∈ I 0 (x), and [z] g [x] g ≥ [x] g 2 if g ∈ I =0 (x) {0}, which consists of half-spaces and the origin. Here the parameter > 0 controls the grey region presented in Figure 1b , and the exact way to set will be discussed in Section 4 and Appendix. Hence, x inhabits S(x k ), i.e., x ∈ S(x k ), only if: (i) [x] g lies in the upper half-space for all g ∈ I =0 (x k ) for some prescribed ∈ [0, 1) as shown in Figure 1a ; and (ii) [x] g equals to zero for all g ∈ I 0 (x k ). The fundamental assumption for Half-Space Step to success is that: the Initialization Stage has produced a (possibly non-sparse) solution estimate x k nearby a group sparse solution x * of problem (1), i.e., the optimal distance x k -x * is sufficiently small. As seen in Appendix, it further indicates that the group sparse optimal solution x * inhabits S k := S(x k ), which implies that S k has already covered the group-support of x * , i.e., I =0 (x * ) ⊆ I =0 (x k ). Our goal now becomes minimizing Ψ(x) over S k to identify the remaining groups of zero variables, i.e., I 0 (x * )/I 0 (x k ), which is formulated as the following smooth optimization problem: x k+1 = arg min x∈S k Ψ(x) = f (x) + λΩ(x). By the definition of S k , [x] I 0 (x k ) ≡ 0 are constrained as fixed during Algorithm 3 proceeding, and only the entries in I =0 (x k ) are allowed to move. Hence Ψ(x) is smooth on S k , and ( 8) is a reduced space optimization problem. A standard way to solve problem (8) would be the stochastic gradient descent equipped with Euclidean projection (Nocedal & Wright, 2006) . However, such a projected method rarely produces zero (group) variables as the dense xE illustrated in Figure 1a . To address it, we introduce a novel projection operator to effectively conduct group projection as follows. As stated in Algorithm 3, we first approximate the gradient of Ψ on the free variables in I =0 (x k ) by [∇Ψ B k (x k )] I =0 (x k ) (line 2, Algorithm 3 ), then employ SGD to compute a trial point x k+1 (line 3, Algorithm 3) which is passed into a new projection operator Proj S k (•) defined as Proj S k (z) g := [z] g if [z] g [x k ] g ≥ [x k ] g 2 , 0 otherwise. ( ) The above projector of form ( 9) is not the standard Euclidean projection operator in most casesfoot_1 , but still satisfies the following two advantages: (i) the actual search direction 1a , then the progress to the optimum is made via the sufficient decrease property as drawn in Lemma 1; and (ii) effectively project groups of variables to zero simultaneously if the inner product of corresponding entries is sufficiently small. In contrast, the Euclidean projection operator is far away effective to promote group sparsity, as the Euclidean projected point xE = 0 versus x k+1 = Proj S k ( xk+1 ) = 0 shown in Figure 1a . d k := (Proj S k ( xk+1 ) -x k )/α k performs as a descent direction to Ψ B k (x k ) := f B k (x k ) + λΩ(x k ), i.e., [d k ] g [∇Ψ B k (x k ))] g < 0 as θ < 90 • in Figure Lemma 1. Algorithm 3 yields the next iterate x k+1 as Proj S k (x k -α k ∇Ψ B k (x k )), then the search di- rection d k := (x k+1 -x k )/α k is a descent direction for Ψ B k (x k ), i.e., d k ∇Ψ B k (x k ) < 0. Moreover, letting L be the Lipschitz constant for ∇Ψ B k on the feasible domain, and Ĝk := I =0 (x k ) I 0 (x k+1 ) and Gk := I =0 (x k ) I =0 (x k+1 ) be the sets of groups which projects or not onto zero, we have ΨB k (x k+1 ) ≤ΨB k (x k ) -α k - α 2 k L 2 g∈ Gk [∇ΨB k (x k )]g 2 - 1 - α k - L 2 g∈ Ĝk [x k ]g 2 . ( ) We then intuitively illustrate the strength of HSPG on group sparsity exploration. In fact, the halfspace projection (9) is a more effective sparsity promotion mechanism compared to the existing methods. Particularly, it benefits from a much larger projection region to map a reference point xk+1 := x k -α k ∇f B k (x k ) or its variants to zero. As the 2D case described in Figure 1b , the projection regions of Prox-SG, Prox-SVRG, Prox-Spider and SAGA are 2 -balls with radius as α k λ. In stochastic learning, especially deep learning tasks, the step size α k is usually selected around 10 -3 to 10 -4 or even smaller for convergence. Together with the common setting of λ 1, their projection regions would vanish rapidly, resulting in the difficulties to produce group sparsity. As a sharp contrast, even though α k λ is near zero, the projection region of HSPG {x : x k x < (α k λ + x k ) x k } (seen in Appendix) is still an open half-space which contains those 2 balls as well as RDA's if is large enough. Moreover, the positive control parameter adjusts the level of aggressiveness of group sparsity promotion (9), i.e., the larger the more aggressive, and meanwhile maintains the progress to the optimality by Lemma 1. In practice, proper fine tuning is sometimes required to achieve both group sparsity enhancement and sufficient decrease on objective value as will see in Section 4.

Intuition of Two-Stage Method:

To end this section, we discuss the advantage of designing such two stage schema rather than an adaptive switch back and forth between the Prox-SG Step and Half-Space Step based on some evaluation switching criteria, as many multi-step deterministic optimization algorithms (Chen et al., 2017) . In fact, we numerically observed that switching back to the Prox-SG Step consistently deteriorate the progress of group sparsity exploration by Half-Space Step while without obvious gain on convergence. Such regression on group sparsity by the Prox-SG Step is less attractive in realistic applications, e.g., model compression, where people usually possess heavy models of high generalization accuracy ahead and want to filter out the redundancy effectively. Therefore, in term of the ease of application, we end at organizing Prox-SG Step and Half-Space Step as such a two-stage schema, controlled by a switching hypermeter N P . In theory, we require N P sufficiently large to let the initial iterate of Half-Space Step be close enough to the local minimizer as shown in Section 3. In practice, HSPG is sensitive to the choice of N P at early iterations, i.e., switching to Half-Space Step too early may result accuracy loss. But such sensitivity vanishes rapidly if switching to Half-Space Step after some acceptable evaluation switching criteria.

3. CONVERGENCE ANALYSIS

In this section, we give the convergence guarantee of our HSPG. Towards that end, we make the following widely used assumption in optimization literature (Xiao & Zhang, 2014; Yang et al., 2019) and active set identification analysis of regularization problem (Nutini et al., 2019; Chen et al., 2018) . Assumption 1. Each f i : R n → R, for i = 1, 2, • • • , N , is differentiable and bounded below. Their gradients ∇f i (x) are Lipschitz continuous, and let L be the shared Lipschitz constant. Assumption 2. The least and the largest 2 -norm of non-zero groups in x * are lower and upper bounded by some constants, i.e., 0 < 2δ 1 := min g∈I =0 (x * ) [x * ] g and 0 < 2δ 2 := max g∈I =0 (x * ) [x * ] g . Moreover, we request a common strict complementarity on any B, i.e., 0 < 2δ 3 := min g∈I 0 (x * ) (λ -[∇f B (x * )] g ) for regularization optimization. Notations: Let x * be a local minimizer of problem (1) with group sparsity property, Ψ * be the local minimum value corresponding to x * , and {x k } ∞ k=0 be the iterates generated from Algorithm 1. Denote the gradient mapping of Ψ(x) and its estimator on mini-batch B as ξ η (x) := 1 η x -Prox ηλΩ(•) (x -η∇f (x)) and ξ η,B (x) := 1 η x -Prox ηλΩ(•) (x -η∇f B (x) ) respectively. We say x a stationary point of Ψ(x) if ξ η ( x) = 0. To be simple, let X be a neighbor of x * as X := {x : xx * ≤ R} with R as a positive constant related to δ 1 , δ 2 and (see ( 23) in Appendix C), and M be the supremum of ∂Ψ(x) on the compact set X . Remark: Assumption 1 implies that ∇f B (x) measured on mini-batch B is Lipschitz continuous on R n with the same Lipschitz constant L, while ∇Ψ B (x) is not as shown in Appendix. However, the Lipschitz continuity of ∇Ψ B (x) still holds on X = {x : [x] g ≥ δ 1 for each g ∈ G} by excluding a 2 -ball centered at the origin with radius δ 1 from R n . For simplicity, let ∇Ψ B (x) share the same Lipschitz constant L on X with ∇f B (x), since we can always select the bigger value as their shared Lipschitz constant. Now, we state the first main theorem of HSPG. Theorem 1. Suppose f is convex on X , ∈ 0, min δ 2 1 δ2 , 2δ1-R 2δ2+R , x K -x * ≤ R 2 for K ≥ N P . Set k := K + t, (t ∈ Z + ). Then for any τ ∈ (0, 1), there exist step size α k = O( 1 √ N t ) ∈ 0, min 2(1-) L , 1 L , 2δ1-R-(2δ2+R)

M

, and mini-batch size |B k | = O(t) ≤ N -N 2M , such that {x k } converges to some stationary point in expectation with probability at least 1 -τ , i.e., P(lim k→∞ E [ ξ α k ,B k (x k ) ] = 0) ≥ 1 -τ . Remark: Theorem 1 only requires local convexity of f on a neighborhood X of x * while itself can be non-convex in general. This local convexity assumption appears in many non-convex analysis, such as: tensor decomposition (Ge et al., 2015) and shallow neural networks (Zhong et al., 2017) . Theorem 1 implies that if the Kth iterate locates close enough to x * , the step size α k and mini-batch size |B k | is set as above, (it further indicates x * inhabits the {S k } k≥K of all subsequent iterates updated by Half-Space Step with high probability in Appendix), then the Half-Space Step in Algorithm 3 guarantees the convergence to the stationary point. The O(t) mini-batch size is commonly used in the analysis of stochastic algorithms, e.g., Adam and Yogi (Zaheer et al., 2018) . Later based on numerical results in Section 4, we observe that a much weaker increasing or even constant mini-batch size is sufficient. In fact, experiments show that practically, a reasonably large mini-batch size can work well if the variance is not large. Although the assumption x K -x * < R/2 is hard to be verified in practice, setting N P large enough usually performs quite well. We then reveal the sparsity identification guarantee of HSPG as stated in Theorem 2. Theorem 2. If k ≥ N P and x k -x * ≤ 2α k δ3 1-+α k L , then HSPG yields I 0 (x * ) ⊆ I 0 (x k+1 ). Remark: Theorem 2 shows that when x k is in the 2 -ball centered at x * with radius 2α k δ3 1-+α k L , HSPG identifies the optimal sparsity pattern, i.e., I 0 (x * ) ⊆ I 0 (x k+1 ). In contrast, to identify the sparsity pattern, Prox-SG requires the iterates to fall into the 2 -ball centered at x * with radius α k δ 3 (Nutini et al., 2019) . Since α k ≤ 1/L and ∈ [0, 1), then 2α k δ3 1-+α k L ≥ α k δ 3 implies that the 2 -ball of HSPG contains the 2 -ball of Prox-SG, i.e., HSPG has a stronger performance in sparsity pattern identification. Therefore, Theorem 2 reveals a better sparsity identification property of HSPG than Prox-SG, and no similar results exist for other methods to our knowledge.

The Initialization Stage Selection:

To satisfy the pre-requirement of convergence of Half-Space Step as Theorem 1, i.e., initial iterate close enough to x * , there exists several proper candidates e.g., Prox-SG, Prox-SVRG and SAGA to form as the Initialization Stage. Considering the tradeoff between computational efficiency and theoretical convergence, our default setting is to select Prox-SG. Although Prox-SVRG/SAGA may have better theoretical convergence property than Prox-SG, they require higher time and space complexity to compute or estimate full gradient on a huge mini-batch or store previous gradient, which may be prohibitive for large-scale training especially when the memory is often limited. Besides, it is well noticed that SVRG does not work as desired on the popular non-convex deep learning applications (Defazio & Bottou, 2019; Chen et al., 2020) . In contrast, Prox-SG is efficient and can also achieves the good initialization assumption in Theorem 1, i.e., x N P -x * ≤ R/2, in the manner of high probability via performing sufficiently many times, as revealed in Appendix C.4 by leveraging related literature (Rosasco et al., 2019) associated with an additional strongly convex assumption. However, one should notice that Prox-SG does not guarantee any group sparsity property of x N P due to the limited projection region and randomness. Remark: We emphasize that this paper focuses on improving the group sparsity identification, which is rarely explored and also a key indicator of success for structured sparsity regularization problem. Meanwhile, we would like to point out improving the convergence rate has been very well explored in a series of literatures (Reddi et al., 2016; Li & Li, 2018) , but out of our main consideration.

4. NUMERICAL EXPERIMENTS

In this section, we present results of several benchmark numerical experiments in deep neural networks to illustrate the superiority of HSPG than other related algorithms on group sparsity exploration and the comparable convergence. Besides, two extensible convex experiments are conducted in Appendix to empirically demonstrate the validness and superiority of the group sparsity identification by HSPG. Image Classification: We now consider the popular Deep Convolutional Neural Networks (DC-NNs) for image classification tasks. Specifically, we select several popular and benchmark DCNN architectures, i.e., VGG16 (Simonyan & Zisserman, 2014) , ResNet18 (He et al., 2016) and MobileNetV1 (Howard et al., 2017) on two benchmark datasets CIFAR10 (Krizhevsky & Hinton, 2009) and Fashion-MNIST (Xiao et al., 2017) . We conduct all experiments for 300 epochs with a mini-batch size of 128 and λ as 10 -3 , since it returns competitive testing accuracy to the models trained without regularization, (see more in Appendix D.3). The step size α k is initialized as 0.1, and decayed by a factor 0.1 periodically. We set each kernel in the convolution layers as a group variable. In these experiments, we proceed a test on the objective value stationarity similarly to (Zhang et al., 2020 , Section 2.1) and switch to Half-Space Step roughly on 150 epochs with N P as 150N/|B|. The control parameter in the half-space projection (9) controls the aggressiveness level of group sparsity promotion, which is first set as 0, then fined tuned to be around 0.02 to favor the sparsity level whereas does not hurt the target objective Ψ; the detailed procedure is in Appendix D.3. We exclude RDA because of no acceptable results attained during our tests with the step size parameter γ setting throughout all powers of 10 from 10 -3 to 10 3 , and skip Prox-Spider and SAGA since Prox-SVRG has been a superb representative to the proximal incremental gradient methods. Table 1 demonstrates the effectiveness and superiority of HSPG, where we mark the best values as bold, and the group sparsity ratio is defined as the percentage of zero groups. In particular, (i) HSPG computes remarkably higher group sparsity than other methods on all tests under both = 0 and fine tuned , of which the solutions are typically multiple times sparser in the manner of group than those of Prox-SG, while Prox-SVRG performs not comparably since the variance reduction techniques may not work as desired for deep learning applications (Defazio & Bottou, 2019) ; (ii) HSPG performs competitively with respect to the final objective values Ψ and f (see f in Appendix). In addition, all the methods reach a comparable generalization performance on unseen test data. On the other hand, sparse regularization methods may yield solutions with entries that are not exactly zero but are very small. Sometimes all entries below certain threshold (T ) are set to zero (Jenatton et al., 2010; Halabi et al., 2018) . However, such simple truncation mechanism is heuristic-rule based, hence may hurt convergence and accuracy. To illustrate this, we set the groups of the solutions of Prox-SG and Prox-SVRG to zero if the magnitudes of the group variables are less than some T , and denote the corresponding solutions as Prox-SG* and Prox-SVRG*. As shown in Figure 2d (i), under the T with no accuracy regression, Prox-SG* and Prox-SVRG* reach higher group sparsity ratio as 60% and 32% compared to Table 1 , but still significantly lower than the 70% of HSPG under = 0.05 without simple truncation. Under the T to reach the same group sparsity ratio as HSPG, the testing accuracy of Prox-SG* and Prox-SVRG* regresses drastically to 28% and 17% in Figure 2d (ii) respectively. Remark here that although further refitting the models from Prox-SG* and Prox-SVRG* on active (non-zero) groups of weights may recover the accuracy regression, it requires additional engineering efforts and training cost, which is less attractive and convenient than HSPG (with no need to refit). Finally, we investigate the group sparsity evolution under different 's. As shown in Figure 2b , HSPG produces the highest group-sparse solutions compared with other methods. Notably, at the early N P iterations, HSPG performs merely the same as Prox-SG. However, after switching to Half-Space Step at the 150th epoch, HSPG outperforms all the other methods dramatically, and larger results in higher sparsity level. It is a strong evidence that our half-space based technique is much more successful than the proximal mechanism and its variants in terms of the group sparsity identification. Besides, the evolutions of Ψ and testing accuracy confirm the comparability on convergence among the tested algorithms. Particularly, the objective Ψ generally monotonically decreases for small = 0 to 0.02, and experiences a mild pulse after switch to Half-Space Step for larger , e.g., 0.05, which matches Lemma 1. As a result, with the similar generalization accuracy, HSPG allows dropping entire hidden units of networks, which may further achieve automatic dimension reduction and construct smaller model architectures for efficient inference.

5. CONCLUSIONS AND FUTURE WORK

We proposed a new Half-Space Stochastic Projected Gradient (HSPG) method for disjoint groupsparsity induced regularized problem, which can be applied to various structured sparsity stochastic learning problem. HSPG makes use of proximal stochastic gradient method to seek a near-optimal solution estimate, followed by a novel half-space group projection to effectively exploit the group sparsity structure. In theory, we provided the convergence guarantee, and showed its better sparsity identification performance. Experiments on both convex and non-convex problems demonstrated that HSPG usually achieves solutions with competitive objective values and significantly higher group sparsity compared with state-of-the-arts stochastic solvers. Further study is needed to investigate the proper leverage of group sparsity into diverse deep learning applications, e.g., help people design and understand optimal network architecture by removing redundant hidden structures.

A PROJECTION REGION

In this Appendix, we derive the projection region of HSPG, and reveal that is a superset of those of Prox-SG, Prox-SVRG and Prox-Spider under the same α k and λ. Proposition 1. The Half-Space Step of HSPG yields next iterate x k+1 based on the trial iterate xk+1 = x k -α k ∇f B k (x k ) as follows for each g ∈ I =0 (x k ) [x k+1 ] g = [ xk+1 ] g -α k λ [x k ]g [x k ]g if [ xk+1 ] g [x k ] g > (α k λ + ) [x k ] g 0 otherwise. (11) Consequently, if [ xk+1 ] g ≤ α k λ, then [x k+1 ] g = 0 for any ≥ 0. Proof. For g ∈ I =0 (x k ) I =0 (x k+1 ), by Algorithm 3, it is equivalent to x k -α k ∇f B k (x k ) -α k λ [x k ] g [x k ] g g [x k ] g > [x k ] g 2 , [ xk+1 ] g [x k ] g -α k λ [x k ] g > [x k ] g 2 , [ xk+1 ] g [x k ] g > (α k λ + [x k ] g ) [x k ] g . Similarly, g ∈ I =0 (x k ) I 0 (x k+1 ) is equivalent to x k -α k ∇f B k (x k ) -α k λ [x k ] g [x k ] g g [x k ] g ≤ [x k ] g 2 , [ xk+1 ] g [x k ] g -α k λ [x k ] g ≤ [x k ] g 2 , [ xk+1 ] g [x k ] g ≤ (α k λ + [x k ] g ) [x k ] g . (13) If [ xk+1 ] g ≤ α k λ, then [ xk+1 ] g [x k ] g ≤ [ xk+1 ] g [x k ] g ≤ α k λ [x k ] g . Hence [x k+1 ] g = 0 holds for any ≥ 0 by ( 13), which implies that the projection region of Prox-SG and its variance reduction variants, e.g., Prox-SVRG, Prox-Spider and SAGA are the subsets of HSPG's.

B NON-LIPSCHITZ CONTINUITY OF ∇Ψ ON R n

The first-derivative of Ψ(x) at x = 0 can be written as ∇Ψ(x) = ∇f (x) + λ g∈G [x] g [x] g We next show [x]g [x] g is not Lipschitz continuous on R n if |g| ≥ 2. Take a example for x ∈ R 2 , and select x 1 = (t, a 1 t) , x 2 = (t, a 2 t) , a 1 = a 2 and t ∈ R. Then suppose there exists a positive constant L < ∞ such that Lipschitz continuity holds as follows x 1 x 1 - x 2 x 2 ≤ L x 1 -x 2 (1, a 1 ) 1 + a 2 1 - (1, a 2 ) 1 + a 2 2 ≤ L|a 1 -a 2 | • |t| holds for any t ∈ R, and note the left hand side is a positive constant. However, letting t → 0, we have that L → ∞ which contradicts the L < ∞. Therefore, [x]g [x]g is not Lipschitz continuous on R 2 , specifically the region surrounding the origin point. Although [∇Ψ(x)] I =0 (x) is not Lipscthiz continuous on R n , the Lipschitz continuity still holds on by excluding a fixed size 2 -ball centered at the origin for the group of non-zero variables I =0 (x) from R n . For our paper, we define the region where Lipscthiz continuity of [∇Ψ(x)] I =0 (x) still holds as X = {x : [x] g ≥ δ 1 for each g ∈ I =0 (x), and [x] g = 0 for each g ∈ I 0 (x)}. ( C CONVERGENCE ANALYSIS PROOF Denote the sets of groups which are projected or not onto zero as Ĝk := I =0 (x k ) I 0 (x k+1 ), and ( 18) Gk := I =0 (x k ) I =0 (x k+1 ). Denote X := {x : [x] g ≥ δ 1 for each g ∈ G} where the Lipschitz continuity of ∇Ψ B (x) still holds by excluding a 2 -ball centered at the origin with radius δ 1 from R n . Let M denote one upper bound of ∂Ψ and ξ . Additionally, establishing some convergence results require the below constants to measure the least and largest magnitude of non-zero group variables in x * , 0 < δ 1 := 1 2 min g∈I =0 (x * ) [x * ] g , and 0 < δ 2 := 1 2 max g∈I =0 (x * ) [x * ] g . ( ) and a subsequent results of strict complementary assumption on any B uniformly, 0 < δ 3 := 1 2 min g∈I 0 (x * ) (λ -[∇f B (x * )] g ) And denote the following frequently used constant R describing the size of neighbor around x * . R := min -(δ 1 + 2 δ 2 ) + (δ 1 + 2 δ 2 ) 2 -4 2 δ 2 + 4 δ 2 1 , δ 1 > 0. ( ) Remark: (23) is well defined as 0 < < δ 2 1 δ2 , and degenerated to δ 1 as = 0.

C.1 SUFFICIENT DECREASE OF PROX-SG STEP AND HALF-SPACE STEP

Our convergence analysis relies on the following sufficient decrease properties of Half-Space Step and Prox-SG Step.

Sufficient Decrease of Half-Space

Step: We prove the Lemma 1 as the below.

Proof of Lemma 1:

It follows Algorithm 3 and the definition of Gk and Ĝk as ( 19) and ( 18) that x k+1 = x k + α k d k where d k is [d k ] g =    -[∂Ψ B k (x k )] g if g ∈ Gk = I =0 (x k ) I =0 (x k+1 ), -[x k ] g /α k if g ∈ Ĝk = I =0 (x k ) I 0 (x k+1 ), 0 otherwise. We also notice that for any g ∈ Ĝk , the following holds [x k -α k ∂Ψ B k (x k )] g [x k ] g < [x k ] g 2 , (1 -) [x k ] g 2 < α k [∂Ψ B k (x k )] g [x k ] g . For simplicity, let I =0 k := I =0 (x k ). Since [d k ] g = 0 for any g ∈ I 0 (x k ), then by ( 24) and (25), we have d k ∂Ψ B k (x k ) = [d k ] I =0 k [∂Ψ B k (x k )] I =0 k = - g∈ Gk [∂Ψ B k (x k )] g 2 - g∈ Ĝk 1 α k [x k ] g [∂Ψ B k (x k )] g ≤ - g∈ Gk [∂Ψ B k (x k )] g 2 - g∈ Ĝk 1 α 2 k (1 -) [x k ] g 2 < 0, holds for any ∈ [0, 1), which implies that d k is a descent direction for Ψ B k (x k ).

Now, we start to prove the suffcient decrease of Half-Space

Step. By the descent lemma, x k ∈ X and the Lipschitz continuity of [∂Ψ B k ] I =0 k on X , we have that Ψ B k (x k + α k d k ) ≤ Ψ B k (x k ) + α k [∂Ψ B k (x k )] I =0 k [d k ] I =0 k + L 2 α 2 k [d k ] I =0 k 2 . ( ) Then it follows ( 24) that ( 27) can be rewritten as follows Ψ B k (x k + α k d k ) ≤Ψ B k (x k ) + α k [∂Ψ B k (x k )] I =0 k [d k ] I =0 k + L 2 α 2 k [d k ] I =0 k 2 =Ψ B k (x k ) - g∈ Gk [∂Ψ B k (x k )] g 2 α k - L 2 α 2 k - g∈ Ĝk [∂Ψ B k (x k )] g [x k ] g - L 2 [x k ] g 2 (28) Consequently, combining with ∈ [0, 1) and ( 25), ( 28) can be further shown as Ψ B k (x k+1 ) ≤ Ψ B k (x k ) -α k - α 2 k L 2 g∈ Gk [∂Ψ B k (x k )] g 2 - 1 - α k - L 2 g∈ Ĝk [x k ] g 2 , (29) which completes the proof. Sufficient Decrease of Prox-SG Step: The second lemma is well known for proximal operator under our notations. We include this proof for completeness. Lemma 2. Line 3 of Algorithm 2 yields that x k+1 = x k -α k ξ α k ,B k (x k ), where ξ α k ,B k (x k ) ∈ ∇f B k (x k ) + λ∂Ω(x k+1 ). ( ) And the objective value Ψ B k satisfies Ψ B k (x k+1 ) ≤ Ψ B k (x k ) -α k - α 2 k L 2 ξ α k ,B k (x k ) 2 . ( ) Proof. It follows from the line (3) in Algorithm 2 and the definitions of proximal operator that x k+1 = arg min x∈R n 1 2α k x -(x k -α k ∇f B k (x k )) 2 + λΩ(x) = arg min x∈R n ∇f B k (x k ) (x -x k ) + λΩ(x) + 1 2α k x -x k 2 (32) By the optimal condition, we have 0 ∈ 1 α k (x k+1 -x k ) + ∇f B k (x k ) + λ∂Ω(x k+1 ). Since x k+1 = x k -α k ξ α k ,B k (x k ), we have 0 ∈ -ξ α k ,B k (x k ) + ∇f B k (x k ) + λ∂Ω(x k+1 ), which implies that ξ α k ,B k (x k ) ∈ ∇f B k (x k ) + λ∂Ω(x k+1 ). (35) And thus there exists some v ∈ ∂Ω(x k+1 ) such that ξ α k ,B k (x k ) = ∇f B k (x k ) + λv. ( ) By Lipschitz continuity of ∇f B k and convexity of Ω(•), we have f B k (x k+1 ) = f B k (x k -α k ξ α k ,B k (x k )) ≤ f B k (x k ) -α k ∇f B k (x k ) ξ α k ,B k (x k ) + α 2 k L 2 ξ α k ,B k (x k ) 2 (37) and λΩ(x k+1 ) = λΩ(x k -α k ξ α k ,B k (x k )) ≤ λΩ(x k ) + λv (x k -α k ξ α k ,B k (x k ) -x k ) = λΩ(x k ) -α k λv ξ α k ,B k (x k ). ( ) Hence, by ( 36), ( 37) and ( 38), the objective Ψ B k (x k+1 ) satisfies Ψ B k (x k+1 ) = f B k (x k+1 ) + λΩ(x k+1 ) ≤f B k (x k ) -α k ∇f B k (x k ) ξ α k ,B k (x k ) + α 2 k L 2 ξ α k ,B k (x k ) 2 + λΩ(x k ) -α k λv ξ α k ,B k (x k ) =Ψ B k (x k ) -α k (∇f B k (x k ) + λv) ξ α k ,B k (x k ) + α 2 k L 2 ξ α k ,B k (x k ) 2 =Ψ B k (x k ) -α k - α 2 k L 2 ξ α k ,B k (x k ) 2 , which completes the proof. According to Lemma 1 and Lemma 2, the objective value on a mini-batch tends to achieve a sufficient decrease in both Prox-SG Step and Half-Space Step given α k is small enough. By taking the expectation on both sides, we obtain the following result characterizing the sufficient decrease from Ψ(x k ) to E [Ψ(x k+1 )]. Corollary 1. For iteration k, we have (i) if kth iteration conducts Prox-SG Step, then E [Ψ(x k+1 )] ≤ Ψ(x k ) -α k - α 2 k L 2 E ξ α k ,B k (x k ) 2 . ( ) (ii) if kth iteration conducts Half-Space Step, x k ∈ X , then E [Ψ(x k+1 )] ≤ Ψ(x k )- g∈ Gk α k - α 2 k L 2 E ∂Ψ B k (x k ) 2 - 1 - α k - L 2 g∈ Ĝk [x k ] g 2 . ( ) Corollary 1 shows that the bound of Ψ depends on step size α k and norm of search direction. It further indicates that both Half-Space Step and Prox-SG Step can make some progress to optimality with proper selection of α k .

C.2 PROOF OF THEOREM 1

Toward that end, we first show that if the optimal distance from x k to the local minimizer x * is sufficiently small, then HSPG already covers the supports of x * , i.e., I =0 (x * ) ⊆ I =0 (x k ). Lemma 3. If x k -x * ≤ R, then I =0 (x * ) ⊆ I =0 (x k ). Proof. For any g ∈ I =0 (x * ), by the assumption of this lemma and the definition of R as ( 23) and δ 1 as ( 20), we have that [x * ] g -[x k ] g ≤ [x k -x * ] g ≤ x k -x * ≤ R ≤ δ 1 [x k ] g ≥ [x * ] g -δ 1 ≥ 2δ 1 -δ 1 = δ 1 > 0 (41) Hence [x k ] g = 0, i.e., g ∈ I =0 (x k ). Therefore, I =0 (x * ) ⊆ I =0 (x k ). The next lemma shows that if the distance between current iterate x k and x * , i.e., x k -x * is sufficiently small, then x * inhabits the reduced space S k := S(x k ). Lemma 4. Under Assumption 1, if 0 ≤ < δ 2 1 δ2 , x k -x * ≤ R, then for each g ∈ I =0 (x * ), [x k ] g [x * ] g ≥ [x k ] g 2 (42) Consequently, it implies x * ∈ S k by the definition as (7). Proof. It follows the assumption of this lemma and the definition of R in ( 23), δ 1 and δ 2 in ( 23), ( 20) and ( 21) that for any g ∈ I =0 (x * ), [x k ] g ≤ [x * ] g + R ≤ 2δ 2 + R, ( ) and the -( δ 1 + 2 δ 2 ) + (δ 1 + 2 δ 2 ) 2 -4 2 δ 2 + 4 δ 2 1 / in (23) is actually the solution of z 2 + (4 δ 2 + 2δ 1 )z + 4 δ 2 2 -4δ 2 1 = 0 regarding z ∈ R + . Then we have that [x k ] g [x * ] g =[x k -x * + x * ] g [x * ] g =[x k -x * ] g [x * ] g + [x * ] g 2 ≥ [x * ] g 2 -[x k -x * ] g [x * ] g = [x * ] g ( [x * ] g -[x k -x * ] g ) ≥2δ 1 (2δ 1 -R) ≥ (2δ 2 + R) 2 ≥ [x k ] g 2 (44) holds for any g ∈ I =0 (x * ), where the second last inequality holds because that 2δ 1 (2δ 1 -R) = (2δ 2 + R) 2 as R = -(δ 1 + 2 δ 2 ) + (δ 1 + 2 δ 2 ) 2 -4 2 δ 2 + 4 δ 2 1 / . Now combing with the definition of S k as (7), we have x * inhabits S k , which completes the proof. Furthermore, if x k -x * is small enough and the step size is selected properly, every recovery of group sparsity by Half-Space Step can be guaranteed as successful as stated in the following lemma. Lemma 5. Suppose k ≥ N P , x k -x * ≤ R, 0 ≤ < 2δ1-R 2δ2+R and 0 < α k ≤ 2δ1-R-(2δ2+R) M , then for any g ∈ Ĝk = I =0 (x k ) I 0 (x k+1 ), g must be in I 0 (x * ), i.e., g ∈ I 0 (x * ). Proof. To prove it by contradiction, suppose there exists some g ∈ Ĝk such that g ∈ I =0 (x * ). Since g ∈ Ĝk = I =0 (x k ) I 0 (x k+1 ), then the group projection ( 9) is trigerred at g such that [ xk+1 ] g [x k ] g = [x k -α∇Ψ B k (x k )] g [x k ] g = [x k ] g 2 -α k [∇Ψ B k (x k )] g [x k ] g < [x k ] g 2 . ( ) On the other hand, it follows the assumption of this lemma and g ∈ I =0 (x * ) that [x k -x * ] g ≤ x k -x * ≤ R (46) Combining the definition of δ 1 as ( 20) and δ 2 as ( 21), we have that [x k ] g ≥ [x * ] g -R ≥ 2δ 1 -R [x k ] g ≤ [x * ] g + R ≤ 2δ 2 + R (47) It then follows 0 < α k ≤ 2δ1-R-(2δ2+R) M , where note 2δ 1 -R -(2δ 2 + R) > 0 as R ≤ δ 1 and < 2δ1-R 2δ2+R , that [ xk+1 ] g [x k ] g = [x k ] g 2 -α k [∇Ψ B k (x k )] g [x k ] g ≥ [x k ] g 2 -α k [∇Ψ B k (x k )] g [x k ] g = [x k ] g ( [x k ] g -α k [∇Ψ B k (x k )] g ) ≥ [x k ] g ( [x k ] g -α k M ) ≥ [x k ] g [(2δ 1 -R) -α k M ] ≥ [x k ] g (2δ 1 -R) - 2δ 1 -R -(2δ 2 + R) M M ≥ [x k ] g [(2δ 1 -R) -2δ 1 + R + (2δ 2 + R)] ≥ [x k ] g (2δ 2 + R) ≥ [x k ] g 2 (48) which contradicts with (45). Hence, we conclude that any g of variables projected to zero, i.e., g ∈ Ĝk = I =0 (x k ) I 0 (x k+1 ) are exactly also the zeros on the optimal solution x * , i.e., g ∈ I 0 (x * ). We next present that if the iterate of Half-Space Step is close enough to the optimal solution x * , then x * inhabits all reduced spaces constructed by the subsequent iterates of Half-Space Step with high probability. To establish this results, we require the below two lemmas. The first bounds the accumulated error because of random sampling. Here we introduce the error of gradient estimator on I =0 (x) for Ψ on mini-batch B as e B (x) := [∇Ψ B (x) -∇Ψ(x)] I =0 (x) , where by the definition of Ω in problem (1), we have e B (x) also equals to the error of estimation for ∇f , e B (x) = [∇Ψ B (x) -∇Ψ(x)] I =0 (x) = [∇f B (x) -∇f (x)] I =0 (x) . Lemma 6. Given any θ > 1, K ≥ N P , let k := K + t, t ∈ Z + {0}, then there exists α k = O(1/t) and |B k | = O(t), such that for any y t ∈ R n , max {yt} ∞ t=0 ∈X ∞ ∞ t=0 α k e B k (y t ) 2 ≤ 3R 2 8(4R + 1) holds with probability at least 1 -1 θ 2 . Proof. Define random variable Y t := α K+t e B K+t (y t ) 2 for all t ≥ 0. Since {y t } ∞ t=0 are arbitrarily chosen, then the random variables {Y t } ∞ t=0 are independent. Let Y := ∞ t=0 Y t . Using Chebshev's inequality, we obtain P Y ≥ E[Y ] + θ Var[Y ] ≤ P |Y -E[Y ]| ≥ θ Var[Y ] ≤ 1 θ 2 . ( ) And based on the Assumption 1, there exists an upper bound σ 2 > 0 for the variance of random noise e(x) generated from the one-point mini-batch, i.e., B = {i}, i = 1, . . . , N . Consequently, for each t ≥ 0, we have E[Y t ] ≤ α K+t σ √ |B K+t | and Var[Y t ] ≤ α 2 K+t σ 2 |B K+t | , then combining with (51), we have Y ≤ E[Y ] + θ Var[Y ] (52) ≤ ∞ t=0 α K+t σ |B k+t | + θ • ∞ t=0 α 2 K+t σ 2 |B K+t | (53) ≤ ∞ t=0 α K+t σ |B k+t | + θ • ∞ t=0 α K+t σ |B K+t | = (1 + θ) ∞ t=0 α K+t σ |B K+t | holds with probability at least 1 -1 θ 2 . Here, for the second inequality, we use the property that the equality E[ Mitzenmacher (2005) ; and for the third inequality, we use α K+t σ √ ∞ t=0 Y i ] = ∞ t=0 E[Y i ] holds whenever ∞ t=0 E[|Y i |] convergences, see Section 2.1 in |B K+t | ≤ 1 without loss of generality as the common setting of large mini-batch size and small step size. Given any θ > 1, there exists some α k = O(1/t) and |B k | = O(t), the above series converges and satisfies that (1 + θ) ∞ t=0 α K+t σ |B K+t | ≤ 3R 2 8(4R + 1) holds. Notice that the above proof holds for any given sequence {y t } ∞ t=0 ∈ X ∞ , thus max {yt} ∞ t=0 ∈X ∞ ∞ t=0 α k e B k (y t ) 2 ≤ 3R 2 8(4R + 1) holds with probability at least 1 -1 θ 2 . The second lemma draws if previous iterate of Half-Space Step falls into the neighbor of x * , then under appropriate step size and mini-batch setting, the current iterate also inhabits the neighbor with high probability. Lemma 7. Under the assumptions of Lemma 6, suppose x K -x * ≤ R/2; for any satisfying K ≤ < K + t, 0 < α ≤ min{ 1 L , 2δ1-R-(2δ2+R) M }, |B | ≥ N -N 2M and x -x * ≤ R holds, then x K+t -x * ≤ R. ( ) holds with probability at least 1 -1 θ 2 . Proof. It follows the assumptions of this lemma, Lemma 5, ( 18) and ( 19) that for any satisfying K ≤ < K + t [x * ] g = 0, for any g ∈ Ĝ . Hence we have that for K ≤ < K + t, x +1 -x * 2 = g∈ G [x -x * -α ∇Ψ(x ) -α e B (x )] g 2 + g∈ Ĝk [x -x * -x ] g 2 = g∈ G [x -x * ] g 2 -2α [x -x * ] g [∇Ψ(x ) + e B (x )] g + α 2 [∇Ψ(x ) + e B (x )] g 2 + g∈ Ĝ [x * ] g 2 = g∈ G [x -x * ] g 2 -2α [x -x * ] g [∇Ψ(x )] g -2α [x -x * ] g [e B (x )] g + α 2 [∇Ψ(x ) + e B (x )] g 2 ≤ g∈ G [x -x * ] g 2 -[∇Ψ(x )] g 2 2 α L -α 2 -2α [x -x * ] g [e B (x )] g + α 2 [e B (x )] g 2 + 2α 2 [∇Ψ(x )] g [e B (x )] g ≤ g∈ G [x -x * ] g 2 -[∇Ψ(x )] g 2 2 α L -α 2 + 2α [x -x * ] g [e B (x )] g + α 2 [e B (x )] g 2 + 2α 2 [∇Ψ(x )] g [e B (x )] g ≤ g∈ G [x -x * ] g 2 -[∇Ψ(x )] g 2 2 α L -α 2 + (2α + 2α 2 L) [x k -x * ] g [e B (x )] g + α 2 [e B (x )] g 2 ≤ g∈ G [x -x * ] g 2 -[∇Ψ(x )] g 2 2 α L -α 2 + (2α + 2α 2 L) x -x * e B (x ) + α 2 e B (x ) 2 On the other hand, by the definition of e B (x) as (49), we have that e B (x) =[∇Ψ B (x) -∇Ψ(x)] I =0 (x) = [∇f B (x) -∇f (x)] I =0 (x) = 1 |B| j∈B [∇f j (x)] I =0 (x) - 1 N N i=1 [∇f i (x)] I =0 (x) = 1 N j∈B N |B| [∇f j (x)] I =0 (x) -[∇f j (x)] I =0 (x) - 1 N N i=1 i / ∈B [∇f i (x)] I =0 (x) = 1 N j∈B N -|B| |B| [∇f j (x)] I =0 (x) - 1 N N i=1 i / ∈B [∇f i (x)] I =0 (x) Thus taking the norm on both side of (59) and using triangle inequality results in the following: e B (x) ≤ 1 N j∈B N -|B| |B| [∇f j (x)] I =0 (x) + 1 N N i=1 i / ∈B [∇f i (x)] I =0 (x) ≤ 1 N N -|B| |B| |B k |M + 1 N (N -|B|)M ≤ 2(N -|B|)M N . Since α ≤ 1, and |B | ≥ N -N 2M hence α e B (x ) ≤ 1. Then combining with α ≤ 1/L, (58) can be further simplified as x +1 -x * 2 ≤ g∈ G [x -x * ] g 2 -[∇Ψ(x )] g 2 2 α L -α 2 + (2α + 2α 2 L) x -x * e B (x ) + α 2 e B (x ) 2 ≤ g∈ G [x -x * ] g 2 - 1 L 2 [∇Ψ(x )] g 2 + 4α x -x * e B (x ) + α 2 e B (x ) 2 ≤ x -x * 2 + 4α x -x * e B (x ) + α e B (x ) Following from the assumption that xx * ≤ R, then (61) can be further simplified as x +1 -x * 2 ≤ x -x * 2 + 4α R e B (x ) + α k e B (x ) ≤ x -x * 2 + (4R + 1)α e B (x ) Summing the the both side of (62 ) from = K to = K + t -1 results in x K+t -x * 2 ≤ x K -x * 2 + (4R + 1) K+t-1 =K α e B (x ) It follows Lemma 6 that the followng holds with probability at least 1 -1 θ 2 , ∞ =K α e B (x ) ≤ 3R 2 4(4R + 1) . Thus we have that x K+t -x * 2 ≤ x K -x * 2 + (4R + 1) K+t-1 =K α e B (x ) ≤ x K -x * 2 + (4R + 1) ∞ =K α e B (x ) ≤ R 2 4 + (4R + 1) 3R 2 4(4R + 1) ≤ R 2 4 + 3R 2 4 ≤ R 2 , holds with probability at least 1 -1 θ 2 , which completes the proof. Based on the above lemmas, the Lemma 8 below shows if initial iterate of Half-Space Step locates closely enough to x * , step size α k polynomially decreases, and mini-batch size B k polynomially increases, then x * inhabits all subsequent reduced space {S k } ∞ k=K constructed in Half-Space Step with high probability. Lemma 8. Suppose x K -x * ≤ R 2 , K ≥ N P , k = K + t, t ∈ Z + , 0 < α k = O(1/( √ N t)) ≤ min{ 2(1-) L , 1 L , 2δ1-R-(2δ2+R) M } and |B k | = O(t) ≥ N -N 2M . Then for any constant τ ∈ (0, 1), x k -x * ≤ R with probability at least 1 -τ for any k ≥ K. Proof. It follows Lemma 4 and the assumption of this lemma that x * ∈ S K . Moreover, it follows the assumptions of this lemma, Lemma 6 and 7, the definition of finite-sum f (x) in (1), and the bound of error as (60) that P({x k } ∞ k=K ∈ {x : x -x * ≤ R} ∞ ) ≥ 1 - 1 θ 2 O(N -K) ≥ 1 -τ, where the last two inequalities comes from that the error vanishing to zero as |B k | reaches the upper bound N , and θ is sufficiently large depending on τ and O(N -K). Corollary 2. Lemma 8 further implies x * inhabits all subsequent S k , i.e., x * ∈ S k for any k ≥ K. Next, we establish that after finitely number of iterations, HSPG generates sequences that inhabits in the feasible domain X where Lipschitz continuity of Ψ holds. Lemma 9. Suppose the assumptions of Lemma 8 hold, then after finite number of iterations, all subsequent iterates x k ∈ X with high probability. Proof. It follows Lemma 8 that all subsequent x k satisfying x k -x * ≤ R with high probability. Combining with Lemma 3, we have that I =0 (x * ) ⊆ I =0 (x k ) for all k ≥ K with high probability. Then for any g ∈ I =0 (x k ), there are two possbilities, either g ∈ I =0 (x * ) or g ∈ I 0 (x * ). For the first case g ∈ I =0 (x * ) I =0 (x k ), it follows the definitions of R as ( 23) and δ 1 as ( 20) that [x k -x * ] g ≤ x k -x * ≤ R ≤ δ 1 [x * ] g -[x k ] g ≤ δ 1 [x k ] g ≥ [x * ] g -δ 1 ≥ 2δ 1 -δ 1 = δ 1 (67) For any g ∈ I 0 (x * ) I =0 (x k ), by Algorithm 3, its norm is bounded below by δ 1 ≥ [x k -x * ] g = [x k ] g ≥ t [x K ] g , where by the Theorem 2 will shown in Appendix C.3, if [x k ] g ≤ 2α k δ3 1-+α k L , then [x k+1 ] g equals to zero and will be fixed as zero since Algorithm 3 operates on S k as (7). Note α k = O(1/t), following (Karimi et al., 2016, Theorem 4) and (Drusvyatskiy & Lewis, 2018, Theorem 3.2)  , E[ [x k ] g 2 ] = O(1/t). If > 0, then after finite number of iterations O(1/ 2 ), g ∈ I 0 (x * ) I =0 (x k ) becomes zero. If = 0, note B k = O(t) and f is finite-sum, then similar result holds by (Gower, 2018, Theorem 2.3, Theorem 3.2) (f needs further strongly convexity on X ). Hence with high probability, after finite number of iterations, denoted by T , all subsequent x k , k ≥ K + T inhabits X . Regarding [x k ] g∈I 0 (x * ) I =0 (x k ) for K ≤ k ≤ K + T , note t [x K ] g is also bounded below by constant T [x K ] g > 0 given x K , for simplicity, denote the Lipschitz constant of [∇Ψ(x k )] g as L as well. We now prove the first main theorem of HSPG, i.e., Theorem 1.

Proof of Theorem 1:

We know that Algorithm 1 performs an infinite sequence of iterations. It follows Corollary 1 that for any ∈ Z + , E[Ψ(x K )] -E[Ψ(x +1 )] = k=K {E[Ψ(x k )] -E[Ψ(x k+1 )]} ≥ K≤k≤ α k - α 2 k L 2 g∈ Gk E [∇Ψ(x k )] g 2 + K≤k≤ 1 - α k - L 2 g∈ Ĝk [x k ] g 2 . ( ) Combining the assumption that Ψ is bounded below and letting → ∞, we obtain k≥K α k - α 2 k L 2 g∈ Gk E [∇Ψ(x k )] g 2 + k≥K 1 - α k - L 2 g∈ Ĝk [x k ] g 2 < ∞ By Algorithm 3, variables on I 0 (x k ) are fixed during kth Half-Space Step and n is finite, then the group projection appears finitely many times, consequently, k≥K 1 - α k - L 2 g∈ Ĝk [x k ] g 2 < ∞. Thus (70) implies that  k≥K α k - α 2 k L 2 g∈ Gk E [∇Ψ(x k )] g 2 (72) = k≥K α k g∈ Gk E [∇Ψ(x k )] g 2 - k≥K α 2 k L g∈ Gk E [∇Ψ(x k )] g 2 < ∞ (73) Since α k = O(1/( √ N t)), then k≥K α k = ∞ E [ ξ α k ,B k (x k ) ] = 0) ≥ 1 -τ. C.3 PROOF OF THEOREM 2 In this Appendix, we compare the group sparsity identification property of HSPG and Prox-SG. We first show the generic sparsity identification property of Prox-SG for any mixed 1 / p regularization for p ≥ 1. Lemma 10. If x k -x * p ≤ min{δ 3 /L, α k δ 3 }, where 1/p + 1/p = 1 (p = ∞ if p = 1), then the Prox-SG yields that for each g ∈ I 0 (x * ), [x k+1 ] g = 0 holds, i.e., I 0 (x * ) ⊆ I 0 (x k+1 ). Proof. It follows from the reverse triangle inequality, basic norm inequalities, Lipschitz continuity of ∇f (x) and the assumption of this lemma that for any g ∈ G, [∇f B k (x k )] g p -[∇f B k (x * )] g p ≤ [∇f B k (x k ) -∇f B k (x * )] g p ≤ ∇f B k (x k ) -∇f B k (x * ) p ≤ L x k -x * p ≤ L • δ 3 L = δ 3 . By (78), we have that for any g ∈ I 0 (x * ), [∇f B k (x k )] g p ≤ [∇f B k (x * )] g p + δ 3 ≤ λ -2δ 3 + δ 3 = λ -δ 3 Combining ( 79) and the assumption of this lemma, the following holds for any α k > 0 that [x k -α k ∇f B k (x k )] g p ≤ [x k ] g p + [α k ∇f B k (x k )] g p ≤ α k δ 3 + α k (λ -δ 3 ) = α k λ which further implies that the Ecludiean projection yields that Proj e B( • p ,α k λ) ([x k -α k ∇f B k (x k )] g ) = [x k -α k ∇f B k (x k )] g . Combining with (81), the fact that proximal operator is the residual of identity operator subtracted by Euclidean project operator onto the dual norm ball and [x k ] g = 0 for any g ∈ I 0 (x * ) (Chen, 2018), we have that [x k+1 ] g = Prox α k λ • p ([x k -α k ∇f B k (x k )] g ) = I -Proj e B( • p ,α k λ) [x k -α k ∇f B k (x k )] g = [x k -α k ∇f B k (x k )] g -[x k -α k ∇f B k (x k )] g = 0, consequently I 0 (x * ) ⊆ I 0 (x k+1 ), which completes the proof. Proof. By the conditions (A1, A2, A3), Assumption 3.1 and Theorem 3.2 in Rosasco et al. (2019) , we have for any k ≥ 2, E[ x k -x * 2 ] ≤ E[ x k0 -x * 2 ] k 0 k + σ 2 µ 2 β 2 log(k -1) k . Let E[ x k0 -x * 2 ] = s k0 . For any τ ∈ (0, 1), there exists a N P ∈ Z + satisfying N P ≥ max 8k 0 s k0 R 2 τ , 8σ 2 log(N P -1) µ 2 β 2 R 2 τ , we have E[ x N P -x * 2 ] ≤ R 2 τ 4 . Therefore, by Markov inequality, we have that x N P -x * 2 ≤ R 2 4 ⇔ x N P -x * ≤ R 2 holds with probability at least 1 -τ .

D ADDITIONAL NUMERICAL EXPERIMENTS

In this section, we provide additional numerical experiments to (i) demonstrate the validness of group sparsity identification of HSPG; (ii) provide comprehensive comparison to Prox-SG, RDA and Prox-SVRG on benchmark convex problems; and (iii) describe more details regarding our non-convex deep learning experiments shown in the main body.

D.1 LINEAR REGRESSION ON SYNTHETIC DATA

We first numerically validate the proposed HSPG on group sparsity identification by linear regression problems with 1 / 2 regularizations using synthetic data. Consider a data matrix A ∈ R N ×n consisting of N instances and the target variable y ∈ R N , we are interested in the following problem: minimize x∈R n 1 2N Ax -y 2 + λ g∈G [x] g . ( ) Our goal is to empirically show that HSPG is able to identify the ground truth zero groups with synthetic data. We conduct the experiments as follows: (i) generate the data matrix A whose elements are uniformly distributed among [-1, 1]; (ii) generate a vector x * working as the ground truth solution, where the elements are uniformly distributed among [-1, 1] and the coordinates are equally divided into 10 groups (|G| = 10); (iii) randomly set a number of groups of x * to be 0 according to a pre-specified group sparsity ratio; (iv) compute the target variable y = Ax * ; (v) solve the above problem (92) for x with A and y only, and then evaluate the Intersection over Union (IoU) with respect to the identities of the zero groups between the computed solution estimate x by HSPG and the ground truth x * . We test HSPG on (92) under different problem settings. For a slim matrix A where N ≥ n, we test with various group sparsity ratios among {0.1, 0.3, 0.5, 0.7, 0.9}, and for a fat matrix A where N < n, we only test with a certain group sparsity value since a recovery of x * requires that the number of non-zero elements in x * is bounded by N . Throughout the experiments, we set λ to be 100/N , the mini-batch size |B| to be 64, step size α k to be 0.1 (constant), and fine-tune per problem. Based on a similar statistical test on objective function stationarity (Zhang et al., 2020) , we switch to Half-Space Step roughly after 30 epoches. Table 2 shows that under each setting, the proposed HSPG correctly identifies the groups of zeros as indicated by IoU( x, x * ) = 1.0, which is a strong evidence to show the correctness of group sparsity identification of HSPG.  (x;b)∈R n+1 1 N N i=1 log(1 + e -l i (x T d i +b) ) + λ g∈G [x]g , for binary classification with a bias b ∈ R. We set the regularization parameter λ as 100/N throughout the experiments since it yields high sparse solutions and low object value f 's, equally decompose the variables into 10 groups to form G, and test problem (93) on 8 standard publicly available large-scale datasets from LIBSVM repository (Chang & Lin, 2011) as summarized in Table 3 . All convex experiments are conducted on a 64-bit operating system with an Intel(R) Core(TM) i7-7700K CPU @ 4.20 GHz and 32 GB random-access memory. We run the solvers with a maximum number of epochs as 60. The mini-batch size |B| is set to be min{256, 0.01N } similarly to (Yang et al., 2019) . The step size α k setting follows [Section 4] (Xiao & Zhang, 2014) . Particularly, we first compute a Lipschitz constant L as max i d i 2 /4, then fine tune and select constant α k ≡ α = 1/L to Prox-SG and Prox-SVRG since it exhibits the best results. For RDA, the step size parameter γ is fined tuned as the one with the best performance among all powers of 10. For HSPG, we set α k as the same as Prox-SG and Prox-SVRG in practice. We set N P as 30N/|B| such that Half-Space Step is triggered after employing Prox-SG Step 30 epochs similarly to Appendix D.1, and the control parameter in (9) as 0.05. We select two 's as 0 and 0.05. The final objective value Ψ and f , and group sparsity in the solutions are reported in Table 4 -6, where we mark the best values as bold to facilitate the comparison. Furthermore, Figure 3 plots the relative runtime of these solvers for each dataset, scaled by the runtime of the most time-consuming solver. Table 6 shows that our HSPG is definitely the best solver on exploring the group sparsity of the solutions. In fact, HSPG under = 0.05 performs all the best except ijcnn1. Prox-SVRG is the second best solver on group sparsity exploration, which demonstrates that the variance reduction techniques works well in convex setting to promote sparsity, but not in non-convex settings. HSPG under = 0 performs much better than Prox-SG which matches the better sparsity recovery property of HSPG as stated in Theorem 2 even under as 0. Moreover, as shown in Table 4 and 5, we observe that all solvers perform quite competitively in terms of final objective values (round up to 3 decimals) except RDA, which demonstrates that HSPG reaches comparable convergence as Prox-SG and Prox-SVRG in practice. Finally, Figure 3 indicates that Prox-SG, RDA and HSPG have similar computational cost to proceed, except Prox-SVRG due to its periodical full gradient computation. 

D.3 DEEP LEARNING EXPERIMENTS

We conduct all deep learning experiments on one GeForce GTX 1080 Ti GPU, and describe how to fine-tune the control parameter in (9) in details. Fine-tune λ: The λ balances the sparsity level of the exact optimal solution and the bias of model estimation. Larger λ encourages higher sparsity but may hurt the objective function. Therefore, in order to obtain solution of both high group sparsity and low objective value, we need to fine-tune λ carefully. In our experiments, we iterate λ through all powers of 10 from 10 -5 to 10 -1 by proceeding Prox-SG, and pick up the largest λ which has the same level of test accuracy to the model trained without any regularization. Fine-tune : According to Theorem 2, a larger results in a faster group sparsity identification, while by Lemma 1 on the other hand too large may cause a significant regression on the target objective Ψ value, i.e., the Ψ value increases a lot. Hence, in our experiments, from the point of view of optimization, we search a proper in the following ways: start from = 0.0 and the models trained by employing N P Prox-SG Steps, incrementally increase by 0.01 and check if the Ψ on the first Half-Space Step has an obvious increase, then accept the largest without regression on Ψ as our fine tuned shown in the main body of the paper. Particularly, the fine tuned 's equal to 0.03, 0.05, 0.02 and 0.02 for VGG16 with CIFAR10, VGG16 with Fashion-MNIST, ResNet18 with CIFAR10 and ResNet18 with Fashion-MNIST respectively. Note from the perspective of different applications, there are different criterions to fine tune , i.e., for model compression, we may accept based on the validation accuracy regression to reach higher group sparsity. Final f comparison: Additionally, we also report the final f comparison in Table 7 and its evolution on ResNet18 with CIFAR10 in Figure 4 , where we can see that all tested algorithms can achieve competitive f values as they do in convex settings. And the evolution of f is similar to that of Ψ, i.e., the raw objective f generally monotonically decreases for small = 0 to 0.02, and experiences a mild pulse after switch to Half-Space Step for larger , e.g., 0.05, which matches Lemma 1. 



Group sparsity is defined as # of zero groups, where a zero group means all its variables are exact zeros. Unless Ω(x) is x 1 where each g ∈ G is singleton, then S k becomes an orthant face(Chen et al., 2020).



Figure 1: Illustration of Half-Space Step with projection in (9), where G = {{1, 2}}.

Figure 2: On ResNet18 with CIFAR10, (a)-(c): Evolution of Ψ, group sparsity ratio and testing accuracy, (d): HSPG versus Prox-SG* and Prox-SVRG* (Prox-SG and Prox-SVRG with simple truncation mechanism).

Figure 3: Relative runtime.

Figure 4: Evolution of f value on ResNet18 with CIFAR10.

We state the Half-Space Stochastic Projected Gradient (HSPG) method in Algorithm 1. In general, it contains two stages: Initialization Stage and Group-Sparsity Stage. The first Initialization Stage employs Prox-SG Step (Algorithm 2) to search for a close-enough but usually non-sparse solution estimate. Then the second and fundamental stage proceeds Half-Space Step (Algorithm 3) started with the non-sparse solution estimate to effectively exploit the group sparsity within a sequence of reduced spaces, and converges to the group-sparse solutions with theoretical convergence property.

Final Ψ/group sparsity ratio/testing accuracy for tested algorithms on non-convex problems.

and k≥K α 2 k ≤ ∞. Combining with (72) and the boundness of ∂Ψ, it implies It follows from the assumptions of this theorem and Lemma 3 to 8 and Corollay 2 that with high probability at least 1 -τ , for each k ≥ K, x * inhabits S k . Note as |B k | = O(t) linearly increases, the error of gradient estimate vanishes. Hence, (76) naturally implies that the sequence {x k } k∈K converges to some stationary point with high probability. And we can extend K to {k : k ≥ K} due to the non-decreasing distance to optimal solution as shown in the Lemma 8. By the above, we conclude that P( lim

Linear regression problem settings and IoU of the recovered solutions by HSPG.We then focus on the benchmark convex logistic regression problem with the mixed1 / 2regularization given N examples (d 1 , l 1 ), • • • , (d N , l N ) where d i ∈ R n and l i ∈ {-1, 1} with the form

Summary of datasets.

Final objective values Ψ for tested algorithms on convex problems.

Final objective values f for tested algorithms on convex problems.

Group sparsity for tested algorithms on convex problems.

Final objective values f for tested algorithms on non-convex problems.

annex

Now we establish the group-sparsity identification of HSPG as Theorem 2.Proof of Theorem 2:1-+α k L . There is nothing to prove if g ∈ I 0 (x * ) I 0 (x k ). For g ∈ I 0 (x * ) I =0 (x k ), we compute thatBy the Lipschitz continuity of ∇f , we have that for eachCombining with the definition of δ 3 , which implies thatHence combining with 83) can be further written asHence the group projection operator is trigerred on g to map the variables to zero, then g ∈ I 0 (x k+1 ), i.e., [x k+1 ] g = 0. Therefore, the group sparsity of x * can be successfully identified by Half-Space Step, i.e., I 0 (x * ) ⊆ I 0 (x k+1 ).C.4 UPPER BOUND OF N P UNDER STRONGLY CONVEXITY Proposition 2. Suppose the following conditions hold:• (A2) there exists a σ > 0 such that E B [ ∇f B (x) -∇f (x) 2 ] ≤ σ 2 for any mini-batch B.• (A3) there exists a β ∈ (0, 1) such that 0Set the step-size α k = 1 2µβk , k 0 = max{1, 1 2µβ } . For any τ ∈ (0, 1), there exists a N P ∈ Z + such that N P ≥ max, 8σ 2 log(N P -1), such that performing Prox-SG N P times yieldsx N P -x * ≤ R/2 (87) with probability at least 1 -τ .

