IMPLICIT REGULARIZATION FOR GROUP SPARSITY

Abstract

We study the implicit regularization of gradient descent towards structured sparsity via a novel neural reparameterization, which we call a "diagonally grouped linear neural network". We show the following intriguing property of our reparameterization: gradient descent over the squared regression loss, without any explicit regularization, biases towards solutions with a group sparsity structure. In contrast to many existing works in understanding implicit regularization, we prove that our training trajectory cannot be simulated by mirror descent. We analyze the gradient dynamics of the corresponding regression problem in the general noise setting and obtain minimax-optimal error rates. Compared to existing bounds for implicit sparse regularization using diagonal linear networks, our analysis with the new reparameterization shows improved sample complexity. In the degenerate case of size-one groups, our approach gives rise to a new algorithm for sparse linear regression. Finally, we demonstrate the efficacy of our approach with several numerical experiments 1 .

1. INTRODUCTION

Motivation. A salient feature of modern deep neural networks is that they are highly overparameterized with many more parameters than available training examples. Surprisingly, however, deep neural networks trained with gradient descent can generalize quite well in practice, even without explicit regularization. One hypothesis is that the dynamics of gradient descent-based training itself induce some form of implicit regularization, biasing toward solutions with low-complexity (Hardt et al., 2016; Neyshabur et al., 2017) . Recent research in deep learning theory has validated the hypothesis of such implicit regularization effects. A large body of work, which we survey below, has considered certain (restricted) families of linear neural networks and established two types of implicit regularization -standard sparse regularization and 2 -norm regularization -depending on how gradient descent is initialized. On the other hand, the role of network architecture, or the way the model is parameterized in implicit regularization, is less well-understood. Does there exist a parameterization that promotes implicit regularization of gradient descent towards richer structures beyond standard sparsity? In this paper, we analyze a simple, prototypical hierarchical architecture for which gradient descent induces group sparse regularization. Our finding -that finer, structured biases can be induced via gradient dynamics -highlights the richness of co-designing neural networks along with optimization methods for producing more sophisticated regularization effects. Background. Many recent theoretical efforts have revisited traditional, well-understood problems such as linear regression (Vaskevicius et al., 2019; Li et al., 2021; Zhao et al., 2019) , matrix factorization (Gunasekar et al., 2018b; Li et al., 2018; Arora et al., 2019) and tensor decomposition (Ge et al., 2017; Wang et al., 2020) , from the perspective of neural network training. For nonlinear models with squared error loss, Williams et al. (2019) and Jin & Montúfar (2020) study the implicit bias of gradient descent in wide depth-2 ReLU networks with input dimension 1. Other works (Gunasekar et al., 2018c; Soudry et al., 2018; Nacson et al., 2019) show that gradient descent biases the solution towards the max-margin (or minimum 2 -norm) solutions over separable data. Outside of implicit regularization, several other works study the inductive bias of network architectures under explicit 2 regularization on model weights (Pilanci & Ergen, 2020; Sahiner et al., 2020) . For multichannel linear convolutional networks, Jagadeesan et al. (2021) show that 2 -norm minimization of weights leads to a norm regularizer on predictors, where the norm is given by a semidefinite program (SDP). The representation cost in predictor space induced by explicit 2 regularization on (various different versions of) linear neural networks is studied in Dai et al. (2021) , which demonstrates several interesting (induced) regularizers on the linear predictors such as p quasi-norms and group quasi-norms. However, these results are silent on the behavior of gradient descent-based training without explicit regularization. In light of the above results, we ask the following question: Beyond 2 -norm, sparsity and low-rankness, can gradient descent induce other forms of implicit regularization? Our contributions. In this paper, we rigorously show that a diagonally-grouped linear neural network (see Figure 1b ) trained by gradient descent with (proper/partial) weight normalization induces group-sparse regularization: a form of structured regularization that, to the best of our knowledge, has not been provably established in previous work. One major approach to understanding implicit regularization of gradient descent is based on its equivalence to a mirror descent (on a different objective function) (e.g., Gunasekar et al., 2018a; Woodworth et al., 2020) . However, we show that, for the diagonally-grouped linear network architecture, the gradient dynamics is beyond mirror descent. We then analyze the convergence of gradient flow with early stopping under orthogonal design with possibly noisy observations, and show that the obtained solution exhibits an implicit regularization effect towards structured (specifically, group) sparsity. In addition, we show that weight normalization can deal with instability related to the choices of learning rates and initialization. With weight normalization, we are able to obtain a similar implicit regularization result but in more general settings: orthogonal/non-orthogonal designs with possibly noisy observations. Also, the obtained solution can achieve minimax-optimal error rates. Overall, compared to existing analysis of diagonal linear networks, our model design -that induces structured sparsity -exhibits provably improved sample complexity. In the degenerate case of size-one groups, our bounds coincide with previous results, and our approach can be interpreted as a new algorithm for sparse linear regression. Our techniques. Our approach is built upon the power reparameterization trick, which has been shown to promote model sparsity (Schwarz et al., 2021) . Raising the parameters of a linear model element-wisely to the N -th power (N > 1) results in that parameters of smaller magnitude receive smaller gradient updates, while parameters of larger magnitude receive larger updates. In essence, this leads to a "rich get richer" phenomenon in gradient-based training. In Gissin et al. (2019) and Berthier (2022) , the authors analyze the gradient dynamics on a toy example, and call this "incremental learning". Concretely, for a linear predictor w ∈ R p , if we re-parameterize the model as w = u •Nv •N (where u •N means the N -th element-wise power of u), then gradient descent will bias the training towards sparse solutions. This reparameterization is equivalent to a diagonal linear network, as shown in Figure 1a . This is further studied in Woodworth et al. (2020) for interpolating predictors, where they show that a small enough initialization induces 1 -norm regularization. For noisy settings, Vaskevicius et al. (2019) and Li et al. (2021) show that gradient descent converges to sparse models with early stopping. In the special case of sparse recovery from under-sampled observations (or compressive sensing), the optimal sample complexity can also be obtained via this reparameterization (Chou et al., 2021) . Inspired by this approach, we study a novel model reparameterization of the form w = [w 1 , . . . , w L ], where w l = u 2 l v l for each group l ∈ {1, . . . , L}. (One way to interpret this model is to think of u l as the "magnitude" and v l as the "direction" of the subvector corresponding to each group; see Section 2 for details.) This corresponds to a special type of linear neural network architecture, as shown in Figure 1b . A related architecture has also been recently studied in Dai et al. (2021) , but there the authors have focused on the bias induced by an explicit 2 regularization on the weights and have not investigated the effect of gradient dynamics. The diagonally linear network parameterization of Woodworth et al. (2020) ; Li et al. (2021) does not suffer from identifiability issues. In contrast to that, in our setup the "magnitude" parameter u l of each group interacts with the norm of the "direction", v l 2 , causing a fundamental problem of identifiability. By leveraging the layer balancing effect (Du et al., 2018) in DGLNN, we verify the group regularization effect implicit in gradient flow with early stopping. But gradient flow is idealized; for a more practical algorithm, we use a variant of gradient descent based on weight normalization, proposed in (Salimans & Kingma, 2016) , and studied in more detail in (Wu et al., 2020) . Weight normalization has been shown to be particularly helpful in stabilizing the effect of learning rates (Morwani & Ramaswamy, 2022; Van Laarhoven, 2017) . With weight normalization, the learning effect is separated into magnitudes and directions. We derive the gradient dynamics on both magnitudes and directions with perturbations. Directions guide magnitude to grow, and as the magnitude grows, the directions get more accurate. Thereby, we are able to establish regularization effect implied by such gradient dynamics. A remark on grouped architectures. Finally, we remark that grouping layers have been commonly used in grouped CNN and grouped attention mechanisms (Xie et al., 2017; Wu et al., 2021) , which leads to parameter efficiency and better accuracy. Group sparsity is also useful for deep learning models in multi-omics data for survival prediction (Xie et al., 2019) . We hope our analysis towards diagonally grouped linear NN could lead to more understanding of the inductive biases of groupingstyle architectures.

2. SETUP

Notation. Denotes the set {1, 2, . . . , L} by [L], and the vector 2 norm by • . We use 1 p and 0 p to denote p-dimensional vectors of all 1s and all 0s correspondingly. Also, represents the entry-wise multiplication whereas β •N denotes element-wise power N of a vector β. We use e i to denote the i th canonical vector. We write inequalities up to multiplicative constants using the notation , whereby the constants do not depend on any problem parameter.

Observation model. Suppose that the index set

[p] = ∪ L j=l G l is partitioned into L disjoint (i.e., non-overlapping) groups G 1 , G 2 , . . . , G L where G i ∩ G j = ∅, ∀i = j. The size of G l is denoted by p l = |G l | for l ∈ [L]. Let w ∈ R p be a p-dimensional vector where the entries of w are non-zero only on a subset of groups. We posit a linear model of data where observations (x i , y i ) ∈ R p ×R, i ∈ [n] are given such that y i = x i , w + ξ i for i = 1, . . . , n, and ξ = [ξ 1 , . . . , ξ n ] is a noise vector. Note that we do not impose any special restriction between n (the number of observations) and p (the dimension). We write the linear model in the following matrix-vector form: y = Xw + ξ, with the n × p design matrix X = [X 1 , X 2 , . . . , X L ], where X l ∈ R n×p l represents the features from the l th group G l , for l ∈ [L]. We make the following assumptions on X: Assumption 1. The design matrix X satisfies sup β 1 ≤1, β 2 ≤1 β 1 , 1 n X l X l -I β 2 ≤ δ in , where β 1 , β 2 ∈ R p l , sup β 1 ≤1, β 2 ≤1 1 √ n X l β 1 , 1 √ n X l β 2 ≤ δ out , where β 1 ∈ R p l , β 2 ∈ R p l , l = l , ( ) for some constants δ in , δ out ∈ (0, 1). The first part ( 1) is a within-group eigenvalue condition while the second part ( 2) is a between-group block coherence assumption. There are multiple ways to construct a sensing matrix to fulfill these two conditions (Eldar & Bolcskei, 2009; Baraniuk et al., 2010) . One of them is based on the fact that random Gaussian matrices satisfy such conditions with high probability (Stojnic et al., 2009) . Reparameterization. Our goal is to learn a parameter w from the data {(x i , y i )} n i=1 with coefficients which obey group structure. Instead of imposing an explicit group-sparsity constraint on w (e.g., via weight penalization by group), we show that gradient descent on the unconstrained regression loss can still learn w , provided we design a special reparameterization. Define a mapping g(•) : [p] → [L] from each index i to its group g(i). Each parameter is rewritten as w i = u 2 g(i) v i , ∀i ∈ [p]. The parameterization G(•) : R L + × R p → R p reads [u 1 , . . . , u L , v 1 , v 2 , . . . , v p ] → [u 2 1 v 1 , u 2 1 v 2 , . . . , u 2 L v p ]. This corresponds to the 2-layer neural network architecture displayed in Figure 1b , in which W 1 = diag(v 1 , . . . , v p ), and W 2 is "diagonally" tied within each group: W 2 = diag(u 1 , . . . , u 1 , u 2 , . . . , u 2 , . . . , u L , . . . , u L ). Gradient dynamics. We learn u and v by minimizing the standard squared loss: L(u, v) = 1 2 y -X[(Du) •2 v] 2 , where D =     1 p1 0 p1 . . . 0 p1 0 p2 1 p2 . . . 0 p2 . . . . . . . . . . . . 0 p L 0 p L . . . 1 p L     ∈ R p×L . By simple algebra, the gradients with respect to u and v read as follows: ∇ u L = 2D v X X((Du) •2 v -w ) -X ξ Du , ∇ v L = X X((Du) •2 v -w ) -X ξ (Du) •2 . Denote r(t) = y - L l =1 u 2 l (t)X l v l (t). For each group l ∈ [L], the gradient flow reads ∂u l (t) ∂t = 2 n u l (t)v l (t)X l r(t), ∂v l (t) ∂t = 1 n u 2 l (t)X l r(t). Although we are not able to transform the gradient dynamics back onto w(t) due to the overparameterization, the extra term u l (t) on group magnitude leads to "incremental learning" effect.

3.1. FIRST ATTEMPT: MIRROR FLOW

Existing results about implicit bias in overparameterized models are mostly based on recasting the training process from the parameter space {u(t), v(t)} t≥0 to the predictor space {w(t)} t≥0 (Woodworth et al., 2020; Gunasekar et al., 2018a) . If properly performed, the (induced) dynamics in the predictor space can now be analyzed by a classical algorithm: mirror descent (or mirror flow). Implicit regularization is demonstrated by showing that the limit point satisfies a KKT (Karush-Kuhn-Tucker) condition with respect to minimizing some regularizer R(•) among all possible solutions. At first, we were unable to express the gradient dynamics in Eq. ( 3) in terms of w(t) (i.e., in the predictor space), due to complicated interactions between u and v. This hints that the training trajectory induced by an overparameterized DGLNN may not be analyzed by mirror flow techniques. In fact, we prove a stronger negative result, and rigorously show that the corresponding dynamics cannot be recast as a mirror flow. Therefore, we conclude that our subsequent analysis techniques are necessary and do not follow as a corollary from existing approaches. We first list two definitions from differential topology below. Definition 1. Let M be a smooth submanifold of R D . Given two C 1 vector fields of X, Y on M , we define the Lie Bracket of X and Y as [X, Y ](x) := ∂Y (x)X(x) -∂X(x)Y (x). Definition 2. Let M be a smooth submanifold of R D . A C 2 parameterization G : M → R d is said to be commuting iff for any i, j ∈ [d], the Lie Bracket [∇G i , ∇G j ](x) = 0 for all x ∈ M . The parameterization studied in most existing works on diagonal networks is separable, meaning that each parameter only affects one coordinate in the predictor space. In DGLNN, the parameterization is not separable, due to the shared parameter u within each group. We formally show that it is indeed not commuting. Lemma 1. G(•) is not a commuting parameterization. Non-commutativity of the parameterization implies that moving along -∇G i and then -∇G j is different with moving with -∇G j first and then -∇G i . This causes extra difficulty in analyzing the gradient dynamics. Li et al. (2022) study the equivalence between gradient flow on reparameterized models and mirror flow, and show that a commuting parameterization is a sufficient condition for when a gradient flow with certain parameterization simulates a mirror flow. A complementary necessary condition is also established on the Lie algebra generated by the gradients of coordinate functions of G with order higher than 2. We show that the parameterization G(•) violates this necessary condition. Theorem 1. There exists an initialization [u init , v init ] ∈ R L + × R p and a time-dependent loss L t such that gradient flow under L t G starting from [u init , v init ] cannot be written as a mirror flow with respect to any Legendre function R under the loss L t . The detailed proof is deferred to the Appendix. Theorem 1 shows that the gradient dynamics implied in DGLNN cannot be emulated by mirror descent. Therefore, a different technique is needed to analyze the gradient dynamics and any associated implicit regularization effect.

3.2. LAYER BALANCING AND GRADIENT FLOW

Let us first introduce relevant quantities. Following our reparameterization, we rewrite the true parameters for each group l as w l = (u l ) 2 v l , v l 2 = 1, v l ∈ R p l . The support is defined on the group level, where S = {l ∈ [L] : u l > 0} and the support size is defined as s = |S|. We denote u max = max{u l |l ∈ S}, and u min = min{u l |l ∈ S}. The gradient dynamics in our reparameterization does not preserve v l (t) 2 = 1, which causes difficulty to identify the magnitude of each u l and v l (t) 2 . Du et al. ( 2018) and Arora et al. (2018) applied on DGLNN, initialization for directions is very crucial; The algorithm may fail even with a very small initialization when the direction is not accurate, as shown in Appendix E. The balancing effect (Lemma 2) is sensitive to the step size, and errors may accumulate (Du et al., 2018) . Weight normalization as a commonly used training technique has been shown to be helpful in stabilizing the training process. The identifiability of the magnitude is naturally resolved by weight normalization on each v l . Moreover, weight normalization allows for a larger step size on v, which makes the direction estimation at each step behave like that at the origin point. This removes the restrictive assumption of orthogonal design. With these intuitions in mind, we study the gradient descent algorithm with weight normalization on v summarized in Algorithm 1. One advantage of our algorithm is that it converges with any unit norm initialization v l (0). The step size on u(t) is chosen to be small enough in order to enable the incremental learning, whereas the step size on v(t) is chosen as η l,t = 1 u 4 l (t) as prescribed by our theoretical investigation. For convenience, we define ζ = 80 1 n X ξ ∞ ∨ , for a precision parameter > 0. The convergence of Algorithm 1 is formalized as follows: Theorem 3. Fix > 0. Consider Algorithm 1 with u l (0) = α < 4 ∧ 1 (u max ) 8 ∧ 1 80L (u min ) 2 ∧ L , ∀l ∈ [L], any unit-norm initialization on v l for each l ∈ [L] and γ ≤ 1 20(u max ) 2 . Suppose Assumption 1 is satisfied with δ in ≤ (u min ) 2 120(u max ) 2 and δ out ≤ (u min ) 2 120s(u max ) 2 . There exist a lower bound on the number of iterations T lb = log (u max ) 2 2α 2 2 log(1 + γ 2 (ζ ∨ (u min ) 2 )) + log 2 (u max ) 2 ζ 5 2γ(ζ ∨ (u min ) 2 ) , and an upper bound T ub ≥ 5 16γ(ζ ∨ (u min ) 2 ) log 1 α 4 , such that T lb ≤ T ub and for any T lb ≤ t ≤ T ub , u 2 l (t)v l (t) -w l ∞ 1 n X ξ ∞ ∨ , if l ∈ S α, if l / ∈ S . Similarly as Theorem 2, Theorem 3 states the error bounds for the estimation of the true weights w . When α is small, the algorithm keeps all non-supported entries to be close to zero through iterations while maintaining the guarantee for supported entries. Compared to the works on implicit (unstructured) sparse regularization (Vaskevicius et al., 2019; Chou et al., 2021) , our assumption on the incoherence parameter δ out scales with 1/s, where s is the number of non-zero groups, instead of the total number of non-zero entries. Therefore, the relaxed bound on δ out implies an improved sample complexity, which is also observed experimentally in Figure 4 . We now state a corollary in a common setting with independent random noise, where (asymptotic) recovery of w is possible. Definition 3. A random variable Y is σ-sub-Gaussian if for all t ∈ R there exists σ > 0 such that Ee tY ≤ e σ 2 t 2 /2 . Corollary 1. Suppose the noise vector ξ has independent σ 2 -sub-Gaussian entries and = 2 σ 2 log(2p) n . Under the assumptions of Theorem 3, Algorithm 1 produces w(t) = (Du(t)) •2 v(t) that satisfies w(t)w 2 2 (sσ 2 log p)/n with probability at least 1 -1/(8p 3 ) for any t such that T lb ≤ t ≤ T ub . Note that the error bound we obtain is minimax-optimal. Despite these appealing properties of Algorithm 1, our theoretical results require a large step size on each v l (t), which may cause instability at later stages of learning. We observe this instability numrerically (see Figure 6 , Appendix E). Although the estimation error of w remains small (which aligns with our theoretical result), individual entries in v may fluctuate considerably. Indeed, the large step size is mainly introduced to maintain a strong directional information extracted from the gradient of v l (t) so as to stabilize the updates of u(t) at the early iterations. Therefore, we also propose Algorithm 2, a variant of Algorithm 1, where we decrease the step size after a certain number of iterations. The effectiveness of our algorithms. We start by demonstrating the convergence of the two proposed algorithms. In this experiment, we set n = 150 and p = 300. The number of non-zero entries is 9, divided into 3 groups of size 3. We run both Algorithms 1 and 2 with the same initialization α = 10 -6 . The step size γ on u and decreased step size η on v are both 10 -3 . In Figure 2 , we present the recovery error of w on the left, and recovered group magnitudes on the right. As we can see, early stopping is crucial for reaching the structured sparse solution. In Figure 3 , we present the recovered entries, recovered group magnitudes and recovered directions for each group from left to right. In addition to convergence, we also observe an incremental learning effect. Structured sparsity versus standard sparsity. From our theory, we see that the block incoherence parameter scales with the number of non-zero groups, as opposed to the number of non-zero entries. As such, we can expect an improved sample complexity over the estimators based on unstructured sparse regularization. We choose a larger support size of 16. The entries on the support are all 1 for simplicity. We apply our Algorithm 2 with group size 4. The result is shown in Figure 4 (left). We compare with the method in Vaskevicius et al. (2019) with parameterization w = u •2v •2 , designed for unstructured sparsity. We display the result in the right figure, where interestingly, that algorithm fails to converge because of an insufficient number of samples. Degenerate case. In the degenerate case where each group is of size 1, our reparameterization takes a simpler form w i ≈ u 2 i sgn(v), i.e., due to weight normalization, our method normalizes v to 1 or -1 after each step. We demonstrate the efficacy of our algorithms even in the degenerate case. We set n = 80 and p = 200. The entries on the support are [1, -1, 1, -1, 1] with both positive and negative entries. We present the coordinate plot and the recovery error in Figure 5 .

6. DISCUSSION

In this paper, we show that implicit regularization for group-structured sparsity can be obtained by gradient descent (with weight normalization) for a certain, specially designed network architecture. Overall, we hope that such analysis further enhances our understanding of neural network training. Future work includes relaxing the assumptions on δ's in Theorem 2, and rigorous analysis of modern grouping architectures as well as power parametrizations.



(a) Diagonal linear NN (DLNN). (b) Diagonally grouped linear NN (DGLNN).

Figure 1: An illustration of the two architectures for standard and group sparse regularization.

Figure 4: Comparison with reparameterization using standard sparsity. n = 100, p = 500.

Figure5: Degenerate case when each group size is 1. The log 2 -error plot is repeated 30 times, and the mean is depicted. The shaded area indicates the region between the 25 th and 75 th percentiles.

Comparisons to related work on implicit and explicit regularization. Here, GD stands for gradient descent, (D)LNN/CNN for (diagonal) linear/convolutional neural network, and DGLNN for diagonally grouped linear neural network.

ACKNOWLEDGMENTS

This work was supported in part by the National Science Foundation under grants CCF-1934904, CCF-1815101, and CCF-2005804. 

annex

show that the gradient flow of multi-layer homogeneous functions effectively enforces the differences between squared norms across different layers to remain invariant. Following the same idea, we discover a similar balancing effect in DGLNN between the parameter u and v. Lemma 2. For any l ∈ [L], we haveThe balancing result eliminates the identifiability issue on the magnitudes. As the coordinates within one group affect each other, the direction which controls the growth rate of both u and v need to be determined as well.Note that this initialization can be obtained by a single step of gradient descent with 0 initialization. Lemma 3 suggests the direction is close to the truth at the initialization. We can further normalize it to be v l (0)based on the balancing criterion. The magnitude equality, v l (t), is preserved by Lemma 2. However, ensuring the closeness of the direction throughout the gradient flow presents significant technical difficulties. That said, we are able to present a meaningful implicit regularization result of the gradient flow under orthogonal (and noisy) settings. Theorem 2. Fix > 0. Consider the case wherethere exists an lower bound and upper bound of the time T l < T u in the gradient flow in Eq. (3), such that for any T l ≤ t ≤ T u we haveTheorem 2 states the error bounds for the estimation of the true weights w . For entries outside the (true) support, the error is controlled by θ 3/2 . When θ is small, the algorithm keeps all non-supported entries to be close to zero through iterations while maintaining the guarantee for supported entries.Theorem 2 shows that under the assumption of orthogonal design, gradient flow with early stopping is able to obtain the solution with group sparsity.

4. GRADIENT DESCENT WITH WEIGHT NORMALIZATION

Algorithm 1 Gradient descent with weight normalizationWe now seek a more practical algorithm with more general assumptions and requirements on initialization. To speed up the presentation, we will directly discuss the corresponding variant of (the more practical) gradient descent instead of gradient flow. When standard gradient descent is Published as a conference paper at ICLR 2023 Algorithm 2. Run Algorithm 1 with the same setup till each u l (t), l ∈ [L] gets roughly accurate, set η l,t = η. Continue Algorithm 1 until early stopping criterion is satisfied. 120s(u max ) 3 , we apply Algorithm 2 with η l,t = 1 u 4 (t) at the beginning, and η l,t = η ≤ 4 9(u max ) 2 after ∀l ∈ [L], u 2 l (t) ≥ 1 2 (u l ) 2 , then with the same T lb and T ub , we have that for anyIn Theorem 4, the criterion to decrease the step size is:. Once this criterion is satisfied, our proof indeed ensures that it would hold for at least up to the early stopping time T ub specified in the theorem. In practice, since u l 's are unknown, we can switch to a more practical criterion: max{|u l (t + 1)u l (t)|/|u l (t) + ε|} < τ for some pre-specified tolerance τ > 0 and small value ε > 0 as the criterion for changing the step size. The motivation of this criterion is further discussed in Appendix D. The error bound remains the same as Theorem 3. The change in step size requires a new way to study the gradient dynamics of directions with perturbations. With our proof technique, Theorem 4 requires a smaller bound on δ's (see Lemma 16 versus Lemma 8 in Appendix C for details). We believe it is a proof artifact and leave the improvement for future work.Connection to standard sparsity. Consider the degenerate case where each group size is 1. Our reparameterization, together with the normalization step, can roughly be interpreted as Vaskevicius et al. (2019) and Li et al. (2021) . This also shows why a large step size on v i is needed at the beginning. If the initialization on v i is incorrect, the sign of v i may not move with a small step size.

5. SIMULATION STUDIES

We conduct various experiments on simulated data to support our theory. Following the model in Section 2, we sample the entries of X i.i.d. using Rademacher random variables and the entries of the noise vector ξ i.i.d. under N (0, σ 2 ). We set σ = 0.5 throughout the experiments. 

