IMPLICIT REGULARIZATION FOR GROUP SPARSITY

Abstract

We study the implicit regularization of gradient descent towards structured sparsity via a novel neural reparameterization, which we call a "diagonally grouped linear neural network". We show the following intriguing property of our reparameterization: gradient descent over the squared regression loss, without any explicit regularization, biases towards solutions with a group sparsity structure. In contrast to many existing works in understanding implicit regularization, we prove that our training trajectory cannot be simulated by mirror descent. We analyze the gradient dynamics of the corresponding regression problem in the general noise setting and obtain minimax-optimal error rates. Compared to existing bounds for implicit sparse regularization using diagonal linear networks, our analysis with the new reparameterization shows improved sample complexity. In the degenerate case of size-one groups, our approach gives rise to a new algorithm for sparse linear regression. Finally, we demonstrate the efficacy of our approach with several numerical experiments 1 .

1. INTRODUCTION

Motivation. A salient feature of modern deep neural networks is that they are highly overparameterized with many more parameters than available training examples. Surprisingly, however, deep neural networks trained with gradient descent can generalize quite well in practice, even without explicit regularization. One hypothesis is that the dynamics of gradient descent-based training itself induce some form of implicit regularization, biasing toward solutions with low-complexity (Hardt et al., 2016; Neyshabur et al., 2017) . Recent research in deep learning theory has validated the hypothesis of such implicit regularization effects. A large body of work, which we survey below, has considered certain (restricted) families of linear neural networks and established two types of implicit regularization -standard sparse regularization and 2 -norm regularization -depending on how gradient descent is initialized. On the other hand, the role of network architecture, or the way the model is parameterized in implicit regularization, is less well-understood. Does there exist a parameterization that promotes implicit regularization of gradient descent towards richer structures beyond standard sparsity? In this paper, we analyze a simple, prototypical hierarchical architecture for which gradient descent induces group sparse regularization. Our finding -that finer, structured biases can be induced via gradient dynamics -highlights the richness of co-designing neural networks along with optimization methods for producing more sophisticated regularization effects. Background. Many recent theoretical efforts have revisited traditional, well-understood problems such as linear regression (Vaskevicius et al., 2019; Li et al., 2021; Zhao et al., 2019) , matrix factorization (Gunasekar et al., 2018b; Li et al., 2018; Arora et al., 2019) and tensor decomposition (Ge et al., 2017; Wang et al., 2020) , from the perspective of neural network training. For nonlinear models with squared error loss, Williams et al. ( 2019) and Jin & Montúfar (2020) study the implicit bias of gradient descent in wide depth-2 ReLU networks with input dimension 1. Other works (Gunasekar et al., 2018c; Soudry et al., 2018; Nacson et al., 2019) show that gradient descent biases the solution towards the max-margin (or minimum 2 -norm) solutions over separable data. Our contributions. In this paper, we rigorously show that a diagonally-grouped linear neural network (see Figure 1b ) trained by gradient descent with (proper/partial) weight normalization induces group-sparse regularization: a form of structured regularization that, to the best of our knowledge, has not been provably established in previous work. One major approach to understanding implicit regularization of gradient descent is based on its equivalence to a mirror descent (on a different objective function) (e.g., Gunasekar et al., 2018a; Woodworth et al., 2020) . However, we show that, for the diagonally-grouped linear network architecture, the gradient dynamics is beyond mirror descent. We then analyze the convergence of gradient flow with early stopping under orthogonal design with possibly noisy observations, and show that the obtained solution exhibits an implicit regularization effect towards structured (specifically, group) sparsity. In addition, we show that weight normalization can deal with instability related to the choices of learning rates and initialization. With weight normalization, we are able to obtain a similar implicit regularization result but in more general settings: orthogonal/non-orthogonal designs with possibly noisy observations. Also, the obtained solution can achieve minimax-optimal error rates. Overall, compared to existing analysis of diagonal linear networks, our model design -that induces structured sparsity -exhibits provably improved sample complexity. In the degenerate case of size-one groups, our bounds coincide with previous results, and our approach can be interpreted as a new algorithm for sparse linear regression. Our techniques. Our approach is built upon the power reparameterization trick, which has been shown to promote model sparsity (Schwarz et al., 2021) . Raising the parameters of a linear model element-wisely to the N -th power (N > 1) results in that parameters of smaller magnitude receive smaller gradient updates, while parameters of larger magnitude receive larger updates. In essence, this leads to a "rich get richer" phenomenon in gradient-based training. In Gissin et al. (2019) and Berthier (2022), the authors analyze the gradient dynamics on a toy example, and call this "incremental learning". Concretely, for a linear predictor w ∈ R p , if we re-parameterize the model as w = u •Nv •N (where u •N means the N -th element-wise power of u), then gradient descent will bias the training towards sparse solutions. This reparameterization is equivalent to a diagonal linear network, as shown in Figure 1a . This is further studied in Woodworth et al. (2020) for interpolating predictors, where they show that a small enough initialization induces 1 -norm regularization. For noisy settings, Vaskevicius et al. (2019) and Li et al. (2021) show that gradient descent converges to sparse models with early stopping. In the special case of sparse recovery from under-sampled



Code is available on https://github.com/jiangyuan2li/Implicit-Group-Sparsity



Comparisons to related work on implicit and explicit regularization. Here, GD stands for gradient descent, (D)LNN/CNN for (diagonal) linear/convolutional neural network, and DGLNN for diagonally grouped linear neural network.Outside of implicit regularization, several other works study the inductive bias of network architectures under explicit 2 regularization on model weights(Pilanci & Ergen, 2020; Sahiner et al., 2020). For multichannel linear convolutional networks, Jagadeesan et al. (2021) show that 2 -norm minimization of weights leads to a norm regularizer on predictors, where the norm is given by a semidefinite program (SDP). The representation cost in predictor space induced by explicit 2 regularization on (various different versions of) linear neural networks is studied inDai et al. (2021), which demonstrates several interesting (induced) regularizers on the linear predictors such as p quasi-norms and group quasi-norms. However, these results are silent on the behavior of gradient descent-based training without explicit regularization. In light of the above results, we ask the following question:

