IMPLICIT REGULARIZATION FOR GROUP SPARSITY

Abstract

We study the implicit regularization of gradient descent towards structured sparsity via a novel neural reparameterization, which we call a "diagonally grouped linear neural network". We show the following intriguing property of our reparameterization: gradient descent over the squared regression loss, without any explicit regularization, biases towards solutions with a group sparsity structure. In contrast to many existing works in understanding implicit regularization, we prove that our training trajectory cannot be simulated by mirror descent. We analyze the gradient dynamics of the corresponding regression problem in the general noise setting and obtain minimax-optimal error rates. Compared to existing bounds for implicit sparse regularization using diagonal linear networks, our analysis with the new reparameterization shows improved sample complexity. In the degenerate case of size-one groups, our approach gives rise to a new algorithm for sparse linear regression. Finally, we demonstrate the efficacy of our approach with several numerical experiments 1 .

1. INTRODUCTION

Motivation. A salient feature of modern deep neural networks is that they are highly overparameterized with many more parameters than available training examples. Surprisingly, however, deep neural networks trained with gradient descent can generalize quite well in practice, even without explicit regularization. One hypothesis is that the dynamics of gradient descent-based training itself induce some form of implicit regularization, biasing toward solutions with low-complexity (Hardt et al., 2016; Neyshabur et al., 2017) . Recent research in deep learning theory has validated the hypothesis of such implicit regularization effects. A large body of work, which we survey below, has considered certain (restricted) families of linear neural networks and established two types of implicit regularization -standard sparse regularization and 2 -norm regularization -depending on how gradient descent is initialized. On the other hand, the role of network architecture, or the way the model is parameterized in implicit regularization, is less well-understood. Does there exist a parameterization that promotes implicit regularization of gradient descent towards richer structures beyond standard sparsity? In this paper, we analyze a simple, prototypical hierarchical architecture for which gradient descent induces group sparse regularization. Our finding -that finer, structured biases can be induced via gradient dynamics -highlights the richness of co-designing neural networks along with optimization methods for producing more sophisticated regularization effects. Background. Many recent theoretical efforts have revisited traditional, well-understood problems such as linear regression (Vaskevicius et al., 2019; Li et al., 2021; Zhao et al., 2019) , matrix factorization (Gunasekar et al., 2018b; Li et al., 2018; Arora et al., 2019) and tensor decomposition (Ge et al., 2017; Wang et al., 2020) , from the perspective of neural network training. For nonlinear models with squared error loss, Williams et al. ( 2019) and Jin & Montúfar (2020) study the implicit bias of gradient descent in wide depth-2 ReLU networks with input dimension 1. Other works (Gunasekar et al., 2018c; Soudry et al., 2018; Nacson et al., 2019) show that gradient descent biases the solution towards the max-margin (or minimum 2 -norm) solutions over separable data.



Code is available on https://github.com/jiangyuan2li/Implicit-Group-Sparsity 1

