SPARSITY BY REDUNDANCY: SOLVING L 1 WITH A SIMPLE REPARAMETRIZATION

Abstract

We identify and prove a general principle: L 1 sparsity can be achieved using a redundant parametrization plus L 2 penalty. Our results lead to a simple algorithm, spred, that seamlessly integrates L 1 regularization into any modern deep learning framework. Practically, we demonstrate (1) the efficiency of spred in optimizing conventional tasks such as lasso and sparse coding, (2) benchmark our method for nonlinear feature selection of six gene selection tasks, and (3) illustrate the usage of the method for achieving structured and unstructured sparsity in deep learning in an end-to-end manner. Conceptually, our result bridges the gap in understanding the inductive bias of the redundant parametrization common in deep learning and conventional statistical learning.

1. INTRODUCTION

In many fields, optimization of an objective function with respect to an L 1 constraint is of fundamental importance (Santosa & Symes, 1986; Tibshirani, 1996; Donoho, 2006; Sun et al., 2015; Candes et al., 2008) . The advantage of the L 1 penalty is that its solutions are sparse and thus highly interpretable. While non-gradient techniques like interior point methods can be applied to solve L 1 -regularized problems, gradient-based methods are favored by practices due to their scalability on large-scale problems and simplicity of implementation (Schmidt et al., 2007; Beck & Teboulle, 2009) . However, previous algorithms are mostly problem-specific extensions of gradient descent and highly limited in the scope of applicability. It remains an important and fundamental open problem of how to optimize a general nonconvex objective with L 1 regularization. The foremost contribution of this work is to propose a method for solving arbitrary nonconvex objectives with L 1 regularization. The proposed method is simple and scalable. The proposed method does not require any special optimization algorithm: it can be solved by simple gradient descent and can be boosted by common training tricks in deep learning such as minibatch sampling or adaptive learning rates. One can even apply common second-order methods such as Newton's method or the LBFGS optimizer. The proposed method can be implemented in any standard deep learning framework with only a few lines of code and, therefore, seamlessly leverages the power of modern GPUs. In fact, there is a large gap between L 1 learning and deep learning. A lot of tasks, such as feature selection, that L 1 -based methods work well cannot be tackled by deep learning, and achieving sparsity in deep learning is almost never based on L 1 . This gap between conventional statistics and deep learning is perhaps because there is no method that efficiently solves a general L 1 penalty in general nonlinear settings, not to mention incorporating such methods within the standard backpropagationbased training pipelines. Our result, crucially, bridges this gap between classical L 1 sparsity with the most basic deep learning practices. The main contributions of this work are 1. Identification and proof of a general principle: L 1 penalty is the same as a redundant parametrization plus weight decay, which can be easily optimized with SGD; 2. Proposal of a simple and efficient end-to-end algorithm for optimizing L 1 regularized loss within any deep learning framework; 3. A principled explanation of known and discovery of unknown mechanisms in the standard deep learning practice that leads to low-rankness and sparsity.

2. RELATED WORKS

L1 Penalty. It is well-known that the L 1 penalty leads to a sparse solution (Wasserman, 2013). For linear models, the objectives with L 1 regularization are usually convex, but they are not easy to solve because the objective becomes non-differentiable exactly at the point where sparsity is achieved (namely, the origin). Previous literature often proposes special algorithms for solving the L 1 penalty for a specific task. For example, lasso finds a sparse weight solution for a linear regression task. The original lasso paper suggests a method based on the quadratic programming algorithms (Tibshirani, 1996) . Later, algorithms such as coordinate descent (Friedman et al., 2010) and least-angle regression (LARS) (Efron et al., 2004) 2022) study the problem in case of a convex and lower semicontinuous loss function. A preliminary work pointed out the connection between a fully connected neural network with L 2 and group lasso Tibshirani (2021). All these previous works lack two essential components of our proposal: (1) the application of the theory to deep learning practice and (2) the proposal that such an equivalence means that simple SGD can solve the L 1 penalty. Sparsity in Deep Learning. One important application of our theory is to the understanding and achieving any type of parameter sparsity in deep learning. There are two main reasons for introducing sparsity to the model. The first is that some level of sparsity often leads to better generalization performance; the second is that compressing the models can lead to more memory/computationefficient deployment of the models (Gale et al., 2019; Blalock et al., 2020) . However, none of the popular methods for sparsity in deep learning is based on the L 1 penalty, which is the favored method in conventional statistics. For example, pruning-based methods are the dominant strategies in deep learning (LeCun et al., 1989) . However, such methods are not satisfactory from a principled perspective because the pruning part is done in separation from the training, and it is hard to understand these pruning procedures are actually optimizing.

3. MAIN RESULT

Consider a generic objective function L(V s , V d ) that depends on two sets of learnable parameters V s and V d , where the subscript s stands for "sparse" and d stands for "dense". Often, we want to find a sparse set of parameters V s that minimizes L. The conventional way to achieve this is by minimizing the loss function with an L 1 penalty of strength 2κ: min Vs,V d L(V s , V d ) + 2κ||V s || 1 . Under suitable conditions for L, the solutions of L(V s , V d ) will feature both (1) sparsity and (2) shrinkage of the norm of the solution V s , and thus one can perform variable selection and overfitting avoidance at the same time. A primary obstacle that has prevented a scalable optimization of Eq. ( 1) with gradient descent algorithms is that it is non-differentiable exactly at the points where sparsity is achieved, and this optimization problem only has efficient algorithms when the loss function belongs to a restrictive set of families. See Figure 1 .



have been proposed as more efficient alternatives. One major problem of the coordinate descent algorithm is that it scales badly as the number of parameters increases and is difficult to parallelize. Yet another line of work advocates the iterative thresholding algorithms for solving lasso(Beck & Teboulle, 2009), but it is unclear how ISTA could be generalized to solve general nonconvex problems. This optimization problem also applies to the sparse multinomial logistic regression problem(Cawley et al., 2006), which relies on a diagonal second-order coordinate descent algorithm. Another important problem is the nonconvex L 1 -sparse coding problem. This problem can be decomposed into two convex problems, andLee et al. (2006)   has utilized this feature to propose the sign-feature algorithm. Our work tackles the problem of L 1 sparsity from a completely different angle. Instead of finding an efficient algorithm for a special L 1 problem, we transform any L 1 problem into a smooth problem for which the simplest gradient descent algorithms have been found efficient.

