SPARSITY BY REDUNDANCY: SOLVING L 1 WITH A SIMPLE REPARAMETRIZATION

Abstract

We identify and prove a general principle: L 1 sparsity can be achieved using a redundant parametrization plus L 2 penalty. Our results lead to a simple algorithm, spred, that seamlessly integrates L 1 regularization into any modern deep learning framework. Practically, we demonstrate (1) the efficiency of spred in optimizing conventional tasks such as lasso and sparse coding, (2) benchmark our method for nonlinear feature selection of six gene selection tasks, and (3) illustrate the usage of the method for achieving structured and unstructured sparsity in deep learning in an end-to-end manner. Conceptually, our result bridges the gap in understanding the inductive bias of the redundant parametrization common in deep learning and conventional statistical learning.

1. INTRODUCTION

In many fields, optimization of an objective function with respect to an L 1 constraint is of fundamental importance (Santosa & Symes, 1986; Tibshirani, 1996; Donoho, 2006; Sun et al., 2015; Candes et al., 2008) . The advantage of the L 1 penalty is that its solutions are sparse and thus highly interpretable. While non-gradient techniques like interior point methods can be applied to solve L 1 -regularized problems, gradient-based methods are favored by practices due to their scalability on large-scale problems and simplicity of implementation (Schmidt et al., 2007; Beck & Teboulle, 2009) . However, previous algorithms are mostly problem-specific extensions of gradient descent and highly limited in the scope of applicability. It remains an important and fundamental open problem of how to optimize a general nonconvex objective with L 1 regularization. The foremost contribution of this work is to propose a method for solving arbitrary nonconvex objectives with L 1 regularization. The proposed method is simple and scalable. The proposed method does not require any special optimization algorithm: it can be solved by simple gradient descent and can be boosted by common training tricks in deep learning such as minibatch sampling or adaptive learning rates. One can even apply common second-order methods such as Newton's method or the LBFGS optimizer. The proposed method can be implemented in any standard deep learning framework with only a few lines of code and, therefore, seamlessly leverages the power of modern GPUs. In fact, there is a large gap between L 1 learning and deep learning. A lot of tasks, such as feature selection, that L 1 -based methods work well cannot be tackled by deep learning, and achieving sparsity in deep learning is almost never based on L 1 . This gap between conventional statistics and deep learning is perhaps because there is no method that efficiently solves a general L 1 penalty in general nonlinear settings, not to mention incorporating such methods within the standard backpropagationbased training pipelines. Our result, crucially, bridges this gap between classical L 1 sparsity with the most basic deep learning practices. The main contributions of this work are 1. Identification and proof of a general principle: L 1 penalty is the same as a redundant parametrization plus weight decay, which can be easily optimized with SGD; 2. Proposal of a simple and efficient end-to-end algorithm for optimizing L 1 regularized loss within any deep learning framework; 3. A principled explanation of known and discovery of unknown mechanisms in the standard deep learning practice that leads to low-rankness and sparsity. 1

