THE EFFICACY OF L 1 REGULARIZATION IN NEURAL NETWORKS

Abstract

A crucial problem in neural networks is to select the most appropriate number of hidden neurons and obtain tight statistical risk bounds. In this work, we present a new perspective towards the bias-variance tradeoff in neural networks. As an alternative to selecting the number of neurons, we theoretically show that L 1 regularization can control the generalization error and sparsify the input dimension. In particular, with an appropriate L 1 regularization on the output layer, the network can produce a statistical risk that is near minimax optimal. Moreover, an appropriate L 1 regularization on the input layer leads to a risk bound that does not involve the input data dimension. Our analysis is based on a new amalgamation of dimension-based and norm-based complexity analysis to bound the generalization error. A consequent observation from our results is that an excessively large number of neurons do not necessarily inflate generalization errors under a suitable regularization.

1. INTRODUCTION

Neural networks have been successfully applied in modeling nonlinear regression functions in various domains of applications. A critical evaluation metric for a predictive learning model is to measure its statistical risk bound. For example, the L 1 or L 2 risks of typical parametric models such as linear regressions are at the order of (d/n) 1/2 for small d (Seber & Lee, 2012), where d and n denote respectively the input dimension and number of observations. Obtaining the risk bound for a nonparametric regression model such as neural networks is highly nontrivial. It involves an approximation error (or bias) term as well as a generalization error (or variance) term. The standard analysis of generalization error bounds may not be sufficient to describe the overall predictive performance of a model class unless the data is assumed to be generated from it. For the model class of two-layer feedforward networks and a rather general data-generating process, Barron (1993; 1994) proved an approximation error bound of O(r -1/2 ) where r denotes the number of neurons. The author further developed a statistical risk error bound of O((d/n) 1/4 ), which is the tightest statistical risk bound for the class of two-layer neural networks up to the authors' knowledge (for d < n). This risk bound is based on an optimal bias-variance tradeoff involving an deliberate choice of r. Note that the risk is at a convergence rate much slower than the classical parametric rate. We will tackle the same problem from a different perspective, and obtain a much tighter risk bound. A practical challenge closely related to statistical risks is to select the most appropriate neural network architecture for a particular data domain (Ding et al., 2018) . For two-layer neural networks, this is equivalent to selecting the number of hidden neurons r. While a small r tends to underfit, researchers have observed that the network is not overfitting even for moderately large r. Nevertheless, recent research has also shown that an overly large r (e.g., when r > n) does cause overfitting with high probability (Zhang et al., 2016) . It can be shown under some non-degeneracy conditions that a two-layer neural network with more than n hidden neurons can perfectly fit n arbitrary data, even in the presence of noise, which inevitably leads to overfitting. A theoretical choice of r suggested by the asymptotic analysis in (Barron, 1994) is at the order of (n/d) 1/2 , and a practical choice of r is often from cross-validation with an appropriate splitting ratio (Ding et al., 2018) . An alternative perspective that we advocate is to learn from a single neural network with sufficiently many neurons and an appropriate L 1 regularization on the neuron coefficients, instead of performing a selection from multiple candidate neural models. A potential benefit of this approach is easier hardware

