THE EFFICACY OF L 1 REGULARIZATION IN NEURAL NETWORKS

Abstract

A crucial problem in neural networks is to select the most appropriate number of hidden neurons and obtain tight statistical risk bounds. In this work, we present a new perspective towards the bias-variance tradeoff in neural networks. As an alternative to selecting the number of neurons, we theoretically show that L 1 regularization can control the generalization error and sparsify the input dimension. In particular, with an appropriate L 1 regularization on the output layer, the network can produce a statistical risk that is near minimax optimal. Moreover, an appropriate L 1 regularization on the input layer leads to a risk bound that does not involve the input data dimension. Our analysis is based on a new amalgamation of dimension-based and norm-based complexity analysis to bound the generalization error. A consequent observation from our results is that an excessively large number of neurons do not necessarily inflate generalization errors under a suitable regularization.

1. INTRODUCTION

Neural networks have been successfully applied in modeling nonlinear regression functions in various domains of applications. A critical evaluation metric for a predictive learning model is to measure its statistical risk bound. For example, the L 1 or L 2 risks of typical parametric models such as linear regressions are at the order of (d/n) 1/2 for small d (Seber & Lee, 2012), where d and n denote respectively the input dimension and number of observations. Obtaining the risk bound for a nonparametric regression model such as neural networks is highly nontrivial. It involves an approximation error (or bias) term as well as a generalization error (or variance) term. The standard analysis of generalization error bounds may not be sufficient to describe the overall predictive performance of a model class unless the data is assumed to be generated from it. For the model class of two-layer feedforward networks and a rather general data-generating process, Barron (1993; 1994) proved an approximation error bound of O(r -1/2 ) where r denotes the number of neurons. The author further developed a statistical risk error bound of O((d/n) 1/4 ), which is the tightest statistical risk bound for the class of two-layer neural networks up to the authors' knowledge (for d < n). This risk bound is based on an optimal bias-variance tradeoff involving an deliberate choice of r. Note that the risk is at a convergence rate much slower than the classical parametric rate. We will tackle the same problem from a different perspective, and obtain a much tighter risk bound. A practical challenge closely related to statistical risks is to select the most appropriate neural network architecture for a particular data domain (Ding et al., 2018) . For two-layer neural networks, this is equivalent to selecting the number of hidden neurons r. While a small r tends to underfit, researchers have observed that the network is not overfitting even for moderately large r. Nevertheless, recent research has also shown that an overly large r (e.g., when r > n) does cause overfitting with high probability (Zhang et al., 2016) . It can be shown under some non-degeneracy conditions that a two-layer neural network with more than n hidden neurons can perfectly fit n arbitrary data, even in the presence of noise, which inevitably leads to overfitting. A theoretical choice of r suggested by the asymptotic analysis in (Barron, 1994) is at the order of (n/d) 1/2 , and a practical choice of r is often from cross-validation with an appropriate splitting ratio (Ding et al., 2018) . An alternative perspective that we advocate is to learn from a single neural network with sufficiently many neurons and an appropriate L 1 regularization on the neuron coefficients, instead of performing a selection from multiple candidate neural models. A potential benefit of this approach is easier hardware implementation and computation since we do not need to implement multiple models separately. Perhaps more importantly, this perspective of training enables much tighter risk bounds, as we will demonstrate. In this work, we focus on the model class of two-layer feedforward neural networks. Our main contributions are summarized below. First, we prove that L 1 regularization on the coefficients of the output layer can produce a risk bound O((d/n) 1/2 ) (up to a logarithmic factor) under the L 1 training loss, which approaches the minimax optimal rate. Such a rate has not been established under the L 2 training loss so far. The result indicates a potential benefit of using L 1 regularization for training a neural network, instead of selecting from a number of neurons. Additionally, a key ingredient of our result is a unique amalgamation of dimension-based and norm-based risk analysis, which may be interesting on its own right. The technique leads to an interesting observation that an excessively large r can reduce approximation error while not increasing generalization error under L 1 regularizations. This implies that an explicit regularization can eliminate overfitting even when the specified number of neurons is enormous. Moreover, we prove that the L 1 regularization on the input layer can induce sparsity by producing a risk bound that does not involve d, where d may be much larger compared with the true number of significant variables. Related work on neural network analysis. Despite the practical success of neural networks, a systematic understanding of their theoretical limit remains an ongoing challenge and has motivated research from various perspectives. Cybenko (1989) showed that any continuous function could be approximated arbitrarily well by a two-layer perceptron with sigmoid activation functions. Barron (1993; 1994) established an approximation error bound of using two-layer neural networks to fit arbitrary smooth functions and their statistical risk bounds. A dimension-free Rademacher complexity for deep ReLU neural networks was recently developed (Golowich et al., 2017; Barron & Klusowski, 2019) . Based on a contraction lemma, a series of norm-based complexities and their corresponding generalization errors are developed (Neyshabur et al., 2015 , and the references therein). Another perspective is to assume that the data are generated by a neural network and convert its parameter estimation into a tensor decomposition problem through the score function of the known or estimated input distribution (Anandkumar et al., 2014; Janzamin et al., 2015; Ge et al., 2017; Mondelli & Montanari, 2018) . Also, tight error bounds have been established recently by assuming that neural networks of parsimonious structures generate the data. In this direction, Schmidt-Hieber (2017) proved that specific deep neural networks with few non-zero network parameters can achieve minimax rates of convergence. Bauer & Kohler (2019) developed an error bound that is free from the input dimension, by assuming a generalized hierarchical interaction model. Related work on L 1 regularization. The use of L 1 regularization has been widely studied in linear regression problems (Hastie et al., 2009, Chapter 3) . The use of L 1 regularization for training neural networks has been recently advocated in deep learning practice. A prominent use of L 1 regularization was to empirically sparsify weight coefficients and thus compress a network that requires intensive memory usage (Cheng et al., 2017) . The extension of L 1 regularization to group-L 1 regularization (Yuan & Lin, 2006) has also been extensively used in learning various neural networks (Han et al., 2015; Zhao et al., 2015; Wen et al., 2016; Scardapane et al., 2017) . Despite the above practice, the efficacy of L 1 regularization in neural networks deserves more theoretical study. In the context of two-layer neural networks, we will show that the L 1 regularizations in the output and input layers play two different roles: the former for reducing generalization error caused by excessive neurons while the latter for sparsifying input signals in the presence of substantial redundancy. Unlike previous theoretical work, we consider the L 1 loss, which ranks among the most popular loss functions in, e.g., learning from ordinal data (Pedregosa et al., 2017) or imaging data (Zhao et al., 2016) , and for which the statistical risk has not been studied previously. In practice, the use of L 1 loss for training has been implemented in prevalent computational frameworks such as Tensorflow (Google, 2016), Pytorch (Ketkar, 2017) , and Keras (Gulli & Pal, 2017) .

2.1. MODEL ASSUMPTION AND EVALUATION

Suppose we have n labeled observations {(x i , y i )} i=1,...,n , where y i 's are continuously-valued responses or labels. We assume that the underlying data generating model is y i = f * (x i ) + ε i for some unknown function f * (•), where x i 's ∈ X ⊂ R d are independent and identically distributed,

Input dimension !

Hidden dimension "

Input layer Hidden layer

Output layer and ε i 's are independent and identically distributed that is symmetric at zero and # $ % & ' & $ ( $ ) $ ! ( % * ( + & + , + - ' , ' - % - % , E (ε 2 i | x i ) ≤ τ 2 . (1) Here, X is a bounded set that contains zero, for example {x : x ∞ ≤ M } for some constant M . Our goal is learn a regression model fn : x → fn (x) for prediction. The fn is obtained from the following form of neural networks r j=1 a j σ(w j x + b j ) + a 0 , where a 0 , a j , b j ∈ R, w j ∈ R d , j = 1, . . . , r, are parameters to estimate. We let a = [a 0 , a 1 , . . . , a r ] T denote the output layer coefficients. An illustration is given Figure 1 . The estimation is typically accomplished by minimizing the empirical risk n -1 n i=1 (y i , f (x i )), for some loss function l(•) plus a regularization term. We first consider the L 1 regularization at the output layer. In particular, we search for such f by the empirical risk minimization from the function class F V = f : R d → R f (x) = r j=1 a j σ(w j x + b j ) + a 0 , a 1 ≤ V (3) where V is a constant. The following statistical risk measures the predictive performance of a learned model f : R(f ) ∆ = E (y, f (x)) -E (y, f * (x)). The loss function (•) is pre-determined by data analysts, usually the L 1 loss defined by (y, ỹ) = |y -ỹ| or the L 2 loss defined by 2 (y, ỹ) = (y -ỹ) 2 . Under the L 1 loss, the risk is R(f ) = E |f * (x) + ε -f (x)| -E |ε|, which is nonnegative for symmetric random variables ε. It is typical to use the same loss function for both training and evaluation.

2.2. NOTATION

Throughout the paper, we use n, d, k, r to denote the number of observations, the number of input variables or input dimension, the number of significant input variables or sparsity level, the number of neurons (or hidden dimension), respectively. We write a n b n , b n a n , or b n = O(a n ), if |b n /a n | < c for some constant c for all sufficiently large n. We write a n b n if a n b n as well as a n b n . Let N (µ, V ) denote Gaussian distribution with mean µ and covariance V . Let • 1 and • 2 denote the common L 1 and L 2 vector norms, respectively. Let X denote the essential support of X. For any vector z ∈ R d , we define z X ∆ = sup x∈X |x z|, which may or may not be infinity. If X = {x : x ∞ ≤ M }, z X is equivalent to M z 1 . Throughout the paper, fn denotes the estimated regression function with n being the number of observations.

2.3. ASSUMPTIONS AND CLASSICAL RESULTS

We introduce some technical assumptions necessary for our analysis, and state-of-the-art statistical risk bounds built through dimension-based complexity analysis. Assumption 1. The activation function σ(•) is a bounded function on the real line satisfying σ(x) → 1 as x → ∞ and σ(x) → 0 as x → -∞, and it is L-Lipschitz for some constant L. Assumption 2. The regularization constant V is larger than 2C + f * (0), where C is any constant such that the Fourier transform of f * , denoted by F , satisfies R d ω X F (dω) ≤ C. ( ) Assumption 3. σ(x) approaches its limits at least polynomially fast, meaning that |σ(x) -1{x > 0}| < ε for all |x| > x ε where x ε is a polynomial of 1/ε. Also, the value of η ∆ = sup j w j X scales with n polynomially meaning that log η = O(log n) as n → ∞. Assumption 4. There exists a constant c > 0 and a bounded subset S ⊂ R such that P(X ∈ S) > c and inf x∈S σ (x) > c for X ∼ N (0, 1). We explain each assumption below. The above notation of C, V follow those in (Barron, 1993; 1994) . Assumption 1 specifies the class of the activation functions we consider. A specific case is the popular activation function σ(x) = 1/{1+exp(-x)}. Assumption 2, first introduced in (Barron, 1993), specifies the smoothness condition for f * to ensure the approximation property of neural networks (see Theorem 2.1). In Assumption 3, the condition for w is for technical convenience. It could also be replaced with the following alternative condition: There exists a constant c > 0 such that the distribution of x satisfies sup w: w 2 =1 P log(|w x|) < c log ε < ε for any ε ∈ (0, 1). Simply speaking, the input data x is not too small with high probability. This condition is rather mild. For example, it holds when each component of x has a a bounded density function. This alternative condition ensures that for some small constant ε > 0 and any w ∈ R d , there exists a surrogate of w, ŵ ∈ R d with log ŵ 2 = O(-log ε), such that P(|σ(w x) -σ( ŵ x)| > ε) < ε. And this can be used to surrogate the assumption of w in Assumption 3 throughout the proofs in the appendix. Assumption 4 means that σ(•) is not a nearly-constant function. This condition is only used to bound the minimax lower bound in Theorem 3.2. Theorem 2.1 (Approximation error bound (Barron, 1993) ). Suppose that Assumptions 1, 2, 3 hold. We have inf f ∈F V X (f (x) -f * (x)) 2 µ(dx) 1/2 ≤ 2C 1 √ r + δ η , where µ denotes a probability measure on X, δ η = inf 0<ε<1/2 2ε + sup |x|>ε σ(ηx) -1{x > 0} , η is defined in Assumption 3, and C is defined in (4). Theorem 2.2 (Statistical risk bound (Barron, 1994)). Suppose that Assumptions 1, 2, 3 hold. Then the L 2 estimator fn in F V satisfies E { fn (x) -f * (x)} 2 V 2 /r + (rd log n)/n. In particular, if we choose r V n/(d log n), then E { fn (x) -f * (x)} 2 V (d log n)/n. It is known that a typical parametric rate under the L 2 loss is at the order of O(d/n), much faster than the above result. This gap is mainly due to excessive model complexity in bounding generalization errors. We will show in Section 3 that the gap in the rate of convergence can be filled when using L 1 loss. Our technique will be based on the machinery of Rademacher complexity, and we bound this complexity through a joint analysis of the norm of coefficients ('norm-based') as well as dimension of parameters ('dimension-based').

2.4. MODEL COMPLEXITY AND GENERALIZATION ERROR

The statistical risk consists of two parts. The first part is an approximation error term non-increasing in the number of neurons r, and the second part describes generalization errors. The key issue for risk analysis is to bound the second term using a suitable model complexity and then tradeoff with the first term. We will develop our theory based on the following measure of complexity. Let F denote a class of functions each mapping from X to R, and x 1 , x 2 , . . . , x n ∈ X. Following a similar terminology as in (Neyshabur et al., 2015) , the Rademacher complexity, or simply 'complexity', of a function class F is defined by E sup f ∈F |n -1 n i=1 ξ i f (x i )|, where ξ i , i = 1, 2, . . . , n are independent symmetric Bernoulli random variables. Lemma 2.3 (Rademacher complexity of F V ). Suppose that Assumptions 1, 3 hold. Then for the Rademacher complexity of F V , we have E sup f ∈F V 1 n n i=1 ξ i f (x i ) V √ d log n √ n . ( ) The proof is included in Appendix A.1. The bound in ( 6) is derived from an amalgamation of dimension-based and norm-based analysis elaborated in the appendix. It is somewhat surprising that the bound does not explicitly involve the approximation error part (that depends on r and η). This Rademacher complexity bound enables us to derive tight statistical risk bounds in the following section.

3.1. STATISTICAL RISK BOUND FOR THE L 1 REGULARIZED NETWORKS IN (3)

Theorem 3.1 (Statistical risk bound). Suppose that Assumptions 1, 2, 3 hold. Then the constrained L 1 estimator fn over F V satisfies R( fn ) 1 √ r + δ η C + V √ d log n + τ √ n , where δ η is defined in (5), and τ was introduced in (1). Moreover, choosing the parameters r, η large enough, we have R( fn ) V √ d log n + τ √ n . ( ) The proof is in Appendix A.2. We briefly explain our main idea in deriving the risk bound (7). A standard statistical risk bound contains two parts which correspond to the approximation error and generalization error, respectively. The approximation error part in ( 7) is the first term, which involves the hidden dimension r and the norm of input coefficients through η. This observation motivates us to use the norm of output-layer coefficients through V and the input dimension d to derive a generalization error bound. In this way, the generalization error term does not involve r already used for bounding the approximation error, and thus a bias-variance tradeoff through r is avoided. This thought leads to the generalization error part in (7), which is the second term involving V and d. Its proof combines the machinery of both dimension-based and norm-based complexity analysis. From our analysis, the error bound in Theorem 3.1 is a consequence of the L 1 loss function and the employed L 1 regularization. In comparison with the previous result of Theorem 2.2, the bound obtained in Theorem 3.1 is tight and it approaches the parametric rate d/n for the d < n regime. Though we can only prove for L 1 loss in this work, we conjecture that the same rate is achieved using L 2 loss. In the following, we further show that the above risk bound is minimax optimal. The minimax optimality indicates that deep neural networks with more than two layers will not perform much better than shallow neural networks when the underlying regression function belongs to F V . Theorem 3.2 (Minimax risk bound). Suppose that Assumptions 1 and 4 hold, and x 1 , x 2 , . . . , x n iid ∼ N (0, I d ), then inf fn sup f ∈F V R( fn (x)) V d/n. Here the F V is the same one as defined in (3). All the smooth functions f * (•) that satisfy V > 2C + f * (0) and (4) belong to F V according to Theorem 2.1. The proof is included in Appendix A.3.

3.2. ADAPTIVENESS TO THE INPUT SPARSITY

It is common to input a large dimensional signal to a neural network, while only few components are genuinely significant for prediction. For example, in environmental science, high dimensional weather signals are input for prediction while few are physically related (Shi et al., 2015) . In image processing, the image label is relevant to few background pixels (Han et al., 2015) . In natural language processing, a large number of redundant sentences sourced from Wikipedia articles are input for language prediction (Diao et al., 2019) . The practice motivates our next results to provide a tight risk bound for neural networks whose input signals are highly sparse. Assumption 5. There exists a positive integer k ≤ d and an index set S ⊂ {1, . . . , d} with card(S) = k, such that f * (x) = g * (x S ) for some function g * (•) with probability one. The subset S is generally unknown to data analysts. Nevertheless, if we know k, named the sparsity level, the risk bound could be further improved by a suitable regularization on the input coefficients. We have the following result where d is replaced with k in the risk bound of Theorem 3.1. Proposition 3.3. Suppose that that Assumptions 1, 2, 3, 5 hold. Suppose that fn is the L 1 estimator over the following function class f : R d → R f (x) = r j=1 a j σ(w j x + b j ) + a 0 , a 1 ≤ V, sup j w j 0 ≤ k . Then R( fn ) {k log(dn)}/n. The proof is included in Appendix A.4. The above statistical risk bound is also minimax optimal according to a similar argument in Theorem 3.2. From a practical point of view, the above L 0 constraint is usually difficult to implement, especially for a large input dimension d. Alternatively, one may impose an L 1 constraint instead of an L 0 constraint on the input coefficients. Our next result is concerned with the risk bound when the model is learned from a joint regularization on the output and input layers. For technical convenience, we will assume that X is a bounded set. Theorem 3.4. Consider the following function class of two-layer neural networks F V,η = f : R d → R f (x) = r j=1 a j σ(w j x + b j ) + a 0 , a 1 ≤ V, sup 1≤j≤r ( w j 1 + |b j |) ≤ η . Suppose that V C, where C is defined in (4). Then the constrained L 1 estimator fn over F V,η satisfies R( fn ) C 1 √ r + δ η + V η + τ √ n , where δ η is defined in (5). In particular, choosing r large enough, we have R( fn ) Cδ η + V η + τ √ n which does not involve the input dimension d and the number of hidden neurons r. Moreover, suppose that σ(x) = 1/(1 + e -x ), η n log 2 n 1/3 , then R( fn ) V (log n)/n 1/3 . The proof is included in Appendix A.5. In the above result, the risk bound is at the order of O(n -1/3 ), which is slower than the O(n -1/2 ) in the previous Theorem 3.1 and Proposition 3.3 if ignoring d and logarithmic factors of n. However, for a large input dimension d that is even much larger than n, the bound can be much tighter than the previous bounds since it is dimension-free.

4. CONCLUSION AND FURTHER REMARKS

We studied the tradeoff between model complexity and statistical risk in two-layer neural networks from the explicit regularization perspective. We end our paper with two future problems. First, in Theorem 3.4, For a small d, the order of n -1/3 seems to be an artifact resulting from our technical arguments. We conjecture that in the small d regime, this risk bound could be improved to O(n -1/2 ) by certain adaptive regularizations. Second, it would be interesting to emulate the current approach to yield similarly tight risk bounds for deep forward neural networks. 

A APPENDIX

A.1 PROOF OF LEMMA 2.3 We first prove (6), which uses an amalgamation of dimension-based and norm-based analysis. For the output layer, we use the following norm-based analysis E sup f ∈F V 1 n n i=1 ξ i f (z i ) = E sup f ∈F V | a, 1 n n i=1 ξ i σ(W z i + b) | (9) ≤ sup a 1 E sup f ∈F V 1 n n i=1 ξ i σ(W z i + b) ∞ ≤ V E sup f ∈F V max j 1 n n i=1 ξ i σ(w j z i + b j ) ≤ V E sup w∈R d 1 n n i=1 ξ i σ(w z i + b) . For notational convenience, we define w 0 = 0, b 0 = 0, and a 0 = σ(0) -1 a 0 σ(w 0 z + b 0 ) so that a 0 can be treated in a similar manner as other a i 's. Without loss of generality, we do not separately consider a 0 in the following proofs. Next, we prove that E sup w∈R d 1 n n i=1 ξ i σ(w z i + b) d log n n , and thus conclude the proof. The proof will be based on an ε-net argument together with the union bound. For any ε, let W ε ⊂ R d denote the subset W ε = w = ε 2d (i 1 , i 2 , . . . , i d ) : i j ∈ Z, w 1 ≤ η n . Then, for any w, b, there exists some element ŵ ∈ W ε such that sup z∈X |σ(w z + b) -σ( ŵ z + b)| ≤ sup z |(w z + b) -( ŵ z + b)| ≤ sup z |(w -ŵ) z| + |b -b| ≤ w -ŵ 1 sup z z ∞ + |b -b| ≤ ε, where b = (ε/2d) (2db/ε) and • is the floor function. By Bernstein's Inequality, for any w, b, P | 1 n n i=1 ξ i σ(w z i + b)| > t ≤ 2 exp - nt 2 2(1 + t/3) . By taking the union bound over W ε , and use the fact that log card(W ε ) d log(nd/ε), we obtain sup w∈R d 1 n n i=1 ξ i σ(w z i + b) ε + d n log nd ε log 1 δ , with probability at least 1 -δ. Then the desired result is obtained by taking ε ∼ (d log n)/n. A.2 PROOF OF THEOREM 3.1 The proof is based on the following contraction lemma used in (Neyshabur et al., 2015) . Lemma A.1 (Contraction Lemma). Suppose that g is L-Lipschitz and g(0) = 0. Then for any function class F mapping from X to R and any set {x 1 , x 2 , . . . , x n }, we have E sup f ∈F 1 n n i=1 ξ i g(f (x i )) ≤ 2LE sup f ∈F 1 n n i=1 ξ i f (x i ) . With the above lemma, we have the following result. Lemma A.2. The constrained L 1 estimator fn over F satisfies R( fn ) ≤ min f ∈F E |f (x) -f * (x)| + 2E sup f ∈F | 1 n n i=1 ξ i f (z i )| + 2 E y 2 n . Proof. Define the empirical risk as: R n (f ) = E 1 n n i=1 |f * (x i ) + ε i -f (x i )| -E |ε|. Since fn minimizes n -1 n i=1 |f * (x i ) + ε i -f (x i )| in F, we have R( fn ) ≤ R( fn ) -{R n ( fn ) -R n ( f )} = {R( fn ) -R n ( fn )} + R n (f 0 ), (14) where f 0 = arg min f ∈F R(f ). We also have R n (f 0 ) = R(f 0 ) = min f ∈F E (|f * (x) + ε -f (x i )| -|ε|) ≤ min f ∈F E |f (x) -f * (x)|. In the following, we will analyze the term R( fn ) -R n ( fn ) in ( 14). Let z i 's denote independent and identically distributed copies of x i 's. R( fn ) -R n ( fn ) = E 1 n n i=1 | fn (z i ) -f * (z i ) -ε i | -| fn (x i ) -f * (x i ) -ε i | ≤ E sup f ∈F 1 n n i=1 |f (z i ) -f * (z i ) -ε i | -|f (x i ) -f * (x i ) -ε i | ≤ 2E sup f ∈F 1 n n i=1 ξ i |f (z i ) -f * (z i ) -ε i |, where ξ 1 , . . . , ξ n are independent and identically distributed symmetric Bernoulli random variables that are independent with z i 's. According to Lemma A.1, since g(x) = |x| is 1-Lipschitz and g(0) = 0, we have E sup f ∈F 1 n n i=1 ξ i |f (z i ) -f * (z i ) -ε i | ≤ 2E sup f ∈F | 1 n n i=1 ξ i (f (z i ) -f * (z i ) -ε i )| ≤ 2E sup f ∈F 1 n n i=1 ξ i f (z i ) + 2 E y 2 n . Combining this and (15), we conclude the proof of Lemma A.2.



Figure 1: A graph showing the two-layer neural network model considered in (2).

Lei Zhao, Qinghua Hu, and Wenwu Wang. Heterogeneous feature selection with multi-modal deep neural networks and sparse groupLASSO. IEEE Trans. Multimed., 17(11):1936-1948, 2015.

annex

Proof of Theorem 3.1. The proof of ( 7) is a direct consequence of Lemma 2.3, Lemma A.2, Theorem 2.1 and the fact that the first moment is no more than the second moment. The proof of (8) follows from the fact that δ(η) → 0 as η → ∞.A.3 PROOF OF THEOREM 3.2 Define a subclass of F V byIn the following, we will prove the minimax bound for F V by analyzing F 0 . Notice thatdenote the covering ε-entropy of F 0 with the square root Kullback-Leibler divergence, then according to its relation with the L 2 distance shown in (Yang & Barron, 1999) , we havewhere M 2 (ε) denote the packing ε-entropy of F V with L 2 loss function. The second inequality is proved in a similar way to the proof of Lemma 2.3, which is omitted here for brevity. Hence, according to (Yang & Barron, 1999 , Theorem 1),This concludes the proof.

A.4 PROOF OF PROPOSITION 3.3

To prove the proposition, it is sufficient to verify the following Rademacher complexity boundwhich can be derived easily by adjusting the proof in Lemma 2.3. Then the result follows with a similar analysis as in Theorem 3.1.A.5 PROOF OF THEOREM 3.4It can be verified from the identity ( 9) thatThen according to Lemma A.1, we haveCombining ( 16) and ( 17), we obtain the following lemma that may be interesting on its own right. Lemma A.3. We haveSince w X w 1 and {w : w X η} ⊂ {w : w 1 η}, the • X can be replaced with • 1 in the bounds in Lemmas A.3 and A.2. Then, with a similar argument as in the proof of Theorem 3.1, we conclude the proof of Theorem 3.4.

