DEEP LEARNING MEETS NONPARAMETRIC REGRES-SION: ARE WEIGHT DECAYED DNNS LOCALLY ADAPTIVE?

Abstract

We study the theory of neural network (NN) from the lens of classical nonparametric regression problems with a focus on NN's ability to adaptively estimate functions with heterogeneous smoothness -a property of functions in Besov or Bounded Variation (BV) classes. Existing work on this problem requires tuning the NN architecture based on the function spaces and sample size. We consider a "Parallel NN" variant of deep ReLU networks and show that the standard ℓ 2 regularization is equivalent to promoting the ℓ p -sparsity (0 < p < 1) in the coefficient vector of an end-to-end learned function bases, i.e., a dictionary. Using this equivalence, we further establish that by tuning only the regularization factor, such parallel NN achieves an estimation error arbitrarily close to the minimax rates for both the Besov and BV classes. Notably, it gets exponentially closer to minimax optimal as the NN gets deeper. Our research sheds new lights on why depth matters and how NNs are more powerful than kernel methods.

1. INTRODUCTION

Why do deep neural networks (DNNs) work better? They are universal function approximators (Cybenko, 1989 ), but so are splines and kernels. They learn data-driven representations, but so are the shallower and linear counterparts such as matrix factorization. The theoretical understanding on why DNNs are superior to these classical alternatives is surprisingly limited. In this paper, we study DNNs in nonparametric regression problems -a classical branch of statistical theory and methods with more than half a century of associated literatures (Nadaraya, 1964; De Boor et al., 1978; Wahba, 1990; Donoho et al., 1998; Mallat, 1999; Scholkopf & Smola, 2001; Rasmussen & Williams, 2006) . Nonparametric regression addresses the fundamental problem: • Let y i = f (x i ) + Noise for i = 1, ..., n. How can we estimate a function f using data points (x 1 , y 1 ), ..., (x n , y n ) in conjunction with the knowledge that f belongs to a function class F? Function class F typically imposes only weak regularity assumptions such as smoothness, which makes nonparametric regression widely applicable to real-life applications under weak assumptions. Local adaptivity. We say a nonparametric regression technique is locally adaptive if it can cater to local differences in smoothness, hence allowing more accurate estimation of functions with varying smoothness and abrupt changes. A subset of nonparametric regression techniques were shown to have the property of local adaptivity (Mammen & van de Geer, 1997) in both theory and practice. These include wavelet smoothing (Donoho et al., 1998) , locally adaptive regression splines (LARS, Mammen & van de Geer, 1997) , trend filtering (Tibshirani, 2014; Wang et al., 2014) and adaptive local polynomials (Baby & Wang, 2019; 2020) . In light of such a distinction, it is natural to consider the following question: Are NNs locally adaptive, i.e., optimal in learning functions with heterogeneous smoothness? This is a timely question to ask, partly because the bulk of recent theory of NN leverages its asymptotic Reproducing Kernel Hilbert Space (RKHS) in the overparameterized regime (Jacot et al., 2018; Belkin et al., 2018; Arora et al., 2019) . RKHS-based approaches, e.g., kernel ridge regression with We build upon the recent work of Suzuki (2018) and Parhi & Nowak (2021a) who provided encouraging first answers to the question above. Specifically, Parhi & Nowak (2021a, Theorem 8) showed that a two-layer truncated power function activated neural network with a non-standard regularization is equivalent to the LARS. This connection implies that such NNs achieve the minimax rate for the (high order) bounded variation (BV) classes. A detailed discussion is provided in Section B. Suzuki (2018) showed that multilayer ReLU DNNs can achieve minimax rate for the Besov class, but requires the artificially imposed sparsity-level of the DNN weights to be calibrated according to parameters of the Besov class, thus is quite difficult to implement in practice. (Oono & Suzuki, 2019; Liu et al., 2021) replaced the sparse neural network with Resnet-style CNN and achieved the same rate, but they similarly require carefully choosing the number of parameters for each nonparametric class. We show that ℓ 2 regularization suffices for mildly overparameterized DNNs to achieve the optimal "local adaptive" rates for many nonparametric classes at the same time. Parallel neural networks. We restrict our attention on a special network architecture called parallel neural network (Haeffele & Vidal, 2017; Ergen & Pilanci, 2021c) which learns an ensemble of subnetworks -each being a multilayer ReLU DNNs. Parallel NNs have been shown to be more well-behaved both theoretically (Haeffele & Vidal, 2017; Zhang et al., 2019; Ergen & Pilanci, 2021b; c; d) and empirically (Zagoruyko & Komodakis, 2016; Veit et al., 2016) . On the other hand, many successful NN architectures such as SqueezeNet, ResNext and Inception (see (Ergen & Pilanci, 2021c) and the references therein) use the idea similar to a parallel NN. Weight decay, also known as square ℓ 2 regularization, is one of the most popular regularization techniques for preventing overfitting in DNNs. It is called "weight decay" because each iteration of the gradient descent (or SGD) shrinks the parameter towards 0 multiplicatively. Many tricks in deep learning, including early stopping (Yao et al., 2007) , quantization (Hubara et al., 2016) , and dropout (Wager et al., 2013) behaves like ℓ 2 regularization. Thus even though we focus on the exact minimizer of the regularized objective, it may explain the behavior of SGD in practice. Summary of results. Our main contributions are: 1. We prove that the (standard) ℓ 2 regularization in training an L-layer parallel ReLUactivated neural network is equivalent to a sparse ℓ p penalty term (where p = 2/L) on the linear coefficients of a learned representation (Proposition 4). 2. We show that the estimation error of ℓ 2 regularized parallel NN can be close to the minimax rate for estimating functions in Besov space. Notably, the method can adapt to different smoothness parameter, which is not the case for many other methods. 3. We find that deeper models achieve closer to the optimal error rate. This result helps explain why deep neural networks can achieve better performance than shallow ones empirically. Besides, we have the following technical contributions which could be of separate interest: • We provide a way to bound the complexity of an overparameterized neural network. Specifically, we bound the metric entropy of a parallel neural network in Theorem 5, and the bound does not depend on the number of subnetworks. • We propose a method to handle unconstrained function subspace when bounding the estimation error as in Equation (4). The above results separate parallel NNs with any linear methods such as kernel ridge regression. To the best of our knowledge, we are the first to demonstrate that standard techniques (ℓ 2 regularization and ReLU activation) suffice for DNNs in achieving the optimal rates for estimating BV and Besov functions. The comparison with previous works is shown in Table 1 . More discussion about related works are shown in Section A.

2. PRELIMINARY

2.1 NOTATION AND PROBLEM SETUP. We denote regular font letters as scalars, bold lower case letters as vectors and bold upper case letters as matrices. a ≲ b means a ≤ Cb for some constant C that does not depend on a or b, and a ≂ b denotes a ≲ b and b ≲ a. See Table 2 for the full list of symbols used. Let f 0 be the target function to be estimated. The training dataset is D n := {(x i , y i ), y i = f 0 (x i ) + ϵ i , i ∈ [n]} , where x i are fixed and ϵ i are zero-mean, independent Gaussian noises with variance σ 2 . In the following discussion, we assume x i ∈ [0, 1] d , f 0 (x i ) ∈ [-1, 1], ∀i. We will be comparing estimators under the mean square error (MSE), defined as MSE( f ) := E Dn 1 n n i=1 ( f (x i ) -f 0 (x i )) 2 . The optimal worst-case MSE is described by R(F) := min f max f0∈F MSE( f ). We say that f is optimal if MSE( f ) ≲ R(F). The empirical (square er- ror) loss is defined as L( f ) := 1 n n i=1 ( f (x i ) -y i ) 2 . The corresponding population loss is L( f ) := E[ 1 n n i=1 ( f (x i ) -y ′ i ) 2 | f ] where y ′ i are new data points. It is clear that E[L( f )] = MSE[ f ] + σ 2 .

2.2. BESOV SPACES AND BOUND VARIATION SPACE

Besov space, denoted as B α p,q , is a flexible function class parameterized by α, p, q whose definition is deferred to Section C.1. Here α ≥ 0 determines the smoothness of functions, 1 ≤ p ≤ ∞ determines the averaging (quasi-)norm over locations, 1 ≤ q ≤ ∞ determines the averaging (quasi-)norm over scale which plays a relatively minor role. Smaller p is more forgiving to inhomogeneity and loosely speaking, when the function domain is bounded, smaller p induces a larger function space. On the other hand, it is easy to see from definition that B α p,q ⊂ B α p,q ′ , if q < q ′ . Without loss of generalizability, in the following discussion we will only focus on B α p,∞ . When p = 1, the Besov space allows higher inhomogeneity, and it is more general than the Sobolev or Hölder space. Bounded variation (BV) space is a more interpretable class of functions with spatially heterogeneous smoothness (Donoho et al., 1998) . It is defined through the total variation (TV) of a function. Besov space. [n] {x ∈ N : 1 ≤ x ≤ n}. | • | B α p,q Besov quasi-norm . ∥ • ∥ F Frobenius norm. ∥ • ∥ B α p,q Besov norm. ∥ • ∥ p ℓ p -norm. M m (•) m th order Cardinal B-spline bases. Weight and bias in the ℓ-th layer in the j-th subnetwork. R, Z, N Set of real numbers, integers, and nonnegative integers. x W (1) W (2) . . . For (m + 1)th differentiable function f : [0, 1] → R, the mth order total variation is defined as Lorentz, 1993) : W (L-1) W (L) W (2) 1 W (2) M . . . W (L-1) 1 W (L-1) M . . . W (L) 1 W (L) M y min f j L( j fj) + λ L ℓ=1 M j=1 ∥W (ℓ) j ∥ 2 F . W (ℓ) : W (ℓ) 1 W (ℓ) 2 W (ℓ T V (m) (f ) := T V (f (m+1) ) = [0,1] |f (m+1) (x) B m+1 1,1 ⊂ BV (m) ⊂ B m+1 1,∞ This allows the results derived for the Besov space to be easily applied to BV space. Minimax MSE It is well known that minimax rate for Besov and 1D BV classes are O(n -2α 2α+d ) and O(n -(2m+2)/(2m+3) ) respectively . The minimax rate for linear estimators in 1D BV classes is known to be O(n -(2m+1)/(2m+2) ) (Mammen & van de Geer, 1997; Donoho et al., 1998) .

3. MAIN RESULTS: PARALLEL RELU DNNS

Consider a parallel neural network containing M multi layer perceptrons (MLP) with ReLU activation functions called subnetworks. Each subnetwork has width w and depth L. The input is fed to all the subnetworks, and the output of the parallel NN is the summation of the output of each subnetwork. The architecture of a parallel neural network is shown in Figure 2a . This parallel neural network is equivalent to a vanilla neural network with block diagonal weights in all but the first and the last layers (Figure 2  arg min {W (ℓ) j ,b (ℓ) j } L(f ) + λ M j=1 L ℓ=1 W (ℓ) j 2 F , where f (x) = M j=1 f k (x) denotes the parallel neural network, f j (•) denotes the j-th subnetwork, and λ > 0 is a fixed scaling factor. We choose not to regularize the bias terms b (ℓ) j to provide a cleaner equivalent model (Proposition 4). If the bias terms are regularized, the result will be similar. Besides, we ignore the computation issue and focus on the global optimal solution to this problem. In practice, in deep neural network, the solution obtained using gradient descent-style methods are often close to the global optimal solution (Choromanska et al., 2015) . Theorem 1. For any fixed αd/p > 1, q ≥ 1, L ≥ 3, define m = ⌈α -1⌉. For any f 0 ∈ B α p,q , given an L-layer parallel neural network satisfying • The width of each subnetwork is fixed satisfying w ≥ O(md). See Theorem 8 for the detail. • The number of subnetworks is large enough: pL) . M ≳ n 1-2/L 2α/d+1-2/( Under the assumption as in Lemma 17, with proper choice of the parameter of regularizaton λ that depends on D, α, d, L, the solution f parameterized by (2) satisfies MSE( f ) = C(w, L) Õ n -2α/d(1-2/L) 2α/d+1-2/(pL) + e -c6L . ( ) where Õ shows the scale up to a logarithmic factor, c 6 > 0 is a numerical constant from Theorem 8, C(w, L) ≂ (w 4-4/L L 2-4/L ) 2α/d 2α/d+1-2/(pL) depends polynomially on L. We explain the proof idea in the next section,but defer the extended form of the theorem and the full proof to Section F. Before that, we comment on a few interesting aspects of the result. Near optimal rates and the effect of depth. The first term in the MSE bound is the estimation error and the second term is (part of) the approximation error of this NN. Recall that the minimax rate of a Besov class is O(n -2α 2α+d ). The gap between the estimation error and the minimax rate is because the minimax rate can be achieved by an ℓ 0 sparse model, while the parallel NN is equivalent to an ℓ p sparse model (will be shown in Proposition 4), which is an approximation to ℓ 0 . As the depth parameter L increases, p = 2/L gets closer to 0, the MSE can get arbitrarily close to the minimax rate and the trailing constant term in (3) can be arbitrarily small. Close to the optimal rate can be achieved if we choose L ≳ log n: Corollary 2. Under the conditions of Theorem 1, for any f 0 ∈ B α p,q , there is a numerical constant C such that when we choose C log n ≤ L ≤ 100C log n, MSE( f ) = Õ(n -2α 2α+d (1-o(1)) ), where Õ hides only logarithmic factors and the o(1) factor in the exponent is O(1/ log(n)). Sparsity and comparison with standard NN. We also note that the result does not depend on M as long as M is large enough. This means that the neural network can be arbitrarily overparameterized while not overfitting. The underlying reason is sparsity. As it will become clearer in Section 4.1, ℓ 2 regularized training of a parallel L-layer ReLU NNs is equivalent to a sparse regression problem with an ℓ p penalty assigned to the coefficient vector of a learned dictionary. Here p = 2/L which promotes even sparser solutions than an ℓ 1 penalty. Such ℓ p sparsity does not exist in standard deep neural networks to the best of our knowledge, which indicates that parallel neural networks may be superior over standard neural networks in local adaptivity. Adaptivity to function spaces. For any fixed L, m, our result shows the parallel neural network with width w = O( md) can achieve close to the minimax rate for any Besov class as long as α ≤ m. In other words, neural networks can adapt to smoothness parameter by tuning only the regularizaton parameter. As will be shown in Theorem 5, overestimating α with m only changes the logarithmic terms in the MSE bound -a mild price to pay for a more adaptive method. Hyperparameter tuning. We provide an explicit choice of λ in Lemma 17 underlying our theoretical result. Empirically, it can be determined empirically, e.g. using cross validation. Fixed design v.s. random design. We mainly focus on bounding the error at sample covariates (the fixed design problem) to be comparable to classical nonparameteric regression results. For completeness, we also state results for the random design version of the problem (bounding E D E f MSE( f )) in Theorem 19, to be compatible with the standard statistical learning setting (e.g., Suzuki, 2018) . More discussion about this result can be found in Section G. Bounded variation classes. Thanks to the Besov space embedding of the BV class (1), our theorem also implies the result for the BV class in 1D. Corollary 3. If the target function is in bounded variation class f 0 ∈ BV (m), For any fixed L ≥ 3, for a neural network satisfying the requirements in Theorem 1 with d = 1 and with proper choice of the regularization factor λ, the NN f parameterized by (6) satisfies MSE( f ) = C(w, L) Õ(n -(2m+2)(1-2/L) 2m+3-2/L ) + O(e -c6L ), where C(w, L) is the same as in (3) except replacing α with m. It is known that any linear estimators such as kernel smoothing and smoothing splines cannot have an error lower than O(n -(2m+1)/(2m+2) ) for BV (m) (Donoho et al., 1998) . When L > O(m 2 ), the first term in the MSE of NN decreases with n faster than that of the linear methods. When n is large enough, there exists L such that the MSE of NN is strictly smaller than that of any linear method. This partly explains the advantage of DNNs over kernels.

4. PROOF OVERVIEW

We start by first proving that a parallel neural network trained with ℓ 2 regularization is equivalent to an ℓ p -sparse regression problem with representation learning (Section 4.1); which helps decompose its MSE into an estimation error and approxmation error. Then we bound the two terms under an ℓ p -sparse constrained problem setting in Section 4.2 and Section 4.3 respectively. Notably, we adapted the generic statistical learning machinery (a self-bounding argument) for studying this constrained ERM problem (Suzuki, 2018, Proposition 4) to bound the estimation error. This adaption is non-trival because there is an unconstrained subspace with no bounded metric entropy. Specifically, Proposition 15 shows that the MSE of the regression problem can be bounded by MSE( f ) =O inf f ∈F MSE(f ) approximation error + log N (F ∥ , δ, ∥ • ∥ ∞ ) + d(F ⊥ ) n + δ estimation error (4) in which F decomposes into F ∥ ×F ⊥ , where F ⊥ is an unconstrained subspace with finite dimension, and F ∥ is a compact set in the orthogonal complement with a δ-covering number of N (F ∥ , δ, ∥ • ∥ ∞ ) in ∥ • ∥ ∞ -norm. This decomposes MSE into an approximation error and an estimation error. The novel analysis of these two represents the major technical contribution of this paper.

4.1. EQUIVALENCE TO ℓ p SPARSE REGRESSION

It is widely known that ReLU function is 1-homogeneous: σ(ax) = aσ(x), ∀a ≥ 0, x ∈ R. In any consecutive two layers in a neural network (or a subnetwork), one can multiply the weight and bias in one layer with a positive constant, and divide the weight in another layer with the same constant. The neural network after such transformation is equivalent to the original one: W (2) σ(W (1) x + b (1) = 1 c W (2) σ(cW (1) x + cb (1) ), ∀c > 0, x. This property can be applied to each subnetwork (instead of the entire model in a standard NN), and we can reformulate (2) to an ℓ p sparsity-regularized problem: Proposition 4. There exists an one-to-one mapping between λ > 0 and λ ′ > 0 such that (2) is equivalent to the following problem: arg min { W(ℓ) j , b(ℓ) j ,aj } L M j=1 a j fj + λ ′ ∥{a j }∥ 2/L 2/L s.t. ∥ W(1) j ∥ F ≤ c 1 √ d, ∀j ∈ [M ]; ∥ W(ℓ) j ∥ F ≤ c 1 √ w, ∀j ∈ [M ], 2 ≤ ℓ ≤ L, where fj (•) is a subnetwork with parameters W(ℓ) j , b(ℓ) j . This equivalent model is demonstrated in Figure 2b . The proof, which we defer to Section D.1, uses AM-GM inequality and the observation that the optimal solution will have norm-equalized weights per layer. The constraint ∥ W(1) j ∥ F ≲ √ d, ∥ W(ℓ) j ∥ F ≲ √ w, ∀ℓ > 1 is typical in deep learning for better numerical stability. The equivalent model in Proposition 4 is also a parallel neural network, but it appends one layer with parameters {a k } at the end of the neural network, and the constraint on the Frobenius norm is converted to the 2/L norm on the factors {a k }. Since L ≫ 2 in a typical application, 2/L ≪ 1 and this regularizer can enforce a sparser model than that in Section B. The same technique can also be used to prove that an ℓ 2 constrained neural network is equivalent to the ℓ 2/L constrained model as in (7). There are two useful implications of Proposition 4. First, it gives an intuitive explanation on how a regularized Parallel NN works. Specifically, it can be viewed as a sparse linear regression with representation learning. Secondly, the conversion into the constrained form allows us to decompose the MSE into two terms as in (4) and bound them separately. We emphasize that Proposition 4 by itself is not new. The same result was previously obtained by Savarese et al. (2019, Appendix C) (see Section A for more details) and the key proof techniques date back to at least Burer & Monteiro (2003) . Our novel contribution is to leverage this folklore equivalence for proving new learning bounds.

4.2. ESTIMATION ERROR ANALYSIS

Previous results that bound the covering number of neural networks (Yarotsky, 2017; Suzuki, 2018) depends on the width of the neural networks explicitly, which cannot be applied when analysing a potentially infinitely wide neural network. In this section, we leverage the ℓ p -norm bounded coefficients to avoid the dependence in M in the covering number bound, and focus on a constrained optimization problem: arg min { W(ℓ) j , b(ℓ) j ,aj } L M j=1 a j fj , s.t.∥{a j }∥ 2/L 2/L ≤ P ′ , and { W(ℓ) j , b(ℓ) j } satisfy the same constraint as in ( 6). The connection between the regularized problem and the constrained problem is defered to Lemma 17. Theorem 5. The covering number of the model defined in (7) apart from the bias in the last layer satisfies log N (F, δ) ≲ w 2+2/(1-2/L) L 2 √ dP ′ 1 1-2/L δ -2/L 1-2/L log(wP ′ /δ). This theorem provides a bound of estimation error for an arbitrarily wide parallel neural network as long as the total Frobenius norm is bounded. The proof can be found in Section D.2. It requires the following lemma, whose proof is deferred to Section D.3: Lemma 6. Let G ⊆ {R d → [-c 3 , c 3 ]} be a set with covering number satisfying log N (G, δ) ≲ k log(1/δ) for some finite c 3 , and for any g ∈ G, |a| ≤ 1, we have ag ∈ G. The covering number of F = M i=1 a i g i g i ∈ G, ∥a∥ p p ≤ P, 0 < p < 1 for any P > 0 satisfies log N (F, ϵ) ≲ kP 1 1-p (δ/c 3 ) -p 1-p log(c 3 P/δ) up to a double logarithmic factor.

4.3. APPROXIMATION ERROR ANALYSIS

The approximation error analysis involves two steps. We first analyse how a subnetwork can approximate a B-spline basis, which is defered to Section E.1. Then we show that a sparse linear combination of B-spline bases approximates Besov functions. Both add up to the total error in approximating Besov functions with a parallel neural network (Theorem 8). Proposition 7. Let αd/p > 1, r > 0. For any function in Besov space f 0 ∈ B α p,q and any positive integer M , there is an M -sparse approximation using B-spline basis of order m satisfying 0 < α < min(m, m -1 + 1/p): f M = M i=1 a ki,si M m,ki,si for any positive integer M such that the approximation error is bounded as ∥ f M -f 0 ∥ r ≲ M -α/d ∥f 0 ∥ B α p,q , and the coefficients satisfy ∥{2 ki a ki,si } ki,si ∥ p ≲ ∥f 0 ∥ B α p,q . The proof as well as the remark can be found in Section E.2. Theorem 8. Under the same condition as Proposition 7, for any positive integer M , any function in Besov space f 0 ∈ B α p,q can be approximated by a parallel neural network with no more than O( M ) number of subnetworks satisfying: 1. Each subnetwork has width w = O(md) and depth L. All the "active" subnetworks are plotted in (c)(f). The horizontal axis in (b) is not linear.

2.. The weights in each layer satisfy

∥ W(ℓ) k ∥ F ≤ O( √ w) except the first layer ∥ W(1) k ∥ F ≤ O( √ d), 3. The scaling factors have bounded 2/L-norm: P ′ := ∥{a j }∥ 2/L 2/L ≲ M 1-2/(pL) .

4.. The approximation error is bounded by

∥ f -f 0 ∥ r ≤ (c 4 M -α/d + c 5 e -c6L )∥f ∥ B α p,q where c 4 , c 5 , c 6 are constants that depend only on m, d and p. Here M is the number of "active" subnetworks, which is not to be confused with the number of subnetworks at initialization. The proof can be found in Section E.3. Using the estimation error in Theorem 5 and approximation error in Theorem 8, by choosing M to minimax the total error, we can conclude the sample complexity of parallel neural networks using ℓ 2 regularization, which is the main result (Theorem 1) of this paper. See Section F for the detail.

5. EXPERIMENT

We empirically compare a parallel neural network (PNN) and a vanilla ReLU neural network (NN) with smoothing spline, trend filtering (TF) (Tibshirani, 2014), and wavelet denoising. Trend filtering can be viewed as a more efficient discrete spline version of locally adaptive regression spline and enjoys the same optimal rates for the BV classes. Wavelet denoising is also known to be minimaxoptimal for the BV classes. The results are shown in Figure 3 . We use two target functions: a Doppler function whose frequency is decreasing(Figure 3 As can be shown in the figure, both TF and wavelet denoising can adapt to the different levels of smoothness in the target function, while smoothing splines tend to be oversmoothed where the target function is less smooth (the left side in (a)(d), enlarged in (g)). The prediction of PNN is similar to TF and wavelet denoising and shows local adaptivity. Besides, the MSE of PNN almost follows the same trend as TF and wavelet denoising which is consistent with our theoretical understanding that the error rate of neural network is closer to locally adaptive methods. Notably PNN, TF and wavelet denoising achieve lower error at a much smaller degree-of-freedom than smoothing splines. There are some mild drops in the best MSE one can achieve with Parallel NN vs TF in both examples. We are surprised that the drop is small because Parallel NN needs to learn the basis functions that TF essentially hard-coded. The additional price to pay for using a more adaptive and more flexible representation learning method seems not high at all. In Figure 3 (c)(f), we give the output all the "active" subnetwork, i.e. the subnetworks whose output is not a constant. Notice that the number of active subnetworks is much smaller than the initialization. This is because ℓ 2 regularization in weights induces ℓ p sparsity and the weight in most of the subnetworks reduces towards 0 after training. More details are shown in Section H. In Figure 3 (h)(i), we plot the MSE versus the number of training samples for "Doppler" and "Vary" respectively. It is clear that parallel NN works the best overall. In (i), we further compare the scaling of the MSE against the minimax rate (n -4/5 ) and the minimax linear rate (n -3/4 ), i.e., the best rate kernel methods could achieve. As is predicted by our theory, when n is large, the MSE of parallel neural networks and trend filtering decreases at almost the same rate as the minimax rate, while smoothing splines, as expected, is converging at the (suboptimal) minimax linear rate. Interestingly, vanilla NN seems to converge at the optimal rate too on this example. It remains an open question whether vanila NN is merely "lucky" on this example, or it also achieves the minimax rate for all functions in BV(m).

6. CONCLUSION AND DISCUSSION

In this paper, we show that a deep parallel neural network can be locally adaptive with standard ℓ 2 regularization. This confirms that neural networks can be nearly optimal in learning functions with heterogeneous smoothness which separates them from kernel methods. Specifically, we prove that training an L layer parallel neural network with standard ℓ 2 regularization is equivalent to an ℓ 2/L -penalized regression model with representation learning. Since in typical application L ≫ 2, standard regularization promotes a sparse linear combination of the learned bases. Using this method, we proved that a parallel neural network can achieve close to the minimax rate in the Besov space and bounded variation (BV) space by tuning the regularization factor. Our result reveals that one do not need to specify the smoothness parameter α (or m) when training a parallel neural network. With only an estimation of the upper bound of α (or m), parallel neural networks can adapt to different degree of smoothness, or choose different parameters for different regions of the domain of the target function. This property shows the strong adaptivity of deep neural networks. On the other hand, as the depth of neural network L increases, 2/L tends to 0 and the error rate moves closer to the minimax rate of Besov and BV space. This indicates that when the sample size is large enough, deeper models have smaller error than shallower models, and helps explain why empirically deep neural networks has better performance than shallow neural networks.

A OTHER RELATED WORKS

NN and kernel methods. Jacot et al. (2018) draws the connection between neural networks and kernel methods. However, it has been found that neural networks often outperform any kernel method, especially when the learning rate is relatively large (Lewkowycz et al., 2020) . A series of work tried to distinguish NN from kernel methods by providing examples of function spaces that NN provably outperform kernel methods (Allen-Zhu & Li, 2019; Ghorbani et al., 2020) . However, these papers did not consider the local adaptivity of nerual networks, which provides a more systematic explanation. Resnet-type convolution neural networks. A recent series of work (Oono & Suzuki, 2019; Liu et al., 2021) proves that an arbitrary parallel neural network can be approximated by a resnet-type convolution neural networks. These works do not require the model to be sparse, thus are easier to train, yet they still require the architecture (the width and depth of each residual block, the number of residual blocks) to be tuned based on the dataset, and the estimation error analysis is based on the number of parameters. Besides, the number of residual block need to increase with n, making the entire too deep to train in practice. Approximation and estimation. The approximation-theoretic and estimation-theoretic research for neural network has a long history too (Cybenko, 1989; Barron, 1994; Yarotsky, 2017; Schmidt-Hieber, 2020; Suzuki, 2018)  f (x) = M j=1 v j σ m (w j x + b j ) + c(x), where w j , v j denote the weight in the first and second layer respectively, b j denote the bias in the first layer, c(x) is a polynomial of order up to m, σ m (x) := max(x, 0) m . Parhi & Nowak (2021a, Theorem 8) showed that when M is large enough, The optimization problem min w,v L(f ) + λ 2 M j=1 (|v j | 2 + |w j | 2m ) is equivalent to the locally adaptive regression spline: min f L(f ) + λT V (f (m) (x)), which optimizes over arbitrary functions that is m-times weakly differentiable. The latter was studied in Mammen & van de Geer (1997) , which leads to the following MSE: Theorem 9. Let M ≥ nm, and f be the function ( 9) parameterized by the minimizer of (10), then MSE( f ) = O(n -(2m+2)(2m+3) ). We show a simpler proof in the univariate case due to Tibshirani (2022): Proof. As is shown in Parhi & Nowak (2021a, Theorem 8) , the minimizer of (10) satisfy |v j | = |w j | m , ∀k so the TV of the neural network f N N is T V (m) (f N N ) = T V (m) c(x) + M j=1 |v j ||w j | m T V (m) (σ (m) (x)) = M j=1 |v j ||w j | m = 1 2 M j=1 (|v j | 2 + |w j | 2m ) which shown that ( 10) is equivalent to the locally adaptive regression spline (11) as long as the number of knots in ( 11) is no more than M . Furthermore, it is easy to check that any spline with knots no more than M can be expressed as a two layer neural network (10). It suffices to prove that the solution in (11) has no more than nm number of knots. Mammen & van de Geer (1997, Proposition 1) showed that there is a solution to (11) f (x) such that f (x) is a mth order spline with a finite number of knots but did not give a bound. Let the number of knots be M , we can represent f using the truncated power basis f (x) = M j=1 a j (x -t j ) m + + c(x) := M j=1 a j σ (m) j (x) + c(x) where t j are the knots, c(x) is a polynomial of order up to m, and define σ (m) j (x) = (x -t j ) m + . Mammen & van de Geer (1997) however did not give a bound on M . Parhi & Nowak (2021a)'s Theorem 1 implies that M ≤ nm. Its proof is quite technical and applies more generally to a higher dimensional generalization of the BV class. Tibshirani (2022) communicated to us the following elegant argument to prove the same using elementary convex analysis and linear algebra, which we present below. Define Π m (f ) as the L 2 (P n ) projection of f onto polynomials of degree up to m, Π ⊥ m (f ) := f -Π m (f ). It is easy to see that Π ⊥ m f (x) = M j=1 a j Π ⊥ m σ (m) j (x) Denote f (x 1:n ) := {f (x 1 ), . . . , f (x n )} ∈ R n as a vector of all the predictions at the sample points. Π ⊥ m f (x 1:n ) = M j=1 a j Π ⊥ m σ (m) j (x 1:n ) ∈ Π ⊥ m conv{±σ (m) j (x 1:n )} • M j=1 |a j | ∈ conv{±Π ⊥ m σ (m) j (x 1:n )} • M j=1 |a j | where conv denotes the convex hull of a set. The convex hull conv{±σ (m) j (x 1:n )} • M j=1 |a j | is an n-dimensional space, and polynomials of order up to m is an m + 1 dimensional space, so the set defined above has dimension nm -1. By Carathéodory's theorem, there is a subset of points in this space {Π ⊥ m σ (m) j k (x 1:n )} ⊆ {Π ⊥ m σ (m) j (x 1:n )}, 1 ≤ k ≤ n -m such that Π ⊥ m f (x) = n-m k=1 ãk Π ⊥ m σ (m) j k (x), n-m k=1 |a k | ≤ 1 In other word, there exist a subset of knots { tj , j ∈ [nm]} that perfectly recovers Π ⊥ m f (x) at all the sample points, and the TV of this function is no larger than f . This shows that f (x) = n-m j=1 ãj (x -t j ) m + , s.t. f (x i ) = f (x i ) for all x i in n onbservation points. The MSE of locally adaptivity regressive spline (11) was studied in Mammen & van de Geer (1997, Section 3), which equals the error rate given in Theorem 9. This indicates that the neural network ( 9) is minimax optimal for BV (m). Let us explain a few the key observations behind this equivalence. (a) The truncated power functions (together with an mth order polynomial) spans the space of an mth order spline. (b) The neural network in ( 9) is equivalent to a free-knot spline with M knots (up to reparameterization). (c) A solution to ( 11) is a spline with at most nm knots (Parhi & Nowak, 2021a, Theorem 8). (d) Finally, by the AM-GM inequality |v j | 2 + |w j | 2m ≥ 2|v j ||w j | m = 2|c j | where c j = v j |w j | m is the coefficient of the corresponding jth truncated power basis. The mth order total variation of a spline is equal to j |c j |. It is not hard to check that the loss function depends only on c j , thus the optimal solution will always take "=" in the AM-GM inequality.

C INTRODUCTION TO COMMON FUNCTION CLASSES

In the following definition define Ω be the domain of the function classes, which will be omitted in the definition.

C.1 BESOV CLASS

Definition 1. Modulus of smoothness: For a function f ∈ L p (Ω) for some 1 ≤ p ≤ ∞, the r-th modulus of smoothness is defined by w r,p (f, t) = sup h∈R d :∥h∥2≤t ∥∆ r h (f )∥ p , ∆ r h (f ) :=        r j=0 ( r j )(-1) r-j f (x + jh), if x ∈ Ω, x + rh ∈ Ω, 0, otherwise. Definition 2. Besov space: For 1 ≤ p, q ≤ ∞, α > 0, r := ⌈α⌉ + 1, define |f | B α p,q =        ∞ t=0 (t -α w r,p (f, t)) q dt t 1 q , q < ∞ sup t>0 t -α w r,p (f, t), q = ∞, and define the norm of Besov space as: ∥f ∥ B α p,q = ∥f ∥ p + |f | B α p,q . A function f is in the Besov space B α p,q if ∥f ∥ B α p,q is finite. Note that the Besov space for 0 < p, q < 1 is also defined, but in this case it is a quasi-Banach space instead of a Banach space and will not be covered in this paper. Functions in Besov space can be decomposed using B-spline basis functions. Any function f in Besov space B α p,q , α > d/p can be decomposed using B-spline of order m, m > α: let x ∈ R d , f (x) = ∞ k=0 s∈J(k) c k,s (f )M m,k,s (x) where J(k) := {2 -k s : s ∈ [-m, 2 k +m] d ⊂ Z d }, M m,k,s (x) := M m (2 k (x-s)), and M k (x) = d i=1 M k (x i ) is the cardinal B-spline basis function which can be expressed as a polynomial: M m (x) = 1 m! m+1 j=1 (-1) j m + 1 j (x -j) m + = ((m + 1)/2) m 1 m! m+1 j=1 (-1) j m + 1 j x -j (m + 1)/2 m + , Furthermore, the norm of Besov space is equivalent to the sequence norm: ∥{c k,s }∥ b α p,q := ∞ k=0 (2 (α-d/p)k ∥{c k,s (f )} s ∥ p ) q 1/q ≂ ∥f ∥ B α p,q . See e.g. Dũng (2011, Theorem 2.2) for the proof. The Besov space is closely connected to other function spaces including the Hölder space (C α ) and the Sobolev space (W α p ). Specifically, if the domain of the functions is d-dimensional (Suzuki, 2018; Sadhanala et al., 2021) , • ∀α ∈ N, B α p,1 ⊂ W α p ⊂ B α p,∞ , and B α 2,2 = W α 2 . • For 0 < α < ∞ and α ∈ N , C α = B α ∞,∞ . • If α > d/p, B α p,q ⊂ C 0 . C.2 OTHER FUNCTION SPACES Definition 3. Hölder space: let m ∈ N, the m-th order Holder class is defined as C m = f : max |a|=k |D a f (x) -D a f (z)| ∥x -z∥ 2 < ∞, ∀x, z ∈ Ω where D a denotes the weak derivative. Note that fraction order of Hölder space can also be defined. For simplicity, we will not cover that case in this paper. Definition 4. Sobolev space: let m ∈ N , 1 ≤ p ≤ ∞, the Sobolev norm is defined as ∥f ∥ W m p :=   |a|≤m ∥D a f ∥ p p   1/p , the Sobolev space is the set of functions with finite Sobolev norm: W m p := {f : ∥f ∥ W m p < ∞}. Definition 5. Total Variation (TV): The total variation (TV) of a function f on an interval [a, b] is defined as T V (f ) = sup P n P -1 i=1 |f (x i+1 ) -f (x i )| where the P is taken among all the partitions of the interval [a, b]. In many applications, functions with stronger smoothness conditions are needed, which can be measured by high order total variation. Definition 6. High order total variation: the m-th order total variation is the total variation of the (m -1)-th order derivative T V (m) (f ) = T V (f (m-1) ) Definition 7. Bounded variation (BV): The m-th order bounded variation class is the set of functions whose total variation (TV) is bounded. BV (m) := {f : T V (f (m) ) < ∞}.

D PROOF OF ESTIMATION ERROR D.1 EQUIVALENCE BETWEEN PARALLEL NEURAL NETWORKS AND p-NORM PENALIZED PROBLEMS

Proposition 4. There exists an one-to-one mapping between λ > 0 and λ ′ > 0 such that (2) is equivalent to the following problem: arg min { W(ℓ) j , b(ℓ) j ,aj } L M j=1 a j fj + λ ′ ∥{a j }∥ 2/L 2/L s.t. ∥ W(1) j ∥ F ≤ c 1 √ d, ∀j ∈ [M ]; ∥ W(ℓ) j ∥ F ≤ c 1 √ w, ∀j ∈ [M ], 2 ≤ ℓ ≤ L, where fj (•) is a subnetwork with parameters W(ℓ) j , b(ℓ) j . Proof. We make use of the property from (5) to minimize the constraint term in (7) while keeping this neural network equivalent to the original one. Specifically, let W (1) , b (1) , . . . W (L) , b (L) be the parameters of an L-layer neural network. f (x) = W (L) σ(W (L-1) σ(. . . σ(W (1) x + b (1) ) . . . ) + b (L-1) ) + b (L) , which is equivalent to f (x) = α L W(L) σ(α L-1 W(L-1) σ(. . . σ(α 1 W(1) x + b(1) ) . . . ) + b(L-1) ) + b(L) , as long as α ℓ > 0, L ℓ=1 α L = L ℓ=1 ∥W (ℓ) ∥ F , where W(ℓ) := W (ℓ) ∥W (ℓ) ∥ F . By the AM-GM inequality, the ℓ 2 regularizer of the latter neural network is L ℓ=1 ∥α ℓ W(ℓ) ∥ 2 F = L ℓ=1 α 2 ℓ ≥ L L ℓ=1 a ℓ 2/L = L L ℓ=1 ∥W (ℓ) ∥ F 2/L and equality is reached when α 1 = α 2 = • • • = α L . In other word, in the problem (2), it suffices to consider the network that satisfies ∥W (1) j ∥ F = ∥W (2) j ∥ F = • • • = ∥W (L) j ∥ F , ∀j ∈ [M ], ℓ ∈ [L]. Using ( 5) again, one can find that the neural network is also equivalent to f (x) = M j=1 a j W(L) σ( W(L-1) j σ(. . . σ( W(1) j x + b(1) j ) . . . ) + b(L-1) j ) + b(L) j , where ∥ W(ℓ) j ∥ F ≤ β (ℓ) , a j = L ℓ=1 ∥W (ℓ) j ∥ F L ℓ=1 β (ℓ) = ∥W (1) j ∥ L F L ℓ=1 β (ℓ) = ( L ℓ=1 ∥W (ℓ) j ∥ 2 F /L) L/2 L ℓ=1 β (ℓ) , ( ) where the last two equality comes from the assumption ( 14). Choosing  β (ℓ) = c 1 √ w expect ℓ = 1 where β (1) = c 1 √ d, (F, δ) ≲ w 2+2/(1-2/L) L 2 √ dP ′ 1 1-2/L δ -2/L 1-2/L log(wP ′ /δ). The proof relies on the covering number of each subnetwork in a parallel neural network (Lemma 10), observing that |f (x)| ≤ 2 L-1 w L-1 √ d under the condition in Lemma 10, and then apply Lemma 6. We argue that our choice of condition on ∥b (ℓ) ∥ 2 in Lemma 10 is sufficient to analyzing the model apart from the bias in the last layer, because it guarantees that √ w∥W (ℓ) A ℓ-1 (x)∥ 2 ≤ ∥b (ℓ) ∥ 2 . This leads to ℓ) is either always positive or always negative for all feasible x along at least one dimension. ∥W (ℓ) A ℓ-1 (x)∥ ∞ ≤ ∥W (ℓ) A ℓ-1 (x)∥ 2 ≤ √ w∥b (ℓ) ∥ 2 ≤ ∥b (ℓ) ∥ ∞ If this condition is not met, W (ℓ) A ℓ-1 (x) + b ( If (W (ℓ) A ℓ-1 (x) + b (ℓ) ) i is always negative, one can replace b (ℓ) ) i with -max x ∥W (ℓ) A ℓ-1 (x)∥ ∞ without changing the output of this model for any feasible x. If (W (ℓ) A ℓ-1 (x) + b (ℓ) ) i is always positive, one can replace b (ℓ) ) i with max x ∥W (ℓ) A ℓ-1 (x)∥ ∞ , and adjust the bias in the next layer such that the output of this model is not changed for any feasible x. In either cases, one can replace the bias b (ℓ) with another one with smaller norm while keeping the model equivalent except the bias in the last layer. Lemma 10. Let F ⊆ {f : R d → R} denote the set of L-layer neural network (or a subnetwork in a parallel neural network) with width w in each hidden layer. It has the form f (x) = W (L) σ(W (L-1) σ(. . . σ(W (1) x + b (1) ) . . . ) + b (L-1) ) + b (L) , W (1) ∈ R w×d , ∥W (1) ∥ F ≤ √ d, b (1) ∈ R w , ∥b (1) ∥ 2 ≤ √ dw, W (ℓ) ∈ R w×w ∥W (ℓ) ∥ F ≤ √ w, b (ℓ) ∈ R w , ∥b (ℓ) ∥ 2 ≤ 2 ℓ-1 w ℓ-1 √ dw, ∀ℓ = 2, . . . L -1, W (L) ∈ R 1×w , ∥W (L) ∥ F ≤ √ w, b (L) = 0 (16) and σ(•) is the ReLU activation function, the input satisfy ∥x∥ 2 ≤ 1, then the supremum norm δ-covering number of F obeys log N (F, δ) ≤ c 7 Lw 2 log(1/δ) + c 8 where c 7 is a constant depending only on d, and c 8 is a constant that depend on d, w and L. Proof. First study two neural networks which differ by only one layer. Let g ℓ , g ′ ℓ be two neural networks satisfying ( 16) with parameters W 1 , b 1 , . . . , W L , b L and W ′ 1 , b ′ 1 , . . . , W ′ L , b ′ L respectively. Furthermore, the parameters in these two models are the same except the ℓ-th layer, which satisfy ∥W ℓ -W ′ ℓ ∥ F ≤ ϵ, ∥b ℓ -b ′ ℓ ∥ 2 ≤ ε. Denote the model as g ℓ (x) = B ℓ (W ℓ A ℓ (x) + b ℓ ), g ′ ℓ (x) = B ℓ (W ′ ℓ A ℓ (x) + b ′ ℓ ) where A ℓ (x) = σ(W ℓ-1 σ(. . . σ(W 1 x+b 1 ) . . . )+b ℓ-1 ) denotes the first ℓ-1 layers in the neural network, and A ℓ (x) = W L σ(. . . σ(W ℓ+1 σ(x) + b ℓ+1 ) . . . ) + b L ) denotes the last L -ℓ -1 layers, with definition A 1 (x) = x, B L (x) = x. Now focus on bounding ∥A(x)∥. Let W ∈ R m×m ′ , ∥W∥ F ≤ √ m ′ , x ∈ R m ′ , b ∈ R m , ∥b∥ 2 ≤ √ m ∥σ(Wx + b)∥ 2 ≤ ∥Wx + b∥ 2 ≤ ∥W∥ 2 ∥x∥ 2 + ∥b∥ 2 ≤ ∥W∥ F ∥x∥ 2 + ∥b∥ 2 ≤ √ m ′ ∥x∥ 2 + √ m where we make use of ∥ • ∥ 2 ≤ ∥ • ∥ F . Because of that, ∥A 2 (x)∥ 2 ≤ √ d + √ dw ≤ 2 √ dw, ∥A 3 (x)∥ 2 ≤ √ w∥A 2 (x)∥ 2 + 2w √ dw ≤ 4w √ dw, . . . ∥A ℓ (x)∥ 2 ≤ √ w∥A ℓ-1 (x)∥ 2 ≤ 2 √ dw(2w) ℓ-2 . ( ) Then focus on B(x). Let W ∈ R m×m ′ , ∥W∥ F ≤ √ m ′ , x, x ′ ∈ R m ′ , b ∈ R m , ∥b∥ 2 ≤ √ m. Furthermore, ∥x -x ′ ∥ 2 ≤ ϵ, then ∥σ(Wx + b) -σ(Wx ′ + b)∥ 2 ≤ ∥W(x -x ′ )∥ 2 ≤ ∥W∥ F ∥x -x ′ ∥ 2 which indicates that ∥B(x) -B(x) ′ ∥ 2 ≤ ( √ w) L-ℓ ∥x -x ′ ∥ 2 Finally, for any W, W ′ ∈ R m×m ′ , x ∈ R m ′ , b, b ′ ∈ R m , one have ∥(Wx + b) -(W ′ x + b ′ )∥ 2 = ∥(W -W ′ )x + (b -b ′ )∥ 2 ≤ ∥W -W ′ ∥ 2 ∥x∥ 2 + ∥b -b ′ ∥ 2 . ≤ ∥W -W ′ ∥ F ∥x∥ 2 + √ m∥b -b ′ ∥ ∞ . In summary, |g ℓ (x) -g ′ ℓ (x)| = |B ℓ (W ℓ A ℓ (x) + b ℓ ) -B ℓ (W ′ ℓ A ℓ (x) + b ′ ℓ )| ≤ ( √ w) L-ℓ ∥(W ℓ A ℓ (x) + b ℓ ) -(W ′ ℓ A ℓ (x) + b ′ ℓ )∥ 2 ≤ ( √ w) L-ℓ (∥W ℓ -W ′ ℓ ∥ F ∥A ℓ (x)∥ 2 + ∥b ℓ -b ′ ℓ ∥ 2 ) ≤ 2 (ℓ-1) w (L+ℓ-3)/2 d 1/2 ϵ + w (L-ℓ)/2 ε Let f (x), f ′ (x) be two neural networks satisfying (16) with parameters W 1 , b 1 , . . . , W L , b L and W ′ 1 , b ′ 1 , . . . , W ′ L , b ′ L respectively, and ∥W ℓ -W ′ ℓ ∥ F ≤ ϵ ℓ , ∥b ℓ -b ′ ℓ ∥ F ≤ εℓ . Further define f ℓ be the neural network with parameters W 1 , b 1 , . . . , W ℓ , b ℓ , W ′ ℓ+1 , b ′ ℓ+1 , . . . , W ′ L , b ′ L , then |f (x) -f ′ (x)| ≤ |f (x) -f 1 (x)| + |f 1 (x) -f 2 (x)| + • • • + |f L-1 (x) -f ′ (x)| ≤ L ℓ=1 2 (ℓ-2) d 1/2 w (L+ℓ-3)/2 ϵ + w (L-ℓ)/2 ε For any δ > 0, one can choose ϵ ℓ = δ 2 ℓ w (L+ℓ-3)/2 d 1/2 , εℓ = δ 2w (L-ℓ)/2 such that |f (x) -f ′ (x)| ≤ δ. On the other hand, the ϵ-covering number of {W ∈ R m×m ′ : ∥W∥ F ≤ √ m ′ } on Frobenius norm is no larger than (2 √ m ′ /ϵ + 1) m×m ′ , and the ε-covering number of {b ∈ R m : ∥b∥ 2 ≤ 1} on infinity norm is no larger than (2/ε + 1) m . The entropy of this neural network can be bounded by log N (f ; δ) ≤ w 2 L log(2 L+1 w L-1 /δ + 1) + wL log(2 L-1 w (L-1)/2 d 1/2 /δ + 1) D.3 COVERING NUMBER OF p-NORM CONSTRAINED LINEAR COMBINATION Lemma 6. Let G ⊆ {R d → [-c 3 , c 3 ] } be a set with covering number satisfying log N (G, δ) ≲ k log(1/δ) for some finite c 3 , and for any g ∈ G, |a| ≤ 1, we have ag ∈ G. The covering number of F = M i=1 a i g i g i ∈ G, ∥a∥ p p ≤ P, 0 < p < 1 for any P > 0 satisfies log N (F, ϵ) ≲ kP 1 1-p (δ/c 3 ) -p 1-p log(c 3 P/δ) up to a double logarithmic factor. Proof. Let ϵ be a positive constant. Without the loss of generality, we can sort the coefficients in descending order in terms of their absolute values. There exists a positive integer M (as a function of ϵ), such that |a i | ≥ ϵ for i ≤ M, and |a i | < ϵ for i > M. By definition, Mϵ p ≤ M i=1 |a i | p ≤ P so M ≤ P/ϵ p , and |a i | p ≤ P, |a i | ≤ P 1/p for all i. Furthermore, i>m |a i | = i>M |a i | p |a i | 1-p < i>M |a i | p ϵ 1-p ≤ P ϵ 1-p Let gi = arg min g∈ G ∥g -ai P 1/p g i ∥ ∞ where G is the δ ′ -convering set of G. By definition of the covering set, M i=1 a i g i (x) - M i=1 P 1/p gi (x) ∞ ≤ M i=1 (a i g i (x) -P 1/p gi (x)) ∞ + M i=M+1 a i g i (x) ∞ ≤ MP 1/p δ ′ + c 3 P ϵ 1-p . (18) Choosing ϵ = (δ/2c 3 P ) 1 1-p , δ ′ ≂ P -1 p(1-p) (δ/2c 3 ) 1 1-p /2, we have M ≤ P 1 1-p (δ/2c 3 ) -p 1-p , MP 1/p δ ′ ≤ δ/2, c 3 P ϵ 1-p ≤ δ/2, so (18) ≤ δ. One can compute the covering number of F by log N (F, δ) ≤ M log N (G, δ ′ ) ≲ kM log(1/δ ′ ) Taking ( 19) into (20) finishes the proof. 

E PROOF OF APPROXIMATION ERROR

• | Mm,k,s (x) -M m,k,s (x)| ≤ ϵ, if 0 ≤ 2 k (x i -s i ) ≤ m + 1, ∀i ∈ [d], • Mm,k,s (x) = 0, otherwise. • The weight in each layer has bounded norm ∥W (ℓ) ∥ F ≲ 2 k/L √ w, except the first layer where ∥W (1) ∥ F ≤ 2 k/L √ d. Note that the product of the coefficients among all the layers are proportional to 2 k , instead of 2 km when approximating truncated power basis functions. This is because the transformation from M m to M m,k,s only scales the domain of the function by 2 k , while the codomain of the function is not changed. To apply the transformation to the neural network, one only need to scale weights in the first layer by 2 k , which is equivalent to scaling the weights in each layer bt 2 k/L and adjusting the bias according. As for the proof, we follow the method developed in Yarotsky (2017); Suzuki (2018) , while putting our attention on bounding the Frobenius norm of the weights. Lemma 12 (Yarotsky (2017, Proposition 3)). : There exists a neural network with two-dimensional input and one output f × (x, y), with constant width and depth O(log(1/δ)), and the weight in each layer is bounded by a global constant c 1 , such that • |f × (x, y) -xy| ≤ δ, ∀ 0 ≤ x, y ≤ 1, • f × (x, y) = 0, ∀ x = 0 or y = 0. We first prove a special case of Lemma 11 on the unscaled, unshifted B-spline basis function by fixing k = 0, s = 0: Proposition 13. There exists a neural network with d-dimensional input and one output, with width w = w(d, m) ≂ dm and depth L ≲ log(c(m, d)/ϵ) for some constant w, c that depends only on m and d, denoted as Mm (x), x ∈ R d , such that • | Mm (x) -M m (x)| ≤ ϵ, if 0 ≤ x i ≤ m + 1, ∀i ∈ [d], while M m (•) denote m-th order B-spline basis function, • Mm (x) = 0, if x i ≤ 0 or x i ≥ m + 1 for any i ∈ [d]. • The weight in each layer has bounded norm ∥W (ℓ) ∥ F ≲ √ w. Proof. We first show that one can use a neural network with constant width w 0 , depth L ≂ log(m/ϵ 1 ) and bounded norm ∥W (1) ∥ F ≤ O( √ d), ∥W (ℓ) ∥ F ≤ O( √ w), ∀ℓ = 2, . . . , L to approximate truncated power basis function up to accuracy ϵ 1 in the range [0, 1]. Let m = ⌈log 2 m⌉ i=0 m i 2 i , m i ∈ {0, 1} be the binary digits of m, and define mj = i j=0 m i , γ = ⌈log 2 m⌉, then for any x x m + = x mγ + × x 2 γ + mγ [x mγ + , x 2 γ + ] = [x mγ-1 + × x 2 γ-1 + mγ-1 , x 2 γ-1 + × x 2 γ-1 + ] . . . [x m2 + , x 4 + ] = [x m1 + × x 2 + m1 , x 2 + × x 2 + ] [x m1 + , x 2 + ] = [x m0 + × x m0 + , x + × x + ] Notice that each line of equation only depends on the line immediately below. Replacing the multiply operator × with the neural network approximation shown in Lemma 12 demonstrates the architecture of such neural network approximation. For any x, y ∈ [0, 1], let |f × (x, y) -xy| ≤ δ, |x -x| ≤ δ 1 , |y -δy| ≤ δ 2 , then |f × (x, ỹ) -xy| ≤ δ 1 + δ 2 + δ. Taking this into (21) shows that ϵ 1 ≂ 2 γ δ ≂ mδ, where ϵ 1 is the upper bound on the approximate error to truncated power basis of order m and δ is the approximation error to a single multiply operator as in Lemma 12. A univariate B-spline basis can be expressed using truncated power basis, and observing that it is symmetric around (m + 1)/2: M m (x) = 1 m! m+1 j=1 (-1) j m + 1 j (x -j) m + = 1 m! ⌈(m+1)/2⌉ j=1 (-1) j m + 1 j (min(x, m + 1 -x) -j) m + = ((m + 1)/2) m m! ⌈(m+1)/2⌉ j=1 (-1) j m + 1 j min(x, m + 1 -x) -j (m + 1)/2 m + , A multivariate (d-dimensional) B-spline basis function can be expressed as the product of truncated power basis functions and thus can be decomposed as M m (x) = d i=1 M m (x i ) = ((m + 1)/2) md (m!) d d i=1 ⌈(m+1)/2⌉ j=1 (-1) j m + 1 j min(x i , m + 1 -x) -j (m + 1)/2 m + Using Lemma 12, one can construct m + 1 number of neural networks, and each of them has width w 0 and depth L = O(log(m/ϵ 1 ), such that the (j + 1)-th neural network approximates ( x-j (m+1)/2 ) m + with error no more than ϵ 1 for any 0 ≤ x ≤ (m + 1)/2. The weighted summation of these subnetworks can approximate the univariate B-spline basis function with error no more than d((m + 1)/2) m 1 m! m+1 i=1 m + 1 j ϵ 1 ≂ de 2m √ m ϵ 1 where we applied Stirling's approximation. A multivariate B-spline basis is the product of univariate B-spline basis along each dimension M m (x) = d i=1 M m (x i ). We can construct a neural network to approximate this function by parallizing d number of neural networks to approximate each B-spline basis function along each dimension, and use the last L 1 ≂ log(d/δ) layers to approximate their product. The totol approximation error of this function is bounded by d ((m + 1)/2) m m! m+1 j=1 m + 1 j ϵ 1 + (d -1)δ ≂ e 2m √ m dϵ 1 + dδ where δ and ϵ 1 has the same definition as above. Choosing δ = ϵ d(e 2m √ m+1) , and recall ϵ 1 ≂ mδ proves the approximation error. The proof of the Lemma 11 for general k, s follows by appending one more layer in the front, as we show below. Proof of Lemma 11. Using the neural network proposed in Proposition 13, one can construct a neural network for appropximating M m,k,s by adding one layer before the first layer: σ(2 k I d x -2 k s) The unused neurons in the first hidden layer is zero padded. The Frobenius norm of the weight is 2 k ∥I d ∥ F = 2 k √ d. Following the proof of Proposition 4, rescaling the weight in this layer by 2 -k , and the weight matrix in the last layer by 2 k , and scaling the bias properly, one can verify that this neural network satisfy the statement. E.2 SPARSE APPROXIMATION OF BESOV FUNCTIONS USING B-SPLINE WAVELETS Proposition 7. Let αd/p > 1, r > 0. For any function in Besov space f 0 ∈ B α p,q and any positive integer M , there is an M -sparse approximation using B-spline basis of order m satisfying 0 < α < min(m, m -1 + 1/p): f M = M i=1 a ki,si M m,ki,si for any positive integer M such that the approximation error is bounded as ∥ f M -f 0 ∥ r ≲ M -α/d ∥f 0 ∥ B α p,q , and the coefficients satisfy ∥{2 ki a ki,si } ki,si ∥ p ≲ ∥f 0 ∥ B α p,q . Remark 1. The requirement in Proposition 7: αd/p > 1 is stronger than the condition typically found in approximation theorem αd/p ≥ 0 (Dũng, 2011) , so-called "Boundary of continuity", or the condition in Suzuki (2018) α > d(1/p -1/r) + . This is because although the functions in B α p,q when 0 ≤ αd/p < 1 can be approximated by B-spline basis, the sum of weighted coefficients may not converge. One simple example is the step function f step (x) = 1(x ≥ 0.5), f step ∈ B 1 1,∞ . Although it can be decomposed using first order B-spline basis as in ( 12), the summation of the coefficients is infinite. Actually one only needs a ReLU neural network with one hidden layer and two neurons to approximate this function to arbitrary precision, but the weight need to go to infinity. Proof. Dũng (2011, Theorem 3.1) Suzuki (2018, Lemma 2) proposed an adaptive sampling recovery method that approximates a function in Besov space. The method is divided into two cases: when p ≥ r, and when p < r. When p ≥ r, there exists a sequence of scalars λ j , j ∈ P d (µ), P d (µ) := {j ∈ Z d : |j i | ≤ µ, ∀i ∈ d} for some positive µ, for arbitrary positive integer k, the linear operator Dũng (2011, 2.6-2.7) for the detail of the extrapolation as well as references for options of sequence λ j . Furthermore, Qk(f ) ∈ B α p,q so it can be decomposed in the form (12 Qk(f, x) = s∈J( k,m,d) ak ,s (f )Mk ,s (x), ak ,s (f ) = j∈Z d ,P d (µ) λ j f (s + 2 -kj) has bounded approximation error ∥f -Qk(f, x)∥ r ≤ C2 -α k∥f ∥ B α p,q , where f is the extrapolation of f , J( k, m, d) := {s : 2 ks ∈ Z d , -m/2 ≤ 2 ks i ≤ 2 k + m/2, ∀i ∈ [d]}. See ) with M = k k=0 (2 k + m - 1) d ≲ 2 kd components and ∥{c k,s } k,s ∥ ≲ ∥Qk(f )∥ B α p,q ≲ ∥f ∥ B α p,q where ck,s is the coefficients of the decomposition of Qk(f ). Choosing k ≂ log 2 M/d leads to the desired approximation error. On the other hand, when p < r, there exists a greedy algorithm that constructs G(f ) = Qk(f ) + k * k= k+1 n k j=1 c k,sj (f )M k,sj where k ≂ log 2 (M ), k * = [ϵ -1 log(λM )] + k + 1, n k = [λM 2 -ϵ(k-k) ] for some 0 < ϵ < α/δ -1, δ = d(1/p -1/r), λ > 0, such that ∥f -G(f )∥ r ≤ M -α/d ∥f ∥ B α p,q and k k=0 (2 k + m -1) d + k * k= k+1 n k ≤ M . See Dũng (2011, Theorem 3.1) for the detail. Finally, since αd/p > 1, ∥{2 ki c ki,si } ki,si ∥ p ≤ k k=0 2 k ∥{c ki,si } si ∥ p = k k=0 2 (1-(α-d/p))k (2 (α-d/p)k ∥{c ki,si } si ∥ p ) ≲ k k=0 2 (1-(α-d/p))k ∥f ∥ B α p,q ≂ ∥f ∥ B α p,q (23) where the first line is because for arbitrary vectors a i , i ∈ [n], ∥ n i=1 a i ∥ p ≤ n i=1 ∥a i ∥ p , the third line is because the sequence norm of B-spline decomposition is equivalent to the norm in Besov space (see Section C.1) . Note that when αd/p = 1, the sequence norm ( 23) is bounded (up to a factor of constant) by k * ∥f ∥ B α p,q , which can be proven by following ( 23) except the last line. This adds a logarithmic term with respect to M compared with the result in Proposition 7. This will add a logarithmic factor to the MSE. We will not focus on this case in this paper of simplicity.

E.3 SPARSE APPROXIMATION OF BESOV FUNCTIONS USING PARALLEL NEURAL NETWORKS

Theorem 8. Under the same condition as Proposition 7, for any positive integer M , any function in Besov space f 0 ∈ B α p,q can be approximated by a parallel neural network with no more than O( M ) number of subnetworks satisfying: 1. Each subnetwork has width w = O(md) and depth L.

2.. The weights in each layer satisfy

∥ W(ℓ) k ∥ F ≤ O( √ w) except the first layer ∥ W(1) k ∥ F ≤ O( √ d), 3. The scaling factors have bounded 2/L-norm: P ′ := ∥{a j }∥ 2/L 2/L ≲ M 1-2/(pL) . 4. The approximation error is bounded by ∥ f -f 0 ∥ r ≤ (c 4 M -α/d + c 5 e -c6L )∥f ∥ B α p,q where c 4 , c 5 , c 6 are constants that depend only on m, d and p. The proof is divided into three steps: 1. Bound the 0-norm and the p-norm of the coefficients of B-spline basis in order to approximate an arbitrary function in Besov space up to any ϵ > 0. 2. Bound p ′ -norm of the coefficients of B-spline basis functions where p ′ = 2/L, 0 < p ′ < 1 using the results above . 3. Add the approximation of neural network to B-spline basis computed in Lemma 11 into Step 2. We first prove the following lemma. Lemma 14. For any a ∈ R M , 0 < p ′ < p, it holds that: ∥a∥ p ′ p ′ ≤ M 1-p ′ /p ∥a∥ p ′ p . Proof. i |a i | p ′ = ⟨1, |a| p ′ ⟩ ≤ i 1 1-p ′ p i (|a i | p ′ ) p p ′ p ′ p = M 1-p ′ p ∥a∥ p ′ p The first inequality uses a Holder's inequality with conjugate pair p p ′ and 1/(1 -p ′ p ). Proof of Theorem 8. Using Proposition 7, one can construct M number of NN according to Lemma 11, such that each NN represents one B-spline basis function. The weights in the last layer of each NN is scaled to match the coefficients in Proposition 7. Taking p ′ in Lemma 14 as 2/L and combining with Lemma 11 finishes the proof.

F PROOF OF THE MAIN THEOREM

Theorem 1 extended form. For any fixed αd/p > 1, q ≥ 1, L ≥ 3, define m = ⌈α -1⌉. For any f 0 ∈ B α p,q , given an L-layer parallel neural network satisfying • The width of each subnetwork is fixed satisfying w ≥ O(md). See Theorem 8 for the detail. • The number of subnetworks is large enough: pL) . M ≳ n 1-2/L 2α/d+1-2/( Under the assumption as in Lemma 17, with proper choice of the parameter of regularizaton λ that depends on D, α, d, L, the solution f parameterized by (2) satisfies MSE( f ) = Õ w 4-4/L L 2-4/L n 1-2/L 2α/d 2α/d+1-2/(pL) + e -c6L where Õ shows the scale up to a logarithmic factor, and c 6 is the constant defined in Theorem 8. Proof. First recall the relationship between covering number (entropy) and estimation error: Proposition 15. Let F ⊆ {R d → [-F, F } be a set of functions. Assume that F can be decomposed into two orthogonal spaces F = F ∥ × F ⊥ where F ⊥ is an affine space with dimension of N. Let f 0 ∈ {R d → [-F, F ]} be the target function and f be the least squares estimator in F: f = arg min f ∈F n i=1 (y i -f (x i )) 2 , y i = f 0 (x i ) + ϵ i , ϵ i ∼ N (0, σ 2 )i.i.d., then it holds that MSE( f ) ≤ Õ arg min f ∈F MSE(f ) + N + log N (F ∥ , δ) + 2 n + (F + σ)δ . The proof of Proposition 15 is defered to the section below. We choose F as the set of functions that can be represented by a parallel neural network as stated, the (null) space F ⊥ = {f : f (x) = constant} be the set of functions with constant output, which has dimension 1. This space captures the bias in the last layer, while the other parameters contributes to the projection in F ∥ . See Section D.2 for how we handle the bias in the other layers. One can find that F ∥ is the set of functions that can be represented by a parallel neural network as stated, and further satisfy n i=1 f (x i ) = 0. Because F ∥ ⊆ F, N (F ∥ , δ) ≤ N (F, δ ) for all δ > 0, and the latter is studied in Theorem 5. In Theorem 1, the width of each subnetwork is no less than what is required in Theorem 8, while the depth and norm constraint are the same, so the approximation error is no more that that in Theorem 8. Choosing r = 2, p = 2/L, and taking Theorem 5 and Theorem 8 into this Proposition 15, one gets MSE( f ) ≲ min f ∈F MSE(f ) + w 2+2/(1-2/L) L 2 √ dP ′ 1 1-2/L δ -2/L 1-2/L log(wP ′ /δ) n + δ ≲ M -2α/d + w 2+2/(1-2/L) L 2 n M 1-2/(pL) 1-2/L δ -2/L 1-2/L (log( M /δ) + 3) + δ, where ∥f ∥ B α p,q , m and d taken as constants. By choosing δ ≂ w 4-4/L L 2-4/L M 1-2/(pL) n 1-2/L , M ≂ n 1-2/L w 4-4/L L 2-4/L Then we bound M SE ⊥ (f ). We convert this part into a finite dimension least square problem: f⊥ = arg min f ∈F ⊥ L⊥ (f ) = arg min f ∈F ⊥ 1 n n i=1 (f (x i ) -f 0⊥ (x i ) -ϵ i⊥ ) 2 = arg min f ∈F ⊥ 1 n n i=1 (f (x i ) -f 0⊥ (x i ) -ϵ i⊥ ) 2 + ϵ 2 i∥ = arg min f ∈F ⊥ 1 n n i=1 (f (x i ) -f 0⊥ (x i ) -ϵ i⊥ -ϵ i∥ ) 2 = arg min f ∈F ⊥ 1 n n i=1 (f (x i ) -f 0⊥ (x i ) -ϵ i ) 2 The forth line comes from our assumption that F ⊥ is orthogonal to F ∥ , so ∀f ∈ F ⊥ , f + f 0⊥ + ϵ ⊥ is orthogonal to ϵ ∥ . Let the basis function of F ⊥ be h 1 , h 2 , . . . , h N , the above problem can be reparameterized as arg min θ∈R N 1 n ∥Xθ -y∥ 2 where X ∈ R n×N : X i = h j (x i ), y = y 0⊥ + ϵ, y 0⊥ = [f 0⊥ (x 1 ), . . . , f 0⊥ (x n )], ϵ = [ϵ 1 , . . . , ϵ n ]. This problem has a closed-form solution θ = (X T X) -1 X T y Observe that f 0⊥ ∈ F ⊥ , let y 0⊥ = Xθ * ,The MSE of this problem can be computed by L( f⊥ ) = 1 n ∥Xθ -y 0⊥ ∥ 2 = 1 n ∥X(X T X) -1 X T (Xθ * + ϵ) -Xθ * ∥ 2 = 1 n ∥X(X T X) -1 X T ϵ∥ 2 Observing that Π := X(X T X) -1 X T is an idempotent and independent projection whose rank is N , and that E[ϵϵ T ] = σ 2 I, we get MSE ⊥ ( f⊥ ) = E[L( f⊥ )] = 1 n ∥Πϵ∥ 2 = 1 n tr(Πϵϵ T ) = σ 2 n tr(Π) which concludes that MSE ⊥ ( f ) = O N n σ 2 . ( ) See also (Hsu et al., 2011 , Proposition 1).

Next we study MSE

∥ ( f ). Denote σ2 ∥ = 1 n n i=1 ϵ 2 i∥ , E = max i |ϵ i |. Using Jensen's inequality and union bound, we have exp(tE[E]) ≤ E[exp(tE)] = E[max exp(t|ϵ i |)] ≤ n i=1 E[exp(t|ϵ i |)] ≤ 2n exp(t 2 σ 2 /2) Taking expectation over both sides, we get E[E] ≤ log 2n t + tσ 2 2 maximizing the right hand side over t yields E[E] ≤ σ 2 log 2n. Lemma 17. Assume that these exists C 1 , C 2 > 1 (which may depend on the target function), for all P ′ > 0, there exists λ > 0, such that the soltion to the regularized optimization problem (6), denoted as f , satisfy C 1 P ′ ≤ ∥{ã j }∥ 2/L 2/L ≤ C 2 P ′ , then the MSE of the regularized optimization problem satisfy MSE( f ) ≤ CMSE( f ) where C is a constant that depends on C 1 , C 2 , f is the solution to the constrained optimzation problem (7), and λ ≲ MSE( f ) P ′ ≲ n -(1-2/L) Proof. The MSE of the regularized problem can be achieved by taking our assumtion into (4). We only need to prove the selection of λ. We apply the decomposition as in Proposition 15, and only need to consider F ∥ , as F ⊥ is not imfluenced by regularization or constrained. From the definition of f and λ, we have L( f ) + λ∥{ã j }∥ 2/L 2/L ≤ L( f ) + λ∥{â j }∥ 2/L 2/L , L∥ ( f ) + λ∥{ã j }∥ 2/L 2/L ≤ L ∥ ( f ) + λ∥{â j }∥ 2/L 2/L From Proposition 15, we get (1 -ϵ)MSE( f ) -O   log N (∥{ã j }∥ 2/L 2/L , δ) n   + λ∥{ã j }∥ 2/L 2/L ≤ (1 + ϵ)MSE( f ) + O   log N (∥{â j }∥ 2/L 2/L , δ) n   + λ∥{â j }∥ 2/L 2/L (30) Observing that MSE( f ) ≥ 0, and log N (∥{âj }∥ 2/L 2/L ,δ) n ≂ M SE( f ) for the optimally chosen P ′ , taking the assumtion into the inequality proves the choice of λ. Remark 2. Define R(λ) := R(arg min L(f )+λR(f )), where R(f ) = ∥{a j }∥ 2/L 2/L is the regularizer term of a parallel NN (f ). Notice that R(λ) is a non-increasing function of lambda (as proved below), the assumption in Lemma 17 is equivalent to that if R(λ) contains any uncontinuous points, then the uncontinuous points should not be larger than C2 C1 in ratio. On the other hand, if λ is chosen as λ = O( MSE( f ) P ′ ), then from (30), we get λ∥{ã j }∥ 2/L 2/L ≤ O(MSE( f )) + O   log N (∥{â j }∥ 2/L 2/L , δ) n   ≤ O(MSE( f )) + 1 n Õ((∥{â j }∥ 2/L 2/L ) 1 1-2/L ) If the constant term in λ is large enough, the above inequality yields two sets of solutions: ∥{ã j }∥ 2/L 2/L ≤ O ∥{â j }∥ 2/L 2/L + 1 λ MSE( f ) = O(∥{â j }∥ 2/L 2/L ), ∥{ã j }∥ 2/L 2/L ≥ Õ (nλ) 1-2/L 2/L . In the first case, one can easily see from (30) that MSE( f ) ≤ O(MSE( f )), which says that the MSE of the regularized problem is close to the minimax rate; in the later case, the generalization gap of the regularized problem is bounded by O(n 1-2/L 2/L λ L/2 ), which is much larger than the former case. So a sufficient condition of the above assumption is that the model does not overfit significantly (by orders of magnitude) more than the constrained version. In our experiment, we find that the latter case is very difficult to happen, possibly because of the implicit regularization during training, and the connection between λ and effective degree of freedom is actually smooth. Notably, as L gets larger, in the second case ∥{ã j }∥ 2/L 2/L increases exponentially with L (the constant terms depends at most polynomially on L), which suggests that the latter case is less likely to happen for deep neural networks. Claim 18. For fixed D, the regularized problem satisfy that R(λ) as defined above is strictly nonincreasing with λ. Proof. We provide a short proof by contradiction: suppose that there exists lambda 1 < λ 2 , and the solution satisfy (f 1 ) < R(f 2 ) where R(f ) = ∥{a j }∥ 2/L 2/L is the regularizer term of a parallel NN, f 1 , f 2 are the solution to the regularized problem with λ = λ 1 , λ 2 respectively. Then by definition of f 1 , f 2 , we have L(f 1 )+λ 1 R(f 1 ) ≤ L(f 2 )+λ 1 R(f 2 ) , so λ 1 ≥ L(f1)-L(f2) R(f2)-R(f1) ; L(f 2 )+λ 2 R(f 2 ) ≤ L(f 1 ) + λ 2 R(f 1 ), so λ 2 ≤ L(f1)-L(f2) R(f2)-R(f1) which is controversal to our assumption that λ 1 < λ 2 .

G MORE DISCUSSION ABOUT THE MAIN RESULT

Representation learning and adaptivity. The results also shed a light on the role of representation learning in DNN's ability to adapt. Specifically, different from the two-layer NN in (Parhi & Nowak, 2021a) , which achieves the minimax rate of BV (m) by choosing appropriate activation functions using each m, each subnetwork of a parallel NN can learn to approximate the spline basis of an arbitrary order, which means that if we choose L to be sufficiently large, such Parallel NN with optimally tuned λ is simultaneously near optimal for m = 1, 2, 3, . . . . In fact, even if different regions of the space has different orders of smoothness, the paralle NN will still be able to learn appropriate basis functions in each local region. To the best of our knowledge, this is a property that none of the classical nonparametric regression methods possess. Synthesis v.s. analysis methods. Our result could also inspire new ideas in estimator design. There are two families of methods in non-parametric estimation. One called synthesis framework which focuses on constructing appropriate basis functions to encode the contemplated structures and regress the data to such basis, e.g., wavelets (Donoho et al., 1998) . The other is called analysis framework which uses analysis regularization on the data directly (see, e.g., RKHS methods (Scholkopf & Smola, 2001) or trend filtering (Tibshirani, 2014) ). It appears to us that parallel NN is doing both simultaneously. It has a parametric family capable to synthesizing an O(n) subset of an exponentially large family of basis, then implicitly use sparsity-inducing analysis regularization to select the relevant basis functions. In this way the estimator does not actually have to explicitly represent that exponentially large set of basis functions, thus computationally more efficient. Random design problem. This paper focuses on the fixed design problem such that the results are comparable to that in nonparametic regression. One can easily apply the technique in this paper to achieve the estimation error bound on the random design problem: Theorem 19. Under the same condition as Theorem 1, the solution f parameterized by ( 2) satisfies E D E f MSE( f ) ≤ Õ w 4-4/L L 2-4/L n 1-2/L 2α/d 2α/d+1-2/(pL) + e -c6L where Õ shows the scale up to a logarithmic factor, and c 6 is the constant defined in Theorem 8, E D indicates that the expectation is taken with respect to the training set D, E f indicates that the expectation is taken with respect to the domain of f . The proof is similar to that of Theorem 1. The main difference lays in the proof of the estimation error. For f ⊥ part, the estimation error can be bounded using VC-dimension, which is 1. For f ∥ part, the estimation error can be bounded using its covering number, e.g. Lemma 8 in Schmidt-Hieber (2020). where D denotes the dataset. In the third line, we make use of the fact that E[ϵ i ] = 0, E[ϵ 2 i ] = σ 2 , and in the last line, we make use of E[ϵ ′ i ] = 0, E[ϵ ′ i 2 ] = σ 2 , and ϵ ′ i are independent of y i and y 0,i One can easily check that a "zero predictor" (a predictor that always predict ȳ0 , and it always predicts 0 if the target function has zero mean) always has an estimated degree of freedom of 0. In Figure 3 (h)(i), we take the minimum MSE over different choices of λ, and plot the average over 10 runs. Due to optimization issue, sometimes the neural networks are stuck at bad local minima and the empirical loss is larger than the global minimum by orders of magnitude. To deal with this problem, in Figure 3 (h)(i), we manually detect these results by removing the experiments where the MSE is larger than 1.5 times the average MSE under the same setting, and remove them before computing the average.

H.4.1 REGULARIZATION WEIGHT VS DEGREE-OF-FREEDOM

As we explained in the previous section, the degree of freedom is the exact information-theoretic measure of the generalization gap. A Larger degree-of-freedom implies more overfitting. In figure Figure 4 , we show the relationship between the estimated degree of freedom and the scaling factor of the regularizer λ in a parallel neural network and in trend filtering. As is shown in the



2α/d+1-2/(pL) ,



Figure 1: Illustration of a function with heterogeneous smoothness and the problem of locally adaptive nonparametric regression.

Figure 2: Parallel neural network and the equivalent sparse regression model we discovered.

|dx, and the corresponding mth order Bounded Variation class BV (m) := {f : T V (f (m) ) < ∞}. The more general definition is given in Section C.2. Bounded variation class is tightly connected to Besov classes. Specifically (DeVore &

denote the weight and bias in the ℓ-th layer in the j-th subnetwork respectively. Training this model with ℓ 2 regularization returns:

Figure 3: Numerical experiment results of the Doppler function (a-c,h), and "vary" function (d-f,g). All the "active" subnetworks are plotted in (c)(f). The horizontal axis in (b) is not linear.

(a)-(c)(h)), and a combination of piecewise linear function and piecewise cubic function, or "vary" function (Figure 3(d)-(f)(i)). We repeat each experiment 10 times and take the average. The shallow area in Figure 3(b)(e) shows 95% confidence interval by inverting the Wald's test. The degree of freedom (DoF) is computed based on Tibshirani (2015).

NN and splines. BesidesParhi & Nowak (2021a)  which we discussed earlier,Parhi & Nowak  (2021b;c) also leveraged the connections between NNs and splines.Parhi & Nowak (2021b)  focused on characterizing the variational form of multi-layer NN.Parhi & Nowak (2021c)  showed that two-layer ReLU activated NN achieves minimax rate for a BV class of order 1 but did not cover multilayer NNs nor BV class with order > 1, which is our focus.Weight-decay regularization with sparsity-inducing penalties. The connection between weightdecay regularization with sparsity-inducing penalties in two-layer NNs is folklore and used by Neyshabur et al. (2014); Savarese et al. (2019); Ongie et al. (2019); Ergen & Pilanci (2021a;d); Parhi & Nowak (2021a;c);Pilanci & Ergen (2020). The key underlying technique -an application of the AM-GM inequality (which we used in this paper as well) -can be traced back toSrebro et al. (2004) (see a recent exposition by Tibshirani (2021)). Tibshirani (2021) also generalized the result to multi-layered NNs, but with a simple (element-wise) connections. Besides, Ergen & Pilanci (2020) proved that training a two-layer convolution neural network (CNN) with weight decay induces sparsity, and points to a potential extension to these works including our work.Finally, it was brought to our attention that whileSavarese et al. (2019) mainly consider two-layer NNs, a set of results about L-layer parallel NNs was presented in Appendix C of their paper, which essentially contains same arguments we used for proving the equivalence to an ℓ 2/L regularized optimization problem in Proposition 4. The difference is they applied the insight to understand the interpolation regime while we focused on analyzing MSE in the noisy case. Proposition 4, is theSavarese et al. (2019) showed that a parallel networks of depth L have an inductive bias for the L 2/L sparse model, and explicit weight decay causes the solutions of these networks to have a sparse last layer with at most n nonzero weights.

. Most existing work considered the Holder, Sobolev spaces and their extensions, which contain only homogeneously smooth functions and cannot demonstrate the advantage of NNs over kernels. The exceptions includingSuzuki (2018);Oono & Suzuki (2019);Liu et al. (2021) which, as we discussed earlier, requires modifications to NN architecture for each class. In contrast, we require tuning only the standard weight decay parameter. Most importantly, in all previously works, the estimation error of the model (eg. the covering number) depends on the number of nonzero parameters in the model, while our work provides a bound that depends on the norm of the weights instead of the number of subnetworks.B TWO-LAYER NEURAL NETWORK WITH TRUNCATED POWER ACTIVATION FUNCTIONSWe start by recapping the result ofParhi & Nowak (2021a)  and formalizing its implication in estimating BV functions. Parhi & Nowak (2021a) considered a two layer neural network with truncated Published as a conference paper at ICLR 2023 power activation function. Let the neural network be

and scaling b(ℓ) accordingly and taking the regularizer in (2) into (15) finishes the proof. D.2 COVERING NUMBER OF PARALLEL NEURAL NETWORKS Theorem 5. The covering number of the model defined in (7) apart from the bias in the last layer satisfies log N

APPROXIMATION OF NEURAL NETWORKS TO B-SPLINE BASIS FUNCTIONS Lemma 11. Let M m,k,s be the B-spline of order m with scale 2 -k in each dimension and position s ∈ R d : M m,k,s (x) := M m (2 k (xs)), M m is defined in (13). There exists a neural network with d-dimensional input and one output, with width w d,m = O(dm) and depth L ≲ log(c d,m /ϵ) for some constant c d,m that depends only on m and d, approximates the B spline basis function M m,k,s (x) := M m (2 k (xs)) as defined in Section C.1. This neural network, denoted as Mm,k,s (x), x ∈ R d , satisfy

df = E D [ Ê∥y 0 -ŷ∥ 2 2 -Ê∥y -ŷ∥ 2 2 + Ê∥y -ȳ0 ∥ 2 2 -∥y 0 -ȳ0 ∥ 2 2 ] = E∥y 0 -ŷ∥ 2 2 -E∥y -ŷ∥ 2 2 + E D [ Ê[(yy 0 )(y + y 0 -2ȳ 0 )]] = E∥y 0 -ŷ∥ 2 2 -E∥y -ŷ∥ 2 2 + E n i=1 ϵ i (2y i + ϵ i -2ȳ 0 ) = E∥y 0 -ŷ∥ 2 2 -E∥y -ŷ∥ 2 2 + nσ 2 = E∥y ′ -ŷ∥ 2 2 -E∥y -ŷ∥ 2 2

Figure 4: The relationship between degree of freedom and the scaling factor of the regularizer λ. The solid line shows the result after denoising. (a)(b)in a parallel NN. (c)(d) In trend filtering. (a)(c): the "vary" function. (b)(d) the doppler function.

Symbols used in this paper

ACKNOWLEDGMENTS

The work is partially supported by NSF Award #2134214. The authors thank Alden Green for references on Besov-space embedding of BV classes, Dheeraj Baby for helpful discussion on Bsplines as well as Ryan Tibshirani on connections to (Tibshirani, 2021) and for sharing with us a simpler proof of the Theorem 1 of (Parhi & Nowak, 2021a) based on Caratheorodory's Theorem (used in the proof of Theorem 9 on two-layer NNs).

annex

we getwhere MSE( f ) shows the MSE of the solution to constrained optimization problem (7) by optimally choosing M (or P ′ ).Finally, under the assumption in Lemma 17, for any constrained optimization problem, there exists a regularized optimzation problem, whose MSE is not larger than the MSE of the constrained optimization problem up to a factor of a constant. This closes the connection between ( 6) and ( 7) and finishes the proof.Note that the empirical risk minimizer (ERM) of the parallel nerual network satisfy that the (2/L)norm of the coefficients of the parallel neural network satisfy that ∥{a j }∥ 2/L 2/L = ∥{ã j, M }∥ 2/L 2/L where {ã j, M } is the coefficient of the particular M -sparse approximation, although {a j } is not necessarily M sparse. Empirically, one only need to guarantee that during initialization, the number of subnetworks M ≥ M such that the M -sparse approximation is feasible, thus the approximation error bound from Theorem 8 can be applied. Theorem 8 also says that ∥{a j }∥ 2/L 2/L = ∥{ã j, M }∥ 2/L 2/L ≲ M 1-2/pL , thus we can apply the covering number bound from Theorem 5 with P ′ = M 1-2/pL . Finally, if λ is optimally chosen, then it achieves a smaller MSE than this particular λ ′ , which has been proven to be no more than O( M -α/d ) and completes the proof.

Proof of Proposition 15. For any function

2 be the projection of f to F ⊥ , and define f ∥ = ff ⊥ be the projection to the orthogonal complement. Note that f ∥ is not necessarily in F ∥ . However, if f ∈ F, then f ∥ ∈ F ∥ . y i⊥ and y i∥ are defined by creating a function f y such that f y (x i ) = y i , ∀i, e.g. via interpolation. Because F ∥ and F ⊥ are orthogonal, the empirical loss and population loss can be decomposed in the same way:This can be verified by decomposing f , f 0 and y into two orthogonal components as shown above, and observing that, where f⊥ is the projections of f in F ⊥ , and f∥ = f -f⊥ respectively.Proof. Since f ∈ F, by definition f∥ ∈ F ∥ . Assume that there exist f ′ ⊥ , f ′ ∥ , and eitherwhich shows that f is not the minimizer of L(f ) and violates the assumption.Let F∥ be the covering set of F ∥ = {f ∥ : f ∈ F}. For any f∥ ∈ F∥ ,The first term can be bounded using Bernstein's inequality: letusing Bernstein's inequality, for any f∥ ∈ F∥ , with probably at least 1δ p ,3n the last inequality holds true for all ϵ > 0. The union bound shows that with probably at least 1δ, for all f∥ ∈ F∥ ,By rearanging the terms and using the definition of L( f∥ ), we getTaking the expectation (over D) on both sides, and notice thatwhere the integration can be computed by replacing δ with e x . Though it is not integrable under Riemann integral, it is integrable under Lebesgue integration.Similarly, let f∥ = arg min f ∈F ∥ L ∥ (f ),∥ with probably at least 1δ q , for any ϵ > 0,Taking the expectation on both sides,and similarlyWe can conclude thatwhere the first line comes from ( 27), and second comes from (29), the thid line is because f∥ = arg min f ∈F ∥ L∥ (f ), and the last line comes from (28). We also use that fact that L∥ ( f ) ≤ The "vary" function used in Figure 3 where M 1 , M 3 are first and third order Cardinal B-spline bases functions respectively. We uniformly take 256 samples from 0 to 1 in the piecewise cubic function experiment, and uniformly 1000 samples from 0 to 1 in the doppler function and "vary" function experiment. We add zero mean independent (white) Gaussian noise to the observations. The standard derivation of noise is 0.4 in the doppler function experiment and 0.1 in the "vary" function experiment.

H.2 TRAINING/FITTING METHOD

In the piecewise polynomial function ("vary") experiment, the depth of the PNN L = 10, the width of each subnetwork w = 10, and the model contains M = 500 subnetworks. The depth of NN is also 10, and the width is 240 such that the NN and PNN have almost the same number of parameters. In the doppler function experiment, the depth of the PNN L = 12, the width of each subnetwork w = 10, and the model contains M = 2000 subnetworks, because this problem requires a more complex model to fit. The depth of NN is 12, and the width is 470. We used Adam optimizer with learning rate of 10 -3 . We first train the neural network layer by layer without weight decay. Specifically, we start with a two-layer neural network with the same number of subnetworks and the same width in each subnetwork, then train a three layer neural network by initializing the first layer using the trained two layer one, until the desired depth is reached. After that, we turn the weight decay parameter and train it until convergence. In both trend filtering and smoothing spline experiment, the order is 3, and in wavelet denoising experiment, we use sym4 wavelet with soft thresholding. We implement the trend filtering problem according to Tibshirani (2014) using CVXPY, and use MOSEK to solve the convex optimization problem. We directly call R function smooth.spline to solve smoothing spline.

H.3 POST PROCESSING

The degree of freedom of smoothing spline is returned by the solver in R, which is rounded to the nearest integer when plotting. To estimate the degree of freedom of trend filtering, for each choice of λ, we repeated the experiment for 10 times and compute the average number of nonzero knots as estimated degree of freedom. For neural networks, we use the definition (Tibshirani, 2015):where df denotes the degree of freedom, σ 2 is the variance of the noise, y are the labels, ŷ are the predictions and y ′ are independent copy of y. We find that estimating (31) directly by sampling leads to large error when the degree of freedom is small. Instead, we computewhere df is the estimated degree of freedom, E denotes the empirical average (sample mean), y 0 is the target function and ȳ0 is the mean of the target function in its domain.Proposition 20. The expectation of (32) over the dataset D equals (31). figure, generally speaking as λ decreases towards 0, the degree of freedom should increase too. However, for parallel neural networks, if λ is very close to 0, the estimated degree of freedom will not increase although the degree of freedom is much smaller than the number of parametersactually even smaller than the number of subnetworks. Instead, it actually decreases a little. This effect has not been observed in other nonparametic regression methods, e.g. trend filtering, which overfits every noisy datapoint perfectly when λ → 0. But for the neural networks, even if we do not regularize at all, the among of overfitting is still relatively mild 30/256 vs 80/1000. In our experiments using neural networks, when λ is small, we denoise the estimated degree of freedom using isotonic regression.We do not know the exact reason of this curious observation. Our hypothesis is that it might be related to issues with optimization, i.e., the optimizer ends up at a local minimum that generalizes better than a global minimum; or it could be connected to the "double descent" behavior of DNN (Nakkiran et al., 2021) under over-parameterization.

H.4.2 DETAILED NUMERICAL RESULTS

In order to allow the readers to view our result in detail, we plot the numerical experiment results of each method separately in Figure 5 and Figure 6 . 2010)) of the coefficients of the truncated power basis at individual data points (the free-knots learned by NN are snapped to the nearest input x to be comparable).

