APPROXIMATION AND NON-PARAMETRIC ESTIMATION OF FUNCTIONS OVER HIGH-DIMENSIONAL SPHERES VIA DEEP RELU NETWORKS

Abstract

We develop a new approximation and statistical estimation analysis of deep feedforward neural networks (FNNs) with the Rectified Linear Unit (ReLU) activation. The functions of interests for the approximation and estimation are assumed to be from Sobolev spaces defined over the d-dimensional unit sphere with smoothness index r > 0. In the regime where r is of the constant order (i.e., r = O(1)), it is shown that at most d d active parameters are required for getting d -C approximation rate for some constant C > 0. In the regime where the index r grows in the order of d (i.e., r = O(d)) asymptotically, we prove the approximation error decays in the rate d -d β with 0 < β < 1 up to some constant factor independent of d. The required number of active parameters in the networks for the approximation increases polynomially in d as d → ∞. It is also shown that bound on the excess risk has a d d factor, when r = O(1), whereas it has d O(1) factor, when r = O(d). We emphasize our findings by making comparisons to the results on the approximation and estimation errors of deep ReLU FNN when functions are from Sobolev spaces defined over d-dimensional cube. In this case, we show that with the current state-of-the-art result, d d factor remain both in the approximation and estimation errors, regardless of the order of r.

1. INTRODUCTION

Neural networks have demonstrated tremendous success in the tasks of image classification (Krizhevsky et al., 2012; Long et al., 2015) , pattern recognition (Silver et al., 2016) , natural language processing (Graves et al., 2013; Bahdanau et al., 2015; Young et al., 2018) , etc. The datasets used in these real world applications frequently lie in high-dimensional spaces (Wainwright, 2019) . In this paper, we try to understand the fundamental limits of neural networks in the high-dimensional regime through the lens of its approximation power and its generalization error. Both approximation power and generalization error of neural network can be analyzed through specifying the target function's property such as its smoothness index r > 0 and its input space X . In particular, deep feed-forward neural networks (FNNs) with Rectified Linear Units (ReLU) have been extensively studied when they are used for approximating and estimating functions from general function class such as Sobolev class defined on d-dimensional cube (i.e., X := C d ), denoted as W r p (C d ) for 1 ≤ p ≤ ∞. However, in practice, signals on a spherical surface (i.e., X := S d-1 = {x ∈ R d : ∥x∥ 2 = 1}) rather than on Euclidean spaces often arise in various fields, such as astrophysics (Starck et al., 2006; Wiaux et al., 2005) , computer vision (Brechbühler et al., 1995) , and medical imaging (Yu et al., 2007) . Motivated by this, we focus our attention on the cases where deep ReLU FNNs are used for function approximators and estimators, when functions are assumed to be from the Sobolev spaces defined over S d-1 ; that is f ∈ W r ∞ (S d-1 ). Under this setting, our analysis focuses on how the input dimension d explicitly affects the approximation and estimation rates of f ∈ W r ∞ (S d-1 ). At the same time, we show how the scalability of deep ReLU FNNs grows in the high-dimensional regime. Here, the scalability is mainly measured through the three metrics: (1) the width denoted as W, (2) the depth, denoted as L, and (3) the number of active parameters, denoted as N of the network, (Anthony & Bartlett, 1999) . It should be emphasized that we find there exists an interaction with smoothness index r > 0 and dimension d, whereas we cannot find one for the case when f ∈ W r ∞ (C d ). We further summarize our detailed findings in the following Subsection.

1.1. PAPER ROAD MAP AND CONTRIBUTIONS

In Theorem 3.1, we provide an approximation bound of deep ReLU FNN (i.e., f ) for approximating the target functions in Sobolev spaces defined over sphere (i.e., f ∈ W r ∞ (S d-foot_0 )). Notably, in the bound, we track the explicit dependence on data dimension d allowing it tends to infinity. This tracking enables how the three components of network architecture, width (W), depth (L), and the number of active parameters (N ), should change as d increases, for obtaining the good approximation error rate. Our result implies that for approximating f ∈ W r ∞ (S d-1 ), the larger the smoothness index r is, the narrower the width of the network should be enough, while the depth of the network can be fixed. Moreover, when r is in the same order as d, the network can avoid the curse of dimensionality requiring only O(d 2 ) number of active parameters. It is interesting to note that the function smoothness index can affect the design of the network, specifically on width, while it has little effect on the design of depth. Admittedly, the condition r = O(d) is restrictive in a sense that it makes the function space W r ∞ (S d-1 ) small. Nonetheless, it contains some interesting examples: that is, reproducing kernel Hilbert spaces (RKHS) generated by C ∞ kernels such as Gaussian kernels. Additionally, to the best of our knowledge, this finding is not observed in the current approximation theory of neural network literature when f ∈ W r ∞ (C d ) where C d denotes some d-dimensional cubes, and f is a deep ReLU FNN. Out of the long list of literature to be introduced shortly, we choose the result from Schmidt-Hieber (2020) for the comparison as it also has the explicit dependence on d in their approximation bound. From their result, it can be seen that the curse cannot be avoided, even when r = O(d). The width of their constructed network is lower bounded by Ω(r d ∨ e d ) and the number of active parameters is upper-bounded by O((r + d) d ). 1 Note that the bounds on both components grow exponentially in d as r increases. See Subsection 3.1 for detailed comparisons. We further make the comparisons between estimating functions f ∈ W r ∞ (S d-1 ) (Theorems 4.3) versus f ∈ W r ∞ (C d ) (Theorems 4.4) via deep ReLU FNNs under the non-parametric regression framework. Given n noisy samples, the two Theorems suggest the specific orders of W, L and N in terms of n, d and r, for which they give the tightest bound on excess risk of respective function estimator from Proposition 4.2. When r = O(1), it is shown that the excess risk upper-bounds of both function estimators have d d in the constant factors. In contrast, when r = O(d), estimating functions f ∈ W r ∞ (S d-1 ) has at most d O(1) factor in the bound, whereas the bound for function estimator of f ∈ W r ∞ (C d ) has d d . See Table 1 and Subsection 4.2 for detailed comparisons.

1.2. RELATED WORKS

In this Subsection, to aid readers have a more clear understanding on the contributions of our paper, we provide the list of relevant works with comparisons of how these works are different from ours. Approximation of f ∈ W r ∞ (S d-1 ) via deep CNN. For the approximation theory of f ∈ W r ∞ (S d-1 ), we must refer readers Fang et al. (2020) and Feng et al. (2021) . But in their works, the convolutional neural network (CNN) is used for the function approximator under fixed d setting. Approximation of f ∈ W r ∞ (C d ) via deep ReLU FNN. Approximation theory of deep ReLU FNN for functions f ∈ W r ∞ (C d ) has a lengthy history in the literature. Representatively, Mhaskar (1996) showed that f can be approximated uniformly within ε-approximation accuracy with a 1-layer neural network of O(ε -d/r ) neurons and an infinitely differentiable activation function. Later, for deep ReLU networks, Yarotsky (2017) showed that the number of active parameters (N ) in networks is bounded by O(ε -d/r log 1 ε ), and the depth has the order O(log( 1 ε )). He further proved that N is lower-bounded by the order O(ε -d/r ), which is backed up by the result in DeVore et al. (1989) . Petersen & Voigtlaender (2018) showed that there exists a deep ReLU network with bounded and quantized weight parameters, with O(ε -d/r ) network size, and with ε-independent depth for achieving the ε-accuracy in the L p norm. For approximating functions For f ∈ W r p (C d ) with 1 ≤ p ≤ ∞, f ∈ W r ∞ (C d ), Schmidt-Hieber (2020) proved that a network of size O(ε -d/r ) with bounded weight parameters achieves ε-approximation error in the L ∞ norm. Function spaces with special structures. The result of Yarotsky (2017) implies that deep ReLU net cannot escape the curse of dimensionality for approximating f ∈ W r ∞ (C d ). Many papers have demonstrated that the effects of dimension can be either avoided or lessened by considering function spaces different from Sobolev spaces, but defined over C d . Just to name a few, Mhaskar et al. (2016) studied that a function with a compositional structure with regularity r can be approximated by neural network with O(ε -2/r ) neurons within ε accuracy. Suzuki (2018) proved the deep ReLU network with O(ε -1/r ) neurons can avoid the curse for approximating functions in mixed smooth Besov spaces. Chen et al. (2019) showed the network size scales as O(ε -D/r ) for approximating C r functions, when they are defined on a Riemannian manifold isometrically embedded in R d with manifold dimension D with D ≪ d. Montanelli & Du (2019) and Blanchard & Bennouna (2022) showed respectively the deep and shallow ReLU network break the curse for Korobov spaces. Estimation rates of excess risk under non-parametric framework. Many researchers also have tried to tackle how the neural networks avoid the curse by considering specially designed function spaces under the non-parametric regression framework. We only provide an incomplete list of them. Such structures include additive ridge functions (Fang & Cheng, 2023) , composite function spaces with hierarchical structures (Schmidt-Hieber, 2020; Han et al., 2022) , mixed-Besov spaces (Suzuki, 2018) , Hölder spaces defined over a lower-dimensional manifold embedded in R d (Chen et al., 2022) . They all showed the function estimators with neural network architectures can lessen the curse by showing the excess risks of the estimators are bounded by O(n -2r/(2r+D ′ ) ), where n denotes the size of a noisy dataset, and D ′ ≪ d is an intrinsic dimension uniquely determined through the characteristics of function spaces, when they are compared with the minimax risk O(n -2r/(2r+d) ) (Donoho & Johnstone, 1998) for f ∈ W r ∞ (C d ). Comparisons. The aforementioned works mainly focused on the approximation and estimation of functions defined on C d , not S d-1 , for the fixed d. Moreover, the introduced papers on approximation theory, except the work of Schmidt-Hieber (2020), hide the dependence on d in the Big-O notation of N in ε-accuracy, even for papers where they consider the function spaces with special structures. Thus, it is not clear how the d affects the approximation bound and the scale of the provided network architecture. Introduced papers on estimation rate for excess risk also follow the same philosophy with papers on approximation theory, as they work on the fixed d setting. In contrast, we work on the S d-1 input space, track the explicit dependence on d in the error bound, and describe how d affects the scale of deep ReLU FNN as d → ∞ with its interactions with function smoothness r > 0. Our paper focuses on tracking the dependence on d in the constant factor hidden in the Big-O notations both in approximation and estimation error rates, rather than paying attentions to reducing the exponential dependence of d with base ε in N or with base n in excess risk bound.

2. PRELIMINARY DEFINITIONS

In this Section, we provide the mathematical definitions of deep ReLU FNN and Sobolev function spaces on unit sphere.

2.1. DEFINITION OF DEEP RELU NETWORK

For defining the deep ReLU network mathematically, we adopt the notation used in Schmidt-Hieber (2020). For v = (v 1 , . . . , v r ) ∈ R r , let σ v : R r → R r be the shifted ReLU (Rectified Linear Units) activation function as σ v ((y 1 , . . . , y r ) ⊤ ) := σ((y 1 -v 1 , . . . , y r -v r ) ⊤ ), where σ(x) = max(x, 0). With this notation, the network architecture (L, p) consists of a positive integer L, called the number of hidden layers, and a width vector p := (p 0 , . . . , p L+1 ) ∈ N L+2 . A deep ReLU network with architecture (L, p) considered in this work is then any function of the form f : S d-1 → R, x → f (x) = W L σ v L W L-1 σ v L-1 . . . σ v1 W 1 x, where W i ∈ R pi+1×pi is a weight matrix with p 0 = d, p L+1 = 1 and v i ∈ R pi is a shift vector. Network functions are built by alternating matrix-vector multiplications with the action of the nonlinear activation function σ. Let ∥W j ∥ 0 and |v j | 0 be the number of nonzero entries of W j and v j in the j th hidden layer. The final form of neural network we consider in this paper is given by F(L, p, N ) := f of the form (1) : L j=1 ∥W j ∥ 0 + |v j | 0 ≤ N . The main advantage of using this notation comes from its convenience for tracking the construction process of network f for approximating f ∈ W r ∞ (S d-1 ). See Section D.2 in the Appendix. Now, we define the Sobolev spaces over the sphere in the next Subsection.

2.2. DEFINITION OF SOBOLEV SPACES OVER SPHERE

For 1 ≤ p ≤ ∞, we denote L p (S d-1 ) = L p (S d-1 , ρ X ) as the L p -function space defined with respect to the normalized Lebesgue measure ρ X on S d-1 , with norm ∥g∥ p := S d-1 |g(x)| p ρ X (dx) 1/p . Let H d k be the space of homogeneous harmonic polynomials of total degree k ∈ Z + restricted on S d-1 ⊂ R d . In Dai & Xu (2013) ; Efthimiou & Frye (2014) , its dimension for k ∈ N is found to be N (k, d) = 2k + d -2 k k + d -3 k -1 . ( ) Note that L 2 (S d-1 ) is a Hilbert space with inner product ⟨f, g⟩ L2(S d-1 ) := S d-1 f (x)g(x)ρ X (x) for f, g ∈ L 2 (S d-1 ). The spaces H d k , for k ∈ Z + , of spherical harmonics are mutually orthogonal with respect to the inner product of L 2 (S d-1 ). Since the space of spherical polynomials is dense in L 2 (S d-1 ), every f ∈ L 2 (S d-1 ) has a spherical harmonic expansion f = ∞ k=0 Proj k (f ) = ∞ k=0 N (k,d) ℓ=1 f k,ℓ Y k,ℓ converging in the L 2 (S d-1 ) norm. Hereafter, Y k,ℓ N (k,d) ℓ=1 denotes an orthonormal basis of H d k , f k,ℓ is the Fourier coefficients of f given by f k,ℓ := ⟨f, Y k,ℓ ⟩ L2(S d-1 ) := S d-1 f (x)Y k,ℓ (x)ρ X (dx), and Proj k (f ) denotes the orthogonal projection of L 2 (S d-1 ) onto H d k , which has an integral representation Proj k (f )(x) := S d-1 f (y)Z k (x, y)ρ X (dy), ∀x ∈ S d-1 , where Z k (x, y) := N (k,d) ℓ=1 Y k,ℓ (x)Y k,ℓ (y), ∀x, y ∈ S d-1 . We know that Z k (x, y) is a reproducing kernel of H d k , independent of the choice of Y k,ℓ N (k,d) ℓ=1 , and with λ G = d-2 2 , Z k (x, y) := N +λG λG G λG k ⟨x, y⟩ , ∀x, y ∈ S d-1 where G λG k is the Gegenbauer polynomial of degree k with parameter λ G > -1 2 , see for instance Dai & Xu (2013) . Denote u := ⟨x, y⟩, the exact expression of G λG k u is given in terms of the Gamma function by G λG k u := ⌊ k 2 ⌋ ℓ=0 (-1) ℓ Γ k -ℓ + λ G Γ λ G ℓ! k -2ℓ ! 2u k-2ℓ . ( ) The space of H d k of spherical harmonics can also be characterized as eigenfunction spaces of the Laplace-Beltrami operator ∆ S d-1 on S d-1 . Indeed, H d k = f ∈ C 2 S d-1 : ∆ S d-1 f = -λ k f , where λ k = k(k + d -2) and C 2 S d-1 denotes the space of all twice continuously differentiable functions on S d-1 . In fact, with the identity operator I, we may define the fractional power of -∆ S d-1 + I α of the operator -∆ S d-1 + I in a distributional sense for α ∈ R: Proj k - ∆ S d-1 + I α f = 1 + λ k α Proj k f . Now, we define the Sobolev space W r p (S d-1 ) to be the subspace of L p (S d-1 ) for 1 ≤ p ≤ ∞, r > 0, with the finite norm ∥f ∥ W r p (S d-1 ) = -∆ S d-1 + I r/2 f p < ∞. In this paper, we consider the case p = ∞ (i.e., f ∈ W r ∞ (S d-1 )), which is essentially the Hölder space. The sphere S d-1 is a smooth Riemannian manifold without boundary. Its nice Laplace-Beltrami operator (i.e.,∆ S d-1 ) acting as a Hessian operator of functions on the sphere gives the natural definition of Sobolev spaces W r ∞ (S d-1 ) in (5); that is, the Sobolev space is a collection of continuous functions defined on sphere S d-1 whose generalized (distributional) derivatives up to order r are essentially bounded. See Equations ( 16) in Hesse (2006), (3.4) in Fang et al. (2020), (16) in Feng et al. (2021), (5.1.9) in Freeden et al. (1998) for more detailed treatments on W r ∞ (S d-1 ). Readers can also refer the definition of W r ∞ (C d ) in the Appendix A, when C d = [0, 1] d , for compar- ison with W r ∞ (S d-1 ) and later use in Subsection 3.1.

3. APPROXIMATION ERROR

Now, we present our Theorem on approximating functions f ∈ W r ∞ (S d-1 ) via F(L, p, N ) in (2). Theorem 3.1 Let 0 < α < 1, m, N, M ∈ N with 1 ≤ N ≤ d α + 1. For any function f ∈ W r ∞ (S d-1 ) with r > 0, there exists a network f ∈ F L, d, 22N M, . . . , 22N M, 1 , N with depth L = (m + 4)⌈log 2 (2N )⌉ and number of parameters N ≤ M (2d + 404N • (m + 3) + 2N + 4) + 1 such that f -f ∞ ≤ C ′′ η ∥f ∥ W r ∞ (S d-1 ) × max N -r , 6 πe d 4 d N + 3d-4r-2 8 (2N + 1) 3d-4r 4 √ M , d 2N log 2 (2N ) 2 2 -2m , (7) where C ′′ η is a constant dependent on η, and independent on d, r, N, M or f . The proof of Theorem 3.1 is lengthy and technical. We provide detailed proof ideas with technical remarks for the Lemmas and Proposition used for the proof of Theorem 3.1 in the Appendix D. The detailed technical proofs of those Lemmas and Proposition are provided in the Appendix E. Here, for conciseness, we provide some important remarks on the Theorem and a simple proof sketch, which starts with a simple triangle inequality: f -f ∞ ≤ ∥f -L N (f )∥ ∞ + L N (f ) -L y N,M (f ) ∞ + L y N,M (f ) -f ∞ . Three error terms in (7) correspond to the bounds on three terms of the right-hand side in the inequality (8). We want to emphasize that the constant C ′′ η > 0 in (7) is independent of d. Furthermore, we track how the bound is explicitly dependent on d allowing it to tend to infinity. For first term, note that any f ∈ W r ∞ (S d-1 ) is approximated by a weighted sum of Proj k (f ) for 0 ≤ k ≤ 2N , denoted as L N (f ). The corresponding approximation error is small for large enough N and r. Here, importantly, we set the N = ⌈d α ⌉ for 0 < α < 1, so that the input dimension d grows faster than N . For the second term, notice that the definition of L N (f ) is involved with the integral over the sphere, and the key for approximating the function is to discretize this integral by M random samples y = {y 1 , . . . , y M } independently drawn from ρ X . The discretized version of L N (f ) is denoted as L y N,M (f ). As observed in the error bound, the higher degree N the L N (f ) has, the more sampled points M the approximation requires. However, the requirement is ameliorated as r increases. A similar effect can be observed in the constant factor in d. For the fixed smoothness index r, the higher the data dimension d is, the more the sampled point M is required for good approximation, but the requirement is alleviated as the smoothness index r increases. If r increases up to order d, the factor 2 decays exponentially fast as d → ∞, eventually letting M ≥ 1 to be any integer. This phenomenon is further investigated in the Corollary 3.3. The last term corresponds to the error of the neural network f approximating L y N,M (f ). For any point x ∈ S d-1 , the evaluated function value L y N,M (f )(x) is simply a weighted average of ξ N,r (⟨x, y i ⟩), for the sampled y = {y 1 , . . . , y M }. Here, ξ N,r (⟨x, y i ⟩) is a linear combination of G λG k ⟨x, y i ⟩ in (4) for 0 ≤ k ≤ 2N . Thus, it is the sum of univariate polynomials of degree up to 2N . We construct sub-networks approximating ξ N,r (⟨x, y i ⟩) for each i ∈ [M ]. This explains the width of f is proportional to N M . The corresponding error bound is dependent on d 2N , where it comes from the applications of Stirling's formula on the coefficient factors in G λG k ⟨x, y i ⟩ . The er- ror, log 2 (2N ) 2 2 -2m , comes from approximating ⟨x, y i ⟩ k for 0 ≤ k ≤ 2N via neural networks. The larger the m is, the deeper the network becomes as L = O(m), and the error gets smaller.

3.1. COMPARISON WITH SCHMIDT-HIEBER (2020)

In this Subsection, we compare the result from Theorem 3.1 with the result from Schmidt-Hieber (2020), where they consider the approximation of f ∈ W r ∞ ([0, 1] d ) via deep ReLU FNN. The Theorem is stated as follows: Theorem 3.2 [Theorem 5 of Schmidt-Hieber (2020)] For any function f ∈ W r ∞ ([0, 1] d ) and let K > 0 be the radius of Hölder ball. Then, for any integers m ≥ 1 and N H ≥ (r + 1) d ∨ (K + 1)e d , there exists a network f H ∈ F H L, (d, 6(d + ⌈r⌉)N H , . . . , 6(d + ⌈r⌉)N H , 1), N H (9) with depth L = 8 + (m + 5) 1 + ⌈log 2 (d ∨ r)⌉ and the number of parameters N H ≤ 141(1 + d + r) 3+d N H (m + 6), such that f -f H ∞ ≤ (2K + 1)(1 + d 2 + r 2 )6 d N H 2 -m + K3 r N H -r d . ( ) To avoid the confusion with the notations used in Theorem 3.1, we put the superscript H to a parameter that determines width of the network (i.e., N H ), to the total number of parameters in the network (i.e., N H ), and to the network class (i.e., F H ). It is interesting to note that the exponential growth of the network size in d is observed in the construction of F H , whereas there exists a flexibility in F, dependent on the choice of M . Specifically, the width of the network in F H is exponentially dependent on d as N H = Ω(r d ∨ e d ), whereas the width of the network in F is dependent on two parameters N = o(d) and any integers M ≥ 1. For the total number of network parameters, we have Analogously, the bound on the approximation error of f H in ( 10) is dependent on d exponentially, but this exponential dependence in d can be avoided in the error bound of f in (7) under two scenarios: (I) r = O(d) and any integer M ≥ 1 or (II) r = O(1) and M = O(d d ). In the Corollary presented in the next Subsection, we further specify the two scenarios, and describe how the approximation error bound in each scenario converges to 0 in terms of d. Corollary 3.3 Let 0 < α, β, γ < 1 with γ > max{α, β} and N  N H = O((d + r) d ), whereas N = O(M d + N md).

3.2. FAST APPROXIMATION ERROR IN TERMS OF d

∈ N with 1 ≤ N ≤ d α + 1. For any f ∈ W r ∞ (S d-1 ) with r > 0, we have: (I) For 3d-2 4 -C 1 ≤ r ≤ 3d-2 4 with some constant C 1 ≥ 0 independent of d, there exists a network f (I) ∈ F (L, (d, 66N, 66N, . . . , 66N, 1) , N ) with depth L = O (d γ log 2 d) and the number of active parameters N = O d max{α+γ,1} , such that f -f (I) ∞ ≤ C ′ η,α,β,γ ∥f ∥ W r ∞ (S d-1 ) d -d β , where C ′ η,α,β,γ is a constant depend- ing only on C 1 , η, α, β and γ. (II) For r = O(1) and M = O 9 d d 9 4 d , there exists a network f (II) ∈ F L, d, 22N M, . . . , 22N M, 1 , N with depth L = O (d γ log 2 d) and the number of active parameters N = O 9 d d 13 4 d such that f -f (II) ∞ ≤ C ′ η,α,β,γ ∥f ∥ W r ∞ (S d-1 ) d -αr , where C ′ η,α,β,γ is a constant depending only on η, α, β and γ. The detailed proof on Corollary 3.3 is deferred in the Appendix E.6. The approximation error in scenario (I) decays at a rate d -d β for 0 < β < 1, while the required number of active parameters N is at most O(d 2 ). Here, the construction of network f (I) is independent with the choice of M , and we simply choose M = 3. In scenario (II), since r = O(1) and 0 < α < 1, the approximation error decays to 0 at d -O(1) rate, which can be slower than d - 

4. STATISTICAL RISK BOUND

Let X := S d-1 and Y ⊂ R be the measureable feature space and output space. We denote ρ as a joint probability measure on the product space Z := X × Y, and let ρ X be the marginal distribution of the feature space X . We assume that the noisy data set D := {(x i , y i )} n i=1 are generated from the non-parametric regression model y i = f ρ (x i ) + ε i , i = 1, 2, . . . , n, where the noise ε i is assumed to be centered sub-gaussian random variable and E(ε i |x i ) = 0. Our goal is to estimate the regression function f ρ (x) with the given noisy data set D. Specifically, it is assumed that the regression function belongs to Sobolev space on d-dimensional sphere; that is f ρ ∈ W r ∞ (S d-1 ). It is easy to see regression function f ρ := E(y|x) is a minimizer of the following population risk E(f ) defined as: E(f ) = E (x,y)∼ρ y -f (x) 2 . However, since the joint distribution ρ is unknown, we cannot find f ρ directly. Instead, we solve the following empirical risk minimization problem induced from the dataset D: f n = arg min f ∈F (L,p,N ) E D (f ) := arg min f ∈F (L,p,N ) 1 n n i=1 y i -f (x i ) 2 . ( ) Note that the function estimator is taken from the feedforward neural network hypothesis space F(L, p, N )foot_1 defined in (2), and we denote the empirical minimizer of (12) as f n . It is assumed that |y| ≤ B almost everywhere and we have |f ρ (x)| ≤ B. We project the output function f : S d-1 → R onto the interval [-B, B] by a projection operator π B f (x) =    f (x), if -B ≤ f (x) ≤ B, B, if f (x) > B, -B, if f (x) < -B. We consider the clipped estimator π B f n for recovering the regression function f ρ . Note that the clipped estimator has been widely used in statistical learning papers Suzuki (2018) ; Fang & Cheng (2023) ; Oono & Suzuki (2019) . The quality of π B f n is measured through the difference between two expected risks (i.e., excess risk) defined as E π B f n -E f ρ .

4.1. UPPER-BOUND ON EXCESS RISK

In this Subsection, we provide the upper-bound on the excess risk of the clipped estimator π B ( f n ) with respect to the pseudo-dimension (i.e., Pdim(F)) and the approximation error (i.e., ∥f -f ρ ∥ ∞ ). Before presenting the bound, the definition of Pdim(F) is presented. Definition 4.1 Denote by Pdim(F), the pseudo-dimension of F, which is the largest integer ℓ, for which there exists (ξ 1 , . . . , ξ ℓ , η 1 , . . . , η ℓ ) ∈ X ℓ × R ℓ such that for any (a 1 , . . . , a ℓ ) ∈ {0, 1} ℓ , there is some f ∈ F satisfying ∀i : f (ξ i ) > η i ⇐⇒ a i = 1. For more comprehensive exploration on Pdim(F) can be found in references Anthony & Bartlett (1999) ; Bartlett et al. (2019) . We provide the first theorem on the excess risk. Proposition 4.2 Set δ ∈ (0, 1). Then, with probability at least 1 -δ, we have E π B f n -E f ρ ≤ C B,δ,f • Pdim(F) • log(n) n + ∥f -f ρ ∥ ∞ √ n + ∥f -f ρ ∥ 2 ∞ , where C B,δ,f is an absolute constant dependent on B, δ, f independent on n, r, d. A detailed proof of Proposition 4.2 is deferred in the Appendix. The excess risk E π B f n -E f ρ is a random quantity over the estimator f n and the statement in the Theorem holds with probability at least 1 -δ. The failure probability δ ∈ (0, 1) is hidden in the constant C B,δ,f logarithmically, i.e., log( 1 δ ). In the bound, it should be noted that there is a trade-off between the "approximation error" (i.e., ∥f -f ρ ∥ ∞ ) term and the combinatorial "complexity measure" term of a neural network class F (i.e., Pdim(F) • log(n)/n); that is, the richer the network hypothesis space F becomes, the finer the approximation result we get. Nonetheless, the arbitrary increase in the hypothesis space F eventually leads to the increase of the bound in excess risk. In the following Subsection, we will show how the specifications (i.e., the choices of (L, p, N )) of the network architecture affect the tension between these two terms.

4.2. CONVERGENCE RATE OF EXCESS RISK

Now we are ready to formally state bounds on the excess risks of π M f n when f ρ ∈ W r ∞ (S d-1 ) (i.e., Theorem 4.3) and f ρ ∈ W r ∞ ([0, 1] d ) (i.e., Theorem 4.4), respectively. Theorem 4.3 Suppose f ρ ∈ W r ∞ (S d-1 ) with r > 0. A network f n from (6) with choices N = ⌈n 2 3d+4r ⌉, M = ⌈n 3d 3d+4r ⌉, and m = ⌈ r 3d+4r log 2 (n)⌉ yield the bound on the excess risk with probability at least 1 -δ as follows: E π M f n -E f ρ ≤ C B,η,δ,f • max 1, 6rd (3d + 4r) 2 (log 2 (n)) 4 , 6 πe d 2 d 2N + 3d-4r-2 4 , d 4N • n -2r 2r+1.5d , ( ) where C B,η,δ,f depends on B, η, δ, f and independent on d, r and n. Theorem 4.4 Suppose f ρ ∈ W r ∞ ([0, 1] d ) with r > 0. A network f n from (9) with choices N H = ⌈n d 2d+r ⌉ , and m H = ⌈ d+r d+2r log 2 (n)⌉ yield the bound on the excess risk with probability at least 1 -δ as follows: E π M f n -E f ρ (16) ≤ C B,η,δ,K • max ⌈(log 2 ((d + ⌈r⌉)n 2 )⌉ 2 (d + r) d • (log 2 (n)) 3 , 1 + r 2 + d 2 2 6 2d + 3 2r • n -2r 2r+d , where C B,η,δ,K depends on B, η, δ, K and independent on d, r and n. ), is sub-optimalfoot_2 in a minimax sense for estimating functions f ρ ∈ W r ∞ (S d-1 ), where O d hides the constant factor in d. The extra 0.5d factor in the denominator of exponent comes from the Sobolev embedding Lemma (Lemma D.1.3) and discretization Lemma (Lemma D.1.4). For the constant factor in d, when r = O(1), the exponential dependence on d can be observed. However, when r = O(d), the excess bound in (15 1) . In contrast, in (16) for estimating functions f ρ ∈ W r ∞ ([0, 1] d ), the rate n -2r 2r+d is minimax optimal, but we cannot observe the interactions between r and d as we observe in (15). Remark 4.5 From the technical point of view, the result in Theorem 4.3 should be compared with the results in the existing literature, i.e., Schmidt-Hieber (2020); Chen et al. (2022) ; Suzuki (2018) , in a sense that our result doesn't require the boundedness of the weight parameters in the network construction. The detailed readings of their proofs reveal that they require the bound on the uniform covering number of F and it can be bounded by the Lipschitzness of the network output with respect to the weight parameters. Naturally, for the discretizations of the parameter space, the boundedness assumption is required. In contrast, in our result, due from the Bartlett et al. ( 2019) (See Lemma H.1), bounding the complexity measure Pdim(F) doesn't require the parameter boundedness assumption. ) reduces to E π M f n -E f ρ ≤ C B,η,δ,f • max (log 2 (n)) 4 , d 4N • n -2r 2r+1.5d . With a choice of N = ⌈n 2 3d+4r ⌉, as d, r → ∞, the constant d 4N becomes d O(

5. AN OPEN QUESTION

In this paper, we prove when r = O(d), deep ReLU FNNs only require at most N = O(d 2 ) parameters to get a sharp approximation rate. However, this condition seems restrictive, and needs further investigation whether it is a necessary and sufficient condition to avoid the curse of dimensionality for approximating f ∈ W r ∞ (S d-1 ). To answer this question, it is essential to study the lower bound of N with a similar approximation error as stated in Theorem 3.1, and see if it has the matching order with the upper-bound we get in d. We conjecture obtaining this result is possible by combining the ideas of using VC-dimension of deep ReLU FNNs (Bartlett et al., 2019; Yarotsky, 2017) and of constructing the packing set on the sphere through the spherical cap (Hesse, 2006) , while tracking the d-dependency in the constant factor carefully. We leave this for future research. A d d -DEPENDENT CONSTANT IN N FOR APPROXIMATING f ∈ W r ∞ ([0, 1] d ) First, we define the function space W r ∞ ([0, 1] d ) on the d-dimensional unit cube. For r = n + σ where n ∈ N 0 and σ ∈ (0, 1], a function has Hölder smoothness index r if all partial derivatives up to order n exist and are bounded and the partial derivatives of order n are σ Hölder. Formally, the ball of r-Hölder functions with radius Q is then defined as W r ∞ ([0, 1] d ) = f : [0, 1] d → R : α:|α|≤n ∥∂ α f ∥ ∞ + α:|α|=n sup x,y∈[0,1] d x̸ =y |∂ α f (x) -∂ α f (y)| |x -y| σ ∞ ≤ Q . where ∂ α f := ∂ |α| ∂ α 1 ...∂ α d f for the multi-index notation, α := (α 1 , . . . , α d ). The fundamental ideas for approximating functions f ∈ W r ∞ ([0, 1] d ) in the existing literature rely on a local Taylor approximation technique. The technique discretizes d-dimensional input cube into a sub-cube set whose size is (K + 1) d where (K + 1) is the grid size of each coordinate. For any x in the input cube, the function f is approximated by using the closest 2 d grid points to x via Taylor expansion of f up to the degree ⌊r⌋, where we denote the largest integer less than or equal to u > 0 as ⌊u⌋. Therefore, the total number of active parameters for the net is at least more than the total number of coefficients of partial derivatives ∂ α f := ∂ |α| ∂ α 1 ...∂ α d f for |α| = |α 1 | + • • • + |α d | ≤ ⌊r⌋. This yields the lower bound on the active parameters for the network via local Taylor approximation as (K + 1) d • ⌊r⌋ i=0 d+i-1 d . B SUBOPTIMAL CONVERGENCE RATE OF EXCESS RISK FOR f ∈ W r ∞ (S d-1 ) Since S d-1 ⊂ C d , in light of Whitney's extension theorem, it is obvious that the convergence rate of excess risk for f ∈ W r ∞ (S d-1 ) should achieve the minimax optimal rate n -2r 2r+d same as for estimating f ∈ W r ∞ (C d ). However, from this perspective, it is not possible to track the explicit dependence on d in prefactor of the rate. From Proposition 4.2, we track this dependence with the combination of our own approximation result. First of all, in order to achieve the minimax learning rate in n (e.g., sample size); that is, n -2r 2r+d , we need to achieve the optimal approximation rate, known as O(N -r d ), where N denotes the number of active parameters in the network. This can be easily checked in Schmidt-Hieber (2020)'s result. But this cannot be achieved in our Theorem 3.1, the result of approximation theorem. The main reason arises from the employment of Sobolev embedding theorem (Proposition D.1.3) and of concentration inequality from Smale & Zhou (2007)  (Lemma E.2.1) used for bound- ing the term ∥L N (f ) -L y N,M (f )∥ ∞ . Proposition D.1.3 specifies ∥L N (f ) -L y N,M (f )∥ ∞ ≤ C d ∥L N (f ) -L y N,M (f )∥ W s 2 (S d-1 ) for some s > (3d -2)/4 and Lemma D.1.4 specifies the relations ∥L N (f ) -L y N,M (f )∥ W s 2 (S d-1 ) ≤ C ′ d,r ∥f ∥ W r ∞ (S d-1 ) (2N +1) 3d-4r 4 √ M . In this process, there are some extra factors multiplied leading to the rate sub-optimal, reflected in the term (2N + 1)  f ∈ W r ∞ (S d-1 ) We consider the case when the approximator is deep ReLU CNN followed by downsampling operations and very few fully-connected layers. This is exactly the same architecture considered in the paper Fang et al. (2020) , and by applying our result (Lemma D.1.4), we get the following theorem.  Theorem C.1. Let 2 ≤ S ≤ d, 0 < α < 1, and B, N, M ∈ N with 1 ≤ N ≤ d α + 1. Let J ≥ ⌈ M d-1 S-1 ⌉, D 1 = (2B + 3)⌊(d + JS)/ O(⌈log 2 (r ∨ d)⌉) O(r 2 + d) O((⌊r⌋ + 1) 2 ) Width O((d + ⌈r⌉)(r + 1) d ) O(r d+1 d3 d ) O((⌊r⌋ + 1) 2 3 d d ⌊r⌋+1 ) Approx. Rate O(6 d (r + 1) d ) O(8 r (r + 1) d ) O((⌊r⌋ + 1) 2 d ⌊r⌋+(r∨1)/2 ) Table 2 : A summary of r and d dependences in the prefactors of constructed deep ReLU networks' depths, widths and approximation rates for approximating the functions f ∈ W r ∞ (C d ) in Schmidt-Hieber (2020); Lu et al. (2021) ; Jiao et al. (2021) .

W r

∞ (S d-1 ) with r > 0, there exists a network f CNN ∈ H J,D1,D2,S with the number of parameters N ≤ J(3S + 2) + M + 2B + 4 such that f -f CNN ∞ ≤ C ′′ η,α ∥f ∥ W r ∞ (S d-1 ) max N -r , 6 πe d 4 d N + 3d-4r-2 8 (2N + 1) 3d-4r 4 √ M , d 2N r r -1 N 2 B , where C ′′ η,α is a constant dependent on η, α, and independent on d, r, N, M or f . The only part we need to pay attention to is bounding the term ∥ L y N,M (f ) -f CNN ∥ ∞ , and track the explicit dependence on d in the bound, and its proof is deferred in Appendix G. We consider the case r = O(d) and any integer M ≥ 1. In this case, our result for deep ReLU FNN shows that O(d -d β ) can be achieved for the approximation, with at most O(d 2 ) active parameters. In light of the result from Zhou (2020), we also should expect the same results for CNN with downsampling operation. In order to get an approximation rate O(d -d β ) for some 0 < β < 1, controlling the first two terms in (1.1) is the same as that of our proof in the Appendix E.6. We only need to pay attention to the last term. Since 1 ≤ N ≤ d α + 1, for some 0 ≤ α < 1, we have r r -1 • d 2N N 2 /B ≤ 8 • d 2d α +2 d 2α /B ≤ 8 • d 2d α +4 /B ≤ C • d -d β , for some constant C > 0 independent with d > 0 and r > 0. The rate d -d β is obtainable only when B = O(d d ). Then, the number of parameters N ≤ J(3S + 2) + M + 2B + 4 is bounded by O(d d ). This is an unsatisfactory result. However, we firmly believe this result can be improved, and leave it as an open question for future work. 

D ROADMAPS FOR PROOF OF THEOREM 3.1

In this section of the Appendix, we provide the definitions of L N (f ) and L y N,M (f ) along with the overall picture for the proof of Theorem 3.1. Recall we have the following decomposition: f -f ∞ ≤ ∥f -L N (f )∥ ∞ :=(I) + L N (f ) -L y N,M (f ) ∞ :=(II) + L y N,M (f ) -f ∞ :=(III) . In Subsection D.1, we provide the idea for bounding (I) and (II). In Subsection D.2, the construction of neural network f for approximating L y N,M (f ) is described. In this section, no proofs of Propositions and Lemmas are included, but only key ideas for the proofs and technical comparisons with other literature are provided. All the detailed proofs of technical statements in this section are deferred in the Appendix C.1.

D.1 ERROR BOUNDS FOR (I) AND (II)

A function f ∈ W r ∞ (S d-1 ) is approximated by a linear scheme L N defined as follows. Definition D.1.1 Given a C ∞ ([0, ∞]) function η with η(t) = 1 for 0 ≤ t ≤ 1 and η(t) = 0 for t ≥ 2, we define a sequence of linear operator L N , N ∈ N, on L p (S d-1 ) with 1 ≤ p ≤ ∞ by L N (f )(x) := 2N k=0 η k N Proj k (f )(x) = S d-1 f (y)ℓ k,d (⟨x, y⟩)ρ X (dy), x ∈ S d-1 , ( ) where with λ G = d-2 2 , ℓ N,d is a kernel given by ℓ N,d (t) := 2N k=0 η k N k + λ G λ G G λG k t , t ∈ [-1, 1]. It can be found in Dai & Xu (2013) (Chapter 4) that L N is near best, achieving the order of best approximation for f ∈ W r p (S d-1 ). Lemma D.1.2 (Lemma 1 in Fang et al. ( 2020)) For N ∈ N, 1 ≤ p ≤ ∞, r > 0, and f ∈ W r p (S d-1 ), there holds ∥f - L N (f )∥ p ≤ C η N -r • ∥f ∥ W r ∞ (S d-1 ) , where C η is a constant depending only on the function η in defining L N . Note that -∆ S d-1 + I r/2 is a self-adjoint operator. For x ∈ S d-1 , recalling the definition of L N (f ), we have L N (f )(x) = ⟨f, ℓ N,d (⟨x, •⟩)⟩ L2(S d-1 ) = -∆ S d-1 + I r/2 f, -∆ S d-1 + I -r/2 ℓ N,d (⟨x, •⟩) L2(S d-1 ) = S d-1 F r (y) • ξ N,r (⟨x, y⟩)ρ X (dy). Hereafter, we denote F r = -∆ S d-1 +I r/2 f and ξ N,r (⟨x, •⟩) = -∆ S d-1 +I -r/2 ℓ N,d (⟨x, •⟩). By the fractional power of the operator -∆ S d-1 + I -r/2 in a distributional sense, ξ N,r (•) is a polynomial of degree at most 2N written as: ξ N,r (t) = 2N k=0 1 + λ k -r/2 η k N k + λ G λ G G λG k t , t ∈ [-1, 1]. The fractional power of (-∆ S d-1 + I) caused by the regularity f ∈ W r ∞ (S d-1 ) enables rdependent error bound for discretizing L N (f ): the larger the regularity r becomes, the smaller the bound for approximation error gets. Following Fang et al. (2020) , the key idea for a constructing neural network that approximates L N (f ) is to discretize the integral form ( 22) by M random samples y = {y 1 , . . . , y M } independently drawn from ρ X . We write the discretized version of ( 22) as : L y N,M (f )(x) = 1 M M i=1 F r (y i ) • ξ N,r (⟨x, y i ⟩), ∀x ∈ S d-1 . ( ) Before estimating the distance between L N (f ) and L y N,M (f ), we need a Sobolev embedding property. Proposition D.1.3 For d ≥ 5, 1 ≤ p ≤ ∞, and s ≥ 3d-2 4 , the Sobolev space W s p (S d-1 ) is continuously embedded into C(S d-1 ), the space of continuous functions on S d-1 , which implies ∥f ∥ ∞ ≤ c 0 6 πe d 4 • ∥f ∥ W s p (S d-1 ) , f ∈ W s p (S d-1 ), where c 0 is an absolute constant independent of r, d, s, and f . Proposition D.1.3 is motivated from Eq.( 14) in Hesse (2006) , where they proved ∥f ∥ ∞ ≤ C s,d ∥f ∥ W s p (S d-1 ) , f ∈ W s p (S d-1 ) for s ≥ d-1 2 . The constant obtained in Hesse (2006) is C s,d := 1 ω d ∞ k=0 N (k,d) (k+ d-2 2 ) 2s 1/2 , where ω d is the surface of d-dimensional sphere. For large enough d, (1/ω d ) 1/2 grows in the order of O d 2πe d/4 . Then, by choosing s ≥ 3d-2 4 , (1/ω d ) 1/2 can be absorbed into the infinite sum making the constant C s,d converge in an asymptotic regime of d. It should be noted that the threshold on smoothness index (i.e., s ≥ 3d-2 4 ) is larger than that from Hesse (2006  ) (i.e., s ≥ d-1 2 ), where they consider the fixed d. See Appendix E.1 for the proof. Next, we state the discretization lemma which provides a probabilistic bound on the difference L N (f ) -L y N,M (f ). Lemma D.1.4 Let r ≤ 3d-2 4 and 0 < α < 1. If f ∈ W r ∞ (S d-1 ), then for any M ∈ N and 1 ≤ N ≤ d α + 1, there exist y = {y 1 , y 2 , . . . , y M } ⊂ S d-1 such that L N (f ) -L y N,M (f ) ∞ ≤ 6 • C ′′ 6 πe d 4 ∥f ∥ W r ∞ (S d-1 ) d N + 3d-4r-2 8 (2N + 1) 3d-4r 4 √ M , where C ′′ > 0 is a constant depending on α but independent of r, f , N , M , and d. Lemma D.1.4 is motivated by Lemma 2 in Fang et al. (2020) . The main framework of the proof is based on the Sobolev embedding property in Proposition D.1.3 and the concentration inequality for random variables with values in a Hilbert space, which can be found in Smale & Zhou (2007) . For the application of the concentration inequality, the random variable ξ(y i ) := F r (y i )ξ N,r (⟨x, y i ⟩) in ( 24) needs to be bounded in ∥ • ∥ W s 2 (S d-1 ) norm for s ≥ 3d-2 4 . See Appendix E.2 for the proof of the Lemma. When compared with the technical proof of Lemma 2 from Fang et al. (2020) , the most notable difference comes from tracking the explicit dependency on d in the constant factor. Specifically, under the fixed d setting, Fang et al. (2020) did not explicitly express how the constant c s,r,d (see the statement in their Lemma) depends on d. However, in our paper, since the main focus is how the approximation error behaves under d → ∞, we need to keep tracking how d explicitly affects the bound. The result of Proposition 4.1 in our paper serves an important role in this tracking. Note that the constant c 0 is independent of s, d, r, f in the bound of Proposition 4.1, and we obtain the bound decays at the rate 6 πe d/4 . However, in Fang et al. (2020) , they utilized the result from Hesse (2006)  ; that is, ∥f ∥ ∞ ≤ C s,d ∥f ∥ W s p (S d-1 ) , f ∈ W s p (S d-1 ) for s ≥ d-1 2 . Here, note that the constant C s,d is a function of d, and since they work under the fixed d setting, they did not pay much attention to the dependency. Of course, since we are in an asymptotic setting, we use the Stirling's formula to see behaviors of N (k, d) as d → ∞, whereas Fang et al. (2020) just used a simple calculation N (k, d) ≤ c ′ d k d-2 , for some c ′ d dependent on d. layer for j ∈ {1, . . . , p -1} as Poly 1 m (x), . . . , Poly 2 j-1 m (x) ((m+1)+(m+4)•(j-1)) th layer → Poly 1 m (x), . . . , Poly 2 j-1 m (x), Mult m (x, Poly 2 j-1 m (x)), . . . , fm Poly 2 j-1 m (x) ((m+1)+(m+4)•j) th layer . ( ) The first 2 j-1 input values in ((m + 1) + (m + 4) • j) th hidden layer is exactly copied from input values at the ((m + 1) + (m + 4) • (j -1)) th hidden layer. The remaining 2 j-1 input values in ((m + 1) + (m + 4) • j) th hidden layer approximates monomials {x 2 j-1 +1 , . . . , x  L y N,M (f ). Proposition D.2.4 Let 0 < α < 1, m, N, M ∈ N with 1 ≤ N ≤ d α + 1. For any function f ∈ W r ∞ (S d-1 ) with r > 0, define L y N,M (f ) in (24). Then, there exists a network f ∈ F L, d, 22N M, . . . , 22N M, 1 , N with depth L = (m + 4)⌈log 2 (2N )⌉ and number of parameters N ≤ M (2d + 404N • (m + 3) + 2N + 4) + 1 such that L y N,M (f ) -f ∞ ≤ C ′ η • ∥f ∥ W r ∞ (S d-1 ) d 2N log 2 (2N ) 2 2 -2m , where C ′ η is a positive constant depending on η and α, but not on d, r, m, N, M or f . A detailed proof for Proposition D.2.4 is deferred in the Appendix E.5. Given the input data x ∈ S d-1 , recall the definition of L y N,M (f )(x) in ( 24). The crux of the whole construction procedure of our network is to build the sub-network which approximates ξ N,r (⟨x, y i ⟩) for each i ∈ [M ]. The key observation is that ξ N,r (⟨x, y i ⟩) is the weighted sum of univariate polynomials of degree up to 2N . Let u i = ⟨x, y i ⟩. With the properly defined constant α i,q (see its definition in the Appendix E.5), ξ N,r (⟨x, y i ⟩) can be re-written as ξ N,r (u i ) : = 2N q=0 α i,q |u i | q . Since |u i | ∈ [0, 1], with the help of network constructed in Lemma D.2.3 with P = ⌈log 2 (2N )⌉, the sub-network that approximates ξ N,r (u i ) is easily constructed. Recall this is enabled through the reproducing property of the kernel of H ) is continuously embedded into C(S d-1 ), the space of continuous functions on S d-1 , which implies ∥f ∥ ∞ ≤ c 0 6 πe d 4 • ∥f ∥ W s p (S d-1 ) , f ∈ W s p (S d-1 ), where c 0 is an absolute constant independent of r, d, s, and f . Proof. For f ∈ W s p (S d-1 ), by Sobolev embedding Lemma (see Hesse (2006) Eq. 14, p. 420), the infinity norm can be bounded by the Sobolev norm as ∥f ∥ ∞ ≤ C s,d • ∥f ∥ W s 2 (S d-1 ) , where the constant C s,d is defined with its square as C 2 s,d := 1 ω d ∞ k=0 N (k, d) (k + d-2 2 ) 2s (28) with ω d = 2π d 2 /Γ d 2 . Recalling (3), it is easy to see that by Stirling's formula, for large d, N (k, d) = (k + d-2 2 ) d-2 1 + O 1 d . Also, we have Γ d 2 = 2 d Γ d 2 + 1 = 2 π d d 2e d 2 1 + O 1 d . When s > d-1 2 , we have ∞ k=0 k + d -2 2 d-2-2s ≤ ∞ d-2 2 -1 t d-2-2s dt = 1 2s + 1 -d d -2 2 -1 d-1-2s . Observe that d ≥ 5, we have d-2 2 -1 ≥ d 12 . Thus, when s ≥ 3d-2 4 , we have 2s + 1 -d ≥ d/2 and thereby ( 28) is bounded as C 2 s,d ≤ π d d 2πe d 2 2 d d 12 -d 2 1 + O 1 d = 2 √ π d √ d 6 πe d 2 1 + O 1 d . Then, there exists an absolute constant c 0 such that C 2 s,d ≤ c 2 0 6 πe d 2 , ∀d ≥ 5. This yields the claim. E.2 PROOF OF LEMMA D.1.4 Lemma D.1.4 Let 0 < r ≤ 3d-2 4 and 0 < α < 1. If f ∈ W r ∞ (S d-1 ), then for any M ∈ N and 1 ≤ N ≤ d α + 1, there exist y = {y 1 , y 2 , . . . , y M } ⊂ S d-1 such that L N (f ) -L y N,M (f ) ∞ ≤ 6 • C ′′ 6 πe d 4 ∥f ∥ W r ∞ (S d-1 ) d N + 3d-4r-2 8 (2N + 1) 3d-4r 4 √ M , where C ′′ > 0 is a constant depending on α but independent of r, f , N , M , and d. Proof. We recall the following probability inequality for random variables with values in a Hilbert space which can be found in Smale & Zhou (2007) . Lemma E.2.1 Let (H, ∥ • ∥) be a Hilbert space and ξ be a random variable on (Y, ρ X ) with values in H. Assume ∥ξ∥ ≤ M < ∞ almost surely. Denote σ 2 (ξ) = E(∥ξ∥ 2 ). Let {y i } M i=1 be independent samples from ρ X . Then for any 0 < δ < 1, we have with probability at least 1 -δ, 1 M M i=1 ξ(y i ) -E(ξ) H ≤ 2M log 2 δ M + 2σ 2 (ξ) log 2 δ M . ( ) Let us define the random variable ξ on (S d-1 , ρ X ) with values in H given by ξ(y) = F r (y) 2N k=0 (1 + λ k ) -r/2 η k N Z k (y, •), y ∈ S d-1 . To bound the norm ∥ξ∥ = ∥ξ(y)∥ 2 W s 2 , we set s = 3d-2 4 and recall the norm of W s 2 (S d-1 ) given with p = 2 and for y ∈ S d-1 , ∥ξ(y)∥ W s 2 (S d-1 ) = F r (y) 2N k=0 (1 + λ k ) s-r 2 η k N Z k (y, •) L2(S d-1 ) . ( ) Recall λ k = k(k + d -2). Then, for 0 ≤ k ≤ 2N , d ≥ 3, we have k 2 < 1 + λ k ≤ dk 2 . We find (1 + λ k ) s-r ≤ d s-r k 2(s-r) by s = 3d-2 4 ≥ r (∵ s -r ≥ 0). Also note that 0 ≤ η(t) ≤ 1 for t ∈ [0, 2]. Employing Stirling's formula d! = √ 2πd d e d 1 + O(1/d) in the expression (3) for N (k, d) yields N (k, d) ≤ Cd k for 0 ≤ k ≤ 2N and some constant C depending on α but independent of d. By using the identity Z k (y, y) = N (k, d) (see Corollary 1.2.7. in Dai & Xu (2013)), ∥ξ∥ 2 W s 2 (S d-1 ) can be bounded as F r (y) 2 • 2N k=0 1 + λ k s-r η 2 k N N (k, d) = F r (y) 2 • 1 + 2N k=1 1 + λ k s-r η 2 k N N (k, d) ≤ F r (y) 2 • 1 + C • d s-r • 2N k=1 k 2(s-r) d k ≤ F r (y) 2 • 1 + C • d 2N +s-r • 2N k=1 k 2(s-r) , while the term 2N k=1 k 2(s-r) with s -r ≥ 0 can be bounded as 2N k=1 k 2(s-r) ≤ 2N +1 1 x 2(s-r) dx ≤ 1 2(s -r) + 1 (2N + 1) 2(s-r)+1 . Combining this with the definitions of the norm ∥f ∥ W r ∞ (S d-1 ) , we know that ∥ξ(y)∥ 2 W s 2 can be bounded as ∥ξ(y)∥ 2 W s 2 ≤ C ′2 ∥f ∥ 2 W r ∞ (S d-1 ) • d 2N +s-r (2N + 1) 2(s-r)+1 , where C ′ is a constant depending on α but independent of r, s, f , N , and d. Thus the random variable ξ satisfies the condition ∥ξ∥ ≤ M < ∞ in Lemma E.2.1 with M = C ′ ∥f ∥ W r ∞ (S d-1 ) d N + s-r 2 (2N + 1) (s-r)+ 1 2 . So by Lemma E.2.1, with δ = 1 2 and σ 2 (ξ) ≤ M 2 , we know from the positive measure of the sample set that there exists a set of points y = {y i } M i=1 ∈ S d-1 such that 1 M M i=1 ξ(y i ) -E(ξ) H = L N (f ) -L y N,M (f ) W s 2 (S d-1 ) ≤ 6 • C ′ ∥f ∥ W r ∞ (S d-1 ) d N + s-r 2 (2N + 1) (s-r)+ 1 2 √ M . Since s = 3d-2 4 , combining the result from Proposition D.1.3 with (33) yields L N (f ) -L y N,M (f ) ∞ ≤ 6 • C ′′ 6 πe d 4 ∥f ∥ W r ∞ (S d-1 ) d N + 3d-4r-2 8 (2N + 1) 3d-4r 4 √ M , where C ′′ > 0 is a constant depending on α but independent of r, f , N , M , and d.  |Mult m (x, y) -xy| ≤ 2 -2m-1 , for all x, y ∈ [0, 1]. Moreover, Mult m (x, 0) = Mult m (0, y) = 0. Proof. Given input (x, y), the network Mult m (x, y) computes in the first hidden layer (x, y) → σ x + y 2 , σ - x + y 2 , σ x -y 2 , σ - x -y 2 . By using the equality |x| = σ(x)+σ(-x) for x ∈ [0, 1], the network computes in the second hidden layer (x, y) → σ x + y 2 , σ x -y 2 . Note σ x+y 2 , σ x-y 2 ∈ [0, 1], and σ x+y 2 = x+y 2 , σ x-y 2 = x-y 2 . We apply the network fm on the two components respectively. This gives a network of (m + 2) hidden layers with width vector (2, 10, . . . , 10, 2) that computes (x, y) → σ fm x + y 2 , σ fm x -y 2 . ( ) The network Mult m computes (34) in the (m + 3) th hidden layer. Since fm ∈ [0, 1], σ fm (x) = fm (x). In the output layer, the network value is computed as Mult m (x, y) := fm x + y 2 -fm x -y 2 . ( ) Since fm is an increasing function in argument, Mult m (x, y) ≥ 0, and since fm ∈ [0, 1], Mult m (x, y) ≤ 1. By identity, xy = x+y 2 2 -x-y 2 2 , and Lemma D.2.1, the error is computed as follows: |Mult m (x, y) -xy| ≤ fm x + y 2 - x + y 2 2 + fm x -y 2 - x -y 2 2 ≤ 2 -2m-1 . If either x = 0 or y = 0, by the definition of (35), we have Mult m (x, 0) = Mult m (0, y) = 0.

E.4 PROOF OF LEMMA D.2.3

Lemma D.2.3 For any positive integer m ≥ 1, N ≥ 2 and for P = ⌈log 2 (N )⌉, there exists a deep ReLU network Poly {N} m ∈ F L, 1, 11N, . . . , 11N, 2 P , N , with the depth L = m+(m+4) ⌈log 2 (N )⌉-1 and the number of parameters N ≤ 202N •(m+3) such that Poly {N} m (x) ∈ [0, 1] 2 P and Poly j m (x) -x j ≤ P 2 • 2 -2m-1 for all j ∈ {1, . . . , 2 P } for all x ∈ [0, 1]. Proof. Let us describe the construction of the network Poly {N} m . With the application of Lemma D.2.1, in the (m + 1) th hidden layer, the network computes x → σ(x), σ( fm (x)) with the width p = (1, 5, . . . , 5, 2). For approximating x 3 , the network Mult m is applied on the pair (σ(x), σ( fm (x))), and for approximating x 4 , the network fm is applied on the σ( fm (x)). Therefore, in the {(m + 1) + (m + 4)} th hidden layer, the network Poly {N} m computes x → σ(x), σ( fm (x)), σ Mult m (x, fm (x)) , σ fm fm (x) . Note that each component in the hidden layer is in [0, 1] by Lemmas D.2.1 and D.2.2. This procedure is continued until a following vector is in the final output layer, x → Poly 1 m (x), . . . , Poly 2 P -1 m (x), Mult m (x, Poly 2 P -1 m (x)), . . . , fm fm . . . fm (x) ∈ [0, 1] 2 P . The resulting network is referred as Poly {N} m and has m + (m + 4) ⌈log 2 (N )⌉ -1 hidden layers. Recall P = ⌈log 2 (N )⌉. By the construction procedure of the network, we can compute the upper bound of maximum width as, 2 ⌈log 2 (N )⌉-1 + 10 • 2 ⌈log 2 (N )⌉-1 -1 + 5 ≤ 11 • 2 ⌈log 2 (N )⌉-1 ≤ 11N, where we use ⌈log 2 (N )⌉ ≤ log 2 (N )+1 in the second inequality. Now, we need to count the number of active parameters in the network. For k ∈ {1, . . . , ⌈log 2 (N )⌉}, we compute the upper bound on the total number of active parameters in-between following hidden layers: Poly 1 m (x), . . . , Poly 2 k-1 m (x) → Poly 1 m (x), . . . , Poly 2 k-1 m (x), Poly 2 k-1 +1 m (x), . . . , Poly 2 k m (x) . Think of a network that takes the hidden layer in the left hand side of (38) as an input, and gives the hidden layer in the right hand side of ( 38) as an output. It is easy to count the number of active parameters in input, hidden, and output layers, separately as follows:    Input layer : 2 k-1 + 1 + 2 • 2 k-1 -1 = 3 • 2 k-1 -1. Hidden layers : (m + 2) • 2 k-1 + 100 • (m + 2) • (2 k-1 -1) + 25 • (m + 2) = (m + 2)(101 • 2 k-1 -75). Output layer : 2 k-1 + 10 • (2 k-1 -1) + 5 = 11 • 2 k-1 -5. Since the k runs over {1, . . . , ⌈log 2 (N )⌉}, the total number of active parameters can be bounded as: ⌈log 2 (N )⌉ k=1 m + 2 101 • 2 k-1 -75 + 14 • 2 k-1 -6 ≤ (m + 2) • 101 ⌈log 2 (N )⌉ k=1 2 k-1 + 14 • ⌈log 2 (N )⌉ k=1 2 k-1 ≤ 202N • (m + 3). The approximation error is proved via induction on the number of iterated multiplications P = ⌈log 2 (N )⌉. For P = 1, that is N = 2, we have x 2 -fm (x) ≤ 2 -2m-1 by Lemma D.2.1. For the convenience of notation, denote xa := Poly a m (x) for some positive integer a. For P = k -1, assume a following holds x j -xj ≤ 3 k-2 • 2 -2m-1 for j ∈ {1, . . . , 2 k-1 }. Then, for P = k, we want to prove x j -xj ≤ 3 k-1 • 2 -2m-1 for j ∈ {1, . . . , 2 k }. By the construction of neural network and induction assumption, for j ∈ {1, . . . , 2 k-1 }, we have x j -xj ≤ 3 k-2 • 2 -2m-1 ≤ 3 k-1 • 2 -2m-1 . For any j ∈ {2 k-1 + 1, . . . , 2 k }, find any a, b ∈ {1, . . . , 2 k-1 } such that j = a + b. Then, for x ∈ [0, 1], x a+b -Mult m xa , xb ≤ x a+b -xa • xb + xa • xb -Mult m xa , xb ≤ x a x b -xb + xb |x a -xa | + xa • xb -Mult m xa , xb ≤ 3 k-2 • 2 -2m-1 + 3 k-2 • 2 -2m-1 + 2 -2m-1 ≤ 3 k-1 • 2 -2m-1 . By using the fact log 2 (3) < 2, we can deduce 3 k-1 < P 2 and conclude the proof.  L y N,M (f ) -f ∞ ≤ C ′ η • ∥f ∥ W r ∞ (S d-1 ) d 2N log 2 (2N ) 2 2 -2m , where C ′ η is a positive constant depending on η and α, but not on d, r, m, N, M or f . Proof. We adopt the shorthand notation denoting [n] := {1, 2, . . . , n} and [n] 0 := {0, 1, . . . , n} for n ∈ N in the proof. Given the input data x ∈ S d-1 , recall the definition of L y N,M (f )(x) in ( 24). The crux of the whole construction procedure is to build the sub-network which approximates ξ N,r (⟨x, y i ⟩) for each i ∈ [M ]. First, observe that, by ( 4) and ( 23), ξ N,r (u i ) can be written as: ξ N,r (u i ) = 2N k=0 (1 + λ k ) -r 2 η k N k + λ G λ G ⌊ k 2 ⌋ ℓ=0 (-1) ℓ Γ k -ℓ + λ G Γ λ G ℓ! k -2ℓ ! 2u i k-2ℓ , ( ) for i ∈ [M ]. The key observation is that Eq. ( 40) is the weighted sum of univariate polynomials of degree up to 2N . We define a constant c k,ℓ,η,λ k ,r,d as c k,ℓ,η,λ k ,r,d := (1 + λ k ) -r 2 η k N k + λ G λ G (-1) ℓ Γ k -ℓ + λ G 2 k-2ℓ Γ λ G ℓ! k -2ℓ ! . For i ∈ {1, . . . , M }, set α i,q as α i,q = (k,ℓ)∈Aq -c k,ℓ,η,λ k ,r,d if u i < 0 and q is odd, (k,ℓ)∈Aq c k,ℓ,η,λ k ,r,d otherwise, ( ) where for each q ∈ {0, . . . , 2N }, the set A q is given by A q := {(k, ℓ) ∈ [2N ] 0 ×[⌊k/2⌋] 0 : k-2ℓ = q}. Then, (40) can be re-written as ξ N,r (u i ) := 2N q=0 α i,q |u i | q . 1. The Network Construction. Now, we are ready for the construction of f . Through Lemma D.1.4, we know that there exists y = {y 1 , . . . , y M } that satisfies the bound (D.1.4). Then, for each i ∈ [M ], we put y i ∈ S d-1 as a weight vector that connects input x to the (2i -1) th and (2i) th nodes in the first hidden layer. Through this, f computes in its first hidden layer x → σ ⟨x, y 1 ⟩ , σ -⟨x, y 1 ⟩ , . . . , σ ⟨x, y M ⟩ , σ -⟨x, y M ⟩ ∈ [0, 1] 2M . Then, by the identity |x| = σ(x) + σ(-x) for x ∈ R, the network computes in its second hidden layer x → σ |u 1 | , σ |u 2 | , . . . , σ |u M | ∈ [0, 1] M , where u i := ⟨x, y i ⟩ ∈ [-1, 1] for i ∈ [M ]. Since σ(|u i |) = |u i | ∈ [0, 1], Poly {2N} m with P = ⌈log 2 (2N )⌉ is applicable for each {|u i |} M i=1 , and it generates Poly q m (|u i |) with q at most 4N . Set B max := max i=1,...,M 2N q=0 α i,q • Poly q m (|u i |) . Using the definition of the constant α i,q , the network f computes in the (m + 4)⌈log 2 (2N )⌉th hidden layer {σ( 2N q=0 α 1,q Poly q m (|u 1 |) + 2B max ), . . . , σ( 2N q=0 α M,q Poly q m (|u M |) + 2B max )} ∈ R M . By the definition of B max , it is easy to see each component in the hidden layer is positive. Set the weight of output layer as { 1 M F r (y i )} M i=1 . Define L(|u i |) := 2N q=0 α i,q • Poly q m (|u i |) + 2 • B max . Then, given the data y = {y 1 , . . . , y M }, the network f computes its final output as f From 2 nd to (m + 4)⌈log 2 (2N )⌉ -1 th hidden layer : 404N M • (m + 3). (x) = 1 M M j=1 F r (y j ) • L(|⟨x, y j ⟩|) -2B max := 1 M M i=1 F r (y i ) • L ξ N,r ⟨x, y i ⟩ . From (m + 4)⌈log 2 (2N )⌉ -1 th hidden layer to output layer : (2N + 1)M + M + 1. Summing up the total number yields the desired result.

3.. Approximation Error Computation.

A remaining thing is to calculate the approximation error: L y N,M (f ) -f ∞ = sup x∈S d-1 1 M M i=1 F r (y i ) • ξ N,r (⟨x, y i ⟩) - 1 M M i=1 F r (y i ) • L ξ N,r ⟨x, y i ⟩ ≤ ∥f ∥ W r ∞ (S d-1 ) • ξ N,r -L ξ N,r ∞ . Recall the definition of α i,q in (42). Using Stirling's Formula, Γ(n+1 ) = √ 2πn n e n (1+O(1/n)), we observe the behavior of Gegenbauer coefficient in (4) where λ G = d-2 2 ≫ d α + 1 ≥ N , and find that it can be bounded as C • λ k-ℓ G • 2 k-2ℓ (1 + O(1/d)), where C > 0 is a constant independent of d. For k ∈ {0, 1, . . . , 2N }, combining the facts (1 + λ k ) -r 2 < 1, η(•) ≤ 1, k+λG λG ≤ 2 for k ≤ 2N ≤ 2(d α + 1) with λ G = d-2 2 yields c k,ℓ,η,λ k ,r,d ≤ C ′ η • 2 -ℓ • d k-ℓ , where C ′ η > 0 is a constant dependent on α and η. Recall L ξ N,r ⟨x, y j ⟩ := 2N q=0 α j,q • Poly q m (|⟨x, y j ⟩|) and note that 2N q=0 |α j,q | = 2N k=0 ⌊ k 2 ⌋ ℓ=0 c k,ℓ,η,λ k ,r,d . Then, we have ξ N,r -L ξ N,r ∞ ≤ 2N k=0 ⌊ k 2 ⌋ ℓ=0 c k,ℓ,η,λ k ,r,d • sup u∈[0,1] max q∈{0,...,2N } |u q -Poly q m (u)| ≤ C ′ η • 2N k=0 d k ⌊ k 2 ⌋ ℓ=0 1 (2d) ℓ • log 2 (2N ) 2 • 2 -2m-1 where we used the result from Lemma D.2.3 and (44) in the second inequality. Using ⌊ k 2 ⌋ ℓ=0 1 (2d) ℓ ≤ 2 in the last inequality yields the claim. E.6 PROOF OF COROLLARY 3.3 Corollary 3.3 Let 0 < α, β, γ < 1 with γ > max{α, β} and N ∈ N with 1 ≤ N ≤ d α + 1. For any f ∈ W r ∞ (S d-1 ) with r > 0, we have :  (I) For 3d-2 4 -C 1 ≤ r ≤ 3d-2 4 with some constant C 1 ≥ 0 independent of d, there exists a network f (I) ∈ F (L, (d, 66N, 66N, . . . , 66N, 1) , N ) with depth L = O (d γ log 2 d) and the number of active parameters N = O d max{α+γ,1} , such that f -f (I) ∞ ≤ C ′ η,α,β,γ ∥f ∥ W r ∞ (S d-1 ) d -d β , where C ′ η,α,β,γ is a constant depend- ing only on C 1 , η, α, β, γ. f -f (II) ∞ ≤ C ′ η,α,β,γ ∥f ∥ W r ∞ (S d-1 ) d -αr , where C ′ η,α,β,γ is a constant depending only on η, α, β, γ. Proof. By the results of Theorem 3.1, for 1 ≤ N ≤ d α + 1, we have the following inequality on the approximation error f -f ∞ ≤ C ′′ η ∥f ∥ W r ∞ (S d-1 ) × max N -r , 6 πe d 4 d N + 3d-4r-2 8 (2N + 1) 3d-4r 4 √ M , d 2N log 2 (2N ) 2 2 -2m , where C ′′ η is a constant dependent on η, and independent on d, r, N, M or f . We divide the proof into two cases. For the second term in (45), since N = ⌈d α ⌉ with 0 < α < 1, we know that it is bounded by 6 πe d 4 d N + 3d-4r-2 8 (2N + 1) 3d-4r 4 √ M ≤ 6 πe d 4 d d α + 3d-4r+6 8 (3d α ) 3d-4r 4 √ M . ( ) As 3d-2 4 -C 1 ≤ r ≤ 3d-2 4 , we know the term on the right hand side of ( 46) can be written as 6 πe d 4 d d α +O(1) 3 O(1) / √ M . To show that the bound is of order O(d -d β ), we multiply the bound by d d β , take the logarithm, and find that for any 0 < α, β < 1, log 6 πe d 4 d d α +d β +O(1) ≤ d 4 log 6 πe + d α + d β + O(1) log(d) → -∞, as d → ∞. Hence, there exists a constant C α,β > 0 depending only on C 1 , α, β such that 6 πe d 4 d N + 3d-4r-2 8 (2N + 1) 3d-4r 4 √ M ≤ C α,β d -d β , for any fixed M ∈ N. In our proof, we simply choose M = 3. For the third term in (45), take m = ⌈d γ ⌉ with max{α, β} < γ < 1, then there exists a constant C α,β,γ depending on α, β, γ such that d 2N log 2 (2N ) 2 2 -2m < d 2d α +2 2 + log 2 (d) 2 2 -2m ≤ d 3d α 2 -2m ≤ C α,β,γ d -d β , where log 2 (2d α + 2) ≤ log 2 (4d α ) < 2 + log 2 (d) is used in the first inequality, and the last inequality follows from the same argument as above, of multiplying with d d β and taking the logarithm. Combining all the analyses above, we have For the first term in (45), since N = ⌈d α ⌉, we know that N -r = O(d -αr ). f -f ∞ ≤ C ′ η,α,β,γ ∥f ∥ W r ∞ (S d-1 ) d -d β , where C ′ η,α,β,γ > 0 is a constant dependent on η, α, β, γ, For the second term in (45), since N = ⌈d α ⌉ with 0 < α < 1, we know that it is bounded by  6 πe d 4 d N + 3d-4r-2 8 (2N + 1) 3d-4r 4 √ M ≤ 6 πe d 4 d d α + 3d-4r+6 8 (3d α ) 3d-4r 4 √ M ≤ 6 πe d 4 d d α + 9 8 d 3 d √ M . √ M ≤ C α,β d -d β ≤ C α,β d -αr , for M = O(9 d d 9 4 d ). For the third term in (45), take m = ⌈d γ ⌉ with max{α, β} < γ < 1, then there exists a constant C α,β,γ depending on α, β, γ such that d 2N log 2 (2N ) 2 2 -2m < d 2d α +2 2 + log 2 (d) 2 2 -2m ≤ d 3d α 2 -2m ≤ C α,β,γ d -d β ≤ C α,β,γ d -αr , where log 2 (2d α + 2) ≤ log 2 (4d α ) < 2 + log 2 (d) is used in the first inequality, and the last inequality follows from the same argument as above, of multiplying with d d β and taking the logarithm. Combining all the analyses above, we have Proposition 4.2 Set δ ∈ (0, 1). Then, with probability at least 1 -δ, we have f -f ∞ ≤ C ′ η, E π B f n -E f ρ ≤ C B,δ,f • Pdim(F) • log(n) n + ∥f -f ρ ∥ ∞ √ n + ∥f -f ρ ∥ 2 ∞ , where C B,δ,f is an absolute constant dependent on B, δ, f independent on n, r, d. Proof. Since f n is an empirical risk minimizer in (12), we have E D ( f n ) ≤ E D (f ) for any fixed f ∈ F and E D (π B f n ) ≤ E D ( f n ). Then, we have a following decomposition: E π B f n -E f ρ = E π B f n -E f ρ -E D π B f n -E D f ρ + E D π B f n -E D f ρ -E D f -E D f ρ + E D f -E D f ρ -E f -E f ρ + E f -E f ρ ≤ E π B f n -E f ρ -E D π B f n -E D f ρ (49) + E D f -E D f ρ -E f -E f ρ + E f -E f ρ . Let F B := {π B f : ∀f ∈ F} and define two quantities: S 1 (n, F B ) := E f -E f ρ -E D f -E D f ρ ∀f ∈ F B , S 2 (n, F) := E D f -E D f ρ -E f -E f ρ ∀f ∈ F. Step 1 : Control S 1 (n, F B ). The following concentration inequality is needed for controlling the term. the (58), ( 59) in (48) from Proposition 4.2.  E π M f n -E f ρ ≤ C B, where C B,η,δ,f depends on B, η, δ, f and independent on d, r and n. Then, under the regime 1 ≤ N ≤ d α + 1 for some 0 < α < 1, choices of m = ⌈ r 3d+4r log 2 (n)⌉, N = ⌈n 2 3d+4r ⌉ and M = ⌈n 3d 3d+4r ⌉ make the fraction of the first term in (60) simple as follows: ⌈log 2 (N )⌉ • log mM N ⌈log 2 (N )⌉ ≤ 2 3d + 4r log 2 (n) log log 2 (n)n 3d+2 3d+4r 2r (3d + 4r) 2 ⌈log 2 (n)⌉ ≤ 6 3d + 4r log 2 (n) 2 . Then, with the same choices of m, N, M as above, we obtain the bound on the excess risk as : where L t (ξ N,r )(t) is an univariate function for t ∈ [-1, 1] defined in Lemma 7 in Fang et al. (2020) . Now, we work on controlling ∥ξ N,r -L t (ξ N,r )∥ C[-1,1] . E π M f n -E f ρ ≤ C B, ∥ξ N,r -L t (ξ N,r )∥ C[-1,1] ≤ 2ω(ξ N,r , 1/B) ≤ 2 B ∥ξ ′ N,r ∥ C[-1,1] ≤ 8N 2 B ∥ξ N,r ∥ C[-1,1] ≤ 8N 2 B 2N k=1 k -r N (k, d) ≤ 8C α N 2 B d 2N 2N k=1 k -r ≤ 8C α N 2 B d 2N r r -1 . In the first inequality, we use Lemma 7 in Fang et al. (2020) , where ω(ξ N,r , 1/B) is a modulus continuity of ξ N,r given ω(ξ N,r , 1/B) = sup In the second inequality, we use the definition of modulus continuity of ξ N,r . In the third inequality, since ξ N,r is an algebraic polynomial of degree at most 2N , by Markov's inequality, we have ∥ξ η(z i ) -µ < ε ≥ 1 -exp -nε 2 2 σ 2 + 1 3 B η ε .



Interested readers can find the intuitive technical reason for having the exponential dependence in d on width W and active parameters N in the Appendix A. Henceforth, we will use a shorthand notation of F(L, p, N ) as F. Dependence on (L, p, N ) should be implicitly understood. Since S d-1 ⊂ C d , it seems obvious we should achieve minimax optimal rate. In this regard, we add further detailed technical remarks on the sub-optimality of excess risk in the Appendix B.



r) d • n -2r 2r+d Table 1: Here, C > 0 is an universal constant. Notation Õ(•) hide the logarithmic factor in n. Note that the upper-bounds for N in Theorem 4.3 (i.e., N = O(M d)) are from Theorem 3.1 with choices M = ⌈n 3d 3d+4r ⌉.

As a Corollary of Theorem 3.1, we show how the order of function smoothness r can have the effect on the scale of network in terms of d. Specifically, when the function smoothness r = O(1), we show that the constructed network, f , requires W = O(d d ), L = O(d γ log 2 d) for 0 < γ < 1, and N = O(d d+1 ) for obtaining d -O(1) approximation error up to some constant factors independent with d. Furthermore, when r = O(d), we show that only W = O(d α ), L = O(d γ log 2 d), and at most N = O(d 2 ) are required for obtaining the sharp approximation rate O(d -d β ) for 0 < α, β < 1. See Corollary 3.3 for the detailed statement of the result.

If r = O(d), the factor becomes 6 πe d 4 d ⌈d α ⌉ for 0 < α < 1. Here, the exponential decay term 6 πe d 4 is derived from Sobolev embedding Lemma. See Proposition D.1.3 in Appendix D.

d β for β close to 1. The width of f (II) grows exponentially in d requiring M = O(d d ). Interestingly, in both scenarios, the depth L has the same order in d as O (d γ log 2 d) for 0 < γ < 1. Remark 3.4 As suggested by one of the reviewers, we further compare our results in Corollary 3.3 (I) with the CNN architecture with downsampling operation suggested in Fang et al. (2020) for approximating f ∈ W r ∞ (S d-1 ), and (II) with Lu et al. (2021); Jiao et al. (2021) where they consider the problem of approximating f ∈ W r ∞ (C d ) via deep ReLU FNN. Due to the limited space, we defer the detailed remarks on the comparisons in the Appendix C.

d⌋, and D 2 = ⌊(d + JS)/d⌋. Then, for any function f ∈ Schmidt-Hieber (Thm 5) Lu et al. (Thm 1.1.) Jiao et al. (Corol. 3.1) Depth

COMPARISONS OF COROLLARY 3.3. WITH THE RECENT APPROXIMATION RESULTS FOR f ∈ W r ∞ (C d ) VIA DEEP RELU FNN After the publication of Schmidt-Hieber (2020), a series of works further studied the approximation and estimation of f ∈ W r ∞ (C d ) via deep ReLU FNNs. Specifically, Ghorbani et al. (2020) pointed out that the additive function studied in Schmidt-Hieber (2020) has an exponential dependence in d in the prefactor of the estimation error, requiring n ≿ d d sample sizes for good convergence rate. Later, Shen et al. (2021) and Lu et al. (2021) tracked the explicit dependence on d in the approximation error as well as in the architectural components, (i.e., width (W) and depth (L)), for approximating f ∈ C(C d ) (i.e., Lipschitz continuous functions on C d ) and f ∈ W r ∞ (C d ), respectively. Specifically, Lu et al. (2021) improved the prefactor O(8 r (r + 1) d ) when it is compared with O(6 d (r + 1) d ) from Schmidt-Hieber (2020) for the approximation error. Most recently, Jiao et al. (2021) further improved the prefactor O((⌊r⌋ + 1) 2 d ⌊r⌋+(r∨1)/2 ) for approximating f ∈ W r ∞ (C d ). In Table 2, comparisons of depths, widths, and approximation rates of the constructed deep ReLU networks in Schmidt-Hieber (2020); Lu et al. (2021); Jiao et al. (2021) are summarized.Since N is not tracked in the works ofLu et al. (2021);Jiao et al. (2021), direct comparisons of their results with our results for the approximation of f ∈ W r ∞ (S d-1 ) can be challenging. However, it can be roughly checked that the result inJiao et al. (2021) doesn't escape the curse for all r > 0. Specifically, for r = O(1), the constructed network has the exponential dependence in d for its width. For the case r = O(d), achieving the approximation rate O(d -d β ) requires O(d d ) width in the suggested network.

d k for 0 ≤ K ≤ 2N . E PROOFS OF STATEMENTS IN APPENDIX B AND COROLLARY 3.3 E.1 PROOF OF PROPOSITION D.1.3 Proposition D.1.3 For d ≥ 5, 1 ≤ p ≤ ∞, and s ≥ 3d-2 4 , the Sobolev space W s p (S d-1

For any positive integer m ≥ 1, there exists a deep ReLU network Mult m ∈ F m + 3, 2, 10, . . . , 10, 1 , such that Mult m (x, y) ∈ [0, 1] and

PROOF OF PROPOSITION D.2.4 Proposition D.2.4 Let 0 < α < 1, m, N, M ∈ N with 1 ≤ N ≤ d α + 1. For any function f ∈ W r ∞ (S d-1 ) with r > 0, define L y N,M (f ) in(24). Then, there exists a network f ∈ F L, d, 22N M, . . . , 22N M, 1 , N with depth L = (m + 4)⌈log 2 (2N )⌉ and number of parameters N ≤ M (2d + 404N • (m + 3) + 2N + 4) + 1 such that

2. The Width and Number of Active Parameters of f . By the construction of network f and the result of Lemma D.2.3, it is easy to see the maximum width of the network is 22N M . Now, we work on counting the number of active parameters in the network as  Input to 2 nd hidden layer : 2M d + 2M.

II) For r = O(1) and M = O 9 d d 9 4 d , there exists a network f (II) ∈ F L, d, 22N M, . . . , 22N M, 1 , N with depth L = O (d γ log 2 d) and the number of active parameters N = O 9 d d 13 4 d such that

I) r = O(d) and any integer M ≥ 1 For the first term in (45), since N = ⌈d α ⌉, we know that N -r = O(d -αr ) = O(d -d β ) with any 0 < β < 1. This is due to the assumption that 3d-2 4 -C 1 ≤ r ≤ 3d-2 4 , which implies d = O(r) and d β = o(r).

and C 1 . Recall from Proposition D.2.4, f is a network with depth L = (m + 4)⌈log 2 (2N )⌉ and number of parameters N ≤ M (2d + 404N • (m + 3) + 2N + 4) + 1. By simply plugging-in m = ⌈d γ ⌉, N = ⌈d α ⌉ and M = 3, we have L = O(d γ log 2 (d)) and N = O d max{α+γ,1} . (II) r = O(1) and M = O(d d ).

α,β,γ ∥f ∥ W r ∞ (S d-1 ) d -αr , where C ′ η,α,β,γ > 0 is a constant dependent on η, α, β, γ, and C 1 . Recall from Proposition D.2.4, f is a network with depth L = (m + 4)⌈log 2 (2N )⌉ and number of parameters N ≤ M (2d + 404N • (m + 3) + 2N + 4) + 1. By simply plugging-in m = ⌈d γ ⌉, N = ⌈d α ⌉ and M = O(9 d d 9 4 d ), we have L = O(d γ log 2 (d)) and N = O 9 d d

η,δ,f • max 1, 6rd (3d + 4r) 2 (log 2 (n)) 4 , Suppose f ρ ∈ W r ∞ ([0, 1] d ) with r > 0. A network f n from (9) with choices N H = ⌈n d 2d+r ⌉, and m H = ⌈ d+r d+2r log 2 (n)⌉ yield the bound on the excess risk with probability at least 1 -δ as follows:E π M f n -E f ρ (61) ≤ C B,η,δ,K • max ⌈log 2 (d + ⌈r⌉)⌉ 2 (d + r) d • (log 2 (n)) 3 , 1 + r 2 + d 2 2 6 2d + 3 2r • n -2r 2r+d ,where C B,η,δ,K depends on B, η, δ, K and independent on d, r and n. Proof. From Theorem 5 of Schmidt-Hieber (2020), for any function f ρ ∈ C r d ([0, 1] d , K) and any integers m ≥ 1 and N ≥ (r + 1) d ∨ (K + 1)e d , there exists a network f ∈ F L, (d, 6(d + ⌈r⌉)N, . . . , 6(d + ⌈r⌉)N, 1), N , ∞ (62) with depth L = 8 + (m + 5) 1 + ⌈log 2 (d ∨ r)⌉ and the number of parameters N ≤ 141(1 + d + r) 3+d N (m + 6), such that f -f ρ ∞ ≤ (2K + 1)(1 + d 2 + r 2 )6 d N 2 -m + K3 r N -r d . (63)

|t|≤1/B |ξ N,r (ν) -ξ N,r (ν + t)| : ν, ν + t ∈ [-1, 1] .

N,r ∥ C[-1,1] ≤ (2N ) 2 ∥ξ ∥ C[-1,1] .In the fourth inequality, we have the bound∥ξ N,r ∥ C[-1,1] ≤ 2N k=1 k -r N (k, d) by Corollary 1.2.7 of Dai & Xu (2013). Employing Stirling's formula dO(1/d) in N (k, d) yields N (k, d) ≤ C α d k for 0 ≤ k ≤ 2Nand some constant C α depending on α but independent of d. In the last inequality, we used the following inequality: combine our result from Lemma D.1.2 and Lemma D.1.4, and conclude the bound in (66).H USEFUL LEMMASLemma H.1 [Theorem 6 of Bartlett et al. (2019)] Consider the function class F computed by a feed-forward neural network architecture with N parameters and U computation units arranged across L layers. Suppose that all non-output units have piecewise-polynomial activation functions with p + 1 pieces and degrees no more than d, and the output unit has the identity function as its activation function. Then the VC-dimension and pseudo-dimension of class F is upper bounded by VCdim(F), Pdim(F) ≤ C • LN log(p • U) + L 2 N log(d) , with some universal constants C > 0. Lemma H.2 [Theorem 2.8.4 of Vershynin (2018)] Let η be a random variable on a probability space Z with mean E(η) = µ, variance σ 2 (η) = σ 2 , and satisfying |η(z) -E(η)| ≤ B η for almost z ∈ Z. Then, for any ε > 0,

2 j } through fm and Mult m operations in Lemmas D.2.1 and D.2.2. The approximation error can be obtained via proof by induction. Readers can find the detailed proof in the Appendix E.4 with the exact descriptions on the construction of Poly {N} m . Finally, we are ready to state Proposition D.2.4 on the construction of network f which approximates

Hence, there exists a constant C α,β > 0 depending only on C 1 , α, β such that

η,δ,f × mM d n log(n) • ⌈log 2 (N )⌉ • log mM N ⌈log 2 (N )⌉

D.2 CONSTRUCTION OF DEEP RELU NETWORKS : ERROR BOUND FOR (III)

In this section, several useful tools for the construction of neural network for approximating function L y N,M (f ) are introduced. Then, the full proof of our main theorem is presented. The first key lemma is from Yarotsky (2017) wherein the neural network that approximates the quadratic function x 2 for x ∈ [0, 1] is constructed.Lemma D.2.1 (Proposition 2 in Yarotsky ( 2017)) For any positive integer m ≥ 1, there exists a deep ReLU network fm ∈ F m, 1, 5, . . . , 5, 1 , such that fm ∈ [0, 1] and fm (x) -x 2 ≤ 2 -2m-2 , for all x ∈ [0, 1].The main idea of Lemma D.2.1 is to approximate the quadratic function via fm (x) := xm s=1 gs(x) 2 2s . Here, g s (x) is a s-compositions of sawtooth functions defined asNote that g(x) can be implemented by a single layer ReLU network. Then, we can easily construct a ReLU network fm , which belongs to F(m, (1, 5, . . . , 5, 1)).Next lemma states that we can construct a neural network that can implement the multiplication operator. The key idea for constructing Mult m (x, y) is to invoke the identity xy = 1 4 ((x + y) 2 -(x -y) 2 ). The first two hidden layers in the network are used to compute Note that the network Poly {N} m (x) := {Poly 1 m (x), . . . , Poly 2 P m (x)} with P = ⌈log 2 (N )⌉ provides approximations to monomials x j of degree up to 2N for x ∈ [0, 1] at its final output.The key idea for the construction is to employ a tree structure; that is, the width of the network at ((m + 1) + (m + 4) • j) th hidden layer is doubled from that at ((m + 1) + (m + 4) • (j -1)) th hidden Lemma F.1.1 [Theorem 11.4 of Györfi et al. (2002) ] Assume |y| ≤ B almost surely and Haussler (2018) ] Let B > 0 and F ′ be a set of functions f : X → [-B, B]. Then for any ε ∈ (0, B], there holdsRecall a classical relation between ε-packing number and ε-covering number that assertsfor any ε > 0. Combining ( 50), ( 51), the facts log x < x, ∀x > 0, and Pdim(F B ) ≤ Pdim(F) (See Maiorov & Ratsaby (1999) , page 297), we have the upper-bound on)) as follows:Then, taking ε = 1 2 , β = 1 n in Lemma F.1.1, using the upper-bound on covering number in (52) yields the lower bound for the confidence level in Lemma (F.1.1) as follows:where C B > 0 is some absolute constants dependent on B. Choosing α in (53) such thatwith a properly chosen C B,δ > 0 absolute constant dependent on B and δ yields the probability of following event is at least 1 -δ 2 :Step 2 : Control S 2 (n, F). Define a random variable η on Z = X × Y to beThen, by the one-side Bernstein's inequality (see Lemma H.2), we have∞ and solving the quadratic equation with respect to ε yield the following inequalities with some absolute constant C ′′ > 0 :where in the first inequality, the facts √ a + b ≤ √ a + √ b for a, b > 0 is used, and C B,f,δ is a constant dependent on C, B and f . Then, with probability at least 1 -δ 2 , we haveStep 3 : Combining Everything.Then, plugging the ( 54) and ( 55) in ( 49) yields the claim.F.2 PROOF OF THEOREM 4.3⌉, and m = ⌈ r 3d+4r log 2 (n)⌉ yield the bound on the excess risk with probability at least 1 -δ as follows:where C B,η,δ,f depends on B, η, δ, f and independent on d, r and n.), recall from Theorem D.2.4 that there exists a networkwith depth L = (m + 4)⌈log 2 (2N )⌉ and number of parameters N ≤ M (2d + 404N • (m + 3) + 2N + 4) + 1 such that the corresponding network's approximation error is bounded as:where C ′′ η is a constant dependent on η, and independent on d, r, N, M and f . Since the network width is 22N M , the total number of units across the L-hidden layers (i.e., U) of f is bounded asRecall from the result of Lemma H.1, the pseudodimension of function class F in ( 57) is bounded as follows: for some universal constants C > 0:Published as a conference paper at ICLR 2023 Then, similarly with the proof in Theorem 4.3, by the result of Lemma H.1, the pseudo-dimension of F in (62) can be bounded asfor some universal constants C > 0.Plug the ( 63) and ( 64) in ( 48) from Proposition 4.2. Then, we obtain the bound on the excess risk as follows:where C B,δ,K depends on B, δ, K and independent on d, r and n. Note that we useThen, a fraction of the first term in (65) can be bounded as:Then, we obtain the bound on the excess risk as :This concludes the proof. 

G PROOF OF APPROXIMATION

where C ′′ η,α is a constant dependent on η, α, and independent on d, r, N, M or f .Proof. By the inequality (5.9) in the paper Fang et al. (2020) , we have

