APPROXIMATION AND NON-PARAMETRIC ESTIMATION OF FUNCTIONS OVER HIGH-DIMENSIONAL SPHERES VIA DEEP RELU NETWORKS

Abstract

We develop a new approximation and statistical estimation analysis of deep feedforward neural networks (FNNs) with the Rectified Linear Unit (ReLU) activation. The functions of interests for the approximation and estimation are assumed to be from Sobolev spaces defined over the d-dimensional unit sphere with smoothness index r > 0. In the regime where r is of the constant order (i.e., r = O(1)), it is shown that at most d d active parameters are required for getting d -C approximation rate for some constant C > 0. In the regime where the index r grows in the order of d (i.e., r = O(d)) asymptotically, we prove the approximation error decays in the rate d -d β with 0 < β < 1 up to some constant factor independent of d. The required number of active parameters in the networks for the approximation increases polynomially in d as d → ∞. It is also shown that bound on the excess risk has a d d factor, when r = O(1), whereas it has d O(1) factor, when r = O(d). We emphasize our findings by making comparisons to the results on the approximation and estimation errors of deep ReLU FNN when functions are from Sobolev spaces defined over d-dimensional cube. In this case, we show that with the current state-of-the-art result, d d factor remain both in the approximation and estimation errors, regardless of the order of r.

1. INTRODUCTION

Neural networks have demonstrated tremendous success in the tasks of image classification (Krizhevsky et al., 2012; Long et al., 2015) , pattern recognition (Silver et al., 2016) , natural language processing (Graves et al., 2013; Bahdanau et al., 2015; Young et al., 2018) , etc. The datasets used in these real world applications frequently lie in high-dimensional spaces (Wainwright, 2019) . In this paper, we try to understand the fundamental limits of neural networks in the high-dimensional regime through the lens of its approximation power and its generalization error. Both approximation power and generalization error of neural network can be analyzed through specifying the target function's property such as its smoothness index r > 0 and its input space X . In particular, deep feed-forward neural networks (FNNs) with Rectified Linear Units (ReLU) have been extensively studied when they are used for approximating and estimating functions from general function class such as Sobolev class defined on d-dimensional cube (i.e., X := C d ), denoted as W r p (C d ) for 1 ≤ p ≤ ∞. However, in practice, signals on a spherical surface (i.e., X := S d-1 = {x ∈ R d : ∥x∥ 2 = 1}) rather than on Euclidean spaces often arise in various fields, such as astrophysics (Starck et al., 2006; Wiaux et al., 2005 ), computer vision (Brechbühler et al., 1995) , and medical imaging (Yu et al., 2007) . Motivated by this, we focus our attention on the cases where deep ReLU FNNs are used for function approximators and estimators, when functions are assumed to be from the Sobolev spaces defined over S d-1 ; that is f ∈ W r ∞ (S d-1 ). Under this setting, our analysis focuses on how the input dimension d explicitly affects the approximation and estimation rates of f ∈ W r ∞ (S d-1 ). At the same time, we show how the scalability of deep ReLU FNNs grows in the high-dimensional regime. Here, the scalability is mainly measured through the three metrics: (1) the width denoted as W, Theorem 4.3 Theorem 4.4 Function class W r ∞ (S d-1 ) W r ∞ ([0, 1] d ) Smoothness r O(d) O(1) ∀r > 0 Upper-bound on N O(nd) O(nd) Õ((d + r) d ) Estimation error rate Õ d C • n -4r 4r+3d Õ 6 πe d 2 d d • n -4r 4r+3d Õ (d + r) d • n -2r 2r+d Table 1: (2) the depth, denoted as L, and (3) the number of active parameters, denoted as N of the network, (Anthony & Bartlett, 1999) . It should be emphasized that we find there exists an interaction with smoothness index r > 0 and dimension d, whereas we cannot find one for the case when f ∈ W r ∞ (C d ). We further summarize our detailed findings in the following Subsection.

1.1. PAPER ROAD MAP AND CONTRIBUTIONS

In Theorem 3.1, we provide an approximation bound of deep ReLU FNN (i.e., f ) for approximating the target functions in Sobolev spaces defined over sphere (i.e., f ∈ W r ∞ (S d-foot_0 )). Notably, in the bound, we track the explicit dependence on data dimension d allowing it tends to infinity. This tracking enables how the three components of network architecture, width (W), depth (L), and the number of active parameters (N ), should change as d increases, for obtaining the good approximation error rate. Our result implies that for approximating f ∈ W r ∞ (S d-1 ), the larger the smoothness index r is, the narrower the width of the network should be enough, while the depth of the network can be fixed. Moreover, when r is in the same order as d, the network can avoid the curse of dimensionality requiring only O(d 2 ) number of active parameters. It is interesting to note that the function smoothness index can affect the design of the network, specifically on width, while it has little effect on the design of depth. Admittedly, the condition r = O(d) is restrictive in a sense that it makes the function space W r ∞ (S d-1 ) small. Nonetheless, it contains some interesting examples: that is, reproducing kernel Hilbert spaces (RKHS) generated by C ∞ kernels such as Gaussian kernels. Additionally, to the best of our knowledge, this finding is not observed in the current approximation theory of neural network literature when f ∈ W r ∞ (C d ) where C d denotes some d-dimensional cubes, and f is a deep ReLU FNN. Out of the long list of literature to be introduced shortly, we choose the result from Schmidt-Hieber (2020) for the comparison as it also has the explicit dependence on d in their approximation bound. From their result, it can be seen that the curse cannot be avoided, even when r = O(d). The width of their constructed network is lower bounded by Ω(r d ∨ e d ) and the number of active parameters is upper-bounded by O((r + d) d ). 1 Note that the bounds on both components grow exponentially in d as r increases. See Subsection 3.1 for detailed comparisons. We further make the comparisons between estimating functions f ∈ W r ∞ (S d-1 ) (Theorems 4.3) versus f ∈ W r ∞ (C d ) (Theorems 4.4) via deep ReLU FNNs under the non-parametric regression framework. Given n noisy samples, the two Theorems suggest the specific orders of W, L and N in terms of n, d and r, for which they give the tightest bound on excess risk of respective function estimator from Proposition 4.2. When r = O(1), it is shown that the excess risk upper-bounds of



Interested readers can find the intuitive technical reason for having the exponential dependence in d on width W and active parameters N in the Appendix A.



Here, C > 0 is an universal constant. Notation Õ(•) hide the logarithmic factor in n. Note that the upper-bounds for N in Theorem 4.3 (i.e., N = O(M d)) are from Theorem 3.1 with choices M = ⌈n 3d 3d+4r ⌉.

As a Corollary of Theorem 3.1, we show how the order of function smoothness r can have the effect on the scale of network in terms of d. Specifically, when the function smoothness r = O(1), we show that the constructed network, f , requires W = O(d d ), L = O(d γ log 2 d) for 0 < γ < 1, and N = O(d d+1 ) for obtaining d -O(1) approximation error up to some constant factors independent with d. Furthermore, when r = O(d), we show that only W = O(d α ), L = O(d γ log 2 d), and at most N = O(d 2 ) are required for obtaining the sharp approximation rate O(d -d β ) for 0 < α, β < 1. See Corollary 3.3 for the detailed statement of the result.

