LEARNING RELU NETWORKS TO HIGH UNIFORM ACCURACY IS INTRACTABLE

Abstract

Statistical learning theory provides bounds on the necessary number of training samples needed to reach a prescribed accuracy in a learning problem formulated over a given target class. This accuracy is typically measured in terms of a generalization error, that is, an expected value of a given loss function. However, for several applications -for example in a security-critical context or for problems in the computational sciences -accuracy in this sense is not sufficient. In such cases, one would like to have guarantees for high accuracy on every input value, that is, with respect to the uniform norm. In this paper we precisely quantify the number of training samples needed for any conceivable training algorithm to guarantee a given uniform accuracy on any learning problem formulated over target classes containing (or consisting of) ReLU neural networks of a prescribed architecture. We prove that, under very general assumptions, the minimal number of training samples for this task scales exponentially both in the depth and the input dimension of the network architecture.

1. INTRODUCTION

The basic goal of supervised learning is to determine a functionfoot_0 u : r0, 1s d Ñ R from (possibly noisy) samples pupx 1 q, . . . , upx m qq. As the function u can take arbitrary values between these samples, this problem is, of course, not solvable without any further information on u. In practice, one typically leverages domain knowledge to estimate the structure and regularity of u a priori, for instance, in terms of symmetries, smoothness, or compositionality. Such additional information can be encoded via a suitable target class U Ă Cpr0, 1s d q that u is known to be a member of. We are interested in the optimal accuracy for reconstructing u that can be achieved by any algorithm which utilizes m point samples. To make this mathematically precise, we assume that this accuracy is measured by a norm } ¨}Y of a suitable Banach space Y Ą U . Formally, an algorithm can thus be described by a map A : U Ñ Y that can query the function u at m points x i and that outputs a function Apuq with Apuq « u (see Section 2.1 for a precise definition that incorporates adaptivity and stochasticity). We will be interested in upper and lower bounds on the accuracy that can be reached by any such algorithm -equivalently, we are interested in the minimal number m of point samples needed for any algorithm to achieve a given accuracy ε for every u P U . This m would then establish a fundamental benchmark on the sample complexity (and the algorithmic complexity) of learning functions in U to a given accuracy. The choice of the Banach space Y -in other words how we measure accuracy -is very crucial here. For example, statistical learning theory provides upper bounds on the optimal accuracy in terms of an expected loss, i.e., with respect to Y " L 2 pr0, 1s d , dPq, where P is a (generally unknown) Figure 1 : Even though the training of neural networks from data samples may achieve a small error on average, there are typically regions in the input space where the pointwise error is large. The target function in this plot is given by x Þ Ñ logpsinp50xq `2q `sinp5xq (based on Adcock & Dexter, 2021) and the model is a feed-forward neural network. It is trained on m " 1000 uniformly distributed samples according to the hyperparameters in Tables 1 and 2 and achieves final L 1 and L 8 errors of 2.8 ¨10 ´3 and 0.19, respectively. The middle and right plots are zoomed versions of the left plot. data generating distribution (Devroye et al., 2013; Shalev-Shwartz & Ben-David, 2014; Mohri et al., 2018; Kim et al., 2021) . This offers a powerful approach to ensure a small average reconstruction error. However, there are many important scenarios where such bounds on the accuracy are not sufficient and one would like to obtain an approximation of u that is close to u not only on average, but that can be guaranteed to be close for every x P r0, 1s d . This includes several applications in the sciences, for example in the context of the numerical solution of partial differential equations (Raissi et al., 2019; Han et al., 2018; Richter & Berner, 2022) , any security-critical application, for example, facial ID authentication schemes (Guo & Zhang, 2019) , as well as any application with a distribution-shift, i.e., where the data generating distribution is different from the distribution in which the accuracy is measured (Quiñonero-Candela et al., 2008) . Such applications can only be efficiently solved if there exists an efficient algorithm A that achieves uniform accuracy, i.e., a small error sup uPU }u ´Apuq} L 8 pr0,1s d q with respect to the uniform norm given by Y " L 8 pr0, 1s d q, i.e., }f } L 8 pr0,1s d q :" esssup xPr0,1s d |f pxq|. Inspired by recent successes of deep learning across a plethora of tasks in machine learning (LeCun et al., 2015) and also increasingly the sciences (Jumper et al., 2021; Pfau et al., 2020) , we will be particularly interested in the case where the target class U consists of -or contains -realizations of (feed-forward) neural networks of a specific architecturefoot_1 . Neural networks have been proven and observed to be extremely powerful in terms of their expressivity, that is, their ability to accurately approximate large classes of complicated functions with only relatively few parameters (Elbrächter et al., 2021; Berner et al., 2022) . However, it has also been repeatedly observed that the training of neural networks (e.g., fitting a neural network to data samples) to high uniform accuracy presents a big challenge: conventional training algorithms (such as SGD and its variants) often find neural networks that perform well on average (meaning that they achieve a small generalization error), but there are typically some regions in the input space where the error is large (Fiedler et al., 2023) ; see Figure 1 for an illustrative example. This phenomenon has been systematically studied on an empirical level by Adcock & Dexter (2021) . It is also at the heart of several observed instabilities in the training of deep neural networks, including adversarial examples (Szegedy et al., 2013; Goodfellow et al., 2015) or so-called hallucinations emerging in generative modeling, e.g., tomographic reconstructions (Bhadra et al., 2021) or machine translation (Müller et al., 2020) . Note that additional knowledge on the target functions could potentially help circumvent these issues, see Remark 1.3. However, for many applications, it is not possible to precisely describe the regularity of the target functions. We thus analyze the case where no additional information is given besides the fact that one aims to recover a (unknown) neural network of a specified architecture and regularization from given samples -i.e., we assume that U contains a class of neural networks of a given architecture, subject to various regularization methods. This is satisfied in several applications of interest, e.g., model extraction attacks (Tramèr et al., 2016; He et al., 2022) and teacher-student settings (Mirzadeh et al., 2020; Xie et al., 2020) . It is also in line with standard settings in the statistical query literature, in neural network identification, and in statistical learning theory (Anthony & Bartlett, 1999; Mohri et al., 2018) , see Section 1.1. For such settings we can rigorously show that learning a class of neural networks is prone to instabilities. Specifically, any conceivable learning algorithm (in particular, any version of SGD), which recovers the neural network to high uniform accuracy, needs intractably many samples. Theorem 1.1. Suppose that U contains all neural networks with d-dimensional input, ReLU activation function, L layers of width up to 3d, and coefficients bounded by c in the q norm. Assume that there exists an algorithm that reconstructs all functions in U to uniform accuracy ε from m point samples. Then, we have m ě ˆΩ 32d ˙d ¨ε´d , where Ω :" $ & % 1 8¨3 2 {q ¨cL ¨d1´2 q if q ď 2 1 48 ¨cL ¨p3dq pL´1qp1´2 q q if q ě 2. Theorem 1.1 is a special case of Theorem 2.2 (covering Y " L p pr0, 1s d q for all p P r1, 8s, as well as network architectures with arbitrary width) which will be stated and proven in Section 2.3. To give a concrete example, we consider the problem of learning neural networks with ReLU activation function, L layers of width at most 3d, and coefficients bounded by c to uniform accuracy ε " 1 {1024. According to our results we would need at least m ě 2 d ¨cdL ¨p3dq dpL´2q many samples -the sample complexity thus depends exponentially on the input dimension d, the network width, and the network depth, becoming intractable even for moderate values of d, c, L (for d " 15, c " 2, and L " 7, the sample size m would already have to exceed the estimated number of atoms in our universe). If, on the other hand, reconstruction only with respect to the L 2 norm were required, standard results in statistical learning theory (see, for example, Berner et al., 2020) show that m only needs to depend polynomially on d. We conclude that uniform reconstruction is vastly harder than reconstruction with respect to the L 2 norm and, in particular, intractable. Our results are further corroborated by numerical experiments presented in Section 3 below. Remark 1.2. For other target classes U , uniform reconstruction is tractable (i.e., the number of required samples for recovery does not massively exceed the number of parameters defining the class). A simple example are univariate polynomials of degree less than m which can be exactly determined from m samples. One can show similar results for sparse multivariate polynomials using techniques from the field of compressed sensing (Rauhut, 2007) . Further, one can show that approximation rates in suitable reproducing kernel Hilbert spaces with bounded kernel can be realized using point samples with respect to the uniform norm (Pozharska & Ullrich, 2022) . Our results uncover an opposing behavior of neural network classes: There exist functions that can be arbitrarily well approximated (in fact, exactly represented) by small neural networks, but these representations cannot be inferred from samples. Our results are thus highly specific to classes of neural networks. Remark 1.3. Our results do not rule out the possibility that there exist training algorithms for neural networks that achieve high accuracy on some restricted class of target functions, if the knowledge about the target class can be incorporated into the algorithm design. For example, if it were known that the target function can be efficiently approximated by polynomials one could first compute an approximating polynomial (using polynomial regression which is tractable) and then represent the approximating polynomial by a neural network. The resulting numerical problem would however be very different from the way deep learning is used in practice, since most neural network coefficients (namely those corresponding to the approximating polynomial) would be fixed a priori. Our results apply to the situation where such additional information on the target class U is not available and no problem specific knowledge is incorporated into the algorithm design besides the network architecture and regularization procedure. We also complement the lower bounds of Theorem 1.1 with corresponding upper bounds. Theorem 1.4. Suppose that U consists of all neural networks with d-dimensional input, ReLU activation function, L layers of width at most B, and coefficients bounded by c in the q norm. Then, there exists an algorithm that reconstructs all functions in U to uniform accuracy ε from m point samples with m ď C d ¨ε´d , where C :" # ? d ¨cL if q ď 2 d 1´1 q ¨cL ¨BpL´1qp1´2 q q if q ě 2. Theorem 1.4 follows from Theorem 2.4 that will be stated in Section 2.4. We refer to Remark B.4 for a discussion of the gap between the upper and lower bounds. Remark 1.5. Our setting allows for an algorithm to choose the sample points px 1 , . . . , x m q in an adaptive way for each u P U ; see Section 2.1 for a precise definition of the class of adaptive (possibly randomized) algorithms. This implies that even a very clever sampling strategy (as would be employed in active learning) cannot break the bounds established in this paper. Remark 1.6. Our results also shed light on the impact of different regularization methods. While picking a stronger regularizer (e.g., a small value of q) yields quantitative improvements (in the sense of a smaller Ω), the sample size m required for approximation in L 8 can still increase exponentially with the input dimension d. However, this scaling is only visible for very small ε.

1.1. RELATED WORK

Several other works have established "hardness" results for neural network training. For example, the seminal works by Blum & Rivest (1992) ; Vu (1998) show that for certain architectures the training process can be NP-complete. By contrast, our results do not directly consider algorithm runtime at all; our results are stronger in the sense of showing that even if it were possible to efficiently learn a neural network from samples, the necessary number of data points would be too large to be tractable. We also want to mention a series of hardness results in the setting of statistical query (SQ) algorithms, see, e.g., Chen et al. (2022) ; Diakonikolas et al. (2020) ; Goel et al. (2020b) ; Reyzin (2020); Song et al. (2017) . For instance, Chen et al. (2022) shows that any SQ algorithm capable of learning ReLU networks with two hidden layers and width polypdq up to L 2 error 1{polypdq must use a number of samples that scales superpolynomially in d, or must use SQ queries with tolerance smaller than the reciprocal of any polynomial in d. In such SQ algorithms, the learner has access to an oracle that produces approximations (potentially corrupted by adversarial noise) of certain expectations ErhpX, upXqqs, where u is the unknown function to be learned, X is a random variable representing the data, and h is a function chosen by the learner (potentially subject to some restrictions, e.g. Lipschitz continuity). The possibility of the oracle to inject adversarial (instead of just stochastic) noise into the learning procedure -which does not entirely reflect the typical mathematical formulation of learning problems -is crucial for several of these results. We also mention that due to this possibility of adversarial noise, not every gradient-based optimization method (for instance, SGD) is strictly speaking an SQ algorithm; see also the works by Goel et al. (2020a, Page 3) and Abbe et al. (2021) for a more detailed discussion. There also exist hardness results for learning algorithms based on label queries (i.e., noise-free point samples), which constitutes a setting similar to ours. More precisely, Chen et al. (2022) show that ReLU neural networks with constant depth and polynomial size constraints are not efficiently learnable up to a small squared loss with respect to a Gaussian distribution. However, the existing hardness results are in terms of runtime of the algorithm and are contingent on several (difficult and unproven) conjectures from the area of cryptography (the decisional Diffie-Hellmann assumption or the "Learning with Errors" assumption); the correctness of these conjectures in particular would imply that P ‰ NP. By contrast, our results are completely free of such assumptions and show that the considered problem is information-theoretically hard, not just computationally. As already hinted at in the introduction, our results further extend the broad literature on statistical learning theory (Anthony & Bartlett, 1999; Vapnik, 1999; Cucker & Smale, 2002b; Bousquet et al., 2003; Vapnik, 2013; Mohri et al., 2018) . Specifically, we provide fully explicit upper and lower bounds on the sample complexity of (regularized) neural network hypothesis classes. In the context of PAC learning, we analyze the realizable case, where the target function is contained in the hypothesis class (Mohri et al., 2018, Theorem 3.20) . Contrary to standard results, we do not pose any assumptions, such as IID, on the data distribution, and even allow for adaptive sampling. Moreover, we analyze the complexity for all L p norms with p P r1, 8s, whereas classical results mostly deal with the squared loss. As an example of such classical results, we mention that (bounded) hypothesis classes with finite pseudodimension D can be learned to squared L 2 loss with OpD ´2q point samples; see e.g., Mohri et al. (2018, Theorem 11.8) . Bounds for the pseudodimension of neural networks are readily available in the literature; see e.g., Bartlett et al. (2019) . These bounds imply that learning ReLU networks in L 2 is tractable, in contrast to the L 8 setting. Another related area is the identification of (equivalence classes of) neural network parameters from their input-output maps. While most works focus on scenarios where one has access to an infinite number of queries (Fefferman & Markel, 1993; Vlačić & Bölcskei, 2022) , there are recent results employing only finitely many samples (Rolnick & Kording, 2020; Fiedler et al., 2023) . Robust identification of the neural network parameters is sufficient to guarantee uniform accuracy, but it is not a necessary condition. Specifically, proximity of input-output maps does not necessarily imply proximity of corresponding neural network parameters (Berner et al., 2019) . More generally, our results show that efficient identification from samples cannot be possible unless (as done in the previously mentioned works) further prior information is incorporated. In the same spirit, this restricts the applicability of model extraction attacks, such as model inversion or evasion attacks (Tramèr et al., 2016; He et al., 2022) . Our results are most closely related to recent results by Grohs & Voigtlaender (2021) where target classes consisting of neural network approximation spaces are considered. The results of Grohs & Voigtlaender (2021) , however, are purely asymptotic. Since the asymptotic behavior incurred by the rate is often only visible for very fine accuracies, the results of Grohs & Voigtlaender (2021) cannot be applied to obtain concrete lower bounds on the required sample size. Our results are completely explicit in all parameters and readily yield practically relevant bounds. They also elucidate the role of adaptive sampling and different regularization methods.

1.2. NOTATION

For d P N, we denote by Cpr0, 1s d q the space of continuous functions f : r0, 1s d Ñ R. For a finite set I and pa i q iPI P R I , we write ř iPI a i :" 1 |I| ř iPI a i . For m P N, we write rms :" t1, . . . , mu. For A Ă R d , we denote by A o the set of interior points of A. For any subset A of a vector space V , any c P R, and any y P V , we further define y `c ¨A :" ty `ca : a P Au. For a matrix W P R nˆk and q P r1, 8q, we write }W } q :" `ři,j |W i,j | q ˘1{q , and for q " 8 we write }W } 8 :" max i,j |W i,j |. For vectors b P R n , we use the analogously defined notation }b} q .

2. MAIN RESULTS

This section contains our main theoretical results. We introduce the considered classes of algorithms in Section 2.1 and target classes in Section 2.2. Our main lower and upper bounds are formulated and proven in Section 2.3 and Section 2.4, respectively.

2.1. ADAPTIVE (RANDOMIZED) ALGORITHMS BASED ON POINT SAMPLES

As described in the introduction, our goal is to analyze how well one can recover an unknown function u from a target class U in a Banach space Y based on point samples. This is one of the main problems in information-based complexity (Traub, 2003) , and in this section we briefly recall the most important related notions. Given U Ă Cpr0, 1s d q X Y for a Banach space Y , we say that a map A : U Ñ Y is an adaptive deterministic method using m P N point samples if there are f 1 P r0, 1s d and mappings f i : `r0, 1s d ˘i´1 ˆRi´1 Ñ r0, 1s d , i " 2, . . . , m, and Q : `r0, 1s d ˘m ˆRm Ñ Y such that for every u P U , using the point sequence xpuq " px 1 , . . . , x m q Ă r0, 1s d defined as x 1 " f 1 , x i " f i px 1 , . . . , x i´1 , upx 1 q, . . . , upx i´1 qq, i " 2, . . . , m, the map A is of the form Apuq " Qpx 1 , . . . , x m , upx 1 q, . . . , upx m qq P Y . The set of all deterministic methods using m point samples is denoted by Alg m pU, Y q. In addition to such deterministic methods, we also study randomized methods defined as follows: A tuple pA, mq is called an adaptive random method using m P N point samples on average if A " pA ω q ωPΩ where pΩ, F, Pq is a probability space, and where m : Ω Ñ N is such that the following conditions hold: 1. m is measurable, and Erms ď m; 2. @ u P U : ω Þ Ñ A ω puq is measurable with respect to the Borel σ-algebra on Y ; 3. @ ω P Ω : A ω P Alg mpωq pU, Y q. The set of all random methods using m point samples on average will be denoted by Alg M C m pU, Y q, since such methods are sometimes called Monte-Carlo (MC) algorithms. For a target class U , we define the optimal (randomized) error as err M C m pU, Y q :" inf pA,mqPAlg M C m pU,Y q sup uPU E r}u ´Aω puq} Y s . We note that Alg m pU, Y q Ă Alg M C m pU, Y q, since each deterministic method can be interpreted as a randomized method over a trivial probability space.

2.2. NEURAL NETWORK CLASSES

We will be concerned with target classes related to ReLU neural networks. These will be defined in the present subsection. Let : R Ñ R, pxq " maxt0, xu, be the ReLU activation function. Given a depth L P N, an architecture pN 0 , N 1 , . . . , N L q P N L`1 , and neural network coefficients Φ " `pW i , b i q ˘L i"1 P Ś L i"1 `RNiˆNi´1 ˆRNi ˘, we define their realization RpΦq P CpR N0 , R N L q as RpΦq :" φ L ˝ ˝φL´1 ˝¨¨¨˝ ˝φ1 where is applied componentwise and φ i : R Ni´1 Ñ R Ni , x Þ Ñ W i x `bi , for i P rLs. Given c ą 0 and q P r1, 8s, define the class H q pN0,...,N L q,c :" ! RpΦq : Φ P Ś L i"1 `RNiˆNi´1 ˆRNi ˘and }Φ} q ď c ) , where }Φ} q :" max 1ďiďL maxt}W i } q , }b i } q u. To study target classes related to neural networks, the following definition will be useful. Definition 2.1. Let U, H Ă Cpr0, 1s d q. We say that U contains a copy of H, attached to u 0 P U with constant c 0 P p0, 8q, if u 0 `c0 ¨H Ă U.

2.3. LOWER BOUND

The following result constitutes the main result of the present paper. Theorem 1.1 readily follows from it as a special case. Theorem 2.2. Let L P N ě3 , d, B P N, p, q P r1, 8s, and c P p0, 8q. Suppose that the target class U Ă Cpr0, 1s d q contains a copy of H q pd,B...,B,1q,c with constant c 0 P p0, 8q, where the B in pd, B, . . . , B, 1q appears L ´1 times. Then, for any s P N with s ď min B 3 , d ( we have err M C m pU, L p pr0, 1s d qq ě c 0 ¨Ω p32sq 1`s p ¨m´1 p ´1 s , where Ω :" # 1 8¨3 2 {q ¨cL ¨s1´2 q if q ď 2 1 48 ¨cL ¨BpL´1qp1´2 q q if q ě 2. Proof. This follows by combining Theorem A.5 with Lemmas A.2 and A.3 in the appendix. Remark 2.3. For p ! 8, the bound from above does not necessarily imply that an intractable number of training samples is needed. This is a reflection of the fact that efficient learning is possible (at least if one only considers the number of training samples and not the runtime of the algorithm) in this regime. Indeed, it is well-known in statistical learning theory that one obtains learning bounds based on the entropy numbers (w.r.t. the L 8 norm) of the class of target functions, when the error is measured in L 2 , see, for instance, Cucker & Smale (2002a, Proposition 7) . The ε-entropy numbers of a class of neural networks with L layers and w (bounded) weights scale linearly in w, L and logarithmically in 1{ε, so that one gets tractable L 2 learning bounds. By interpolation for L p norms (noting that in our case the target functions are bounded, so that the L 8 reconstruction error is bounded, even though the decay with m is very bad), this also implies L p learning bounds, but these get worse and worse as p Ñ 8. We remark that these learning bounds are based on empirical risk minimization, which might be computationally infeasible (Vu, 1998) ; since our lower bounds should hold for any feasible algorithm (irrespective of its computational complexity), this means that one cannot expect to get an intractable lower bound for p ! 8 in our setting. The idea of the proof of Theorem 2.2 (here only presented for u 0 " 0 and s " d, which implies that B ě 3d) is as follows: 1. We first show (see Lemmas A.2 and A.3) that the neural network set H q pd,B,...,B,1q,c contains a large class of "bump functions" of the form λ ¨ϑM,y . Here, ϑ M,y is supported on the set y `r´1 M , 1 M s d and satisfies }ϑ M,y } L p pr0,1s d q -M ´d{p , where M P N and y P r0, 1s d can be chosen arbitrarily; see Lemma A.1. The size of the scaling factor λ " λpM, c, q, d, Lq depends crucially on the regularization parameters c and q. This is the main technical part of the proof, requiring to construct suitable neural networks adhering to the imposed q restrictions on the weights for which λ is as big as possible. 2. If one learns using m points samples x 1 , . . . , x m and if M " Opm 1{d q, then a volume packing argument shows that there exists y P r0, 1s d such that ϑ M,y px i q " 0 for all i P rms. This means that the learner cannot distinguish the function λ ¨ϑM,y P H q pd,B,...,B,1q,c from the zero function and will thus make an error of roughly }λ ¨ϑM,y } L p -λ ¨M ´d{p . This already implies the lower bound in Theorem 2.2 for the case of deterministic algorithms. 3. To get the lower bound for randomized algorithms using m point samples on average, we employ a technique from information-based complexity (see, e.g., Heinrich, 1994) : We again set M " Opm 1{d q and define py q PrM {2s d as the nodes of a uniform grid on r0, 1s d with width 2{M . Using a volume packing argument, we then show that for any choice of m sampling points x 1 , . . . , x m , "at least half of the functions ϑ M,y avoid all the sampling points", i.e., for at least half of the indices , it holds that ϑ M,y px i q " 0 for all i P rms. A learner using the samples x 1 , . . . , x m can thus not distinguish between the zero function and λ ¨ϑM,y P H q pd,B,...,B,1q,c for at least half of the indices . Therefore, any deterministic algorithm will make an error of Ωpλ ¨M ´d{p q on average with respect to . 4. Since each randomized algorithm A " pA ω q ωPΩ is a collection of deterministic algorithms and since taking an average commutes with taking the expectation, this implies that any randomized algorithm will have an expected error of Ωpλ ¨M ´d{p q on average with respect to . This easily implies the stated bound. As mentioned in the introduction, we want to emphasize that well-trained neural networks can indeed exhibit such bump functions, see Figure 1 

2.4. UPPER BOUND

In this section we present our main upper bound, which directly implies the statement of Theorem 1.4. Theorem 2.4. Let L, d P N, q P r1, 8s, c P p0, 8q, and N 1 , . . . , N L´1 P N. Then, we have err M C m `Hq pd,N1,...,N L´1 ,1q,c , L 8 pr0, 1s d q ˘ď # ? d ¨cL ¨m´1 d if q ď 2 ? d ¨cL ¨p? d ¨N1 ¨¨¨N L´1 q 1´2 q ¨m´1 d if q ě 2. Proof. This follows by combining Lemmas B.2 and B.3 in the appendix. Let us outline the main idea of the proof. We first show that each neural network RpΦq P H q pN0,...,N L q,c is Lipschitz-continuous, where the Lipschitz constant can be conveniently bounded in terms of the parameters N 0 , . . . , N L , c, q, see Lemma B.2 in the appendix. In Lemma B.3, we then show that any function with moderate Lipschitz constant can be reconstructed from samples by piecewise constant interpolation.

3. NUMERICAL EXPERIMENTS

Having established fundamental bounds on the performance of any learning algorithm, we want to numerically evaluate the performance of commonly used deep learning methods. To illustrate our main result in Theorem 2.2, we estimate the error in (2) by a tractable approximation in a student-teacher setting. Specifically, we estimate the minimal error over neural network target functions ("teachers") p U Ă H q pd,N1,...,N L´1 ,1q,c for deep learning algorithms p A Ă Alg M C m pU, L p q via Monte-Carlo sampling, i.e., x err m ´p U , L p ; p A ¯:" inf pA,mqP p A sup uP p U ř ωP p Ω ´řjPrJs `upX j q ´Aω puqpX j q ˘p¯1 {p , where pX j q J j"1 are independent evaluation samples uniformly distributed onfoot_2 r´0.5, 0.5s d and p Ω represents the seeds for the algorithms. We obtain teacher networks u P H 8 pd,N1,...,N L´1 ,1q,c by sampling their coefficients Φ componentwise according to a uniform distribution on r´c, cs. For every algorithm pA, mq P p A and seed ω P p Ω we consider point sequences xpuq uniformly distributed in r´0.5, 0.5s d with mpωq " m. The corresponding point samples are used to train the coefficients of a neural network ("student") using the Adam optimizer (Kingma & Ba, 2015) with exponentially decaying learning rate. We consider input dimensions d " 1 and d " 3, for each of which we compute the error in (3) for 4 different sample sizes m over 40 teacher networks u. For each combination, we train student networks with 3 different seeds, 3 different widths, and 3 different batch-sizes. In summary, this yields 2 ¨4 ¨40 ¨3 ¨3 ¨3 " 8640 experiments each executed on a single GPU. The precise hyperparameters can be found in Tables 1 and 3 in Appendix C. Figure 2 shows that there is a clear gap between the errors x err m p p U , L p ; p Aq for p P t1, 2u and p " 8. Especially in the one-dimensional case, the rate x err m p p U , L 8 ; p Aq w.r.t. the number of samples m also seems to stagnate at a precision that might be insufficient for certain applications. Figure 3 illustrates that the errors are caused by spikes of the teacher network which are not covered by any sample. Note that this is very similar to the construction in the proof of our main result, see Section 2.3. In general, the rates worsen when considering more teacher networks p U and improve when considering further deep learning algorithms p A, including other architectures or more elaborate training and sampling schemes. Note, however, that each setting needs to be evaluated for a number of teacher networks, sample sizes, and seeds. We provide an extensible implementationfoot_3 in PyTorch (Paszke et al., 2019) featuring multi-node experiment execution and hyperparameter tuning using Ray Tune (Liaw et al., 2018) , experiment tracking using Weights & Biases and TensorBoard, and flexible experiment configuration. Building upon our work, research teams with sufficient computational resources can provide further numerical evidence on an even larger scale.

4. DISCUSSION AND LIMITATIONS

Discussion. We derived fundamental upper and lower bounds for the number of samples needed for any algorithm to reconstruct an arbitrary function from a target class containing realizations of neural networks with ReLU activation function of a given architecture and subject to q regularization constraints on the network coefficients, see Theorems 2.2 and 2.4. These bounds are completely explicit in the network architecture, the type of regularization, and the norm in which the reconstruction error is measured. We observe that our lower bounds are severely more restrictive if the error is measured in the uniform L 8 norm rather than the (more commonly studied) L 2 norm. Particularly, learning a class of neural networks with ReLU activation function with moderately high accuracy in the L 8 norm is intractable for moderate input dimensions, as well as network widths and depths. We anticipate that further investigations into the sample complexity of neural network classes can eventually contribute to a better understanding of possible circumstances under which it is possible to design reliable deep learning algorithms and help explain well-known instability phenomena such  10 -2 10 -1 d = 3 L ∞ L 2 L 1 Figure 2: Evaluation of the error in (3) for p P t1, 2, 8u, input dimensions d P t1, 3u, sample sizes m P t10 2 , 10 3 , 10 4 , 10 5 u, and hyperparameters given in Tables 1 and 3 The middle and right plots are zoomed versions of the left plot. The L 8 error p2.7 ¨10 ´3q is about one magnitude larger than the L 2 and L 1 errors p3.9 ¨10 ´4 and 2.4 ¨10 ´4q, which is caused by spikes of the teacher network between samples. as adversarial examples. Such an understanding can be beneficial in assessing the potential and limitations of machine learning methods applied to security-and safety-critical scenarios. Limitations and Outlook. We finally discuss some possible implications and also limitations of our work. First of all, our results are highly specific to neural networks with the ReLU activation function. We expect that obtaining similar results for other activation functions will require substantially new methods. We plan to investigate this in future work. The explicit nature of our results reveal a discrepancy between the lower and upper bound, especially for high dimensions. We conjecture that both the current upper and lower bounds are not quite optimal. Determining to which extent one can tighten the bounds is an interesting open problem. Our analysis is a worst-case analysis in the sense that we show that for any given algorithm A, there exists at least one u in our target class U on which A performs poorly. The question of whether this poor behavior is actually generic will be studied in future work. One way to establish such generic results could be to prove that our considered target classes contain copies of neural network realizations attached to many different u's. Finally, we consider target classes U that contain all realizations of neural networks with a given architecture subject to different regularizations. This can be justified as follows: Whenever a deep learning method is employed to reconstruct a function u by representing it approximately by a neural network (without further knowledge about u), a natural minimal requirement is that the method should perform well if the sought function is in fact equal to a neural network. However, if additional problem information about u can be incorporated into the learning problem it may be possible to overcome the barriers shown in this work. The degree to which this is possible, as well as the extension of our results to other architectures, such as convolutional neural networks, transformers, and graph neural networks will be the subject of future work.  Λ M,σ : R Ñ p´8, 1s, t Þ Ñ " 0 if t ď σ ´1 M 1 ´M ¨|t ´σ| if t ě σ ´1 M , and furthermore Proof. Let us first give a quick overview of the proof. The statement on the support of ϑ psq M,y follows by observing that ∆ psq M,y pxq ą 0 can only happen if Λ M,yi px i q ą 0 for all i P rss. As 0 ď ϑ psq M,y ď 1, the upper bound on the L p pr0, 1s d q norm can then be estimated by the Lebesgue measure of the intersection of the support of ϑ psq M,y and the hypercube r0, 1s d . For the lower bound we compute the measure of the intersection with a subset of the support on which it holds that ϑ psq M,y ě 1 2 . ∆ psq M,y : R d Ñ p´8, 1s, x Þ Ñ ˜s ÿ i"1 Λ M,yi px i q ¸´ps ´1q, ϑ psq M,y : R d Ñ r0, 1s, x Þ Ñ p∆ We start by proving the statement on the support of ϑ psq M,y . If 0 ‰ ϑ psq M,y pxq, then ∆ psq M,y pxq ą 0, meaning ř s i"1 Λ M,yi px i q ą s ´1. Because of Λ M,yi px i q P p´8, 1s for all i P rss, this is only possible if Λ M,yi px i q ą 0 for all i P rss. Directly from the definition of Λ M,yi (see also Figure 4 ), this implies |x i ´yi | ď 1 M for all i P rss, meaning x P y `pM ´1r´1, 1s s ˆRd´s q. This proves the first claim. Regarding the second claim, define y ˚:" py 1 , . . . , y s q P R s , and, for k P N, denote by λ k the Lebesgue measure on R k . Then, since r0, 1s d X supp ϑ psq M,y Ă py ˚`M ´1r´1, 1s s q ˆr0, 1s d´s and 0 ď ϑ psq M,y ď 1, we see that }ϑ psq M,y } L p pr0,1s d q ď `λs `y˚`M ´1 r´1, 1s s ˘˘1{p " ˆ2 M ˙s{p " 2 s{p M ´s{p . For the converse estimate, let us also write x ˚" px 1 , . . . , x s q for x P R d . Then, if x P R d satisfies x ˚P y ˚`1 2M s r´1, 1s s , we see y i ´1 M ď y i ´1 2M s ď x i ď y i `1 2M s for i P rss. By definition of Λ M,yi , this implies Λ M,yi px i q " 1 ´M ¨|x i ´yi | ě 1 ´1 2s and hence ∆ psq M,y pxq " ˜s ÿ i"1 Λ M,yi px i q ¸´ps ´1q ě s ´1 2 ´ps ´1q " 1 2 , so that ϑ psq M,y pxq ě 1 2 . Finally, it is not difficult to show, that λ d ` x P r0, 1s d : x ˚P y ˚`1 2M s r´1, 1s s (˘" λ s `r0, 1s s X py ˚`1 2M s r´1, 1s s q ˘ě p2M sq ´s, see Grohs & Voigtlaender (2021, Equation (A.1) ) for the details. Overall, we thus see }ϑ psq M,y } L p pr0,1s d q ě 1 2 ¨p2M sq ´s{p . Note that a compactly supported (non-trivial) function such as ϑ psq M,y can only be represented by ReLU networks with more than two layers, see Blum & Li (1991, Section 3) . For this reason, we focus on the case L P N ě3 in this paper. Next, we show that scaled versions of the hat functions ϑ psq M,y can be represented using neural networks of a suitable architecture and with a suitable bound on the magnitude of the coefficients. We begin with the (more interesting) case where the exponent q that determines the regularization of the weights satisfies q ě 2. Lemma A.2. Let d P N, L P N ě3 , B P N ě3 , c ą 0, q P r2, 8s, and s P N with s ď mint B 3 , du. Then, there exists a constant λ ě c L ¨BpL´1qp1´2 q q {12 such that ν ¨λ M s ¨ϑpsq M,y P H q pd,B,...,B,1q,c @ M P N, ν P t˘1u, and y P r0, 1s d , where the B in pd, B, . . . , B, 1q appears L ´1 times. Proof. Let M P N, y P r0, 1s d , and ν P t˘1u be fixed. We will now construct the coefficients ppW 1 , b 1 q, . . . , pW L , b L qq of a neural network with the following properties: 1. The first two layers ppW 1 , b 1 q, pW 2 , b 2 qq output at any of their B output dimensions the function C 1 ¨Λpsq M,y for a suitable scaling factor C 1 " C 1 pc, M, s, B, qq ą 0. Define r :" tB{p3squ, noting that r ě 1, since s ď B{3. We first introduce a few notations: We write 0 kˆn for the k ˆn matrix with all entries being zero; similarly, we write 1 kˆn for the k ˆn matrix with all entries being one. Furthermore, we denote by pe 1 , . . . , e d q the standard basis of R d , and define I s :" pe 1 | ¨¨¨| e s q P R dˆs , α :"

2.. The following activation function yields

´M´1 ´y1 2 ˇˇM ´1´y2 2 ˇˇ¨¨¨ˇˇM ´1 ´ys 2 ¯P R 1ˆs , β :" `´y 1 | ´y2 | ¨¨¨| ´ys q P R 1ˆs , γ :" ´s´1 s 1 2M ˇˇ¨¨¨ˇˇs ´1 s 1 2M ¯" s ´1 s 1 2M ¨11ˆs P R 1ˆs . (5) We note that all entries of these matrices and vectors are elements of r´1, 1s. Using these matrices and vectors, we now define W 1 :" c p3srq 1{q ´Is {2 ˇˇI s ˇˇ0 dˆs ˇˇ¨¨¨ˇˇI s {2 ˇˇI s ˇˇ0 dˆs r blocks of pIs{2|Is|0 dˆs q ˇˇ0 dˆpB´3rsq ¯T P R Bˆd , b 1 :" c p3srq 1{q ´α | β | γ | ¨¨¨| α | β | γ r blocks of pα|β|γq | 0 | ¨¨¨| 0 ¯T P R B , and furthermore W 2 :" c p3srBq 1{q ´1Bˆs | ´1Bˆs | ´1Bˆs | ¨¨¨| 1 Bˆs | ´1Bˆs | ´1Bˆs r blocks of p1 Bˆs |´1 Bˆs |´1 Bˆs q | 0 BˆpB´3rsq ¯P R BˆB , b 2 :" p0 | ¨¨¨| 0q T P R B , where we note that B ´3rs ě 0 since r " tB{p3squ. It is straightforward to verify that }W 1 } q , }W 2 } q , }b 1 } q , }b 2 } q ď c. Furthermore, we define W i :" c B 2{q 1 BˆB and b i :" p0| ¨¨¨|0q T P R B for 3 ď i ď L ´1, and finally W L :" ν¨c B 1{q p1| ¨¨¨|1q P R 1ˆB and b L :" p0q P R 1 . Again, it is straightforward to verify that }W i } q , }b i } q ď c for 3 ď i ď L ´1 and also that }W L } q , }b L } q ď c. Therefore, setting Φ :" ppW 1 , b 1 q, . . . , pW L , b L qq, we have RpΦq P H q pd,B,...,B,1q,c ; it thus remains to verify that RpΦq " ν ¨λ M s ¨ϑpsq M,y for a constant λ as in the statement of the lemma. To see this, we note for any x P R d and j P rds that `xj 2 `M´1 ´yj 2 ˘´ px j ´yj q " 1 2 `xj ´yj `M ´1˘´ px j ´yj q " $ ' ' & ' ' % 0 if x j ď y j ´M ´1 1 2M ¨p1 ´M ¨|x j ´yj |q if y j ´M ´1 ă x j ď y j 1 2M ¨p1 ´M ¨|x j ´yj |q if x j ą y j " 1 2M Λ M,yj px j q. For notational convenience we further define φ i pxq :" pW i x `bi q for i P rLs. Then, we observe for x P R B and i P rBs that rφ 2 pxqs i " c p3rsBq 1{q r´1 ÿ b"0 s ÿ j"1 ´x3sb`j ´x3sb`s`j ´x3sb`2s`j ¯. Therefore, we see for arbitrary x P R d and i P rBs that "`φ 2 ˝ ˝φ1 ˘pxq ‰ i " c 2 p3rsq 2{q B 1{q r´1 ÿ b"0 s ÿ j"1 ´ `xj 2 `M´1 ´yj 2 ˘´ px j ´yj q ´ `s´1 s 1 2M ˘" c 2 2M p3rsq 2{q B 1{q r´1 ÿ b"0 s ÿ j"1 ´ΛM,yj px j q ´s´1 s " c 2 r 2M p3rsq 2{q B 1{q ∆ psq M,y pxq. Hence, it holds that ` ˝φ2 ˝ ˝φ1 ˘pxq " c 2 r 2M p3rsq 2{q B 1{q ¨ϑpsq M,y pxq ¨p1| ¨¨¨|1q T P R B . Next, for 3 ď i ď L ´1, we see for arbitrary κ ě 0 and j P rBs that "` ˝φi ˘`κ ¨p1| ¨¨¨|1q T ˘‰j " `řB "1 rW i s j, κ ˘" pcB 1´2 q κq " cB 1´2 q κ, meaning ` ˝φi ˘`κp1 | ¨¨¨| 1q T ˘" cB 1´2 q κ ¨p1 | ¨¨¨| 1q T . Therefore, we conclude ` ˝φL´1 ˝ ˝φL´2 ˝¨¨¨˝ ˝φ1 ˘pxq " c L´1 r pB 1´2 q q L´3 2M p3rsq 2{q B 1{q ϑ psq M,y pxq ¨p1 | ¨¨¨| 1q T P R B . All in all, this easily implies RpΦqpxq " ν B 1{q B ÿ i"1 c L r pB 1´2 q q L´3 2M p3rsq 2{q B 1{q ϑ psq M,y pxq " ν ¨cL pB 1´2 q q L´2 p3rsq 1´2 q 6M s ϑ psq M,y pxq. It therefore remains to recall that r " tB{p3squ ě 1, so that 2r ě 1 `r ą B 3s and hence 3rs ě B 2 . Since also 1 ´2 q ě 0, this implies p3rsq 1´2 q ě pB{2q 1´2 q ě B 1´q 2 {2, which finally shows λ :" c L pB 1´2 q q L´2 p3rsq 1´2 q 6 ě c L ¨BpL´1qp1´2 q q 12 . Now, we also consider the case q ď 2. We remark that in the case q " 2, the next lemma only agrees with Lemma A.2 up to a constant factor. This is a proof artifact and is inconsequential for the questions we are interested in. Lemma A.3. Let d P N, L P N ě3 , B P N ě3 , c ą 0, q P r1, 2s, and s P N with s ď mintd, B 3 u. Then, we have ν ¨cL s 1´2 q {p2 ¨32{q q M s ϑ psq M,y P H q pd,B,...,B,1q,c @ M P N, ν P t˘1u, and y P r0, 1s d , where the B in pd, B, . . . , B, 1q appears L ´1 times. Proof. The proof idea is similar to the one of Lemma A.2. However, we only realize a scaled version of the function ϑ psq M,y in the first coordinate of the outputs after the first two layers. As in the proof of Lemma A.2, we denote by pe 1 , . . . , e d q the standard basis of R d , and we write 0 kˆn and 1 kˆn for the k ˆn matrices which have all entries equal to zero or one, respectively. Moreover, we use the matrices and vectors I s , α, β, γ defined in Equation ( 5). With this setup, define W 1 :" c p3sq 1{q ¨`I s {2 ˇˇI s ˇˇ0 dˆpB´2sq ˘T P R Bˆd , b 1 :" c p3sq 1{q ¨`α ˇˇβ ˇˇγ ˇˇ0 1ˆpB´3sq ˘T P R B . Note that these definitions make sense since 2s ď 3s ď B. Further, define b 2 :" p0| ¨¨¨|0q T P R B and W 2 :" c p3sq 1{q ˆ11ˆs ´11ˆ2s 0 1ˆpB´3sq 0 pB´1qˆs 0 pB´1qˆ2s 0 pB´1qˆpB´3sq ˙P R BˆB . Next, for 3 ď i ď L ´1, define b i :" p0| ¨¨¨|0q T P R B and W i :" c ¨¨1 0 ¨¨¨0 0 0 ¨¨¨0 . . . . . . . . . . . . 0 0 ¨¨¨0 ‹ ‹ ‚ P R BˆB , and finally let W L :" ν ¨c ¨p1|0| ¨¨¨|0q P R 1ˆB and b L :" p0q P R 1 . It is straightforward to verify that }W j } q ď c and }b j } q ď c for all 1 ď j ď L. Therefore, RpΦq P H q pd,B,...,B,1q,c for Φ :" ppW 1 , b 1 q, . . . , pW L , b L qq. It therefore remains to show that RpΦq " ν ¨cL s 1´2 q {p2¨3 2{q q M s ϑ psq M,y . For notational convenience we define φ i pxq :" pW i x `bi q for i P rLs. Then we note for 3 ď i ď L ´1 that ` ˝φi ˘pxq " `c ¨ px 1 q | 0 | ¨¨¨| 0 ˘T . This easily implies ` ˝φL´1 ˝ ˝φL´2 ˝¨¨¨˝ ˝φ3 ˘pxq " `cL´3 ¨ px 1 q | 0 | ¨¨¨| 0 ˘T , and therefore `φL ˝ ˝φL´1 ˝¨¨¨˝ ˝φ3 ˘pxq " ν ¨cL´2 px 1 q for x P R B . Finally, an application of Equation ( 6) shows that rp ˝φ2 ˝ ˝φ1 qpxqs 1 " c 2 p3sq 2{q ˆs ÿ i"1 ˆ `xi 2 `M´1 ´yi 2 ˘´ px i ´yi q ´ `s´1 s 1 2M ˘˙" c 2 2M ¨p3sq 2{q ˆˆs ÿ i"1 Λ M,yi px i q ˙´ps ´1q " c 2 2M ¨p3sq 2{q p∆ psq M,y pxqq " c 2 2M ¨p3sq 2{q ϑ psq M,y pxq.

Overall, we thus see as claimed that

RpΦqpxq " ν ¨cL´2 ¨c2 2M ¨p3sq 2{q ¨ϑpsq M,y pxq " ν ¨cL s 1´2 q {p2 ¨32{q q M s ¨ϑpsq M,y pxq. Remark A.4. A straightforward adaptation of the proof shows that the same statement holds for H q pd,B,N2,...,N L´1 ,1q,c instead of H q pd,B,...,B,1q,c , for arbitrary N 2 , . . . , N L´1 P N.

A.2 A GENERAL LOWER BOUND

We now show that any target class containing a large number of (shifted) hat functions has a large optimal error. Theorem A.5. Let d, m P N, s P rds, and M :" 8rm 1{s s. Assume that U Ă Cpr0, 1s d q satisfies u 0 `ν ¨λ M s ϑ psq M,y P U @ ν P t˘1u and y P r0, 1s d for certain λ ą 0 and u 0 P Cpr0, 1s d q. Then, err M C m pU, L p pr0, 1s d qq ě λ{4 p32sq 1`s p ¨m´1 p ´1 s @ p P r1, 8s. The general idea of the proof is sketched in Section 2.3. In what follows we provide the technical details. Proof. The proof is divided into five steps. Step 1: Define k :" rm 1{s s and let y :" p1,...,1q 8k ` ´p1,...,1q 4k P r0, 1s d for P r4ks d . Furthermore, let Γ :" r4ks s ˆtp1, . . . , 1qu Ă r4ks d . With f ,ν :" u 0 `ν ¨λ M s ϑ psq M,y for p , νq P Γ ˆt˘1u, it holds by assumption that f ,ν P U @ p , νq P Γ ˆt˘1u. (8) Furthermore, since M " 8k, Lemma A.1 and a moment's thought reveal that @ p , νq, p 1 , ν 1 q P Γ ˆt˘1u : ‰ 1 ñ supppf ,ν ´u0 q o X supppf 1 ,ν 1 ´u0 q o " H, where we note that supppf ,ν ´u0 q " supp ϑ psq M,y . Step 2: Letfoot_4 A P Alg 2m pU, L p q be arbitrary and x " xpu 0 q " px 1 , . . . , x 2m q P `r0, 1s d ˘2m as described before Equation (1). Put I x :" ! P Γ : @ i P r2ms : ϑ psq M,y px i q " 0 ) . We now show that |I x | ě p4kq s ´2m. To see this we will estimate the cardinality of the complement set I c x :" ΓzI x from above. For P I c x there must exist i P r2ms with ϑ psq M,y px i q ‰ 0 and hence x i P supp `ϑpsq M,y ˘o. The map I c x Ñ r2ms, Þ Ñ i , is thus injective due to (9). Therefore |I c x | ď 2m and thus |I x | ě |Γ| ´2m, which is (10). Furthermore, the definition of I x , combined with the definition of f ,ν in (7) and the condition that A can only depend on the samples x and the values of the input function at these samples, directly imply that @ p , νq P Γ ˆt˘1u : P I x ñ Apf ,ν q " Apu 0 q. (11) Step 3: Recalling our notation for the average in Section 1.2, it holds that Here, (12) follows since I x Ă Γ; (13) follows from k " rm 1 s s and (10); (14) follows from (11); (15) follows from the triangle inequality and (7); (16) follows from Lemma A.1; and (17) follows from the definition of M , which implies that M ď 8 m 1{s `8 ď 16 m 1{s . ÿ PΓ νPt˘1u }f ,ν ´Apf ,ν q} L p " 1 p4kq s ÿ PΓ ˆ1 2 }f ,´1 ´Apf ,´1 q} L p `1 2 }f ,1 ´Apf ,1 q} L p ě 1 p4kq s ÿ PIx ˆ1 2 }f ,´1 ´Apf ,´1 q} L p `1 2 }f ,1 ´Apf ,1 q} L p ˙(12) " |I x | p4kq s ÿ PIx ˆ1 2 }f ,´1 ´Apf ,´1 q} L p `1 2 }f ,1 ´Apf ,1 q} L p ě 1 2 ÿ PIx ˆ1 2 }f ,´1 Step 4: Let pA, mq P Alg M C m pU, L p q be arbitrary with A " pA ω q ωPΩ for a probability space pΩ, F, Pq. Put Ω 0 :" tω P Ω : mpωq ď 2mu. Since the Markov inequality implies that m ě Erms ě 2m ¨PpΩ c 0 q, it follows that PpΩ 0 q ě 1 2 . ( ) To prove Equation ( 23), let u : r0, 1s d Ñ R be arbitrary with Lip 2 puq ă 8. For arbitrary x P r0, 1s d , there then exists a unique i P rK d s satisfying x P Q i Ă x i `r´1 2K , 1 2K s d , and in particular }x ´xi } 2 ď ? d{p2Kq. Therefore, ˇˇΘ m `upx 1 q, . . . , upx m q ˘pxq ´upxq ˇˇ" |upx i q ´upxq| ď Lip 2 puq ¨}x i ´x} 2 ď Lip 2 puq ¨?d 2K . Since x P r0, 1s d was arbitrary, this implies › › Θ m pupx 1 q, . . . , upx m qq ´u› › L 8 pr0,1s d q ď Lip 2 puq ¨?d 2K . Finally, we note that K " tm 1{d u ě 1 implies 2K ě 1 `K ą m 1{d , which proves the claim. Note that the proof above requires to convert a Lipschitz constant with respect to the 2 norm to an 8 estimate which costs a factor ? d and contributes to the gap between our lower and upper bound. Remark B.4. Note that our upper and lower bounds in Theorems 1.1 and 1.4 are asymptotically sharp with respect to the number of samples m, the regularization parameter c, and the network depth L but not fully sharp with respect to the multiplicative factor depending on d and q only. Given m many samples, a combination of Theorems 1.1 and 1.4 shows that the optimal achievable L 8 reconstruction error ε for reconstructing neural networks with L layers up to width 3d and coefficients bounded by c in the q norm satisfies # 1 256¨3 2 {q ¨cL ¨d´2 q ¨m´1 d 1 1536¨d ¨cL ¨p3dq pL´1qp1´2 q q ¨m´1 d ď ε ď ? d ¨cL ¨m´1 d if q ď 2 d 1´1 q ¨cL ¨p3dq pL´1qp1´2 q q ¨m´1 d if q ą 2.

+

For moderate input dimensions d the upper and lower bounds are quite tight, but for larger d there remains a gap. However, in that case the lower bound for m is already intractable (at least if ε ! 1{d or if c " 1 and L is large) so that the upper bound is merely of academic interest. 



In what follows, the input domain r0, 1s d could be replaced by more general domains (for example Lipschitz domains) without any change in the later results. The unit cube r0, 1s d is merely chosen for concreteness. By architecture we mean the number of layers L, as well as the number of neurons in each layer. To have centered input data, we consider the hypercube r´0.5, 0.5s d in our experiments. Note that this does not change any of the theoretical results. The code can be found at https://github.com/juliusberner/theory2practice. For notational convenience, we abbreviate L p pr0, 1s d q by L p in this proof. This means that if f " g on a neighborhood of x P r0, 1s d , then pT f qpxq " pT gqpxq. We consider complex matrices and vectors, since the Riesz-Thorin theorem applies as stated only for the complex setting.



and Adcock & Dexter (2021);Fiedler et al. (2023).

Figure3: Target function ("teacher"), samples, and model of the deep learning algorithm ("student") attaining the min-max value in (3) for m " 100 and p " 8 in the experiment depicted in Figure2. The middle and right plots are zoomed versions of the left plot. The L 8 error p2.7 ¨10 ´3q is about one magnitude larger than the L 2 and L 1 errors p3.9 ¨10 ´4 and 2.4 ¨10 ´4q, which is caused by spikes of the teacher network between samples.

Figure 4: Plots of the function Λ M,σ in Equation (4) for pM, σq P tp2, 1q, p4, 1q, p4, 3 2 qu.

psq M,y pxqq, where, as before, : R Ñ R, x Þ Ñ maxt0, xu, denotes the ReLU activation function. A plot of Λ M,σ is shown in Figure 4. With these definitions, the function ϑ psq M,y satisfies the following properties: Lemma A.1. For d P N, s P rds, M ě 1, y P r0, 1s d , and p P p0, 8s, we have supp ϑ psq M,y Ă y `pM ´1 ¨r´1, 1s s ˆRd´s q and 1 2 ¨p2sq ´s{p ¨M ´s{p ď }ϑ psq M,y } L p pr0,1s d q ď 2 s{p ¨M ´s{p .

of the layers ppW 3 , b 3 q, . . . , pW L´1 , b L´1 qq scales the previous output by another factor C 2 " C 2 pc, B, qq ą 0, leading to the output C 1 C L´3 2 ¨ϑpsq M,y in any of the B output dimensions. This construction uses the fact that all intermediate outputs are positive by construction such that the intermediate ReLU activation functions just act as identities. 4. The last layer pW L , b L q now computes the sum of the previous outputs scaled by another factor C 3 " C 3 pc, B, qq ą 0 and multiplied by ν, such that the final one-dimensional output equals νBC 1 C L´3 2 C 3 ¨ϑpsq M,y . The result follows by setting λ " BC 1 C L´3 2 C 3 M s and choosing the scaling factors C 1 , C 2 , and C 3 as large as possible, constrained by the width B and the regularization given by c and q.

.

´Apf ,´1 q} L p `1 2 }f ,1 ´Apf ,1 q} L p ˙(13)

General hyperparameters for the experiments in Figure1and Section 3.

Hyperparameters specific to the experiment in Figure1.

Hyperparameters specific to the experiments in Section 3.

ACKNOWLEDGMENTS

The research of Julius Berner was supported by the Austrian Science Fund (FWF) under grant I3403-N32 and by the Vienna Science and Technology Fund (WWTF) under grant ICT19-041. The computational results presented have been achieved in part using the Vienna Scientific Cluster (VSC). Felix Voigtlaender acknowledges support by the DFG in the context of the Emmy Noether junior research group VO 2594/1-1.

annex

Step 5: We finally estimate for pA, mq as in Step 4 that Here, (19) follows from ( 8); (20) follows from Step 3 (note that A ω P Alg 2m pU, L p q for ω P Ω 0 ); and (21) follows from (18).Since pA, mq P Alg M C m pU, L p q was arbitrary, this implies the desired statement. Remark A.6. Close inspection of the proof of Theorem 2.2 shows that one can replace the point samples upx i q by T upx i q, where T : U Ñ Cpr0, 1s d q is any local operator 6 . Since any differential operator is a local operator, our lower bounds also hold if we measure point samples of a differential operator applied to u, as it is commonly done in the context of so-called physics-informed neural networks (Raissi et al., 2019) .

B PROOF OF THE UPPER BOUND IN SECTION 2.4

We first provide an auxiliary result which bounds the spectral norm }W } 2 Ñ 2 of a matrix W by its entry-wise q norm. Lemma B.1. Let W P R N ˆM and q P r1, 8s. Then it holds thatProof. We first note that }W } 2 " }W } F , the Frobenius norm of the matrix W . It is well-known that the Frobenius norm satisfies }W } 2 Ñ 2 ď }W } F . Since we could not locate a convenient reference, we reproduce the elementary proof: The Cauchy-Schwarz inequality implies thatwhich implies the claim. Thus, we see for q ď 2 that }W } 2 Ñ 2 ď }W } 2 ď }W } q . Clearly, the same estimate holds for complex-valued matrices and vectors as well.Now, to handle the case q ě 2, we first note for q " 8 and W P C N ˆM andThis proves the claim in case of q " 8. Finally, for q P p2, 8q, we choose θ " 2 q , so that 1 q " θ 2 `1´θ 8 . Thus, applying the Riesz-Thorin interpolation theorem (see, e.g., Folland, 1999, Theorem 6.27) to the linear map pC N ˆM , } ¨} q q Ñ pC N , } ¨} 2 q, W Þ Ñ W x, shows for each x P C M that }W x} 2 ď p ?N M q 1´θ ¨}W } q " p ? N M q 1´2 q ¨}W } q , which completes the proof 7 .Next, let us define the Lipschitz constant Lip q pφq of a function φ : R d Ñ R k with respect to the 2 norm byLip q pφq :" supx,yPR d ,x‰y }φpxq ´φpyq} q }x ´y} q .Note that the Lipschitz constant of an affine-linear mapping x Þ Ñ W x `b equals the spectral norm }W } 2 Ñ 2 . Thus, we can use the previous lemma to bound the Lipschitz constant of neural network realizations RpΦq P H q pN0,...,N L q,c in terms of their architecture pN 0 , . . . , N L q and the regularization on their weights (given by max 1ďiďL maxt}W i } q , }b i } q u ď c). Lemma B.2. Let L P N, q P r1, 8s, c ą 0, and N 0 , . . . , N L P N. Then, each RpΦq P H q pN0,...,N L q,c satisfiesProof. Let RpΦq P H q pN0,...,N L q,c be arbitrary. By definition, this meanswhere acts componentwise, and where the affine-linear maps φ i : R Ni´1 Ñ R Ni are of the form φ i pxq " W i x `bi , with W i P R NiˆNi´1 and }W i } q ď c.The ReLU activation function : R Ñ R, x Þ Ñ maxt0, xu, is easily seen to satisfy | pxq ´ pyq| ď |x ´y| for x, y P R. This implies thatLemma B.1 establishes for i P rLs that for every function u : r0, 1s d Ñ R with Lip 2 puq ă 8.Proof. Let m P N be arbitrary and choose K :" tm 1{d u ě 1. Write tx 1 , . . . , x K d u " p1,...,1q

2K

` 0 K , 1 K , . . . , K´1 K ( d noting that r0, 1s d "Hence, choosing Q i :" px i `r´1 2K , 1 2K s d qz Ť i´1 j"1 px j `r´1 2K , 1 2K s d q, we get r0, 1s d " Ţ K d i"1 Q i , where the union is disjoint.Note that K d ď m and choose arbitrary points x K d `1, . . . , x m P r0, 1s d . Furthermore, define Θ m : R m Ñ L 8 pr0, 1s d q, pa 1 , . . . , a m q Þ Ñ K d ÿ i"1 a i ¨1Qi .

