LEARNING RELU NETWORKS TO HIGH UNIFORM ACCURACY IS INTRACTABLE

Abstract

Statistical learning theory provides bounds on the necessary number of training samples needed to reach a prescribed accuracy in a learning problem formulated over a given target class. This accuracy is typically measured in terms of a generalization error, that is, an expected value of a given loss function. However, for several applications -for example in a security-critical context or for problems in the computational sciences -accuracy in this sense is not sufficient. In such cases, one would like to have guarantees for high accuracy on every input value, that is, with respect to the uniform norm. In this paper we precisely quantify the number of training samples needed for any conceivable training algorithm to guarantee a given uniform accuracy on any learning problem formulated over target classes containing (or consisting of) ReLU neural networks of a prescribed architecture. We prove that, under very general assumptions, the minimal number of training samples for this task scales exponentially both in the depth and the input dimension of the network architecture.

1. INTRODUCTION

The basic goal of supervised learning is to determine a functionfoot_0 u : r0, 1s d Ñ R from (possibly noisy) samples pupx 1 q, . . . , upx m qq. As the function u can take arbitrary values between these samples, this problem is, of course, not solvable without any further information on u. In practice, one typically leverages domain knowledge to estimate the structure and regularity of u a priori, for instance, in terms of symmetries, smoothness, or compositionality. Such additional information can be encoded via a suitable target class U Ă Cpr0, 1s d q that u is known to be a member of. We are interested in the optimal accuracy for reconstructing u that can be achieved by any algorithm which utilizes m point samples. To make this mathematically precise, we assume that this accuracy is measured by a norm } ¨}Y of a suitable Banach space Y Ą U . Formally, an algorithm can thus be described by a map A : U Ñ Y that can query the function u at m points x i and that outputs a function Apuq with Apuq « u (see Section 2.1 for a precise definition that incorporates adaptivity and stochasticity). We will be interested in upper and lower bounds on the accuracy that can be reached by any such algorithm -equivalently, we are interested in the minimal number m of point samples needed for any algorithm to achieve a given accuracy ε for every u P U . This m would then establish a fundamental benchmark on the sample complexity (and the algorithmic complexity) of learning functions in U to a given accuracy. The choice of the Banach space Y -in other words how we measure accuracy -is very crucial here. For example, statistical learning theory provides upper bounds on the optimal accuracy in terms of an expected loss, i.e., with respect to Y " L 2 pr0, 1s d , dPq, where P is a (generally unknown) 2018; Kim et al., 2021) . This offers a powerful approach to ensure a small average reconstruction error. However, there are many important scenarios where such bounds on the accuracy are not sufficient and one would like to obtain an approximation of u that is close to u not only on average, but that can be guaranteed to be close for every x P r0, 1s d . This includes several applications in the sciences, for example in the context of the numerical solution of partial differential equations (Raissi et al., 2019; Han et al., 2018; Richter & Berner, 2022) , any security-critical application, for example, facial ID authentication schemes (Guo & Zhang, 2019), as well as any application with a distribution-shift, i.e., where the data generating distribution is different from the distribution in which the accuracy is measured (Quiñonero-Candela et al., 2008) . Such applications can only be efficiently solved if there exists an efficient algorithm A that achieves uniform accuracy, i.e., a small error sup uPU }u ´Apuq} L 8 pr0,1s d q with respect to the uniform norm given by Y " L 8 pr0, 1s d q, i.e., }f } L 8 pr0,1s d q :" esssup xPr0,1s d |f pxq|. -0.3 0.0 0.3 -1 0 1 2 0. Inspired by recent successes of deep learning across a plethora of tasks in machine learning (LeCun et al., 2015) and also increasingly the sciences (Jumper et al., 2021; Pfau et al., 2020) , we will be particularly interested in the case where the target class U consists of -or contains -realizations of (feed-forward) neural networks of a specific architecturefoot_1 . Neural networks have been proven and observed to be extremely powerful in terms of their expressivity, that is, their ability to accurately approximate large classes of complicated functions with only relatively few parameters (Elbrächter et al., 2021; Berner et al., 2022) . However, it has also been repeatedly observed that the training of neural networks (e.g., fitting a neural network to data samples) to high uniform accuracy presents a big challenge: conventional training algorithms (such as SGD and its variants) often find neural networks that perform well on average (meaning that they achieve a small generalization error), but there are typically some regions in the input space where the error is large (Fiedler et al., 2023) ; see Figure 1 for an illustrative example. This phenomenon has been systematically studied on an empirical level by Adcock & Dexter (2021) . It is also at the heart of several observed instabilities in the training of deep neural networks, including adversarial examples (Szegedy et al., 2013; Goodfellow et al., 2015) or so-called hallucinations emerging in generative modeling, e.g., tomographic reconstructions (Bhadra et al., 2021) or machine translation (Müller et al., 2020) . Note that additional knowledge on the target functions could potentially help circumvent these issues, see Remark 1.3. However, for many applications, it is not possible to precisely describe the regularity of the target functions. We thus analyze the case where no additional information is given besides the fact that one aims to recover a (unknown) neural network of a specified architecture and regularization from given samples -i.e., we assume that U contains a class of neural networks of a given architecture, subject to various regularization methods. This is satisfied in several applications of interest, e.g., model extraction attacks (Tramèr et al., 2016; He et al., 2022) and teacher-student settings (Mirzadeh et al., 2020; Xie et al., 2020) . It is also in line with standard settings in the statistical query literature,



In what follows, the input domain r0, 1s d could be replaced by more general domains (for example Lipschitz domains) without any change in the later results. The unit cube r0, 1s d is merely chosen for concreteness. By architecture we mean the number of layers L, as well as the number of neurons in each layer.



Even though the training of neural networks from data samples may achieve a small error on average, there are typically regions in the input space where the pointwise error is large. The target function in this plot is given by x Þ Ñ logpsinp50xq `2q `sinp5xq (based on Adcock & Dexter, 2021) and the model is a feed-forward neural network. It is trained on m " 1000 uniformly distributed samples according to the hyperparameters in Tables1 and 2and achieves final L 1 and L 8 errors of 2.8 ¨10 ´3 and 0.19, respectively. The middle and right plots are zoomed versions of the left plot.

