LEARNING RELU NETWORKS TO HIGH UNIFORM ACCURACY IS INTRACTABLE

Abstract

Statistical learning theory provides bounds on the necessary number of training samples needed to reach a prescribed accuracy in a learning problem formulated over a given target class. This accuracy is typically measured in terms of a generalization error, that is, an expected value of a given loss function. However, for several applications -for example in a security-critical context or for problems in the computational sciences -accuracy in this sense is not sufficient. In such cases, one would like to have guarantees for high accuracy on every input value, that is, with respect to the uniform norm. In this paper we precisely quantify the number of training samples needed for any conceivable training algorithm to guarantee a given uniform accuracy on any learning problem formulated over target classes containing (or consisting of) ReLU neural networks of a prescribed architecture. We prove that, under very general assumptions, the minimal number of training samples for this task scales exponentially both in the depth and the input dimension of the network architecture.

1. INTRODUCTION

The basic goal of supervised learning is to determine a functionfoot_0 u : r0, 1s d Ñ R from (possibly noisy) samples pupx 1 q, . . . , upx m qq. As the function u can take arbitrary values between these samples, this problem is, of course, not solvable without any further information on u. In practice, one typically leverages domain knowledge to estimate the structure and regularity of u a priori, for instance, in terms of symmetries, smoothness, or compositionality. Such additional information can be encoded via a suitable target class U Ă Cpr0, 1s d q that u is known to be a member of. We are interested in the optimal accuracy for reconstructing u that can be achieved by any algorithm which utilizes m point samples. To make this mathematically precise, we assume that this accuracy is measured by a norm } ¨}Y of a suitable Banach space Y Ą U . Formally, an algorithm can thus be described by a map A : U Ñ Y that can query the function u at m points x i and that outputs a function Apuq with Apuq « u (see Section 2.1 for a precise definition that incorporates adaptivity and stochasticity). We will be interested in upper and lower bounds on the accuracy that can be reached by any such algorithm -equivalently, we are interested in the minimal number m of point samples needed for any algorithm to achieve a given accuracy ε for every u P U . This m would then establish a fundamental benchmark on the sample complexity (and the algorithmic complexity) of learning functions in U to a given accuracy. The choice of the Banach space Y -in other words how we measure accuracy -is very crucial here. For example, statistical learning theory provides upper bounds on the optimal accuracy in terms of an expected loss, i.e., with respect to Y " L 2 pr0, 1s d , dPq, where P is a (generally unknown)



In what follows, the input domain r0, 1s d could be replaced by more general domains (for example Lipschitz domains) without any change in the later results. The unit cube r0, 1s d is merely chosen for concreteness.

