MINIMUM WIDTH FOR UNIVERSAL APPROXIMATION

Abstract

The universal approximation property of width-bounded networks has been studied as a dual of classical universal approximation results on depth-bounded networks. However, the critical width enabling the universal approximation has not been exactly characterized in terms of the input dimension d x and the output dimension d y . In this work, we provide the first definitive result in this direction for networks using the RELU activation functions: The minimum width required for the universal approximation of the L p functions is exactly max{d x + 1, d y }. We also prove that the same conclusion does not hold for the uniform approximation with RELU, but does hold with an additional threshold activation function. Our proof technique can be also used to derive a tighter upper bound on the minimum width required for the universal approximation using networks with general activation functions.

1. INTRODUCTION

The study of the expressive power of neural networks investigates what class of functions neural networks can/cannot represent or approximate. Classical results in this field are mostly focused on shallow neural networks. An example of such results is the universal approximation theorem (Cybenko, 1989; Hornik et al., 1989; Pinkus, 1999) , which shows that a neural network with fixed depth and arbitrary width can approximate any continuous function on a compact set, up to arbitrary accuracy, if the activation function is continuous and nonpolynomial. Another line of research studies the memory capacity of neural networks (Baum, 1988; Huang and Babri, 1998; Huang, 2003) , trying to characterize the maximum number of data points that a given neural network can memorize. After the advent of deep learning, researchers started to investigate the benefit of depth in the expressive power of neural networks, in an attempt to understand the success of deep neural networks. This has led to interesting results showing the existence of functions that require the network to be extremely wide for shallow networks to approximate, while being easily approximated by deep and narrow networks (Telgarsky, 2016; Eldan and Shamir, 2016; Lin et al., 2017; Poggio et al., 2017) . A similar trade-off between depth and width in expressive power is also observed in the study of the memory capacity of neural networks (Yun et al., 2019; Vershynin, 2020) . In search of a deeper understanding of the depth in neural networks, a dual scenario of the classical universal approximation theorem has also been studied (Lu et al., 2017; Hanin and Sellke, 2017; Johnson, 2019; Kidger and Lyons, 2020) . Instead of bounded depth and arbitrary width studied in classical results, the dual problem studies whether universal approximation is possible with a network of bounded width and arbitrary depth. A very interesting characteristic of this setting is that there exists a critical threshold on the width that allows a neural network to be a universal approximator. For example, one of the first results (Lu et al., 2017) in the literature shows that universal approximation of L 1 functions from R dx to R is possible for a width-(d x + 4) RELU network, but impossible for a width-d x RELU network. This implies that the minimum width required for universal approximation lies between d x + 1 and d x + 4. Subsequent results have shown upper/lower bounds on the minimum width, but none of the results has succeeded in a tight characterization of the minimum width.

1.1. WHAT IS KNOWN SO FAR?

Before summarizing existing results, we first define function classes studied in the literature. For a domain X ⊆ R dx and a codomain Y ⊆ R dy , we define C(X , Y) to be the class of continuous  (R dx , R) RELU d x + 1 ≤ w min ≤ d x + 4 L 1 (K, R) RELU w min ≥ d x Hanin and Sellke (2017) C(K, R dy ) RELU d x + 1 ≤ w min ≤ d x + d y Johnson (2019) C(K, R) uniformly conti. † w min ≥ d x + 1 Kidger and Lyons ( 2020) C(K, R dy ) conti. nonpoly ‡ w min ≤ d x + d y + 1 C(K, R dy ) nonaffine poly w min ≤ d x + d y + 2 L p (R dx , R dy ) RELU w min ≤ d x + d y + 1 Ours (Theorem 1) L p (R dx , R dy ) RELU w min = max{d x + 1, d y } Ours (Theorem 2) C([0, 1], R 2 ) RELU w min = 3 > max{d x + 1, d y } Ours (Theorem 3) C(K, R dy ) RELU+STEP w min = max{d x + 1, d y } Ours (Theorem 4) L p (K, R dy ) conti. nonpoly ‡ w min ≤ max{d x + 2, d y + 1} † requires that ρ is uniformly approximated by a sequence of one-to-one functions. ‡ requires that ρ is continuously differentiable at some z with ρ (z) = 0. functions from X to Y, endowed with the uniform norm: ) , we also define L p (X , Y) to be the class of L p functions from X to Y, endowed with the L p -norm: f p := ( X f (x) p p dx) 1/p . The summary of known upper and lower bounds in the literature, as well as our own results, is presented in Table 1 . We use w min to denote the minimum width for universal approximation. f ∞ := sup x∈X f (x) ∞ . For p ∈ [1, ∞ First progress. As aforementioned, Lu et al. (2017) show that universal approximation of L 1 (R dx , R) is possible for a width-(d x + 4) RELU network, but impossible for a width-d x RELU network. These results translate into bounds on the minimum width: d x + 1 ≤ w min ≤ d x + 4. Hanin and Sellke (2017) consider approximation of C(K, R dy ), where K ⊂ R dx is compact. They prove that RELU networks of width d x + d y are dense in C(K, R dy ), while width-d x RELU networks are not. Although this result fully characterizes w min in case of d y = 1, it fails to do so for d y > 1. General activations. Later, extensions to activation functions other than RELU have appeared in the literature. Johnson (2019) shows that if the activation function ρ is uniformly continuous and can be uniformly approximated by a sequence of one-to-one functions, a width-d x network cannot universally approximate C(K, R). Kidger and Lyons (2020) show that if ρ is continuous, nonpolynomial, and continuously differentiable at some z with ρ (z) = 0, then networks of width d x + d y + 1 with activation ρ are dense in C(K, R dy ). Furthermore, Kidger and Lyons (2020) prove that RELU networks of width d x + d y + 1 are dense in L p (R dx , R dy ). Limitations of prior arts. Note that none of the existing works succeeds in closing the gap between the upper bound (at least d x + d y ) and the lower bound (at most d x + 1). This gap is significant especially for applications with high-dimensional codomains (i.e., large d y ), arising for many practical applications of neural networks, e.g., image generation (Kingma and Welling, 2013; Goodfellow et al., 2014 ), language modeling (Devlin et al., 2019; Liu et al., 2019) , and molecule generation (Gómez-Bombarelli et al., 2018; Jin et al., 2018) . In the prior arts, the main bottleneck for proving an upper bound below d x + d y is that they maintain all d x neurons to store the input and all d y neurons to construct the function output; this means every layer already requires at least d x + d y neurons. In addition, the proof techniques for the lower bounds only consider the input dimension d x regardless of the output dimension d y .

1.2. SUMMARY OF RESULTS

We mainly focus on characterizing the minimum width of RELU networks for universal approximation. Nevertheless, our results are not restricted to RELU networks; they can be generalized to networks with general activation functions. Our contributions can be summarized as follows.



A summary of known upper/lower bounds on minimum width for universal approximation. In the table, K ⊂ R dx denotes a compact domain, and p ∈ [1, ∞). "Conti." is short for continuous.

