MINIMUM WIDTH FOR UNIVERSAL APPROXIMATION

Abstract

The universal approximation property of width-bounded networks has been studied as a dual of classical universal approximation results on depth-bounded networks. However, the critical width enabling the universal approximation has not been exactly characterized in terms of the input dimension d x and the output dimension d y . In this work, we provide the first definitive result in this direction for networks using the RELU activation functions: The minimum width required for the universal approximation of the L p functions is exactly max{d x + 1, d y }. We also prove that the same conclusion does not hold for the uniform approximation with RELU, but does hold with an additional threshold activation function. Our proof technique can be also used to derive a tighter upper bound on the minimum width required for the universal approximation using networks with general activation functions.

1. INTRODUCTION

The study of the expressive power of neural networks investigates what class of functions neural networks can/cannot represent or approximate. Classical results in this field are mostly focused on shallow neural networks. An example of such results is the universal approximation theorem (Cybenko, 1989; Hornik et al., 1989; Pinkus, 1999) , which shows that a neural network with fixed depth and arbitrary width can approximate any continuous function on a compact set, up to arbitrary accuracy, if the activation function is continuous and nonpolynomial. Another line of research studies the memory capacity of neural networks (Baum, 1988; Huang and Babri, 1998; Huang, 2003) , trying to characterize the maximum number of data points that a given neural network can memorize. After the advent of deep learning, researchers started to investigate the benefit of depth in the expressive power of neural networks, in an attempt to understand the success of deep neural networks. This has led to interesting results showing the existence of functions that require the network to be extremely wide for shallow networks to approximate, while being easily approximated by deep and narrow networks (Telgarsky, 2016; Eldan and Shamir, 2016; Lin et al., 2017; Poggio et al., 2017) . A similar trade-off between depth and width in expressive power is also observed in the study of the memory capacity of neural networks (Yun et al., 2019; Vershynin, 2020) . In search of a deeper understanding of the depth in neural networks, a dual scenario of the classical universal approximation theorem has also been studied (Lu et al., 2017; Hanin and Sellke, 2017; Johnson, 2019; Kidger and Lyons, 2020) . Instead of bounded depth and arbitrary width studied in classical results, the dual problem studies whether universal approximation is possible with a network of bounded width and arbitrary depth. A very interesting characteristic of this setting is that there exists a critical threshold on the width that allows a neural network to be a universal approximator. For example, one of the first results (Lu et al., 2017) in the literature shows that universal approximation of L 1 functions from R dx to R is possible for a width-(d x + 4) RELU network, but impossible for a width-d x RELU network. This implies that the minimum width required for universal approximation lies between d x + 1 and d x + 4. Subsequent results have shown upper/lower bounds on the minimum width, but none of the results has succeeded in a tight characterization of the minimum width.

1.1. WHAT IS KNOWN SO FAR?

Before summarizing existing results, we first define function classes studied in the literature. For a domain X ⊆ R dx and a codomain Y ⊆ R dy , we define C(X , Y) to be the class of continuous

