Achieve the Minimum Width of Neural Networks for Universal Approximation

Abstract

The universal approximation property (UAP) of neural networks is fundamental for deep learning, and it is well known that wide neural networks are universal approximators of continuous functions within both the L p norm and the continuous/uniform norm. However, the exact minimum width, w min , for the UAP has not been studied thoroughly. Recently, using a decoder-memorizer-encoder scheme, Park et al. ( 2021) found that w min = max(d x + 1, d y ) for both the L p -UAP of ReLU networks and the C-UAP of ReLU+STEP networks, where d x , d y are the input and output dimensions, respectively. In this paper, we consider neural networks with an arbitrary set of activation functions. We prove that both C-UAP and L p -UAP for functions on compact domains share a universal lower bound of the minimal width; that is, w * min = max(d x , d y ). In particular, the critical width, w * min , for L p -UAP can be achieved by leaky-ReLU networks, provided that the input or output dimension is larger than one. Our construction is based on the approximation power of neural ordinary differential equations and the ability to approximate flow maps by neural networks. The nonmonotone or discontinuous activation functions case and the one-dimensional case are also discussed.

1. Introduction

The study of the universal approximation property (UAP) of neural networks is fundamental for deep learning and has a long history. Early studies, such as Cybenkot (1989) ; Hornik et al. (1989) ; Leshno et al. (1993) , proved that wide neural networks (even shallow ones) are universal approximators for continuous functions within both the L p norm (1 ≤ p < ∞) and the continuous/uniform norm. Further research, such as Telgarsky (2016) , indicated that increasing the depth can improve the expression power of neural networks. If the budget number of the neuron is fixed, the deeper neural networks have better expression power Yarotsky & Zhevnerchuk (2020) ; Shen et al. (2022) . However, this pattern does not hold if the width is below a critical threshold w min . Lu et al. (2017) first showed that the ReLU networks have the UAP for L 1 functions from R dx to R if the width is larger than d x + 4, and the UAP disappears if the width is less than d x . Further research, Hanin & Sellke (2017) ; Kidger & Lyons (2020) ; Park et al. (2021) , improved the minimum width bound for ReLU networks. Particularly, Park et al. (2021) revealed that the minimum width is w min = max(d x + 1, d y ) for the L p (R dx , R dy ) UAP of ReLU networks and for the C(K, R dy ) UAP of ReLU+STEP networks, where K is a compact domain in R dx . For general activation functions, the exact minimum width w min for UAP is less studied. Johnson (2019) consider uniformly continuous activation functions that can be approximated by a sequence of one-to-one functions and give a lower bound w min ≥ d x + 1 for C-UAP (means UAP for C(K, R dy )). Kidger & Lyons (2020) consider continuous nonpolynomial activation functions and give an upper bound w min ≤ d x + d y + 1 for C-UAP. Park et al. (2021) improved the bound for L p -UAP (means UAP for L p (K, R dy )) to w min ≤ max(d x + 2, d y + 1). A summary of known upper/lower bounds on minimum width for the UAP can be found in Park et al. (2021) . In this paper, we consider neural networks having the UAP with arbitrary activation functions. We give a universal lower bound, w min ≥ w * min = max(d x , d y ), to approximate functions from a compact domain K ⊂ R dx to R dy in the L p norm or continuous norm. Furthermore, we show that the critical width w * min can be achieved by many neural networks, as listed in Table 1 . Surprisingly, the leaky-ReLU networks achieve the critical width for the L p -UAP provided that the input or output dimension is larger than one. This result relies on a novel construction scheme proposed in this paper based on the approximation power of neural ordinary differential equations (ODEs) and the ability to approximate flow maps by neural networks. ReLU w min = d x + 1 Hanin & Sellke (2017) L p (R dx , R dy ) ReLU w min = max(d x + 1, d y ) Park et al. (2021) C([0, 1], R 2 ) ReLU w min = 3 = max(d x , d y ) + 1 Park et al. (2021) C(K, R dy ) ReLU+STEP w min = max(d x + 1, d y ) Park et al. (2021) L p (K, R dy ) Conti. nonpoly ‡ w min ≤ max(d x + 2, d y + 1) Park et al. (2021) L p (K, R dy ) Arbitrary w min ≥ max(d x , d y ) =: w * min Ours (Lemma 1) Leaky-ReLU w min = max(d x , d y , 2) Ours (Theorem 2) Leaky-ReLU+ABS w min = max(d x , d y ) Ours (Theorem 3) C(K, R dy ) Arbitrary w min ≥ max(d x , d y ) =: w * min Ours (Lemma 1) ReLU+FLOOR w min = max(d x , d y , 2) Ours (Lemma 4) UOE † +FLOOR w min = max(d x , d y ) Ours (Corollary 6) C([0, 1], R dy ) UOE † w min = d y Ours (Theorem 5) ‡ Continuous nonpolynomial ρ that is continuously differentiable at some z with ρ ′ (z) ̸ = 0. . † UOE means the function having universal ordering of extrema, see Definition 7 .

1.1. Contributions

1) Obtained the universal lower bound of width w * min for feed-forward neural networks (FNNs) that have universal approximation properties. 2) Achieved the critical width w * min by leaky-ReLU+ABS networks and UOE+FLOOR networks. (UOE is a continuous function which has universal ordering of extrema. It is introduced to handle C-UAP for one-dimensional functions. See Definition 7.) 3) Proposed a novel construction scheme from a differential geometry perspective that could deepen our understanding of UAP through topology theory.

1.2. Related work

To obtain the exact minimum width, one must verify the lower and upper bounds. Generally, the upper bounds are obtained by construction, while the lower bounds are obtained by counterexamples. Lower bounds. For ReLU networks, Lu et al. (2017) 2022) noticed that the FNN could also be a discretization of neural ODEs, which motivates us to construct networks achieving the critical width by inheriting the approximation power of neural ODEs. For the excluded dimension one, we design an approximation scheme with leaky-ReLU+ABS and UOE activation functions.

1.3. Organization

We formally state the main results and necessary notations in Section 2. The proof ideas are given in Section 3 4, and 5. In Section 3, we consider the case where N = d x = d y = 1, which is basic for the high-dimensional cases. The construction is based on the properties of monotone functions. In Section 4, we prove the case where N = d x = d y ≥ 2. The construction is based on the approximation power of neural ODEs. In Section 5, we consider the case where d x ̸ = d y and discuss the case of more general activation functions. Finally, we conclude the paper in Section 6. All formal proofs of the results are presented in the Appendix.

2. Main results

In this paper, we consider the standard feed-forward neural network with N neurons at each hidden layer. We say that a σ network with depth L is a function with inputs x ∈ R dx and outputs y ∈ R dy , which has the following form: y ≡ f L (x) = W L+1 σ(W L (• • • σ(W 1 x + b 1 ) + • • • ) + b L ) + b L+1 , where b i are bias vectors, W i are weight matrices, and σ(•) is the activation function. For the case of multiple activation functions, for instance, σ 1 and σ 2 , we call f L a σ 1 +σ 2 network. In this situation, the activation function of each neuron is either σ 1 or σ 2 . In this paper, we consider arbitrary activation functions, while the following activation functions are emphasized: ReLU (max(x, 0)), leaky-ReLU (max(x, αx), α ∈ (0, 1) is a fixed positive parameter), ABS (|x|), SIN (sin(x)), STEP (1 x>0 ), FLOOR (⌊x⌋) and UOE (universal ordering of extrema, which will be defined later). Lemma 1. For any compact domain K ⊂ R dx and any finite set of activation functions {σ i }, the {σ i } networks with width w < w * min ≡ max(d x , d y ) do not have the UAP for both L p (K, R dy ) and C(K, R dy ). L p -UAP and C-UAP. The lemma indicates that w * min ≡ max(d x , d y ) is a universal lower bound for the UAP in both L p (K, R dy ) and C(K, R dy ). The main result of this paper illustrates that the minimal width w * min can be achieved. We consider the UAP for these two function classes, i.e., L p -UAP and C-UAP, respectively. Note that any compact domain can be covered by a big cubic, the functions on the former can be extended to the latter, and the cubic can be mapped to the unit cubic by a linear function. This allows us to assume K to be a (unit) cubic without loss of generality. 2.1 L p -UAP Theorem 2. Let K ⊂ R dx be a compact set; then, for the function class L p (K, R dy ), the minimum width of leaky-ReLU networks having L p -UAP is exactly w min = max(d x , d y , 2). The theorem indicates that leaky-ReLU networks achieve the critical width w * min = max(d x , d y ), except for the case of d x = d y = 1. The idea is to consider the case where d x = d y = d > 1 and let the network width equal d. According to the results of Duan et al. (2022) , leaky-ReLU networks can approximate the flow map of neural ODEs. Thus, we can use the approximation power of neural ODEs to finish the proof. Li et al. (2022) proved that many neural ODEs could approximate continuous functions in the L p norm. This is based on the fact that orientation preserving diffeomorphisms can approximate continuous functions Brenier & Gangbo (2003) . The exclusion of dimension one is because of the monotonicity of leaky ReLU. When we add a nonmonotone activation function such as the absolute value function or sine function, the L p -UAP at dimension one can be achieved. Theorem 3. Let K ⊂ R dx be a compact set; then, for the function class L p (K, R dy ), the minimum width of leaky-ReLU+ABS networks having L p -UAP is exactly w min = max(d x , d y ).

2.2. C-UAP

C-UAP is more demanding than L p -UAP. However, if the activation functions could include discontinuous functions, the same critical width w * min can be achieved. Following the encoder-memory-decoder approach in Park et al. (2021) , the step function is replaced by the floor function, and one can obtain the minimal width w min = max(d x , 2, d y ). Lemma 4. Let K ⊂ R dx be a compact set; then, for the function class C(K, R dy ), the minimum width of ReLU+FLOOR networks having C-UAP is exactly w min = max(d x , 2, d y ). Since ReLU and FLOOR are monotone functions, the C-UAP critical width w * min does not hold for C([0, 1], R). This seems to be the case even if we add ABS or SIN as an additional activator. However, it is still possible to use the UOE function (Definition 12). Theorem 5. The UOE networks with width d y have C-UAP for functions in C([0, 1], R dy ). Corollary 6. Let K ⊂ R dx be a compact set; then, for the continuous function class C(K, R dy ), the minimum width of UOE+FLOOR networks having C-UAP is exactly w min = max(d x , d y ).

3. Approximation in dimension one

(N = d x = d y = d = 1) In this section, we consider one-dimensional functions and neural networks with a width of one. In this case, the expression of ReLU networks is extremely poor. Therefore, we consider the leaky ReLU activation σ α (x) with a fixed parameter α ∈ (0, 1). Note that leaky-ReLU is strictly monotonic, and it was proven by Duan et al. (2022) that any monotone function in C([0, 1], R) can be uniformly approximated by leaky-ReLU networks with width one. This is useful for our construction to approximate nonmonotone functions. Since the composition of monotone functions is also a monotone function, to approximate nonmonotone functions we need to add a nonmonotone activation function. Let us consider simple nonmonotone functions, such as |x| or sin(x). We show that leaky-ReLU+ABS or leaky-ReLU+SIN can approximate any continuous function f * (x) under the L p norm. The idea, shown in Figure 1 , is that the target function f * (x) can be uniformly approximated by the polynomial p(x), which can be represented as the composition g • u(x) = p(x) ≈ f * (x). Here, the outer function g(x) is any continuous function whose value at extrema matches the value at extrema of p(x), and the inner function u(x) is monotonically increasing, which adjusts the location of the extrema (see Figure 1 ). Since polynomials have a finite number of extrema, the inner function u(x) is piecewise continuous. For L p -UAP, the approximation is allowed to have a large deviation on a small interval; therefore, the extrema could not be matched exactly (over a small error). For example, we can choose g(x) as the sine function or the sawtooth function (which can be approximated by ABS networks), and u(x) is a leaky-ReLU network approximating g -1 • p(x) at each monotone interval of p. Figure 1 (a) shows an example of the composition. For C-UAP, matching the extrema while keeping the error small is needed. To achieve this aim, we introduce the UOE functions. Definition 7 (Universal ordering of extrema (UOE) functions). A UOE function is a continuous function in C(R, R) such that any (finite number of ) possible ordering(s) of values at the (finite) extrema can be found in the extrema of the function. There are an infinite number of UOE functions. Here, we give an example, as shown in Figure 2 . This UOE function ρ(x) is defined by a sequence {o i } ∞ i=1 , ρ(x) = x/4, x ≤ 0, o i + (x -i)(o i+1 -o i ), x ∈ [i, i + 1), where {o i } ∞ i=1 = (1, 2, 2, 1, 1, 2, 3, 1, 3, 2, 2, 1, 3, 2, 3, 1, 3, 1, 2, 3, 2, 1, 1, 2, 3, 4, ...) is the concatenation of all permutations of positive integer numbers. The term UOE in this paper means this function ρ. Since the UOE function ρ(x) can represent leaky-ReLU σ 1/4 on any finite interval, this implies that the UOE networks can uniformly approximate any monotone functions. To illustrate the C-UAP of UOE networks, we only need to construct a continuous function g(x) matching the extrema of p(x) (see Figure 1(b) ). That is, construct g(x) by the composition ũ • ρ(x), where ũ(x) is a monotone and continuous function. This is possible since the UOE function contains any ordering of the extrema. The following lemma summarizes the approximation of one-dimensional functions. As a consequence, Theorem 5 holds since functions in C([0, 1], R dy ) can be regarded as d y onedimensional functions. Lemma 8. For any function f * (x) ∈ C[0, 1] and ε > 0, there is a leaky-ReLU+ABS (or leaky-ReLU+SIN) network with width one and depth L such that 1 0 |f * (x)-f L (x)| p dx < ε p . There is a leaky-ReLU+UOE network with a width of one and a depth of L such that |f * (x) -f L (x)| < ε, ∀x ∈ [0, 1]. 4 Connection to the neural ODEs (N = d x = d y = d ≥ 2) Now, we turn to the high-dimensional case and connect the feed-forward neural networks to neural ODEs. To build this connection, we assume that the input and output have the same dimension, d x = d y = d. Consider the following neural ODE with one-hidden layer neural fields: 3) are piecewise constants, then for any compact set K and any ε > 0, there is a leaky-ReLU network f L (x) with width d and depth L such that ẋ(t) = v(x(t), t) := A(t) tanh(W (t)x(t) + b(t)), t ∈ (0, τ ), x(0) = x 0 , ∥ϕ τ (x) -f L (x)∥ ≤ ε, ∀x ∈ K. ( ) Combining these two lemmas, one can directly prove the following corollary, which is a part of our Theorem 2. Corollary 11. Let K ⊂ R d be a compact set and d ≥ 2; then, for the function class L p (K, R d ), the leaky-ReLU networks with width d have L p -UAP. Here, we summarize the main ideas of this result. Let us start with the discretization of the ODE by the splitting approach (see McLachlan & Quispel (2002) for example). Consider the spliting of (3) with v(x, t) = i,j v (j) i (x, t)e j , where v (j) i (x, t) = A ji (t) tanh(W i,: (t)x + b i (t) ) is a scalar function and e j is the j-th axis unit vector. Then for a given time step ∆t = τ /K, (K large enough), the splitting method gives the following iteration of x k which approximates ϕ k∆t (x 0 ), x k+1 = T (d,d) k • • • • • T (1,2) k • T (1,1) k x k , where the map T (i,j) k : x → y is defined as y (l) = x (l) , l ̸ = j, y (j) = x (j) + ∆tv (j) i (x, k∆t) = x (j) + a∆t tanh(wx + β). Here the superscript in x (l) means the l-th coordinate of x. a = A ji , w = W i,: and β = b i take their value at t = k∆t. Note that the scalar functions tanh(ξ) and ξ + a∆t tanh(ξ) are monotone with respect to ξ when ∆t is small enough. This allows us to construct leaky-ReLU networks with width d to approximate each map T (i,j) k and then approximate the flow-map, ϕ τ (x 0 ) ≈ x K . Note that Lemma 10 holds for all dimensions, while Lemma 9 holds for dimensions larger than one. This is because flow maps are orientation-preserving diffeomorphisms, and they can approximate continuous functions only for dimensions larger than one; see Brenier & Gangbo (2003) . The approximation is based on control theory where the flow map can be adjusted to match any finite set of input-output pairs. This match does not hold for dimension one. However, the case of dimension one is discussed in the last section.

5. Achieving the minimal width

Now, we turn to the cases where the input and output dimensions cannot be equal.

5.1. Universal lower bound w

* min = max(d x , d y ) Here, we give a sketch of the proof of Lemma 1, which states that w * min is a universal lower bound over all activation functions. Parts of Lemma 1 have been demonstrated in many papers, such as Park et al. (2021) . Here, we give proof by two counterexamples that are simple and easy to understand from the topological perspective. It contains two cases: 1) there is a function f * that cannot be approximated by networks with width w ≤ d x -1; 2) there is a function f * that cannot be approximated by networks with width w ≤ d y -1. Figure 3 (a)-(b) shows the counterexamples that illustrate the essence of the proof. For the first case, w ≤ d x -1, we show that f * (x) = ∥x∥ 2 , x ∈ K = [-2, 2] dx , is what we want; see Figure 3(a) . In fact, we can relax the networks to a function f (x) = ϕ(W x + b), where W x + b is a transformer from R dx to R dx-1 and ϕ(x) could be any function. A consequence is that there exists a direction v (set as the vector satisfying W v = 0, ∥v∥ = 1) such that f (x) = f (x + λv) for all λ ∈ R. Then, considering the sets A = {x : ∥x∥ ≤ 0.1} and B = {x : ∥x -v∥ ≤ 0.1}, we have K |f (x) -f * (x)|dx ≥ A |f (x) -f * (x)|dx + B |f (x) -f * (x)|dx ≥ A (|f (x) -f * (x)| + |f (x + v) -f * (x + v)|)dx ≥ A (|f * (x) -f * (x + v)|)dx ≥ 0.8|A|. Since the volume of A is a fixed positive number, the inequality implies that even the L 1 approximation for f * is impossible. The case of the L p norm and the uniform norm is impossible as well. For the second case, w ≤ d y -1, we show the example of f * , which is the parametrized curve from 0 to 1 along the edge of the cubic, see Figure 3(b) . Relaxing the networks to a function f (x) = W ψ(x) + b, ψ(x) could be any function. Since the range of f is in a hyperplane while f * has a positive distance to any hyperplane, the target f * cannot be approximated.

5.2. Achieving w *

min for L p -UAP Now, we show that the lower bound w * min for L p -UAP can be achieved by leaky-ReLU+ABS networks. Without loss of generality, we consider K = [0, 1] dx . For any function f * in L p ([0, 1] dx , R dy ), we can extend it to a function f * in L p ([0, 1] d , R d ) by filling in zeros where d = max(d x , d y ) = w * min . When d x > 1 or d y > 1, the L p -UAP for leaky-ReLU networks with width w * min is obtained by using Corollary 11. Recall that by the Lemma 1, w * min is optimal, and we obtain our main result Theorem 2. Combining the case of d x = d y = d = 1 in Section 3, adding absolute function ABS as an additional activation function, we obtain Theorem 3.

5.3. Achieving w *

min for C-UAP Here, we use the encoder-memorizer-decoder approach proposed in Park et al. (2021) to achieve the minimum width. Without loss of generality, we consider the function class C([0, 1] dx , [0, 1] dy ). The encoder-memorizer-decoder approach includes three parts: 1) an encoder maps [0, 1] dx to [0, 1] which quantizes each coordinate of x by a Kbit binary representation and concatenates the quantized coordinates into a single scalar value x having a (d x K)-bit binary representation; 2) a memorizer maps each codeword x to its target codeword ȳ; 3) a decoder maps ȳ to the quantized target that approximates the true target. As illustrated in Figure 3 

5.4. Effect of the activation functions

Here, we emphasize that our universal bound of the minimal width is optimized over arbitrary activation functions. However, it cannot always be achieved when the activation functions are fixed. Here, we discuss the case of monotone activation functions. If the activation functions are strictly monotone and continuous (such as leaky-ReLU), a width of at least d x + 1 is needed for C-UAP. This can be understood through topology theory. Leaky-ReLU, the nonsingular linear transformer, and its inverse are continuous and homeomorphic. Since compositions of homeomorphisms are also homeomorphisms, we have the following proposition: If N = d x = d y = d and the weight matrix in leaky-ReLU networks are nonsingular, then the input-output map is a homeomorphism. Note that singular matrices can be approximated by nonsingular matrices; therefore, we can restrict the weight matrix in neural networks to the nonsingular case. The case where d x < d y . We present a simple example in Figure 5 . The curve '4' corresponding to a continuous function from [0, 1] ⊂ R to R 2 cannot be uniformly approximated. However, the L p approximation is still possible. 

6. Conclusion

Let us summarize the main results and implications of this paper. After giving the universal lower bound of the minimum width for the UAP, we proved that the bound is optimal by constructing neural networks with some activation functions. For the L p -UAP, our construction to achieve the critical width was based on the approximation power of neural ODEs, which bridges the feed-forward networks to the flow maps corresponding to the ODEs. This allowed us to understand the UAP of the FNN through topology theory. Moreover, we obtained not only the lower bound but also the upper bound. For the C-UAP, our construction was based on the encoder-memorizer-decoder approach in Park et al. (2021) , where the activation sets contain a discontinuous function ⌊x⌋. It is still an open question whether we can achieve the critical width by continuous activation functions. Johnson (2019) proved that continuous and monotone activation functions need at least width d x + 1. This implies that nonmonotone activation functions are needed. By using the UOE activation, we calculated the critical width for the case of d x = 1. It would be of interest to study the case of d x ≥ 2 in future research. We remark that our UAP is for functions on a compact domain. Examining the critical width of the UAP for functions on unbounded domains is desirable for future research. according to the well-known Weierstrass approximation theorem. Without a loss of generality, we can assume that p n (x) is not the same at all of its extrema. Then, we can represent p n (x) by the following composition, using Lemma 13 and the property of UOE: p n (x) = v • ρ • u(x), where ρ(x) is the UOE function (2) and v(x) and u(x) are monotonically increasing continuous functions. Then, we can approximate p n (x) by UOE networks. Since v(x) and u(x) are monotone, there are UOE networks ṽ(x) and ũ(x) such that ∥v -ṽ∥ and ∥u -ũ∥ are arbitrarily small. Hence, there is a UOE network f L (x) = ṽ • ρ • ũ(x) that can approximate p n (x) such that |p n (x) -f L (x)| ≤ ε/2, ∀x ∈ [0, 1], which implies that |f * (x) -f L (x)| ≤ ε. This completes the proof of the second point. For the first point, we only emphasize that it is easy to construct a function f (x) that has the same local maximum and local minimum in the interval and has ∥f -f * ∥ L p small enough. This f (x) has the same ordering of extrema as the sawtooth function (or sine) and hence can be uniformly approximated by leaky-ReLU+ABS (or leaky-ReLU+SIN) networks f L . As a consequence, ∥f L -f * ∥ L p is small enough. Proof. This is a special case of Theorem 2.3 in Li et al. (2022) . Proof. For any f * (x) ∈ L p (K, R d ) and ε > 0, there is a flow map ϕ τ (x) associated with the neural ODE (3) such that (according to Lemma 9) ∥f * (•) -ϕ τ (•)∥ L p ≤ ε 2 . Then, employing Lemma 10, there is a leaky-ReLU network f L such that ∥f L (•) -ϕ τ (•)∥ L p ≤ ε 2 . Therefore, we have ∥f L (•) -f * (•)∥ L p ≤ ∥f * (•) -ϕ τ (•)∥ L p + ∥f L (•) -ϕ τ (•)∥ L p ≤ ε.



Figure 1: Example of approximating/representing a polynomial by the composition of a monotonically increasing function u(x) and a nonmonotone function g(x). (a) only matching the ordering of extrema values, (b) matching the values as well.

Figure 2: An example of the UOE function ρ(x), which has an infinite number of pieces.

where x, x 0 , ∈ R d and the time-dependent parameters (A, W, b) ∈ R d×d × R d×d × R d are piecewise constant functions of t. The flow map is denoted as ϕ τ (•), which is the function from x 0 to x(τ ). According to the approximation results of neural ODEs (seeLi et  al. (2022); Tabuada & Gharesifard (2020); Ruiz-Balet & Zuazua (2021) for examples), we have the following lemma. Lemma 9 (Special case of Li et al. (2022) ). Let d ≥ 2. Then, for any continuous function f * : R d → R d , any compact set K ⊂ R d , and any ε > 0, there exist a time τ ∈ R + and a piecewise constant input (A, W, b) : [0, τ ] → R d×d × R d×d × R d so that the flow-map ϕ τ associated with the neural ODE (3) satisfies: ||f * -ϕ τ || L p (K) ≤ ε. Next, we consider the approximation of the flow map associated with (3) by neural networks. Recently, Duan et al. (2022) found that leaky-ReLU networks could perform such approximations. Lemma 10 (Theorem 2.2 in Duan et al. (2022)). If the parameters (A, W, b) in (

(c), using the floor function instead of a step function, one can construct the encoder by FLOOR networks with width d x and the decoder by FLOOR networks with width d y . The memorizer is a one-dimensional scalar function that can be approximated by ReLU networks with a width of two or UOE networks with a width of one. Therefore, the minimal widths max(d x , 2, d y ) and max(d x , d y ) are obtained, which demonstrate Lemma 4 and Corollary 6, respectively.

Figure 3: (a)(b) Counterexamples for proving Lemma 1. (a) Points A and B on a level set of networks f (x); f (A) = f (B) but f * (A) -f * (B) is not small. (b) The curve from 0 to 1 along the edge of the cubic has a positive distance to any hyperplane. (c) illustration of the encoder-memorizer-decoder scheme for C-UAP by an example where d x = d y = 3, 4 bits for the input and 5 bits for the output.

When d x ≥ d y , we can reformulate the leaky-ReLU network as f L (x) = W L+1 ψ(x) + b L+1 , where ψ(x) is the homeomorphism. Note that considering the case where d y = 1 is sufficient, according toHanin & Sellke (2017);Johnson (2019). They proved that the neural network width d x cannot approximate any scalar function with a level set containing a bounded path component. This can be easily understood from the perspective of topology theory. An example is to consider the function f * (x) = ∥x∥ 2 , x ∈ K = [-2, 2] dx shown in Figure4.

Figure 4: Illustrating the possibility of UAP when N = d x . (a) Plot of f * (x) = ∥x∥ 2 and its contour at ∥x∥ = 1. (b) The original point P is an inner point of the unit ball, while its image is a boundary point, which is impossible for homeomorphisms. (c) Any homeomorphism, approximating ∥x∥ 2 with error less than ε (=0.1 for example) on Γ, should have error larger than 1 -ε (=0.9) at P . (d) Approximating f * in L p is possible by leaving a small region.

Figure 5: Illustrating the possibility of C-UAP when d x ≤ d y . The curve in (a) is homeomorphic to the interval [0, 1], while the curve '4' in (b) is not and cannot be approximated uniformly by homeomorphisms. The L p approximation is possible via (a).

Proof of Lemma 9 Lemma 9. Let d ≥ 2. Then, for any continuous function f * : R d → R d , any compact set K ⊂ R d , and any ε > 0, there exist a time τ ∈ R + and a piecewise constant input (A, W, b) : [0, τ ] → R d×d × R d×d × R d so that the flow map ϕ τ associated with the neural ODE (3) satisfies: ||f * -ϕ τ || L p (K) ≤ ε.

Proof of Lemma 10 Lemma 10. If the parameters (A, W, b) in (3) are piecewise constants, then for any compact set K and any ε > 0, there is a leaky-ReLU network f L (x) with width d and depth L such that∥ϕ τ (x) -f L (x)∥ ≤ ε, ∀x ∈ K.(8)Proof. It is Theorem 2.2 inDuan et al. (2022).A.4 Proof of Corollary 11Corollary 11. Let K ⊂ R d be a compact set and d ≥ 2; then, for the function class L p (K, R d ), the leaky-ReLU networks with width d have L p -UAP.

Summary of the known minimum width of feed-forward neural networks that have the universal approximation property.

utilized the disadvantage brought by the insufficient size of the dimensions and proved a lower bound w min ≥ d These properties allow one to construct counterexamples and give a lower bound w min ≥ d x + 1 for C-UAP. For general activation functions,Park et al. (2021) used the volume of simplex in the output space and gave a lower bound w min ≥ d y for either L p -UAP or C-UAP. Our universal lower bound, w min ≥ max(d x , d y ), is based on the insufficient size of the dimensions for both the input and output space, which combines the ideas from these references above.

Acknowledgments

We thank anonymous reviewers for their valuable comments and useful suggestions. This research is supported by the National Natural Science Foundation of China (Grant No. 12201053).

A Proof of the lemmas A.1 Proof of Lemma 8

We give a definition and a lemma below that are useful for proving Lemma 8. Definition 12. We say two functions, f 1 , f 2 ∈ C(R, R), have the same ordering of extrema if they have the following properties:1) f i (x) has only a finite number of extrema that are (increasing) x * i,j , j = 1, 2, ..., m i .2) m 1 = m 2 =: m and the two sequences, S 1 := {f 1 (-∞), f 1 (x * 1,1 ), ..., f 1 (x * 1,m ), f 1 (+∞)}, and S 2 := {f 2 (-∞), f 2 (x * 2,1 ), ..., f 2 (x * 2,m ), f 2 (+∞)}, have the same ordering, i.e., S 1,i < S 1,j ⇐⇒ S 2,i < S 2,j , ∀i, j, S 1,i = S 1,j ⇐⇒ S 2,i = S 2,j , ∀i, j.Lemma 13. Let f 1 and f 2 be continuous functions in C(R, R) that have the same ordering of extrema; then, there are two strictly monotone functions, v and u, such thatProof. Here, we use the same notation in Definition 12. The functions v and u can be constructed as follows.(1) Construct the outer function v that tries to match the function values at the extrema. The only requirement is that S 1,i = v(S 2,i ), ∀i.Since S 1 and S 2 have the same ordering, it is easy to construct such a function v that is continuous and strictly increasing, for example, piecewise linear.(2) Construct the inner function u to match the location of the extrema. Denote g = v • f 2 , which satisfies f 1 (x * 1,i ) = g(x * 2,i ). Since f 1 and g are strictly monotone and continuous on the intervals), respectively, we can construct the function u on I i asCombining each piece of u, we have a strictly increasing and continuous function u on the whole space R. As a consequence, we have1) there is a leaky-ReLU+ABS (or leaky-ReLU+SIN) network with width one and depth L such that2) there is a leaky-ReLU + UOE network width one and depthProof. We mainly provide proof of the second point, while the first point can be proven using the same scheme.For any function f * (x) ∈ C([0, 1], R) and ε > 0, we can approximate it by a polynomial p n (x) with order n such thatB Proof of the main results

B.1 Proof of Lemma 1

Lemma 1. For any compact domain K ⊂ R dx and any finite set of activation functions {σ i }, the {σ i } networks with width w < w * min ≡ max(d x , d y ) do not have the UAP for both L p (K, R dy ) and C(K, R dy ).Proof. It is enough to show the following two counterexamples f * (x) that cannot be approximated in the L p -norm.] dx , cannot be approximated by any networks with widths less than d x -1. In fact, we can relax the networks to a function f (x) = ϕ(W x + b), where W x + b is a transformer from R dx to R dx-1 and ϕ(x) could be any function. A consequence is that there exists a direction v (set as the vector satisfying W v = 0, ∥v∥ = 1) such that f (x) = f (x + λv) for all λ ∈ R. Then, considering the sets A = {x : ∥x∥ ≤ 0.1} and B = {x : ∥x -v∥ ≤ 0.1}, we haveSince the volume of A is a fixed positive number, the inequality implies that even the L 1 approximation for f * is impossible. The case of the L p norm and the uniform norm is impossible as well.2) The function f * , the parametrized curve from 0 to 1 along the edge of the cubic, cannot be approximated by any networks with a width less than d y -1. Relaxing the networks to a function f (x) = W ψ(x) + b, ψ(x) could be any function. Since the range of f is in a hyperplane while f * has a positive distance to any hyperplane, the target f * cannot be approximated.

B.2 Proof of Theorem 2

Theorem 2. Let K ⊂ R dx be a compact set; then, for the function class L p (K, R dy ), the minimum width of leaky-ReLU networks having L p -UAP is exactly w min = max(d x , d y , 2).Proof. Using Lemma 1, we only need to prove two points: 1) the L p -UAP holds when max(d x , d y ) ≥ 2, 2) when d x = d y = 1, there is a function that cannot be approximated by leaky-ReLU networks with width one (since width two is enough for the L p -UAP).The first point is a consequence of Corollary 11 since we can extend the target function to dimension d = max(d x , d y ).The second point is obvious since leaky-ReLU networks with a width of one are monotone functions that cannot approximate nonmonotone functions such as

B.3 Proof of Theorem 3

Theorem 3. Let K ⊂ R dx be a compact set; then, for the function class L p (K, R dy ), the minimum width of leaky-ReLU+ABS networks having L p -UAP is exactly w min = max(d x , d y ).Proof. This is a consequence of Theorem 2 (for the case of max(d x , d y ) ≥ 2) combined with Lemma 8 (for the case of d x = d y = 1).

B.4 Proof of Lemma 4

Lemma 4. Let K ⊂ R dx be a compact set; then, for the function class C(K, R dy ), the minimum width of ReLU+FLOOR networks having C-UAP is exactly w min = max(d x , 2, d y ).Proof. Recalling the results of Lemma 1, we only need to prove two points: 1) the C-UAP holds when max(d x , d y ) ≥ 2, 2) when d x = d y = 1, there is a function that cannot be approximated by ReLU+FLOOR networks with width one (since width two is enough for the C-UAP).The first step can be constructed by the encoder-memorizer-decoder approach. The second point is obvious since ReLU+FLOOR networks with width one are monotone functions that cannot approximate nonmonotone functions such as f

B.5 Proof of Theorem 5

Theorem 5. The UOE networks with width d y have C-UAP for functions in C([0, 1], R dy ).Proof. Since functions in C([0, 1], R dy ) can be regarded as d y one-dimensional functions, it is enough to prove the case of d y = 1, which is the result in Lemma 8.

B.6 Proof of Corollary 6

Corollary 6. Let K ⊂ R dx be a compact set; then, for the continuous function class C(K, R dy ), the minimum width of UOE+FLOOR networks having C-UAP is exactly w min = max(d x , d y ).Proof. The case where max(d x , d y ) ≥ 2 is a consequence of Lemma 4 since the UOE function contains the leaky-ReLU as a part. The case where max(d x , d y ) = 1, i.e. d x = d y = 1, is a consequence of Lemma 8.

