MINIMUM WIDTH FOR UNIVERSAL APPROXIMATION

Abstract

The universal approximation property of width-bounded networks has been studied as a dual of classical universal approximation results on depth-bounded networks. However, the critical width enabling the universal approximation has not been exactly characterized in terms of the input dimension d x and the output dimension d y . In this work, we provide the first definitive result in this direction for networks using the RELU activation functions: The minimum width required for the universal approximation of the L p functions is exactly max{d x + 1, d y }. We also prove that the same conclusion does not hold for the uniform approximation with RELU, but does hold with an additional threshold activation function. Our proof technique can be also used to derive a tighter upper bound on the minimum width required for the universal approximation using networks with general activation functions.

1. INTRODUCTION

The study of the expressive power of neural networks investigates what class of functions neural networks can/cannot represent or approximate. Classical results in this field are mostly focused on shallow neural networks. An example of such results is the universal approximation theorem (Cybenko, 1989; Hornik et al., 1989; Pinkus, 1999) , which shows that a neural network with fixed depth and arbitrary width can approximate any continuous function on a compact set, up to arbitrary accuracy, if the activation function is continuous and nonpolynomial. Another line of research studies the memory capacity of neural networks (Baum, 1988; Huang and Babri, 1998; Huang, 2003) , trying to characterize the maximum number of data points that a given neural network can memorize. After the advent of deep learning, researchers started to investigate the benefit of depth in the expressive power of neural networks, in an attempt to understand the success of deep neural networks. This has led to interesting results showing the existence of functions that require the network to be extremely wide for shallow networks to approximate, while being easily approximated by deep and narrow networks (Telgarsky, 2016; Eldan and Shamir, 2016; Lin et al., 2017; Poggio et al., 2017) . A similar trade-off between depth and width in expressive power is also observed in the study of the memory capacity of neural networks (Yun et al., 2019; Vershynin, 2020) . In search of a deeper understanding of the depth in neural networks, a dual scenario of the classical universal approximation theorem has also been studied (Lu et al., 2017; Hanin and Sellke, 2017; Johnson, 2019; Kidger and Lyons, 2020) . Instead of bounded depth and arbitrary width studied in classical results, the dual problem studies whether universal approximation is possible with a network of bounded width and arbitrary depth. A very interesting characteristic of this setting is that there exists a critical threshold on the width that allows a neural network to be a universal approximator. For example, one of the first results (Lu et al., 2017) in the literature shows that universal approximation of L 1 functions from R dx to R is possible for a width-(d x + 4) RELU network, but impossible for a width-d x RELU network. This implies that the minimum width required for universal approximation lies between d x + 1 and d x + 4. Subsequent results have shown upper/lower bounds on the minimum width, but none of the results has succeeded in a tight characterization of the minimum width.

1.1. WHAT IS KNOWN SO FAR?

Before summarizing existing results, we first define function classes studied in the literature. For a domain X ⊆ R dx and a codomain Y ⊆ R dy , we define C(X , Y) to be the class of continuous First progress. As aforementioned, Lu et al. (2017) show that universal approximation of L 1 (R dx , R) is possible for a width-(d x + 4) RELU network, but impossible for a width-d x RELU network. These results translate into bounds on the minimum width: d x + 1 ≤ w min ≤ d x + 4. Hanin and Sellke (2017) consider approximation of C(K, R dy ), where K ⊂ R dx is compact. They prove that RELU networks of width d x + d y are dense in C(K, R dy ), while width-d x RELU networks are not. Although this result fully characterizes w min in case of d y = 1, it fails to do so for d y > 1. General activations. Later, extensions to activation functions other than RELU have appeared in the literature. Johnson (2019) shows that if the activation function ρ is uniformly continuous and can be uniformly approximated by a sequence of one-to-one functions, a width-d x network cannot universally approximate C(K, R). Kidger and Lyons (2020) show that if ρ is continuous, nonpolynomial, and continuously differentiable at some z with ρ (z) = 0, then networks of width d x + d y + 1 with activation ρ are dense in C(K, R dy ). Furthermore, Kidger and Lyons (2020) prove that RELU networks of width d x + d y + 1 are dense in L p (R dx , R dy ). Limitations of prior arts. Note that none of the existing works succeeds in closing the gap between the upper bound (at least d x + d y ) and the lower bound (at most d x + 1). This gap is significant especially for applications with high-dimensional codomains (i.e., large d y ), arising for many practical applications of neural networks, e.g., image generation (Kingma and Welling, 2013; Goodfellow et al., 2014) , language modeling (Devlin et al., 2019; Liu et al., 2019) , and molecule generation (Gómez-Bombarelli et al., 2018; Jin et al., 2018) . In the prior arts, the main bottleneck for proving an upper bound below d x + d y is that they maintain all d x neurons to store the input and all d y neurons to construct the function output; this means every layer already requires at least d x + d y neurons. In addition, the proof techniques for the lower bounds only consider the input dimension d x regardless of the output dimension d y .

1.2. SUMMARY OF RESULTS

We mainly focus on characterizing the minimum width of RELU networks for universal approximation. Nevertheless, our results are not restricted to RELU networks; they can be generalized to networks with general activation functions. Our contributions can be summarized as follows. • Theorem 1 states that the minimum width for RELU networks to be dense in L p (R dx , R dy ) is exactly max{d x + 1, d y }. This is the first result fully characterizing the minimum width of RELU networks for universal approximation. In particular, the upper bound on the minimum width is significantly smaller than the best known result d x + d y + 1 (Kidger and Lyons, 2020) . • Given the full characterization of w min of RELU networks for approximating L p (R dx , R dy ), a natural question arises: Is w min also the same for C(K, R dy )? We prove that it is not the case; Theorem 2 shows that the minimum width for RELU networks to be dense in C([0, 1], R 2 ) is 3. Namely, RELU networks of width max{d x + 1, d y } are not dense in C(K, R dy ) in general. • In light of Theorem 2, is it impossible to approximate C(K, R dy ) in general while maintaining width max{d x + 1, d y }? Theorem 3 shows that an additional activation comes to rescue. We show that if networks use both RELU and threshold activation functions (which we refer to as STEP)foot_0 , they can universally approximate C(K, R dy ) with the minimum width max{d x + 1, d y }. • Our proof techniques for tight upper bounds are not restricted to RELU networks. In Theorem 4, we extend our results to general activation functions covered in Kidger and Lyons (2020) .

1.3. ORGANIZATION

We first define necessary notation in Section 2. In Section 3, we formally state our main results and discuss their implications. In Section 4, we present our "coding scheme" for proving upper bounds on the minimum width in Theorems 1, 3 and 4. In Section 5, we prove the lower bound in Theorem 2 by explicitly constructing a counterexample. Finally, we conclude the paper in Section 6. We note that all formal proofs of Theorems 1-4 are presented in Appendix.

2. PROBLEM SETUP AND NOTATION

Throughout this paper, we consider fully-connected neural networks that can be described as an alternating composition of affine transformations and activation functions. Formally, we consider the following setup: Given a set of activation functions Σ, an L-layer neural network f of input dimension d x , output dimension d y , and hidden layer dimensions d 1 , . . . , d L-1foot_1 is represented as f := t L • σ L-1 • • • • • t 2 • σ 1 • t 1 , where t : R d -1 → R d is an affine transformation and σ is a vector of activation functions: σ (x 1 , . . . , x d ) = ρ 1 (x 1 ), . . . , ρ d (x d ) , where ρ i ∈ Σ. While we mostly consider the cases where Σ is a singleton (e.g., Σ = {RELU}), we also consider the case where Σ contains both RELU and STEP activation functions as in Theorem 3. We denote a neural network with Σ = {ρ} by a "ρ network" and a neural network with Σ = {ρ 1 , ρ 2 } by a "ρ 1 +ρ 2 network." We define the width w of f as the maximum over d 1 , . . . , d L-1 . We use "ρ networks (or ρ 1 + ρ 2 networks) of width w" for denoting the collection of all ρ networks (or ρ 1 + ρ 2 networks) of width w having a finite number of layers. For describing the universal approximation of neural networks, we say ρ networks (or ρ 1 +ρ 2 networks) of width w are dense in C(X , Y) if for any f * ∈ C(X , Y) and ε > 0, there exists a ρ network (or a ρ 1 +ρ 2 network) f of width w such that f * -f ∞ ≤ ε. Likewise, we say ρ networks (or ρ 1 +ρ 2 networks) are dense in L p (X , Y) if for any f * ∈ L p (X , Y) and ε > 0, there exists a ρ network (or a ρ 1 +ρ 2 network) f such that f * -f p ≤ ε.

3. MINIMUM WIDTH FOR UNIVERSAL APPROXIMATION

L p approximation with RELU. We present our main theorems in this section. First, for universal approximation of L p (R dx , R dy ) using RELU networks, we give the following theorem. Theorem 1. For any p ∈ [1, ∞), RELU networks of width w are dense in L p (R dx , R dy ) if and only if w ≥ max{d x + 1, d y }. This theorem shows that the minimum width w min for universal approximation is exactly equal to max{d x + 1, d y }. In order to provide a tight characterization of w min , we show three new upper and lower bounds: w min ≤ max{d x + 1, d y } through a construction utilizing a coding approach, w min ≥ d y through a volumetric argument, and w min ≥ d x + 1 through an extension of the same lower bound for L 1 (R dx , R dy ) (Lu et al., 2017) . Combining these bounds gives the tight minimum width w min = max{d x + 1, d y }. Notably, using our new proof technique, we overcome the limitation of existing upper bounds that require width at least d x + d y . Our construction first encodes the d x dimensional input vectors into one-dimensional codewords, and maps the codewords to target codewords using memorization, and decodes the target codewords to d y dimensional output vectors. Since we construct the map from input to target using scalar codewords, we bypass the need to use d x + d y hidden nodes. More details are found in Section 4. Proofs of the lower bounds are deferred to Appendix B. Uniform approximation with RELU. In Theorem 1, we have seen the tight characterization w min = max{d x + 1, d y } for L p (R dx , R dy ) functions. Does the same hold for C(K, R dy ), for a compact K ⊂ R dx ? Surprisingly, we show that the same conclusion does not hold in general. Indeed, we show the following result, proving that width max{d x + 1, d y } is provably insufficient for d x = 1, d y = 2. Theorem 2. RELU networks of width w are dense in C([0, 1], R 2 ) if and only if w ≥ 3. Theorem 2 translates to w min = 3, and the upper bound w min ≤ 3 = d x + d y is given by Hanin and Sellke (2017) . The key is to prove a lower bound w min ≥ 3, i.e., width 2 is not sufficient. Recall from Section 1.1 that all the known lower bounds are limited to showing that width d x is insufficient for universal approximation. A closer look at their proof techniques reveals that they heavily rely on the fact that the hidden layers have the same dimensions as the input space. As long as the width w > d x , their arguments break because such a network maps the input space into a higher-dimensional space. Although only for d x = 1 and d y = 2, we overcome this limitation of the prior arts and show that width w = 2 > d x is insufficient for universal approximation, by providing a counterexample. We use a novel topological argument which comes from a careful observation on the image created by RELU operations. In particular, we utilize the property of RELU that it projects all negative inputs to zero, without modifying any positive inputs. We believe that our proof will be of interest to readers and inspire follow-up works. Please see Section 5 for more details. Theorem 1 and Theorem 2 together imply that for RELU networks, approximating C(K, R dy ) requires more width than approximating L p (R dx , R dy ). Interestingly, this is in stark contrast with existing results, where the minimum depth of RELU networks for approximating C(K, R dy ) is two (Leshno et al., 1993) but it is greater than two for approximating L p (R dx , R dy ) (Wang and Qu, 2019) . Uniform approximation with RELU+STEP. While width max{d x + 1, d y } is insufficient for RELU networks to be dense in C(K, R dy ), an additional STEP activation function helps achieve the minimum width max{d x + 1, d y }, as stated in the theorem below. Theorem 3. RELU+STEP networks of width w are dense in C(K, R dy ) if and only if w ≥ max{d x + 1, d y }. Theorem 2 and Theorem 3 indicate that the minimum width for universal approximation is indeed dependent on the choice of activation functions. This is also in contrast to the classical results where RELU networks of depth 2 are universal approximators (Leshno et al., 1993) , i.e., the minimum depths for universal approximation are identical for both RELU networks and RELU+STEP networks. Theorem 3 comes from a similar proof technique as Theorem 1. Due to its discontinuous nature, the STEP activation can be used in our encoder to quantize the input without introducing uniform norm errors. Lower bounds on w min can be proved in a similar way as Theorem 1. General activations. Our proof technique for upper bounds in Theorems 1 and 3 can be easily extended to networks using general activations. Indeed, we prove the following theorem, which shows that adding a width of 1 is enough to cover the networks with general activations. Theorem 4. Let ρ : R → R be any continuous nonpolynomial function which is continuously differentiable at some z with ρ (z) = 0. Then, ρ networks of width w are dense in  L p (K, R dy ) for all p ∈ [1, ∞) if w ≥ max{d x + 2, d y + 1}.

4. TIGHT UPPER BOUND ON MINIMUM WIDTH

In this section, we present the main idea for constructing networks achieving the minimum width for universal approximation, and then sketch the proofs of upper bounds in Theorems 1, 3, and 4.

4.1. CODING SCHEME FOR UNIVERSAL APPROXIMATION

We now illustrate the main idea underlying the construction of neural networks that achieve the minimum width. To this end, we consider an approximation of a target continuous function f * ∈ C([0, 1] dx , [0, 1] dy ); however, our main idea can be easily generalized to other domain, codomain, and L p functions. Our construction can be viewed as a coding scheme in essence, consisting of three parts: encoder, memorizer, and decoder. First, the encoder encodes an input vector x to a one-dimensional codeword. Then, the memorizer maps the codeword to a one-dimensional target codeword that is encoded with respect to the corresponding target f * (x). Finally, the decoder maps the target codeword to a target vector which is sufficiently close to f * (x). Note that one can view the encoder, memorizer, and decoder as functions mapping from d x -dimension to 1-dimension, then to 1-dimension, and finally to d y -dimension. The spirit of the coding scheme is that the three functions can be constructed using the idea of the prior results such as (Hanin and Sellke, 2017) . Recall that Hanin and Sellke (2017) approximate any continuous function mapping n-dimensional inputs to m-dimensional outputs using RELU networks of width n + m. Under this intuition, we construct the encoder, the memorizer, and the decoder by RELU+STEP networks (or RELU networks) of width d x + 1, 2, d y , respectively; these constructions result in the tight upper bound max{d x + 1, d y }. Here, the decoder requires width d y instead of d y + 1, as we only construct the first d y -1 coordinates of the output, and recover the last output coordinate from a linear combination of the target codeword and the first d y -1 coordinates. Note that we construct the exact encoder, decoder, and memorizer as prior results only approximate them. Next, we describe the operation of each part. We explain their neural network constructions in subsequent subsections. Encoder. Before introducing the encoder, we first define a quantization function q n : [0, 1] → C n for n ∈ N and C n := {0, 2 -n , 2 × 2 -n , . . . , 1 -2 -n } as q n (x) := max{c ∈ C n : c ≤ x}. In other words, given any x ∈ [0, 1), q n (x) preserves the first n bits in the binary representation of x and discards the rest; x = 1 is mapped to 1 -2 -n . Note that the error from the quantization is always less than or equal to 2 -n . The encoder encodes each input x ∈ [0, 1] dx to some scalar value via the function encode K : R dx → C dxK for some K ∈ N defined as encode K (x) := dx i=1 q K (x i ) × 2 -(i-1)K . In other words, encode K (x) quantizes each coordinate of x by a K-bit binary representation and concatenates the quantized coordinates into a single scalar value having a (d x K)-bit binary representation. Note that if one "decodes" a codeword encode K (x) back to a vector x asfoot_2  {x} := encode -1 K • encode K (x) ∩ C dx K , then x -x ∞ ≤ 2 -K . Namely, the "information loss" incurred by the encoding can be made arbitrarily small by choosing large K. Memorizer. The memorizer maps each codeword encode K (x) ∈ C dxK to its target codeword via the function memorize K,M : C dxK → C dyM for some M ∈ N, defined as memorize K,M encode K (x) := encode M f * • q K (x) where q K is applied coordinate-wise for a vector. We note that memorize K,M is well-defined as each encode K (x) ∈ C dxK corresponds to a unique q K (x) ∈ C dx K . Here, one can observe that the target of the memorizer contains the information of the target value since encode M f * • q K (x) contains information of f * at a quantized version of x, and the information loss due to quantization can be made arbitrarily small by choosing large enough K and M . Decoder. The decoder decodes each codeword generated by the memorizer using the function decode M : C dyM → C dy M defined as decode M (c) := x where {x} := encode -1 M (c) ∩ C dy M . Combining encode, memorize, and decode completes our coding scheme for approximating f * . One can observe that our coding scheme is equivalent to q M • f * • q K which can approximate the target function f * within any ε > 0 error, i.e., sup x∈[0,1] dx f * (x) -decode M • memorize K,M • encode K (x) ∞ ≤ ε by choosing large enough K, M ∈ N so that ω f * (2 -K ) + 2 -M ≤ ε. 4 In the remainder of this section, we discuss how each part of the coding scheme can be implemented with a neural network using RELU+STEP activations (Section 4.2), RELU activation (Section 4.3), and other general activations (Section 4.4).

4.2. TIGHT UPPER BOUND ON MINIMUM WIDTH OF RELU+STEP NETWORKS (THEOREM 3)

In this section, we discuss how we explicitly construct our coding scheme to approximate functions in C(K, R dy ) using a width-(max{d x + 1, d y }) RELU+STEP network. This results in the tight upper bound in Theorem 3. First, the encoder consists of quantization functions q K and a linear transformation. However, as q K is discontinuous and cannot be uniformly approximated by any continuous function, we utilize the discontinuous STEP activation to exactly construct the encoder via a RELU+STEP network of width d x + 1. On the other hand, the memorizer and the decoder maps a finite number of scalar values (i.e., C dxK and C dyM , respectively) to their target values/vectors. Such maps can be easily implemented by piecewise linear functions; hence, they can be exactly constructed by RELU networks of width 2 and d y , respectively, as discussed in Section 4.1. Note that STEP is used only for constructing the encoder. In summary, all parts of our coding scheme can be exactly constructed by RELU+STEP networks of width d x + 1, 2, and d y . Thus, the overall RELU+STEP network has width max{d x + 1, d y }. Furthermore, it can approximate the target continuous function f * within arbitrary uniform error by choosing sufficiently large K and M .

4.3. TIGHT UPPER BOUND ON MINIMUM WIDTH OF RELU NETWORKS (THEOREM 1)

The construction of width-(max{d x + 1, d y }) RELU network for approximating L p (R dx , R dy ) (i.e., the tight upper bound in Theorem 1) is almost identical to the RELU+STEP network construction in Section 4.2. Since any L p function can be approximated by a continuous function with compact support, we aim to approximate continuous f * : [0, 1] dx → [0, 1] dy here as in our coding scheme. Since the memorizer and the decoder can be exactly constructed by RELU networks, we only discuss the encoder here. As we discussed in the last section, the encoder cannot be uniformly approximated by continuous functions (i.e., RELU networks). Nevertheless, it can be implemented by continuous functions except for a subset of the domain around the discontinuities, and this subset can be made arbitrarily small in terms of the Lebesgue measure. That is, we construct the encoder using a RELU network of width d x + 1 for [0, 1] dx except for a small subset, which enables us to approximate the encoder in the L p -norm. Combining with the memorizer and the decoder, we obtain a RELU network of width max{d x + 1, d y } that approximates the target function f * in the L p -norm.

4.4. TIGHTENING UPPER BOUND ON MINIMUM WIDTH OF GENERAL NETWORKS (THEOREM 4)

Our network construction can be generalized to general activation functions using existing results on approximation of C(K, R dy ) functions. For example, Kidger and Lyons (2020) show that if the activation ρ is continuous, nonpolynomial, and continuously differentiable at some z with ρ (z) = 0, then ρ networks of width d x + d y + 1 are dense in C(K, R dy ). Applying this result to our encoder, memorizer, and decoder constructions of RELU networks, it follows that if ρ satisfies the conditions above, then ρ networks of width max{d x + 2, d y + 1} are dense in L p (K, R dy ), i.e., Theorem 4 holds. We note that any universal approximation result for C(K, R dy ) by networks using other activation functions, other than Kidger and Lyons (2020) , can also be combined with our construction.

4.5. NUMBER OF LAYERS FOR MINIMUM WIDTH UNIVERSAL APPROXIMATORS

Lastly, we discuss the required number of layers for our network constructions achieving the tight minimum width in Theorem 1 and 3. First, our RELU + STEP network construction of width max{d x + 1, d y } for approximating f * ∈ C([0, 1] dx , R dy ) requires O(2 dx•K + 2 M ) layers. Here, O(2 K ) layers are for the encoder, O(2 dx•K ) layers are for the memorizer, and O(2 M ) layers for the decoder. 5 Since K, M must satisfy ω f * (2 -K )+2 -M ≤ ε for approximating f * in ε error (see Section 4.1), we choose K = -log 2 (ω -1 f * (ε/2)) , M = -log 2 (ε/2) so that ω f * (2 -K ), 2 -M ≤ ε/2. Namely, the overall required number of layers (i.e., parameters) for approximating f * in ε error is O (ω -1 f * (ε/2)) -dx + 1/ε . This number is closely related to the number of parameters for approximating arbitrary f * ∈ C([0, 1] dx , R dy ) in ε error: For RELU networks of a constantly bounded number of layers, Θ (ω -1 f * (Ω(ε))) -dx parameters are necessary and sufficient (Yarotsky, 2018) . On the other hand, for RELU networks whose number of layers can increase with the number of parameters, O (ω -1 f * (Ω(ε))) -dx/2 parameters are sufficient (Yarotsky, 2018) . However, while the number of layers grows with the number of parameters in our construction, it requires O (ω -1 f * (Ω(ε))) -dx + 1/ε parameters. Namely, the number of parameters in constructions achieving the tight minimum width can be improved. Likewise, for proving the tight upper bound in Theorem 1, for approximating f * ∈ L p (R dx , R dy ) in ε error, our RELU network construction also requires O (ω -1 f (Ω(ε))) -dx + 1/ε layers where f is some continuous function satisfying f * -f p = ε/2.

5. TIGHT LOWER BOUND ON MINIMUM WIDTH

The purpose of this section is to prove the tight lower bound in Theorem 2, i.e., there exist f * ∈ C([0, 1], R 2 ) and ε > 0 satisfying the following property: For any width-2 RELU network f , we have f * -f ∞ > ε. Our construction of f * is based on topological properties of RELU networks, which we study in Section 5.1. Then, we introduce a counterexample f * and prove that f * cannot be approximated by width-2 RELU networks in Section 5.2. Published as a conference paper at ICLR 2021

5.1. TOPOLOGICAL PROPERTIES OF RELU NETWORKS

We first interpret a width-2 RELU network f : R → R 2 as below, following (1): f := t L • σ • • • • • σ • t 2 • σ • t 1 where L ∈ N denotes the number of layers, t 1 : R → R 2 and t : R 2 → R 2 for > 1 are affine transformations, and σ is the coordinate-wise RELU. Without loss of generality, we assume that t is invertible for all > 1, as invertible affine transformations are dense in the space of affine transformations on bounded support, endowed with the uniform norm. To illustrate the topological properties of f better, we reformulate f as follows: f = (φ -1 L-1 • σ • φ L-1 ) • • • • • (φ -1 2 • σ • φ 2 ) • (φ -1 1 • σ • φ 1 ) • t † (2) where φ and t † are defined as t † := t L • • • • • t 1 and φ := (t L • • • • • t +1 ) -1 , i.e., t = φ • φ -1 -1 for ≥ 2 and t 1 = φ 1 • t † . Under the reformulation (2), f first maps inputs through an affine transformation t † , then it sequentially applies φ -1 • σ • φ . Here, φ -1 • σ • φ can be viewed as changing the coordinate system using φ , applying RELU in the modified coordinate system, and then returning back to the original coordinate system via φ -1 . Under this reformulation, we present the following lemmas, whose proofs are presented in Appendices B.4-B.5. Lemma 5. Let φ : R 2 → R 2 be an invertible affine transformation. Then, there exist a 1 , a 2 ∈ R 2 and b 1 , b 2 ∈ R determined by φ such that the following statements hold for S := {x : a 1 , x + b 1 ≥ 0, a 2 , x + b 2 ≥ 0} and x := φ -1 • σ • φ(x). • If x ∈ S, then x = x. • If x ∈ R 2 \ S, then x = x and x ∈ ∂S.foot_5 Lemma 6. Let φ : R 2 → R 2 be an invertible affine transformation. Suppose that x is in a bounded path-componentfoot_6 of R 2 \ T for some T ⊂ R 2 . Then, the following statements hold for x := φ -1 • σ • φ(x) and T := φ -1 • σ • φ(T ). • If x = x and x / ∈ T , then x is in a bounded path-component of R 2 \ T . • If x = x, then x ∈ T . Lemma 5 follows from the fact that output of RELU is identity to nonnegative coordinates, and is zero to negative coordinates. In particular, a 1 , b 1 and a 2 , b 2 in Lemma 5 correspond to the axes of the "modified" coordinate system before applying σ. Under the same property of RELU, Lemma 6 states that if a point x is surrounded by a set T , after applying φ -1 • σ • φ, either the point stays at the same position and surrounded by the image of T or intersects with the image of T . Based on these observations, we are now ready to introduce our counterexample.

5.2. COUNTEREXAMPLE

Our counterexample f * : [0, 1] → R 2 is illustrated in Figure 2(a) where f * ([0, p 1 ]) is drawn in red from (4, 3) to (0, 0), f * ((p 1 , p 2 )) is drawn in black from (0, 0) to (-1, 0), and f * ([p 2 , 1]) is drawn in blue from (-1, 0) to (1, 0), for some 0 < p 1 < p 2 < 1, e.g., p 1 = 1 3 , p 2 = 2 3 . In this section, we suppose for contradiction that there exists a RELU network f of width 2 such that f * -f ∞ ≤ 1 100 . To this end, consider the mapping by the first layers of f : g := (φ -1 • σ • φ ) • • • • • (φ -1 1 • σ • φ 1 ) • t † . Our proof is based on the fact: If g (x) = g (x ), then f (x) = f (x ). Thus, the following must hold: Since f ([0, 1]) ∩ B must have been constructed in the first * layers, as illustrated in Figures 2(b ) and 2(c), the boundary ∂B intersects with g * ([p 2 , 1]) (the blue line) near points (-1, 1) and (1, 1). Hence, T := g * ([p 2 , 1]) ∪ B forms a "closed loop." Also, ∂B intersects with g * ([0, p 1 ]) near the point (0, 1), so there must exist a point in g * ([0, p 1 ]) \ B that is "surrounded" by T . Given these observations, we have the following lemma. The proof of Lemma 7 is presented in Appendix B.6. Lemma 7. The image g * ([0,  if f * -f ∞ ≤ 1 100 , then g ([0, p 1 ]) ∩ g ([p 2 , 1]) = ∅ for all ≥ 1. p 1 ]) \ B is contained in a bounded path-component of R 2 \ T unless g * ([0, p 1 ]) ∩ g * ([p 2 , 1]) = ∅.

6. CONCLUSION

The universal approximation property of width-bounded networks is one of the fundamental problems in the expressive power theory of deep learning. Prior arts attempt to characterize the minimum width sufficient for universal approximation; however, they only provide upper and lower bounds with large gaps. In this work, we provide the first exact characterization of the minimum width of RELU networks and RELU+STEP networks. In addition, we observe interesting dependence of the minimum width on the target function classes and activation functions, in contrast to the minimum depth of classical results. We believe that our results and analyses would contribute to a better understanding of the performance of modern deep and narrow network architectures.

A.2 PROOF OF TIGHT UPPER BOUND IN THEOREM 1

In this section, we derive the upper bound in Theorem 1. In particular, we prove that for any p ∈ [1, ∞), for any f * ∈ L p (R dx , R dy ), for any ε > 0, there exists a RELU network f of width max{d x + 1, d y } such that f * -f p ≤ ε. To this end, we first note that since f * ∈ L p (R dx , R dy ), there exists a continuous function f on a compact support such that f * -f p ≤ ε 2 . Namely, if we construct a RELU network f such that f -f p ≤ ε 2 , then it completes the proof. Throughout this proof, we assume that the support of f is a subset of [0, 1] dx and its codomain to be [0, 1] dy which can be easily generalized to arbitrary compact support and arbitrary codomain, respectively. We approximate f by a RELU network using the three-part coding scheme introduced in Section 4.1. We will refer to our implementations of the three parts as encode † K (x), memorize † K,M , and decode † M . That is, we will approximate f by a RELU network f := decode † M • memorize † K,M • encode † K . However, unlike our construction of RELU+STEP networks in Appendix A.1, STEP is not available, i.e., uniform approximation of q K is impossible. Nevertheless, one can approximate q K with some continuous piecewise linear function by approximating regions around discontinuities with some linear functions. Under this idea, we introduce the following lemma. The proof of Lemma 11 is presented in Appendix A.7. Lemma 11. For any d x , K ∈ N, for any γ > 0, there exist a RELU network f : R dx → R of width d x + 1 and O(2 K ) layers and D γ ⊂ [0, 1] dx such that for all x ∈ [0, 1] dx \ D γ , f (x) = encode K (x), µ(D γ ) < γ, f (D γ ) ⊂ [0, 1], and f (R dx \ [0, 1] dx ) = {1 -2 dxK } where µ denotes the Lebesgue measure. By Lemma 11, there exist a RELU network encode † K of width d x + 1 and D γ ⊂ [0, 1] dx such that µ(D γ ) < γ and encode † K (x) = encode K (x) for all x ∈ [0, 1] dx \ D γ , encode † K (R dx \ [0, 1] dx ) = {1 -2 dxK }. We approximate the encoder by encode † K . Here, we note that inputs from D γ would be mapped to arbitrary values by encode † K . Nevertheless, it is not critical to the error f -f p as µ(D γ ) < γ can be made arbitrarily small by choosing a sufficiently small γ. The implementation of the memorizer utilizes Lemma 9 as in Appendix A.1. However, as f (x) = 0 for all x ∈ R dx \ [0, 1] dx , we construct a RELU network memorize † K,M of width 2 so that memorize † K,M encode † K,L (R dx \ [0, 1] dx ) = {0}. To achieve this, we design the memorizer for c ∈ C dxK using Lemma 9 and based on (4) as memorize † K,M (c) = 0 if c = 1 -2 dxK memorize K,M (c) otherwise . We note that such a design incurs an undesired error that a subset of E K := [1 -2 -K , 1] dx might be mapped to zero after applying memorize † K,M . Nevertheless, mapping E K to zero is not critical to the error f -f p as µ(E K ) < 2 -dxK can be made arbitrarily small by choosing a sufficiently large K. We implement the decoder by a RELU network decode † M of width d y using Lemma 10 as in Appendix A.1. Then, by Lemma 10, it holds that decode † M (R) ⊂ [0, 1] dy , and hence, f (R dx ) ⊂ [0, 1] dy . Published as a conference paper at ICLR 2021 Finally, we bound the error f -f p utilizing the following inequality: f -f p = R dx f (x) -f (x) p p dx 1 p = [0,1] dx \(E K ∪Dγ ) f (x) -f (x) p p dx + E K ∪Dγ f (x) -f (x) p p dx 1 p ≤ d y (ω f (2 -K ) + 2 -M ) p + (µ(E K ) + µ(D γ )) × sup x∈E K ∪Dγ f (x) -f (x) p p 1 p < d y (ω f (2 -K ) + 2 -M ) p + (2 -dxK + γ) × sup x∈[0,1] dx f (x) p + sup x∈[0,1] dx f (x) p p 1 p ≤ d y (ω f (2 -K ) + 2 -M ) p + (2 -dxK + γ) × sup x∈[0,1] dx f (x) p + (d y ) 1 p p 1 p . By choosing sufficiently large K, M and sufficiently small γ, one can make the RHS smaller than ε/2 as sup x∈[0,1] dx f (x) p < ∞. This completes the proof of the tight upper bound in Theorem 1.

A.3 PROOF OF THEOREM 4

In this section, we prove Theorem 4 by proving the following statement: For any p ∈ [1, ∞), for any f * ∈ L p (K, R dy ), for any ε > 0, there exists a ρ network f of width max{d x + 2, d y + 1} such that f * -f p ≤ ε. Here, there exists a continuous function f ∈ C(K, R dy ) such that f * -f p ≤ ε 2 since f * ∈ L p (K, R dy ). Namely, if we construct a ρ network f such that f -f p ≤ ε 2 , it completes the proof. Throughout the proof, we assume that the support of f is a subset of [0, 1] dx and its codomain is [0, 1] dy which can be easily generalized to arbitrary compact support and arbitrary codomain, respectively. Before describing our construction, we first introduce the following lemma. Lemma 12 [Kidger and Lyons (2020, Proposition 4.9) ]. Let ρ : R → R be any continuous nonpolynomial function which is continuously differentiable at some z with ρ (z) = 0. Then, for any f * ∈ C(K, R dy ), for any ε > 0, there exists a ρ network f : K → R dx × R dy of width d x + d y + 1 such that for all x ∈ K, f (x) := (y 1 (x), y 2 (x)), where y 1 (x) -x ∞ ≤ ε and y 2 (x) -f * (x) ∞ ≤ ε. We note that Proposition 4.9 by Kidger and Lyons (2020) only ensures y 2 (x) -f * (x) ∞ ≤ ε; however, its proof provides y 1 (x) -x ∞ ≤ ε as well. The proof of Theorem 4 also utilizes our coding scheme; here, we approximate RELU network constructions encode † K , memorize † K,M , and decode † M in Appendix A.2 by ρ networks. Using Lemma 12, for any ε 1 > 0, we approximate encode † K by a ρ network encode ‡ K of width d x + 2 so that encode ‡ K (x) -encode † K (x) ∞ ≤ ε 1 for all x ∈ [0, 1] dx \ D γ and encode ‡ K ([0, 1] dx ) ⊂ [-ε 1 , 1 + ε 1 ]. We note that encode ‡ K ([0, 1] dx ) ⊂ [-ε 1 , 1 + ε 1 ] is possible as encode † K (R dx ) ⊂ [0, 1 ] by Lemma 11. Approximating the memorizer can be done in a similar manner. Using Lemma 12, for any compact interval I 2 ⊂ R containing C dxK , for any ε 2 > 0, we approximate memorize † K,M by a ρ network memorize ‡ K,M of width 3 so that memorize ‡ K,M (c) -memorize † K,M (c) ∞ ≤ ε 2 for all c ∈ C dxK and memorize ‡ K,M (I 2 ) ⊂ [-ε 2 , 1 + ε 2 ]. We note that memorize ‡ K,M (I 2 ) ⊂ [-ε 2 , 1 + ε 2 ] is possible as there exists memorize † K,M (i.e., a RELU network) such that memorize † K,M (I 2 ) ⊂ [0, 1] by Lemma 9. For approximating the decoder, we introduce the following lemma. The proof of Lemma 13 is presented in Appendix A.8. Lemma 13. For any d y , M ∈ N, for any ε > 0, for any compact interval I ⊂ R containing [0, 1], there exists a ρ network f : R → R dy of width d y + 1 such that for all c ∈ I, f (c) -decode † M (c) ∞ ≤ ε. Namely, f (I) ⊂ [-ε, 1 + ε] dy . By Lemma 13, for any compact interval I 3 ⊂ R containing [0, 1], for any ε 3 > 0, there exists a ρ network decode ‡ M of width d y + 1 such that decode ‡ M (c) -decode † M (c) ∞ ≤ ε 3 for all c ∈ C dyM and decode ‡ M (I 3 ) ∈ [-ε 3 , 1 + ε 3 ] dy . We approximate f by a ρ network f of width max{d x + 2, d y + 1} defined as f := decode ‡ M • memorize ‡ K,M • encode ‡ K . Here, for any η > 0, by choosing sufficiently large K, M , sufficiently large I 2 , I 3 , and sufficiently small ε 1 , ε 2 , ε 3 so that ω f (2 -K ) + 2 -M ≤ η 2 and ω decode ‡ M ω memorize ‡ K,M (ε 1 ) + ε 2 + ε 3 ≤ η 2 , we have sup x∈[0,1] dx \Dγ f (x) -f (x) ∞ ≤ η and f ([0, 1] dx ) ⊂ [-η 2 , 1 + η 2 ] dx where ω memorize ‡ K,M and ω decode ‡ M are defined on I 2 and I 3 , respectively. Finally, we bound the error f -f p utilizing the following inequality: f -f p = [0,1] dx f (x) -f (x) p p dx 1 p = [0,1] dx \Dγ f (x) -f (x) p p dx + Dγ f (x) -f (x) p p dx 1 p ≤ sup x∈[0,1] dx \Dγ f (x) -f (x) p p + µ(D γ ) × sup x∈Dγ f (x) -f (x) p p 1 p ≤ sup x∈[0,1] dx \Dγ f (x) -f (x) p p + γ × sup x∈[0,1] dx f (x) p + sup x∈[0,1] dx f (x) p p 1 p . By choosing sufficiently small ε 1 , ε 2 , ε 3 , γ, sufficiently large K, M , and sufficiently large I 2 , I 3 , one can make the RHS smaller than ε/2 due to (5) and the fact that sup x∈[0,1] dx f (x) p < ∞. This completes the proof of Theorem 4. A.4 PROOF OF LEMMA 8 We construct f : R → R as f (x ) := f 2 K • • • • • f 1 (x) where each f : R → R is defined for x ∈ [0, 1] as f (x) := ( -1) × 2 -K if x ∈ [( -1) × 2 -K , × 2 -K ) x if x / ∈ [( -1) × 2 -K , × 2 -K ) = g 3 • g 2 • g 1 (x) where g 1 : R → R 2 , g 2 : R 2 → R 2 , and g 3 : R 2 → R are defined as g 1 (x) := σ(x), σ(x -) g 2 (x, z) := σ(x + z), -σ(x -+ 1) g 3 (x, z) := σ(x + z) + 1[x ≥ ]. This directly implies that f (x) = q K (x) for all x ∈ [0, 1] and completes the proof of Lemma 8. A.5 PROOF OF LEMMA 9 Let |X | = N and x (1) , . . . , x (N ) be distinct elements of X in an increasing order, i.e., x (i) < x (j) if i < j. Let x (0) := min I and x (N +1) := max I. Here, x (0) ≤ x (1) and x (N ) ≤ x (N +1) as X ⊂ I. Without loss of generality, we assume that x (0) = 0. Consider a continuous piecewise linear function f † : [x (0) , x (N +1) ] → R of N + 1 linear pieces defined as f † (x) :=      f * (x (1) ) if x ∈ [x (0) , x (1) ) f * (x (i) ) + f * (x (i+1) )-f * (x (i) ) x (i+1) -x (i) (x -x (i) ) if x ∈ [x (i) , x (i+1) ) for 1 ≤ i ≤ N -1 f * (x (N ) ) if x ∈ [x (N ) , x (N +1) ] . Now, we introduce the following lemma. Lemma 14. For any compact interval I ⊂ R, for any continuous piecewise linear function f * : I → R with P linear pieces, there exists a RELU network f of width 2 and O(P ) layers such that f * (x) = f (x) for all x ∈ I. From Lemma 14, there exists a RELU network f of width 2 and O(|X |) layers such that f † (x) = f (x) for all x ∈ X . Since X ⊂ [x (0) , x (N +1) ] = I and f † (I) ⊂ min f * (X ), max f * (X ) , this completes the proof of Lemma 9. Proof of Lemma 14. Suppose that f * is linear on intervals [min I, x 1 ), [x 1 , x 2 ), . . . , [x P -1 , max I] and parametrized as f * (x) =          a 1 × x + b 1 if x ∈ [min I, x 1 ) a 2 × x + b 2 if x ∈ [x 1 , x 2 ) . . . a P × x + b P if x ∈ [x P -1 , max I] for some a i , b i ∈ R satisfying a i × x i + b i = a i+1 × x i + b i+1 . Without loss of generality, we assume that min I = 0. Now, we prove that for any P ≥ 1, there exists a RELU network f : I → R 2 of width 2 and O(P ) layers such that (f (x)) 1 = σ(x -x P -1 ) and (f (x)) 2 = f * (x). Then, (f (x)) 2 is the desired RELU network and completes the proof. We use the mathematical induction on P for proving the existence of such f . If P = 1, choosing (f (x)) 1 = σ(x) and (f (x)) 2 = a 1 × σ(x) + b 1 completes the construction of f . Now, consider P > 1. From the induction hypothesis, there exists a RELU network g of width 2 such that (g(x)) 1 = σ(x -x P -2 ) (g(x)) 2 =          a 1 × x + b 1 if x ∈ [min I, x 1 ) a 2 × x + b 2 if x ∈ [x 1 , x 2 ) . . . a P -1 × x + b P -1 if x ∈ [x P -2 , max I] . Then, the following construction of f completes the proof of the mathematical induction: f (x) = h 2 • h 1 • g(x) h 1 (x, z) = σ(x -x P -1 + x P -2 ), σ(z -K) + K h 2 (x, z) = x, z + (a P -1 -a P -2 ) × x where K := min i min x∈I {a i × x + b i }. This completes the proof of Lemma 14.

B PROOFS OF LOWER BOUNDS B.1 PROOF OF GENERAL LOWER BOUND

In this section, we prove that neural networks of width d y -1 is not dense in both L p (K, R dy ) and C(K, R dy ), regardless of the activation functions. Lemma 18. For any set of activation functions, networks of width d y -1 are not dense in both L p (K, R dy ) and C(K, R dy ). Proof. In this proof, we show that networks of width d y -1 are not dense in L p ([0, 1] dx , R dy ), which can be easily generalized to the cases of L p (K, R dy ) and C(K, R dy ). In particular, we prove that there exist a continuous function f * ∈ L p ([0, 1] dx , R dy ) and ε > 0 such that for any network f of width d y -1, it holds that f * -f p > ε. Let ∆ be a d y -dimensional regular simplex with sidelength √ 2, that is isometrically embedded into R dy . The volume of this simplex is given as Vol dy (∆) = √ dy+1 dy! .foot_8 We denote the vertices of this simplex by {v 0 , . . . , v dy }. Then, we can construct the counterexample as follows. f * (x) = v i if x 1 ∈ 2i 2dy+1 , 2i+1 2dy+1 for some i (2d y + 1)(v i+1 -v i )x 1 + (2i + 2)v i -(2i + 1)v i+1 if x 1 ∈ 2i+1 2dy+1 , 2i+2 2dy+1 for some i . In other words, f * (x) travels the vertices of ∆ sequentially as we move x 1 from 0 to 1, staying at each vertex over an interval of length 1 2dy+1 and traveling between vertices at a constant speed otherwise, i.e., f * is continuous and in L p ([0, 1] dx , R dy ). Recalling (1), one can notice that any neural network f of width less than d y and L ≥ 2 layers can be decomposed as t L • g, where t L : R k → R dy is the last affine transformation and g denotes all the preceding layers, i.e., g = σ L-1 • t L-1 • • • • • σ 1 • t 1 . Here, we consider k = d y -1 as it suffices to cover cases k < d y -1. Now, we proceed as f * -f p = [0,1] dx f * (x) -f (x) p p dx 1 p ≥ [0,1] dx f * (x) -t L • g(x) p p dx 1 p ≥ [0,1] dx inf u * (x)∈R dy -1 f * (x) -t L (u * (x)) p p dx 1 p ≥ 1 2d y + 1 1 p inf t: affine map max i∈[dy+1] inf u * i ∈R dy -1 v i -t(u * i ) p ≥ 1 2d y + 1 1 p inf H∈H max i∈[dy+1] inf ai∈H v i -a i p , where H denotes the set of all (d y -1)-dimensional hyperplanes in R dy and [d y +1] := {0, 1, . . . , d y }. As the vertices of ∆ are d y +1 distinct points in a general position, inf H∈H max i∈[dy+1] inf ai∈H v ia i p > 0. To make this argument more concrete, we take a volumetric approach; for any kdimensional hyperplane H ∈ R dy , we have Vol dy (∆) ≤ 2 • Vol dy-1 (π H (∆)) • max i∈[dy+1] inf ai∈H v i -a i 2 , where π H denotes the projection onto H. As projection is contraction and the distance between any two points are at most √ 2, it holds that for any H, max i∈[dy+1] inf ai∈H v i -a i 2 ≥ Vol dy (∆) 2 • Vol dy-1 ({x ∈ R dy-1 : x 2 ≤ 1}) = d y + 1 2 • d y ! • Γ d y + 1 2 • 2 π dy -1 2 where Γ denotes the gamma function, and we use the fact that Vol dy-1 {x ∈ R dy-1 : x 2 ≤ 1} ≥ Vol dy-1 (π H (∆)) as ∆ can be contained in a d y -dimensional unit ball, and hence π H (∆) can be contained in a (d y -1)-dimensional unit ball. Thus we have f * -f p > ε with ε = d 1 p -1 2 y d y + 1 2 • (2d y + 1) 1 p • d y ! • Γ d y + 1 2 • 2 π dy -1 2 , for p ≥ 2 and ε = d y + 1 2 • (2d y + 1) 1 p • d y ! • Γ d y + 1 2 • 2 π dy -1 2 , for p < 2. This completes the proof of Lemma 18.

B.2 PROOF OF TIGHT LOWER BOUND IN THEOREM 3

In this section, we prove the tight lower bound in Theorem 3. Since we already have the width-d y lower bound by Lemma 18 and it is already proven that RELU networks of width d x is not dense in C(K, R dy ) (Hanin and Sellke, 2017), we prove the tight lower bound in Theorem 3 by showing the following statement: There exist f * ∈ C([0, 1] dx , R) and ε > 0 such that for any RELU+STEP network f of width d x containing at least one STEP, it holds that f * -f ∞ > ε. Without loss of generality, we assume that f has d x hidden neurons at each layer except for the output layer and all affine transformations in f are invertible (see Section 5.1). Our main idea is to utilize properties of level sets of width-d x RELU+STEP networks (Hanin and Sellke, 2017 ) defined as follows: Given a network f of width d x , we call a connected component of f -1 (y) for some y as a level set. Level sets of RELU+STEP networks have a property described by the following lemma. We note that the statement and the proof of Lemma 19 is motivated by Lemma 6 of (Hanin and Sellke, 2017) . Lemma 19. Let f be a RELU+STEP network of width d x containing at least one STEP. Then, for any level set S of f , S is unbounded unless it is empty. Proof of Lemma 19. Let * be the smallest number such that STEP appears at the * -th layer. In this proof, we show that all level sets of the first * layers of f are either unbounded or empty. Then the claim of Lemma 19 directly follows. We prove this using the mathematical induction on * . Recalling (1), we denote by f the mapping of the first layers of f : f := σ • t • • • • • σ 1 • t 1 . First, consider the base case: * = 1. Assume without loss of generality that the activation function of the first hidden node in σ 1 is STEP. Then for any x, the STEP activation maps the first component of t 1 (x) to 1 if (t 1 (x)) 1 ≥ 0, and to 0 otherwise. This means that there exists a ray R starting from x such that f 1 (R) = {f 1 (x)}. Hence, any level set of f 1 is either unbounded or empty. Now, consider the case that * > 1. Then, until the ( * -1)-th layer, the network only utilizes RELU. Here, the level sets of f * -1 can be characterized using the following lemma. Lemma 20 [Hanin and Sellke (2017, Lemma 6) ]. Given a RELU network g of width d x , let S ⊂ R dx be a set such that x ∈ S if and only if inputs to all RELU in g are strictly positive, when computing g(x). Then, S is open and convex, g is affine on S, and any bounded level set of g is contained in S. In this proof, we prove that if g * ([0, p 1 ]) ∩ g * ([p 2 , 1]) = ∅, then g * ([0, p 1 ]) \ B is contained in a bounded path-component of R 2 \ U for some simple polygon U ⊂ g * ([p 2 , 1]) ∪ B. Then, the statement of Lemma 7 directly follows by Lemma 21. To begin with, consider a loop g * ([p 2 , 1]) ∪ L g * (p 2 ), g * (1) where L(x, x ) denotes the line segment from x to x , i.e., L(x, x ) : = {λ • x + (1 -λ) • x : λ ∈ [0, 1]}. Then, the loop consists of a finite number of line segments, as an image of an interval of a RELU network is piecewise linear as well as L g * (p 2 ), g * (1) , i.e., the loop is a polygon. Since g * ([p 2 , 1])∪L g * (p 2 ), g * (1) consists of line segments, under the assumption f * -f ∞ ≤ 1 100 , one can easily construct a simple loop U in g * ([p 2 , 1]) ∪ L g * (p 2 ), g * (1) so that U contains simple curves from the midpoint of L g * (p 2 ), g * (1) to a point near the point (-1, 1), then to a point near the point (1, 1), and finally to the midpoint of L g * (p 2 ), g * (1) . We note that U also consists of line segments, i.e., U is a simple polygon. Figure 3 (a) illustrates U where line segments from g * ([p 2 , 1]) is drawn in blue and line segments from L g * (p 2 ), g * (1) indicated by dotted black line. Now, choose q ∈ (0, p 1 ) such that f * (q) = (0, 1 2 ). Since f * -f ∞ ≤ 1 100 and by the definition of * , f (q) = g * (q) ∈ {x ∈ R 2 : x -(0, 1 2 ) ∞ ≤ 1 100 } which is illustrated by the red dot in Figure 3 . Then, we claim the following statement: g * (q) is contained in a bounded path-component of R 2 \ U. (11) From the definition of q and the path-connectedness of g * ([0, q]), one can observe that proving the claim (11) leads us to that g * ([0, q]) is contained in a bounded path-component of R 2 \ U unless U ∩ g * ([0, q]) = ∅. Since g * ([0, p 1 ]) \ B ⊂ g * ([0, q]) by the definitions of q, * and the assumption that f * -f ∞ ≤ 1 100 , this implies that if g * ([0, p 1 ]) ∩ g * ([p 2 , 1]) = ∅, then g * ([0, p 1 ]) \ B is contained in a bounded path-component of R 2 \ U. Hence, (11) implies the statement of Lemma 7. Proof of claim (11). To prove the claim (11), we first introduce the following lemma. Lemma 22 [Jordan curve theorem (Tverberg, 1980) ]. For any simple loop O ⊂ R 2 , R 2 \ O consists of exactly two path-components where one is bounded and another is unbounded. Lemma 22 ensures the existence of a bounded path-component of R 2 \ U. Furthermore, to prove the claim (11), we introduce the parity function π U : R 2 \ U → {0, 1}: For x ∈ R 2 \ U and a ray starting from x, π U (x) counts the number of times that the ray "properly"



The threshold activation function (i.e., STEP) denotes x → 1[x ≥ 0]. For simplicity of notation, we let d0 := dx and dL := dy. Here, encode -1 K denotes the preimage of encodeK and C dx K is the Cartesian product of dx copies of CK . ω f * denotes the modulus of continuity of f * : f * (x) -f * (x ) ∞ ≤ ω f * ( x -x ∞) ∀x, x ∈ [0, 1] dx . dx, dy are considered as constants. ∂S denotes the boundary set of S. S ⊂ T is a path-component of T if S is a maximal set satisfying the following condition: For any x1, x2 ∈ S, there exists a continuous function f : [0, 1] → S such that f (0) = x1 and f (1) = x2. We consider ω h on J . Vol d (S) denotes the volume of S in the d-dimensional Euclidean space. µ denotes the Lebesgue measure. A vertex denotes one of the points (2, -1), (2, 1), (-2, -1), (-2, 1).



Figure 1: Illustration of the coding scheme

Figure 2: (a) Illustration of the image of f * : [0, 1] → R 2 (b, c) Examples of g * ([0, 1]). affected by RELU operations. In other words, f ([0, 1]) ∩ B must have been constructed in the first * layers. Under this observation, the boundary of S in Lemma 5 determined by φ * must intersect with B: If B ⊂ S, then φ -1 * • σ • φ (B) = B which contradicts the definition of * and if B ∩ S = ∅, then f ([0, 1]) ∩ B cannot be constructed in the first * layers. Therefore, there must exist a line (e.g., a line containing one boundary of S, see the arrow in Figure 2(b)) intersecting with B, such that the image g * ([0, 1]) lies in one side of the line. Since the image of the entire network f ([0, p 1 ]) is on both sides of the line, we have g * ([0, p 1 ]) = f ([0, p 1 ]), which implies that the remaining layers * + 1, . . . , L -1 must have moved the image g * ([0, p 1 ]) \ B to f ([0, p 1 ]) \ B; this also implies g * ([0, p 1 ]) \ B = ∅. A similar argument gives g * ([p 2 , 1]) \ B = ∅.

Figures 2(b) and 2(c) illustrates the two possible cases of Lemma 7. If g * ([0, p 1 ]) ∩ g * ([p 2 , 1]) = ∅ (Figure 2(c)), this contradicts (3). Then, g * ([0, p 1 ]) \ B must be contained in a bounded pathcomponent of R 2 \ T . Recall that g * ([0, p 1 ]) \ B has to move to f ([0, p 1 ]) \ B by layers * + 1, . . . , L -1. However, by Lemma 6, if any point in g * ([0, p 1 ]) \ B moves, then it must intersect with the image of T . If it intersects with the image of g * ([p 2 , 1]), then (3) is violated, hence a contradiction. If it intersects with B at the † -th layer for some † > * , then B ⊂ S for S in Lemma 5 determined by † . Hence, it violates the definition of * as φ -1 † • σ • φ † (B) = B by Lemma 5. Namely, the approximation by f is impossible in any cases. This completes the proof of Theorem 2.

Figure 3: (a) Illustration of U, g * (q). (b) Illustration of V, g * (q), v. (c) Illustration of U, g * (q), v.

ACKNOWLEDGEMENTS

This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2019-0-00075, Artificial Intelligence Graduate School Program (KAIST) and No.2017-0-01779, A machine learning and statistical inference framework for explainable artificial intelligence). Chulhee Yun acknowledges financial supports from Korea Foundation for Advanced Studies, NSF CAREER grant 1846088, and ONR grant N00014-20-1-2394.

Appendix

We first provide proofs of upper bounds in Theorems 1, 3, 4 in Appendix A. In Appendix B, we provide proofs of lower bounds in Theorem 1, 3 and proofs of Lemmas 5, 6, 7 used for proving the lower bound in Theorem 2. Throughout Appendix, we denote the coordinate-wise RELU by σ and we denote the i-th coordinate of an output of a function f (x) by (f (x)) i .

A PROOFS OF UPPER BOUNDS

A.1 PROOF OF TIGHT UPPER BOUND IN THEOREM 3In this section, we prove the tight upper bound on the minimum width in Theorem 3, i.e., width-(max{d x + 1, d y }) RELU+STEP networks are dense in C([0, 1] dx , R dy ). In particular, we prove that for any f * ∈ C([0, 1] dx , [0, 1] dy ), for any ε > 0, there exists a RELU+STEP network f of widthHere, we note that the domain and the codomain can be easily generalized to arbitrary compact support and arbitrary codomain, respectively.Our construction is based on the three-part coding scheme introduced in Section 4.1. First, consider constructing a RELU+STEP network for the encoder. From the definition of q K , one can observe that the mapping is discontinuous and piecewise constant. Hence, the exact construction (or even the uniform approximation) of the encoder requires the use of discontinuous activation functions such as STEP (recall its definition x → 1[x ≥ 0]). We introduce the following lemma for the exact construction of q K . The proof of Lemma 8 is presented in Appendix A.4. Lemma 8. For any K ∈ N, there exists a RELU+STEP network f : R → R of width 2 and O(2 K ) layers such that f (x) = q K (x) for all x ∈ [0, 1].For constructing the encoder via a RELU+STEP network of width d x + 1, we sequentially apply q K to each input coordinate, by utilizing the extra width 1 and Lemma 8. Here, when we apply q K to some coordinate (requiring width 2), we preserve other coordinates by applying identity mapping (requiring width d x -1) for constructing the encoder within width d x + 1. Once we apply q K for all input coordinates, we apply the linear transformation dx i=1 q K (x i ) × 2 -(i-1)K to obtain the output of the encoder.On the other hand, the memorizer only maps a finite number of scalar inputs to the corresponding scalar targets, which can be easily implemented by piecewise linear continuous functions. We show that the memorizer can be exactly constructed by a RELU network of width 2 using the following lemma. The proof of Lemma 9 is presented in Appendix A.5. Lemma 9. For any function f * : R → R, any finite set X ⊂ R, and any compact interval I ⊂ R containing X , there exists a RELU network f : R → R of width 2 andLikewise, the decoder maps a finite number of scalar inputs in C dyM to corresponding target vectors in C dy M . Here, each coordinate of a target vector corresponds to some consequent bits of the binary representation of the input. Under the similar idea used for our implementation of the memorizer, we show that the decoder can be exactly constructed by a RELU network of width d y using the following lemma. The proof of Lemma 10 is presented in Appendix A.6. Lemma 10. For any d y , M ∈ N, for any δ > 0, there exists a RELU network f : R → R 2 of width d y and O(2 M ) layers such that for all c ∈ C dyMFinally, as the encoder, the memorizer, and the decoder can be constructed by RELU+STEP networks of width d x + 1, width 2, and width d y , respectively, the width of the overall RELU+STEP network f is max{d x + 1, d y }. In addition, as mentioned in Section 4.1, choosing K, M ∈ N large enough so thatThis completes the proof of the tight upper bound in Theorem 3.

A.6 PROOF OF LEMMA 10

Before describing our proof, we first introduce the following lemma. The proof of Lemma 15 is presented in Appendix A.9. Lemma 15. For any M ∈ N, for any δ > 0, there exists a RELU network f : R → R 2 of width 2 and O(2 M ) layers such that for all x ∈ [0, 1] \ D M,δ , f (x) := (y 1 (x), y 2 (x)), where y 1and D M,δ :=In Lemma 15, one can observe that C dyM ⊂ [0, 1] \ D M,δ for δ < 2 -dyM , i.e., there exists a RELU network g of width 2 satisfying (6) on C dyM and (7). g enables us to extract the first M bits of the binary representation of c ∈ C dyM . Consider outputs of g(c): (g(c)) 1 for c ∈ C dyM is the first coordinate of decode M (c) while (g(c)) 2 ∈ C (dy-1)M contains information on other coordinates of decode M (c). Now, consider further applying g to (g(c)) 2 and passing though the output (g(c)) 1 via the identity function (RELU is identity for positive inputs). Then, g (g(c)) 2 1 is the second coordinate of decode M (c) while g (g(c)) 2 2 contains information on coordinates other than the first and the second ones of decode M (c). Under this observation, if we iteratively apply g to the second output of the prior g and pass through all first outputs of previous g's, then we recover all coordinates of decode M (c) within d y -1 applications of g. Note that both the first and the second outputs of the (d y -1)-th g correspond to the second last and the last coordinate of decode M (c), respectively. Our construction of f is such an iterative d y -1 applications of g which can be implemented by a RELU network of width d y . Here, (7) in Lemma 15 enables us to achieve f (R) ⊂ [0, 1] dy . This completes the proof of Lemma 10.

A.7 PROOF OF LEMMA 11

To begin with, we introduce the following Lemma. The proof of Lemma 16 is presented in Appendix A.10. Lemma 16. For any d x , for any α ∈ (0, 0.5), there exists a RELU network f : R dx → R dx of widthBy Lemma 16, there exists a RELU network h 1 of widthFurthermore, by Lemma 15, for any δ > 0, there exists a RELU network g : R → R of width 2 andWe construct a network h 2 : R dx → R dx of width d x + 1 by sequentially applying g for each coordinate of an input x ∈ R dx , utilizing the extra width 1. Then,Note that we use q K (x) for denoting the coordinate-wise q K for a vector x.

Now, we define

Then, from constructions of h 1 and h 2 , we havewhere we use the fact that (1, . . . , 1) / ∈ D K,δ,dx and q K ((1, . . . ,Published as a conference paper at ICLR 2021 Finally, we construct a RELU network f of width d x + 1 asThen, it holds thatfor all x ∈ D γ .In addition, if we choose sufficiently small α and δ so that µ(D γ ) < γ, then f satisfies all conditions in Lemma 11. This completes the proof of Lemma 11.A.8 PROOF OF LEMMA 13The proof of Lemma 13 is almost identical to that of Lemma 10. In particular, we approximate the RELU network construction of iterative d y -1 applications of g (see Appendix A.6 for the definition of g) by a ρ network of width d y + 1. To this end, we consider a ρ network h of width 3 approximating g on some interval J within α error using Lemma 12. Then, one can observe that iterative d y -1 applications of h (as in Appendix A.6) results in a ρ network f of width d y + 1. Here, passing through the identity function can be approximated using a ρ network of width 1, i.e., same width to RELU networks (see Lemma 4.1 by Kidger and Lyons (2020) for details). Furthermore, since h is uniformly continuous on J , it holds thatdy by choosing sufficiently large J and sufficiently small α so thatThis completes the proof of Lemma 13.A.9 PROOF OF LEMMA 15We first clip the input to be in [0, 1] using the following RELU network of width 1.After that, we apply g : [0, 1] → [0, 1] 2 defined asFrom the above definition of g , one can observe that (gOnce we implement a RELU network g of width 2 such that g(x) = g 2 M (x), then, constructing f as. Now, we describe how to construct g 2 M by a RELU network. One can observe that (g 1 (x)) 2 = 0 andfor all x, i.e., alternating applications of min{•, •} and max{•, •}. Finally, we introduce the following definition and lemma.Definition 1 [Hanin and Sellke (2017) ]. f : R dx → R dy is a max-min string of length L if there exist affine transformations h 1 , . . . , h L such thatwhere each τ is either a coordinate-wise max{•, •} or min{•, •}.Lemma 17 [Hanin and Sellke (2017, Proposition 2) ]. For any max-min string f * : R dx → R dy of length L, for any compact K ⊂ R dx , there exists a RELU network f : R dx → R dx × R dy of L layers and width d x + d y such that for all x ∈ K,, where y 1 (x) = x and y 2 (x) = f * (x).We note that Proposition 2 by Hanin and Sellke (2017) itself only ensures y 2 = f * (x); however, its proof provides y 1 = x as well.From the definition of the max-min string, one can observe that (g 2 M (x)) 2 is a max-min string. Hence, by Lemma 17, there exists a RELU network g of width 2 such that g(x) = g 2 M (x) = q M (x) for all x ∈ D K,δ . This completes the proof of Lemma 15.A.10 PROOF OF LEMMA 16Consider the following two functions from R to R:Using h 1 and h 2 , we first map all x ∈ R dx \ [0, 1] dx to some vector whose coordinates are greater than one via g : R dx → R dx , defined asHere we use the addition between a vector and a scalar for denoting addition of the scalar to each coordinate of the vector. Then, one can observe that ifis greater than one. Furthermore, each r (or s ) can be implemented by a RELU network of width d x + 1 (width d x for computing σ(x + 1) -1 and width one for computing h 1 (x )) due to (9). Hence, g can be implemented by a RELU network of width d x + 1.Finally, we construct a RELU network f : R dx → R dx of width d x + 1 as f (x) := min max{g(x), 0}, 1 min max{x, 0}, 1 =1 -σ(1 -σ(x)).Then, one can observe that if x ∈ [α, 1 -α] dx , then f (x) = x and if x ∈ R dx \ [0, 1] dx , then f (x) = (1, . . . , 1), and f (R dx ) ⊂ [0, 1] dx . This completes the proof of Lemma 16.Consider S of f * -1 as in Lemma 20 and consider a level set T of f * containing some x, i.e., T = ∅. If x / ∈ S, then T is unbounded by Lemma 20. If x ∈ S, we argue as the base case. The preimage of f * (x) of the * -th layer (i.e., σ • t * ) contains a ray. If this ray is contained in f * -1 (S), then T is unbounded as f * -1 is invertible and affine on S. Otherwise, T \ S = ∅ and it must be unbounded as any level set of f * -1 not contained in S is unbounded by Lemma 20. This completes the proof of Lemma 19. Now, we continue the proof of the tight lower bound in Theorem 3 based on Lemma 19. We note that our argument is also from the proof of the lower bound in Theorem 1 of (Hanin and Sellke, 2017) .Then, for a = 1 4 and b = 0, one can observe that (f * ) -1 (a) is a sphere of radius 1 2 centered at (f * ) -1 (b) = {( 1 2 , . . . , 1 2 )}. Namely, any path from (f * ) -1 (b) to infinity must intersect with (f * ) -1 (a). Now, suppose that a RELU+STEP network f of width d x satisfies that f * -f ∞ ≤ 1 16 . Then, the level set of f containing ( 1 2 , . . . , 1 2 ) must be unbounded by Lemma 19, and hence, must intersect with (f * ) -1 (a). However, as16 . This completes the proof of the tight lower bound max{d x + 1, d y } in Theorem 3.

B.3 PROOF OF TIGHT LOWER BOUND IN THEOREM 1

In this section, we prove the tight lower bound in Theorem 1. Since we already have the d y lower bound by Lemma 18, we prove the tight lower bound in Theorem 1 by showing the following statement: There exist f * ∈ L p (R dx , R) and ε > 0 such that for any continuous function f represented by a RELU network of width d x , it holds thatNote that this statement can be easily generalized to an arbitrary codomain. To derive the statement, we prove a stronger statement: For any RELU network f of width d x , eitherwhere f = 0 denotes that f is a constant function mapping any input to zero. Then it leads us to the desired result directly. Without loss of generality, we assume that f has d x hidden neurons at each layer except for the output layer and all affine transformations in f are invertible (see Section 5.1).As in the proof of the tight upper bound in Theorem 3, we utilize properties of level sets of f given by Lemma 20. Let S be a set defined in Lemma 20 of f . By the definition of S, one can observe that R dx \ S = ∅. Then, a level set T containing some x ∈ R dx \ S must be unbounded by Lemma 20.Here, if y := f (x) > 0, then for δSince T contains T which is an unbounded set, one can easily observe that µ(T ) = ∞ and hence,10 One can derive the same result for f (x) < 0.Suppose that f (x) = 0 for all x ∈ R dx \ S. Then, f (x) = 0 for all x ∈ ∂S as S is open (see Lemma 20). Furthermore, we claim that f (S) = {0} or f / ∈ L p (R dx , R). For any s ∈ S, consider any two rays of opposite directions starting from s. If one ray is contained in S and f ∈ L p (R dx , R), then its image for f must be {0}. If the image of f is not {0}, using the similar argument for the case f (x) > 0 leads us to f / ∈ L p (R dx , R). Then, one can conclude that f (s) = 0. If both rays are not contained in S, they must both intersect with ∂S. Then, since f is affine on S, f (s) must be zero as f (∂S) = {0}. Hence, we prove the claim. This completes the proof of the tight upper bound in Theorem 1.

B.4 PROOF OF LEMMA 5

Let φ(x) = Ax + b for some invertible matrix A = [a 1 , a 2 ] ∈ R 2×2 and for some vectors a 1 , a 2 , b ∈ R 2 . Then, it is easy to see that ifHence, the first statement of Lemma 5 holds. Now, consider the second statement of Lemma 5. Suppose that a, which is on ∂S. In addition, similar arguments hold for cases that a 1 , x + b 1 < 0, a 2 , x + b 2 ≥ 0 and a 1 , x + b 1 < 0, a 2 , x + b 2 < 0. This completes the proof of Lemma 5.

B.5 PROOF OF LEMMA 6

We first prove the first statement of Lemma 6 using the proof by contradiction. Suppose that x = x and x / ∈ T but x is not in a bounded path-component of R 2 \ T . Here, note that x = x ∈ S for S defined in Lemma 5. Then, there exists a path P from x to infinity such that P ∩ T = ∅. If P ⊂ int(S) 11 , then the preimages of P and T ∩ int(S) under φ -1 • σ • φ stay identical to their corresponding images, i.e., P and T ∩ int(S) (by Lemma 5). This contradicts the assumption that x is in a bounded path-component of R 2 \ T . Hence, it must hold that P ⊂ int(S).

Let x * /

∈ T be the first point in P ∩ ∂S in the trajectory of P starting from x . Then, the preimage of x * contains a ray R starting from x * (see the proof of Lemma 5 for the details) which must not intersect with T ; had the ray R intersected with T , then R ∩ T must have mapped to x * , which contradicts x * / ∈ T and the definition of P. Furthermore, from the definition of x * , the subpath P † of P from x to x * excluding x * satisfies P † ⊂ int(S). Hence, the preimages of P † and T ∩ int(S) under φ -1 • σ • φ stay identical by Lemma 5. This implies that there exist a path P † from x to x * , and then a path R from x * to infinity, not intersecting with T . This contradicts the assumption of Lemma 6. This completes the proof of the first statement of Lemma 6. Now, consider the second statement of Lemma 6. By Lemma 5, x = x implies that x / ∈ S and x ∈ ∂S. Here, as the preimage of x contains a ray from x containing x, this ray must intersect with T from the assumption of Lemma 6. Hence, x ∈ T and this completes the proof of the second statement of Lemma 6.By combining the proofs of the first and the second statements of Lemma 6, we complete the proof of Lemma 6.

B.6 PROOF OF LEMMA 7

Before starting our proof, we first introduce the following definitions and lemma. The proof of Lemma 21 is presented in Appendix B.7. Definition 2. Definitions related to curves, loops, and polygons are listed as follows: For U ⊂ R 2 and F(U)• U is a "curve" if there exists f ∈ F(U).• U is a "simple curve" if there exists an injective f ∈ F(U).• U is a "loop" if there exists f ∈ F(U) such that f (1) = f (0).• U is a "simple loop" if there exists f ∈ F(U) such that f (1) = f (0) and f is injective on [0, 1).• U is a "polygon" if there exists a piecewise linear f ∈ F(U) such that f (1) = f (0).11 int(S) denotes the interior of S.intersects with U (reduced modulo 2) where the proper intersection is an intersection where U enters and leaves on different sides of the ray. Here, it is well-known that π U (x) does not depend on the choice of the ray, i.e., π U is well-defined. We refer the proof of Lemma 2.3 by Thomassen (1992) and the proof of Lemma 1 by Tverberg (1980) for more details. Here, π U characterizes the "position" of x as π U (x) = 0 if and only if x is in the unbounded path-component of R 2 \ U, which is known as the even-odd rule (Shimrat, 1962; Hacker, 1962) . Hence, proving that π U (g * (q)) = 1 would complete the proof of the claim (11).Recall that there exists the line (e.g., the black arrow in Figure 3 ) that intersects with B and the image of g * can be at only "one side" of the line (see Section 5.2 for details). Since B is open, there exists a "vertex" v ∈ ∂B (e.g., the green dot in Figures 3(b ) and 3(c)) such that v is in the "other side" of the line. 12 We prove π U (g * (q)) = 1 by counting the number of proper intersections between the ray R from g * (q) passing through v (the red arrow in Figures 3(b ) and 3(c) illustrates R).To simplify showing π U (g * (q)) = 1, we consider two points z 1 , z 2 ∈ U ∩ ∂B near the points (-1, 1), (1, 1), respectively, such that the simple curve P in U from z 1 to z 2 is contained in B except for z 1 , z 2 . Then, one can observe that P and L(z 1 , z 2 ) forms a simple loop which we call V. Then, π V (g * (q)) = 1 as a ray from g * (q) of the downward direction (the blue arrow in Figure 3(b) ) only properly intersects once with V at some point in L g * (p 2 ), g * (1) , under the assumption that f * -f ∞ ≤ 1 100 . From the property of π V , this implies that the ray R starting from g * (q) and passing through v (e.g., the red arrow in Figures 3(b ) and 3(c)) must properly intersect with V odd times. Furthermore, from the construction of U and V, definition of * , and under the assumption that f * -f ∞ ≤ 1 100 , one can observe that the simple curve in U from z 1 to z 2 not contained in B (i.e., U \ P) can only intersect with B within the ∞ balls of radius 2 100 centered at the points (-1, 1) and (1, 1). This is because if U \ P intersects with B outside these ∞ balls, then by definition of * , the network cannot make further modifications in B, hence contradicting the approximation assumption f * -f ∞ ≤ 1 100 . In other words, all proper intersections between U and R are identical to those between V and R. This implies that π U (g * (q)) = 1 and hence, g * (q) is in the bounded path-component of R 2 \ U. This completes the proof of the claim (11) and therefore, completes the proof of Lemma 7.

B.7 PROOF OF LEMMA 21

Suppose that g * ([0, p 1 ]) ∩ g * ([p 2 , 1]) = ∅ and g * ([0, p 1 ]) \ B is contained in a bounded pathcomponent of R 2 \ U for some U ⊂ g * ([p 2 , 1]) ∪ B. If g * ([0, p 1 ]) \ B is path-connected, then the statement of Lemma 21 directly follows. Hence, suppose that g * ([0, p 1 ]) \ B has more than one path-components. To help the proof, we introduce the following Lemma. Lemma 23. If g * (p) ∈ ∂B for some p ∈ [0, 1], then f (p) = g * (p).Proof of Lemma 23. Suppose that f (p) = g * (p). Then, φ -1 • σ • φ (g * (p)) = g * (p) for some > * . By Lemma 5, there exist a 1 , a 2 ∈ R 2 and b 1 , b 2 ∈ R such that φ -1 •σ •φ (x) = x if and only if a 1 , x +b 1 ≥ 0, a 2 , x +b 2 ≥ 0. Without loss of generality, we assume that a 1 , g * (p) +b 1 < 0. Since g * (p) ∈ ∂B, there exists z ∈ B such that a 1 , z + b 1 < 0, i.e., φ -1 • σ • φ (z) = z, which contradicts to the definition of * by Lemma 5. This completes the proof of Lemma 23 By Lemma 23 and the assumption that f * -f ≤ 1 100 , g * ([0, p 1 ]) \ B can only intersect with ∂B within the ∞ ball O of radius 2 100 centered at the point (0, 1). Hence, all path-components of g * ([0, p 1 ]) \ B intersect with the line segment ∂B ∩ O. In other words, g * ([0, p 1 ]) \ B is in a path-component of R 2 \ (g * ([p 2 , 1]) ∪ B) unless g * ([p 2 , 1]) intersects with ∂B ∩ O. However, by Lemma 23 and the assumption that f * -f ≤ 1 100 , g * ([p 2 , 1]) must not intersect with ∂B ∩ O. This completes the proof of Lemma 21.

