PROVABLE MEMORIZATION VIA DEEP NEURAL NETWORKS USING SUB-LINEAR PARAMETERS

Abstract

It is known that O(N ) parameters are sufficient for neural networks to memorize arbitrary N input-label pairs. By exploiting depth, we show that O(N 2/3 ) parameters suffice to memorize N pairs, under a mild condition on the separation of input points. In particular, deeper networks (even with width 3) are shown to memorize more pairs than shallow networks, which also agrees with the recent line of works on the benefits of depth for function approximation. We also provide empirical results that support our theoretical findings.

1. INTRODUCTION

The modern trend of over-parameterizing neural networks has shifted the focus of deep learning theory from analyzing their expressive power toward understanding the generalization capabilities of neural networks. While the celebrated universal approximation theorems state that over-parameterization enables us to approximate the target function with a smaller error (Cybenko, 1989; Pinkus, 1999) , the theoretical gain is too small to satisfactorily explain the observed benefits of over-parameterizing already-big networks. Instead of "how well can models fit," the question of "why models do not overfit" has become the central issue (Zhang et al., 2017) . Ironically, a recent breakthrough on the phenomenon known as the double descent (Belkin et al., 2019; Nakkiran et al., 2020) suggests that answering the question of "how well can models fit" is in fact an essential element in fully characterizing their generalization capabilities. In particular, the double descent phenomenon characterizes two different phases according to the capability/incapability of the network size for memorizing training samples. If the network size is insufficient for memorization, the traditional bias-variance trade-off occurs. However, after the network reaches the capacity that memorizes the dataset, i.e., "interpolation threshold," larger networks exhibit better generalization. Under this new paradigm, identifying the minimum size of networks for memorizing finite input-label pairs becomes a key issue, rather than function approximation that considers infinite inputs. The memory capacity of neural networks is relatively old literature, where researchers have studied the minimum number of parameters for memorizing arbitrary N input-label pairs. Existing results showed that O(N ) parameters are sufficient for various activation functions (Baum, 1988; Huang and Babri, 1998; Huang, 2003; Yun et al., 2019; Vershynin, 2020) . On the other hand, Sontag (1997) established the negative result that for any network using analytic definable activation functions with o(N ) parameters, there exists a set of N input-label pairs that the network cannot memorize. The sub-linear number of parameters also appear in a related topic, namely the VC-dimension of neural networks. It has been proved that there exists a set of N inputs such that a neural network with o(N ) parameters can "shatter," i.e., memorize arbitrary labels (Maass, 1997; Bartlett et al., 2019) . Comparing the two results on o(N ) parameters, Sontag (1997) showed that not all sets of N inputs can be memorized for arbitrary labels, whereas Bartlett et al. (2019) showed that at least one set of N inputs can be shattered. This suggests that there may be a reasonably large family of N input-label pairs that can be memorized with o(N ) parameters, which is our main interest.

1.1. SUMMARY OF RESULTS

In this paper, we identify a mild condition satisfied by many practical datasets, and show that o(N ) parameters suffice for memorizing such datasets. In order to bypass the negative result by Sontag (1997) , we introduce a condition to the set of inputs, called the ∆-separateness. Definition 1. For a set X ⊂ R dx , we say X is ∆-separated if sup x,x ∈X :x =x x -x 2 < ∆ × inf x,x ∈X :x =x x -x 2 . This condition requires that the ratio of the maximum distance to the minimum distance between distinct points is bounded by ∆. Note that the condition is milder when ∆ is bigger. By Definition 1, any given finite set of (distinct) inputs is ∆-separated for some ∆, so one might ask why ∆separateness is different from having distinct inputs in a dataset. The key difference is that even if the number of data points N grows, the ratio of the maximum to the minimum should remain bounded by ∆. Given the discrete nature of computers, there are many practical datasets that satisfy ∆-separateness, as we will see shortly. Also, this condition is more general than the minimum distance assumptions (∀i, x i 2 = 1, ∀i = j, x i -x j 2 ≥ ρ > 0) that are employed in existing theoretical results (Hardt and Ma, 2017; Vershynin, 2020) . To see this, note that the minimum distance assumption implies 2/ρ-separateness. In our theorem statements, we will use the phrase "∆-separated set of input-label pairs" for denoting the set of inputs is ∆-separated. In our main theorem sketched below, we prove the sufficiency of o(N ) parameters for memorizing any ∆-separated set of N pairs (i.e., any ∆-separated set of N inputs with arbitrary labels) even for large ∆. More concretely, our result is of the following form: Theorem 1 (Informal). For any w ∈ ( 2 3 , 1], there exists a O(N 2-2w / log N + log ∆)-layer, O(N w + log ∆)-parameter fully-connected network with sigmoidal or RELU activation that can memorize any ∆-separated set of N input-label pairs. We note that log has base 2. Theorem 1 states that if the number of layers increases with the number of pairs N , then any ∆-separated set of N pairs can be memorized by a network with o(N ) parameters. Here, we can check from Definition 1 that the log ∆ term does not usually dominate the depth or the number of parameters, especially for modern deep architectures and practical datasets. For example, it is easy to check that any dataset consisting of 3-channel images (values from {0, 1, . . . , 255}) of size a × b satisfies log ∆ < (9 +foot_0 2 log(ab)) (e.g., log ∆ < 17 for the ImageNet dataset), which is often much smaller than the depth of modern deep architectures. For practical datasets, we can show that networks with parameters fewer than the number of pairs can successfully memorize the dataset. For example, in order to perfectly classify one million images in ImageNet dataset 1 with 1000 classes, our result shows that 0.7 million parameters are sufficient. The improvement is more significant for large datasets. To memorize 15.8 million bounding boxes in Open Images V6foot_1 with 600 classes, our result shows that only 4.5 million parameters suffice. Theorem 1 improves the sufficient number of parameters for memorizing a large class of N pairs (i.e., 2 O(N w ) -separated) from O(N ) down to O(N w ) for any w ∈ ( 2 3 , 1), for deep networks. Then, it is natural to ask whether the depth increasing with N is necessary for memorization with a sub-linear number of parameters. The following existing result on the VC-dimension implies that this is indeed necessary for memorization with o(N/ log N ) parameters, at least for RELU networks. Theorem [Bartlett et al. (2019) ]. (Informal) For L-layer RELU networks, Ω(N/(L log N )) parameters are necessary for memorizing at least a single set of N inputs with arbitrary labels. The above theorem implies that for RELU networks of constant depth, Ω(N/ log N ) parameters are necessary for memorizing at least one set of N inputs with arbitrary labels. In contrast, by increasing depth with N , Theorem 1 shows that there is a large class of datasets that can be memorized with o(N/ log N ) parameters. Combining these two results, one can conclude that increasing depth is necessary and sufficient for memorizing a large class of N pairs with o(N/ log N ) parameters. Given that the depth is critical for the memorization power, is the width also critical? We prove that it is not the case, via the following theorem. Theorem 2 (Informal). For a fully-connected network of width 3 with a sigmoidal or RELU activation function, O(N 2/foot_2 + log ∆) parameters (i.e., layers) suffice for memorizing any ∆-separated set of N input-label pairs. Theorem 2 states that under 2 O(N 2/3 ) -separateness of inputs, the network width does not necessarily have to increase with N for memorization with sub-linear parameters. Furthermore, it shows that even a surprisingly narrow network of width 3 has a superior memorization power than a fixed-depth network, requiring only O(N 2/3 ) parameters. Theorems 1 and 2 show the existence of network architectures that memorize N points with o(N ) parameters, under the condition of ∆-separateness. This means that these theorems do not answer the question of how many such data points can a given network memorize. We provide generic criteria for identifying the maximum number of points given general networks (Theorem 3). In a nutshell, our criteria indicate that to memorize more pairs under the same budget for the number of parameters, a network must have a deep and narrow architecture at the final layers of the network. In contrast to the prior results that the number of arbitrary pairs that can be memorized is at most proportional to the number of parameters (Yamasaki, 1993; Yun et al., 2019; Vershynin, 2020) , our criteria successfully incorporate with the characteristics of datasets, the number of parameters, and the architecture, which enable us to memorize ∆-separated datasets with number of pairs super-linear in the number of parameters. Finally, we provide empirical results corroborating our theoretical findings that deep networks often memorize better than their shallow counterparts with a similar number of parameters. Here, we emphasize that better memorization power does not necessarily imply better generalization. We indeed observe that shallow and wide networks often generalize better than deep and narrow networks, given the same (or similar) training accuracy. Organization. We first introduce related works in Section 2. In Section 3, we introduce necessary notation and the problem setup. We formally state our main results and discuss them in Section 4. In Section 6, we provide empirical observations on the effect of depth and width in neural networks. Finally, we conclude the paper in Section 7.

2.1. NUMBER OF PARAMETERS FOR MEMORIZATION

Sufficient number of parameters for memorization. Identifying the sufficient number of parameters for memorizing arbitrary N pairs has a long history. Earlier works mostly focused on bounding the number of hidden neurons of shallow networks for memorization. Baum (1988) proved that for 2-layer STEP 3 networks, O(N ) hidden neurons (i.e., O(N ) parameters) are sufficient for memorizing arbitrary N pairs when inputs are in general position. Huang and Babri (1998) showed that the same bound holds for any bounded and nonlinear activation function σ satisfying that either lim x→-∞ σ(x) or lim x→∞ σ(x) exists, without any condition on inputs. The O(N ) bounds on the number of hidden neurons was improved to O( √ N ) by exploiting an additional hidden layer by Huang (2003) ; nevertheless, this construction still requires O(N ) parameters. With the advent of deep learning, the study has been extended to modern activation functions and deeper architectures. Zhang et al. (2017) proved that O(N ) hidden neurons are sufficient for 2-layer RELU networks to memorize arbitrary N pairs. Yun et al. (2019) showed that for deep RELU (or hard tanh) networks having at least 3 layers, O(N ) parameters are sufficient. Vershynin (2020) proved a similar result for STEP (or RELU) networks with an additional logarithmic factor, i.e., O(N ) parameters are sufficient, for memorizing arbitrary {x i : x i 2 = 1} N i=1 satisfying x i -x j 2 2 = Ω( log log dmax log dmin ) and N, d max = e O(d 1/5 min ) where d max and d min denote the maximum and the minimum hidden dimensions, respectively. In addition, the memorization power of modern network architectures has also been studied. Hardt and Ma (2017) showed that RELU networks consisting of residual blocks with O(N ) hidden neurons can memorize arbitrary {x i : x i 2 = 1} N i=1 satisfying x i -x j 2 ≥ ρ for some absolute constant ρ > 0. Nguyen and Hein (2018) studied a broader class of layers and proved that O(N ) hidden neurons suffice for convolutional neural networks consisting of fully-connected, convolutional, and max-pooling layers for memorizing arbitrary N pairs having different patches. Necessary number of parameters for memorization. On the other hand, the necessary number of parameters for memorization has also been studied. Sontag (1997) showed that for any neural network using analytic definable activation functions, Ω(N ) parameters are necessary for memorizing arbitrary N pairs. Namely, given any network using analytic definable activation with o(N ) parameters, there exists a set of N pairs that the network cannot memorize. The Vapnik-Chervonenkis (VC) dimension is also closely related to the memorization power of neural networks. While the memorization power studies the number of parameters for memorizing arbitrary N pairs, the VC-dimension studies the number of parameters for memorizing at least a single set of N inputs with arbitrary labels. Hence, it naturally provides the lower bound on the necessary number of parameters for memorizing arbitrary N pairs. The VC-dimension of neural networks has been studied for various types of activation functions. For memorizing at least a single set of N inputs with arbitrary labels, it is known that Θ(N/ log N ) parameters are necessary (Baum and Haussler, 1989) and sufficient (Maass, 1997) for STEP networks. Similarly, Karpinski and Macintyre (1997) (2019) showed that Θ(N/( L log N )) parameters are necessary and sufficient for L-layer networks using any piecewise linear activation function where L := 1 W L L =1 W and W denotes the number of parameters up to the -th layer.

2.2. BENEFITS OF DEPTH IN NEURAL NETWORKS

To understand deep learning, researchers have investigated the advantages of deep neural networks compared to shallow neural networks with a similar number of parameters. Initial results discovered examples of deep neural networks that cannot be approximated by shallow neural networks without using exponentially many parameters (Telgarsky, 2016; Eldan and Shamir, 2016; Arora et al., 2018) . Recently, it is discovered that deep neural networks require fewer parameters than shallow neural networks to represent or approximate a class of periodic functions (Chatziafratis et al., 2020a; b) . For approximating continuous functions, Yarotsky (2018) proved that the number of required parameters for RELU networks of constantly bounded depth are square to that for deep RELU networks.

3. NOTATION AND PROBLEM SETUP

In this section, we introduce notation and the problem setup. We use log to denote the logarithm to the base 2. We let RELU be the function x → max{x, 0}. For X ⊂ R, we denote X := { x : x ∈ X }. For n ∈ N and a set X , we denote X n := {S ⊂ X : |S| = n} and [n] := {0, . . . , n -1}. For x ≥ 0 and y > 0, we denote x mod y := x -y • x y . Throughout this paper, we consider fully-connected feedforward networks. In particular, we consider the following setup: Given an activation function σ, we define a neural network f θ of L layers (or equivalently L -1 hidden layers), input dimension d x , output dimension 1, and hidden layer dimensions d 1 , . . . , d L-1 parameterized by θ as f θ := t L-1 •σ•• • ••t 1 •σ•t 0 . Here, t : R d → R d +1 is an affine transformation parameterized by θ. 4 We denote a neural network using an activation function σ by a "σ network." We define the width of f θ as max{d 1 , . . . , d L-1 }. As we introduced in Section 1.1, our main results hold for any sigmoidal activation function and RELU. Here, we formally define the sigmoidal functions as follows. Definition 2. We say a function σ : R → R is sigmoidal if the following conditions are satisfied. • Both lim x→-∞ σ(x), lim x→∞ σ(x) exist and lim x→-∞ σ(x) = lim x→∞ σ(

x).

• There exists z ∈ R such that σ is continuously differentiable at z and σ (z) = 0. A class of sigmoidal functions covers many activation functions including sigmoid, tanh, hard tanh, etc.foot_4 Furthermore, since hard tanh can be represented as a combination of two RELU functions, all results for sigmoidal activation functions hold for RELU as well. 6Lastly, we formally define the memorization as follows. Definition 3. Given C, d x ∈ N, a set of inputs X ⊂ R dx , a label function y : X → [C], and a neural network f θ : R dx → R parameterized by θ, we say f θ can memorize {(x, y(x)) : x ∈ X } in d x dimension with C classes if for any ε > 0, there exists θ such that |f θ (x) -y(x)| ≤ ε for all x ∈ X . Definition 3 defines the memorizability as the ability of a network f θ to fit a set of input-label pairs. While existing results often define memorization only for the binary labels, we consider arbitrary C classes, and prove our results for general multi-class classification problems. We often write "f θ can memorize arbitrary N pairs" without "in d x dimension with C classes" throughout the paper.

4.1. MEMORIZATION VIA SUB-LINEAR PARAMETERS

Efficacy of depth for memorization. Now, we are ready to introduce our main theorem on memorizing N pairs with o(N ) parameters. The proof of Theorem 1 is presented in Section 5. Theorem 1. For any C, N, d x ∈ N, for any w ∈ [2/3, 1], for any ∆ ≥ 1, for any sigmoidal activation function σ, there exists a σ network f θ of O log d x + log ∆ + N 2-2w 1+(1.5w-1) log N log C hidden layers and O d x + log ∆ + N w + N 1-w/2 log C parameters such that f θ can memorize any ∆-separated set of N pairs in d x dimension with C classes. Note that Theorem 1 covers w = 2/3 which is not included in its informal version presented in Section 1.1. However, we could only achieve O(N 2/3 ) parameters at w = 2/3 as the log N term disappears. In addition, while we only address sigmoidal activation functions in the statement of Theorem 1, note that the same conclusion naturally holds for RELU as we described in Section 3. In Theorem 1, ∆ only incurs O(log ∆) overhead to the number of layers and the number of parameters. As we introduced in Section 1.1, log ∆ for modern datasets is often very small. Furthermore, log ∆ is small for random inputs. For example, a set of d x -dimensional i.i.d. standard normal random vectors of size N satisfies log ∆ = O( 1 dx log(N/ √ δ) ) with probability at least 1 -δ (see Section C). Hence, the ∆-separateness condition is often negligible. Suppose that d x and C are treated as constants, as also assumed in existing results. Then, Theorem 1 implies that if log ∆ = O(N w ) for some w < 1, then Θ(N w ) (i.e., sub-linear to N ) parameters are sufficient for sigmoidal or RELU networks to memorize arbitrary ∆-separated set of N pairs. Furthermore, if log ∆ ≤ O(N 2-2w /log N ) and w ∈ ( 2 3 , 1), then the network construction in Theorem 1 has O(N 2-2w /log N ) layers and O(N w ) parameters. Note that the condition log ∆ ≤ O(N 2-2w /log N ) is very loose for many practical datasets, especially for those with huge N . Combined with the lower bound Ω(N/ log N ) on the necessary number of parameters for RELU networks of constant depth (Bartlett et al., 2019) , Theorem 1 implies that depth growing in N is necessary and sufficient for memorizing a large class (i.e., ∆-separated) of N pairs with o(N/ log N ) parameters. In other words, deeper RELU networks have more memorization power. Unimportance of width for memorization. While depth is critical for the memorization power, we show that the width is not very critical. In particular, we prove that extremely narrow networks of width 3 can memorize with O(N 2/3 ) layers (i.e., O(N 2/3 ) parameters) as stated in the following theorem. The proof of Theorem 2 is presented in Section F. Theorem 2. For any C, N, d x ∈ N, for any ∆ ≥ 1, for any sigmoidal activation function σ, a σ network of Θ(N 2/3 log C) hidden layers and width 3 can memorize any ∆-separated set of N pairs in d x dimension with C classes. The statement of Theorem 2 might be somewhat surprising since the network width for memorization does not depend to the input dimension d x . This is in contrast with the recent universal approximation results that width at least d x + 1 is necessary for approximating functions having d x -dimensional domain (Lu et al., 2017; Hanin and Sellke, 2017; Johnson, 2019; Park et al., 2020) . The main difference follows from the fundamental difference in the two approximation problems, i.e., approximating a function at finite inputs versus infinite inputs, e.g., the unit cube. Any set of N input vectors can be easily mapped to N "distinct" scalar values by simple projection using inner product. Hence, memorizing finite input-label pairs in d x -dimension can be easily translated into memorizing finite input-label pairs in one-dimension. In other words, the dimensionality (d x ) of inputs is not very important as long as they can be translated to distinct scalar values. In contrast, there is no "natural" way to map design an injection from d x -dimensional unit cube to lower dimension. Namely, to not loose the "information" from inputs, width d x for preserving inputs is necessary. Therefore, function approximation in general cannot be done with width independent of d x . Extension to regression problem. The results of Theorem 1 and Theorem 2 can be easily applied to the regression problem, i.e., when labels are from [0, 1] . This is because one can simply translate the regression problem with some ε > 0 error tolerance to the classification problem with 1/ε classes. Here, each class c ∈ {0, 1, . . . , 1/ε } corresponds to the target value c • ε. Hence, the regression problem can also be solved within ε error with o(N ) parameters, where the sufficient number of layers and the sufficient number of parameters are identical to the numbers in Theorem 1 and Theorem 2 with the replacement of log C with log(1/ε). Relation with benefits of depth in neural networks. Our observation that deeper RELU networks have more memorization power is closely related to the recent studies on the benefits of depth in neural networks (Telgarsky, 2016; Eldan and Shamir, 2016; Lin et al., 2017; Arora et al., 2018; Yarotsky, 2018; Chatziafratis et al., 2020a; b) . While our observation indicates that depth is critical for the memorization power, these works mostly focused on showing the importance of depth for approximating functions. Here, the existing results on benefits of depth for function approximation cannot directly imply the benefits of depth for memorization since they often focus on specific classes of functions or require parameters far beyond O(N ).

4.2. GENERIC CRITERIA FOR IDENTIFYING MEMORIZATION POWER

While Theorem 1 proves the existence of networks having o(N ) parameters for memorization, the following theorem states the generic criteria for verifying the memorization power of network architectures. The proof of Theorem 3 is presented in Section G. Theorem 3. For some sigmoidal activation function σ, let f θ be a σ network of L hidden layers having d neurons at the -th hidden layer. Then, for any C, N, d x ∈ N and ∆ ≥ 1, f θ can memorize any ∆-separated set of N pairs in d x dimension with C classes if the following statement holds: There exist 0 < L 1 < • • • < L K < L for some 2 ≤ K ≤ log N satisfying conditions 1-4 below. 1. d ≥ 3 for all ≤ L K and d ≥ 7 for all L K < . 2. L1 =1 (d + 1)/2 ≥ ∆ √ 2πd x . 3. Li Li-1+1 (d -2) ≥ 2 i+3 for all 1 < i ≤ K -1. 4. 2 K • L K =L K-1 +1 (d -2) • (L -L K -1)/ log C -4 ≥ N 2 . Our criteria in Theorem 3 require that the layers of the network can be "partitioned" into have K + 1 distinct parts characterized by L 1 , . . . , L K for some K ≥ 2. Under this partition, we describe the four conditions in Theorem 3 in detail. The first condition suggests that the network width is not very critical. We note that d ≥ 7 for L K < does not contradict Theorem 2 as we highly optimize the network architecture to fit in width 3 for Theorem 2, while we provide generic criteria here. The second condition considers the first L 1 hidden layers. In order to satisfy this condition, deep and narrow architectures are better than shallow and wide architectures under a similar budget for parameters, due to the product form L1 =1 (d + 1)/2 . Nevertheless, the architecture of the first L 1 hidden layers is not very critical as only log ∆ + 1 2 log(2πd x ) layers are sufficient even with width 3 (e.g., log ∆ < 17 for the ImageNet dataset). The third condition is closely related to the last condition: As K increases, the LHS in the last condition increases. However, the third condition states that it requires more hidden neurons. Simply, increasing K by one requires doubling hidden neurons from the (L 1 + 1)-th hidden layer to the L K-1 -th hidden layer. Nevertheless, this doubling hidden neurons would make the LHS of the last condition double as well. The last condition is simple. As we explained, 2 K is approximately proportional to the number of hidden neurons from the (L 1 + 1)-th hidden layer to the L K-1 -th hidden layer. The second term in the LHS of the last condition L K =L K-1 +1 (d -2 ) is even more simple; it only requires counting hidden neurons. On the other hand, the last term counts the number of layers. This indicates that to satisfy conditions in Theorem 3 using few parameters, the last layers of the network should be deep and narrow. In particular, we note that such a deep and narrow architecture in the last layers is indeed necessary for RELU networks to memorize with o(N ) parameters (Bartlett et al., 2019) . Now, we describe how to show memorization with o(N ) parameters using our criteria. For simplicity, consider the network with the minimum width stated in the first condition, i.e., d = 3 for all ≤ L K and d = 7 for all L K < . As we explained, the second conditions can be easily satisfied. For the third condition, consider choosing K = log(N 2/3 ), then Θ(N 2/3 ) hidden neurons (i.e., Θ(N 2/3 ) hidden layers) would be sufficient, i.e., L K-1 -L 1 = Θ(N 2/3 ). Likewise, we choose remaining L i to satisfy L K -L K-1 = Θ(N 2/3 ) and L -L K = Θ(N 2/3 ). Then, it naturally satisfy the last condition while using only Θ(N 2/3 ) parameters.

5. PROOF OF THEOREM 1

In this section, we give a constructive proof of Theorem 1: We design a σ network which can memorize N inputs in d x dimension with C classes using only o(N ) parameters. We note that the proofs of Theorem 2 and Theorem 3 also utilize similar constructions. To construct networks, we first introduce the following lemma, motivated by the RELU networks achieving nearly tight VC-dimension (Bartlett et al., 2019) . The proof of Lemma 4 is presented in Section E.1. Lemma 4. For any C, V ∈ N and p ∈ [1/2, 1], there exists a σ network f θ of O V 1-p 1+(p-0.5) log V hidden layers and O(V p + V 1/2 log C) parameters such that f θ can memorize any X ∈ R V with C classes satisfying X = [V ]. Roughly speaking, Lemma 4 states the following: Suppose that the V inputs are scalar and wellseparated, in the sense that exactly one input falls into each interval [i, i + 1). Then, such wellseparated inputs can be mapped to the corresponding labels using only O(V 1/2 ) parameters (with p = 1/2). Thus, what remains is to build a network mapping N arbitrary inputs to some wellseparated set bounded by some V = o(N 2 ), again using o(N ) parameters. Projecting input vectors to scalar values. Now, we introduce our network mapping ∆-separated set of N inputs to some Z satisfying Z ∈ [V ] N . First, we map the input vectors to scalar values using the following lemma. The proof of Lemma 5 is presented in Section E.2. Lemma 5. For any ∆-separated X ∈ R dx N , there exist v ∈ R dx , b ∈ R such that {v x + b : x ∈ X } ∈ [O(N 2 ∆d 1/2 x )] N . Lemma 5 states that any ∆-separated N vectors can be mapped to some well-separated scalar values bounded by O(N 2 ∆d 1/2 x ) using a simple projection using O(d x ) parameters. Note that the direct combination of Lemma 4 and Lemma 5 gives us a σ network of O(N ∆ 1/2 d 1/4 x ) parameters which can memorize any ∆-separated N pairs. However, this combination is limited in at least two sense: The required number of parameters (1) has ∆ 1/2 d 1/4 x multiplicative factor and (2) is not sub-linear to N . In what follows, we introduce our techniques for resolving these two issues. Reducing upper bound to O(N 2 ). First, we introduce the following lemma for improving the ∆ 1/2 d 1/4 x multiplicative factor in the number of parameters to O(log ∆ + log d x ) additional parameters. The proof of Lemma 6 is presented in Section E.3. Lemma 6. For any X ∈ R N such that X ∈ [K] N for some K ∈ N, there exists a σ network f of 1 hidden layer and width 3 such that f (X ) ∈ [T ] N where T := max K/2 , N 2 /4 + 1 . From Lemma 6, a network of O(log ∆ + log d x ) hidden layers and width 3 can decrease the upper bound O(N 2 ∆d 1/2 x ) to N 2 /4 + 1 , which enables us to drop the dependence of ∆d 1/2 x on the upper bound using O(log ∆ + log d x ) parameters. The intuition behind Lemma 6 is that if the number of target intervals T is large enough compared to the number of inputs N , then inputs can be easily mapped without collision (i.e., no two inputs are mapped into the same interval) by some simple network of a small number of parameters. For example, we construct f in Lemma 6 as f (x) := x if x ∈ 0, T x + b mod T if x ∈ T, K for some b ∈ [T ]. However, if T is not large enough compared to N (i.e., T < N 2 /4 + 1 ), then our network (1) cannot avoid the collision between inputs, i. parameters which can memorize any ∆-separated N pairs, i.e., the number of parameters is not sub-linear to N . Reducing upper bound to o(N 2 ). To resolve this issue, we further decrease the upper bound using the following lemma. The proof of Lemma 7 is presented in Section E.4. Lemma 7. For any X ∈ R N such that X ∈ [K] N for some K ∈ N, there exists a σ network f of 1 hidden layer and width O N 2 /K such that f (X ) ∈ [T ] N where T := max K/2 , N . The network in Lemma 7 can reduce the upper bound by approximately half, beyond N 2 /4 + 1 ; however, the required number of parameters will be doubled if both the current upper bound K decreases by half. Hence, in order to decrease the upper bound from O(N 2 ) to V = o(N 2 ) using Θ(log(N 2 /V )) applications of Lemma 7, we need Θ(log(N 2 /V )) layers and Θ(N 2 /V ) parameters. Here, we construct each application of Lemma 7 using two hidden layers: one hidden layer of Θ(N 2 /K) hidden neurons for implementing the function and the other hidden layer of one hidden neuron for the output. Deriving Theorem 1. Now, we choose V = Θ(N 2-w ) for some w ∈ [ 2 3 , 1]. Then, from Lemmas 5-7, O(log d x + log ∆ + log N ) hidden layers and O(d x + log ∆ + N w ) parameters suffice for mapping any ∆-separated set of size N to some Z satisfying Z ∈ [V ] N . Finally, from Lemma 4 and choosing p = w 2-w , additional O N 2-2w 1+(1.5w-1) log N log C hidden layers and O(N w +N 1-w/2 log C) parameters suffice for mapping Z to its labels. This completes the proof of Theorem 1.

6. EXPERIMENTS

In this section, we study the effect of depth and width. In particular, we empirically verify whether our theoretical finding extends to practices: Can deep and narrow networks memorize more training pairs than their shallow and wide counterparts under a similar number of parameters? For the experiments, we use residual networks (He et al., 2016) having the same number of channels for each layer. The detailed experimental setups are presented in Section A. In the following experiments, we observe the training and test accuracy of networks by varying the number of channels (c) and the number of residual blocks (b).

6.1. DEPTH-WIDTH TRADE-OFF IN MEMORIZATION

We verify the memorization power of different network architectures having a similar number of parameters. Figure 1 illustrates training and test accuracy of five different architectures with approximately 50000 parameters for classifying the CIFAR-10 dataset (Krizhevsky and Hinton, 2009) and the SVHN dataset (Netzer et al., 2011) . One can observe that as the network architecture becomes deeper and narrower, the training accuracy increases. Namely, deep and narrow networks memorize better than shallow and wide networks under a similar number of parameters. This observation agrees with Theorem 1, which states that increasing depth reduces the required number of parameters for memorizing the same number of pairs. However, more memorization power does not always imply better generalization. In Figure 1b , as the depth increases, the test accuracy also increases for the SVHN dataset. In contrast, the test accuracy decreases for the CIFAR-10 dataset as the depth increases in Figure 1a . In other words, overfitting occurs for the CIFAR-10 dataset while classifying the SVHN data receive benefits from more expressive power. Note that a similar observation has also been made in the recent double descent phenomenon (Belkin et al., 2019; Nakkiran et al., 2020 ) that more expressive power can both hurt/improve the generalization. In addition, this observation can provide guidance on the design of network architectures in applications where the training accuracy and small number of parameters are critical. For example, recent development in video streaming services tries to reduce the traffic by compressing a video by neural networks and send the compressed video and the decoder network (Yeo et al., 2018) . Here, the corresponding decoder is often trained only for the video to send; hence, only the training accuracy/loss matters while the decoder's number of parameters should also be small for the traffic.

6.2. EFFECT OF WIDTH AND DEPTH

In this section, we observe the effect of depth and width by varying both. Figure 2 reports the training and test accuracy for the CIFAR-10 dataset by varying the number for channels from 5 to 30 and the number of residual blocks from 5 to 50. We present the same experimental results for the SVHN dataset in Section B. First, we observe that a network of 15 channels with feature map size 32 × 32 successfully memorize (i.e., over 99% training accuracy). This size is much narrower than modern network architectures, e.g., ResNet-18 has 64 channels at the first hidden layer (He et al., 2016) . On the other hand, too narrow network (e.g., 5-channels) fail to memorize. This result does not contradict with Theorem 2 as the test of memorization in experiments/practice involves the stochastic gradient descent. We note that similar phenomenons are observed for the SVHN dataset. Furthermore, once the network memorize, we observe that increasing width is more effective than increasing depth for improving test accuracy. These results indicate that width is not very critical for the memorization power, while it can be effective for generalization. Note that similar observations have been made in Zhang et al. (2017) .

7. CONCLUSION

In this paper, we prove that Θ(N 2/3 ) parameters are sufficient for memorizing arbitrary N input-label pairs under the mild ∆-separateness condition. Our result provides significantly improved results, compared to the prior results showing the sufficiency of Θ(N ) parameters with/without conditions on pairs. In addition, Theorem 1 shows that deeper networks have more memorization power. This result coincides with the recent study on the benefits of depth for function approximation. On the other hand, Theorem 2 shows that network width is not important for the memorization power. We also provide generic criteria for identifying the memorization power of networks. Finally, we empirically confirm our theoretical results. C ∆-SEPARATENESS OF GAUSSIAN RANDOM VECTORS While we mentioned in Section 1.1 that digital nature of data enables the ∆-separateness of inputs with small ∆, random inputs are also ∆-separated with small ∆ with high probability. In particular, we prove the following lemma. Lemma 8. For any d x ∈ N, consider a set of N vectors X = {x 1 , . . . , x N } ⊂ R dx where each entry of x i is drawn from the i.i.d. standard normal distribution. Then, for any δ > 0, X is

B TRAINING AND TEST ACCURACY FOR SVHN DATASET

(N/ √ δ) 2/dx 3e + 5e dx ln(N/ √ δ) -separated with probability at least 1 -δ. Lemma 8 implies that Theorem 1 and Theorem 2 can be successfully applied to random Gaussian input vectors as the O (N/ √ δ) 2/dx ln(N/ √ δ) -separateness condition in Lemma 8 is much weaker than our 2 O(N 2/3 ) -separateness condition for memorization with o(N ) parameters. Proof of Lemma 8. To begin with, we first prove that for i = j, the following bound holds: P 2 e • N √ δ -2/dx ≤ 1 √ d x x i -x j 2 ≤ 6 + 10 d x ln N √ δ = P 2 e • N √ δ -4/dx ≤ 1 d x x i -x j 2 2 ≤ 6 + 10 d x ln N √ δ = P 2 e • N √ δ -4/dx ≤ 2 d x X ≤ 6 + 10 d x ln N √ δ = 1 -P 2 e • N √ δ -4/dx > 2 d x X -P 2 d x X > 6 + 10 d x ln N √ δ = 1 -P 1 e • N √ δ -4/dx • d x > X -P X > 3 + 5 d x ln N √ δ • d x ≥ 1 - 1 e • N √ δ -4/dx e 1-1 e •( N √ δ ) -4/dx dx/2 - 3 + 5 d x ln N √ δ e -2 N √ δ -5/dx dx/2 = 1 - δ N 2 • e -1 e •( N √ δ ) -4/dx dx/2 - δ N 2 • 3 + 5 d x ln N √ δ e -2 N √ δ -1/dx dx/2 ≥ 1 - δ N 2 - δ N 2 •   3 e 2 N 1/dx + 5 e 2 • ln N √ δ 1/dx N √ δ 1/dx   dx/2 ≥ 1 - δ N 2 - δ N 2 • 3 e 2 + 5 e 3 dx/2 ≥ 1 - δ N 2 - δ N 2 = 1 - 2δ N 2 (2) where X denotes a chi-square random variable with d x degrees of freedom. For the first inequality in (2), we use the inequalities P(X < z • d x ) ≤ inf t>0 E[e -tX ] e -t(z•dx) = inf t>0 (1 + 2t) -dx/2 e -t(z•dx) = (ze 1-z ) dx/2 for 0 < z < 1 P(X > z • d x ) ≤ inf t>0 E[e tX ] e t(z•dx) = inf 0<t<1/2 (1 -2t) -dx/2 e t(z•dx) = (ze 1-z ) dx/2 for z > 1 which directly follow from the Chernoff bound for the chi-square distribution. For the third inequality in (2), we use the fact that max x>0 (ln x)/x = 1/e and N ≥ 1. For the last inequality in (2), we use Then, X is (N/ √ δ) 2/dx 3e + 5e dx ln(N/ √ δ) -separated with probability at least 1 -δ as the following bound holds: P 2 e δ 1/dx N -2/dx ≤ 1 √ d x x i -x j 2 ≤ 6 + 10 d x ln N √ δ , ∀i = j = 1 -P ∃i = j s.t. x i -x j 2 > 6 + 10 d x ln N √ δ or x i -x j 2 < 2 e δ 1/dx N -2/dx ≥ 1 - i =j P x i -x j 2 > 6 + 10 d x ln N √ δ or x i -x j 2 < 2 e δ 1/dx N -2/dx = 1 - i =j 1 -P 2 e δ 1/dx N -2/dx ≤ 1 √ d x x i -x j 2 ≤ 6 + 10 d x ln N √ δ ≥ 1 - N (N -1) 2 × 2δ N 2 ≥ 1 -δ where the first inequality follows from the union bound and the second inequality follows from (2). This completes the proof of Lemma 8.

D MAIN PROOF IDEA: MEMORIZATION USING STEP AND ID D.1 MAIN IDEA

In the proofs of Theorem 1, Theorem 2, and Theorem 3, we use each sigmoidal activation for approximating either the identity function (ID : x → x) or the binary step function (STEP : x → 1[x ≥ 0]). Thus, we construct our network for the proofs using only ID and STEP activation functions, which directly provides the network construction using sigmoidal activation functions (see Section D.2 and Section D.3 for details).

D.2 TOOLS

We present the following claims important for the proofs. In particular, Claim 9 and Claim 10 guide how to approximate STEP and ID by a single sigmoidal neuron. Claim 9 [Kidger and Lyons (2020, Lemma 4.1) ]. For any sigmoidal activation function σ, for any bounded interval I ⊂ R for any ε > 0, there exist a, b, c, d ∈ R such that |a • σ(c • x + d) + b -x| < ε for all x ∈ I. Claim 10. For any sigmoidal activation function σ, for any ε, δ > 0, there exist a, b, c ∈ R such that |a • σ(c • x) + b -1[x ≥ 0]| < ε for all x / ∈ [ -δ, δ]. Proof of Claim 10. We assume that α := lim x→-∞ σ(x) < lim x→∞ σ(x) =: β where the case that β < α can be proved in a similar mannerwdd. From the definition of α, β, there exists . Suppose that x a•b < x/a b . Then, there exists an integer m such that k > 0 such that |σ(x) -α| < (β -α)ε if x < -k and |σ(x) -β| < (β -α)ε if x > k. Then, choosing a = 1 β-α , b = -α β-α , x a • b ≤ m < x/a b and hence, x a ≤ x a ≤ b • m < x a which leads to a contradiction. This completes the proof of Claim 11.

D.3 TRANSFORMING STEP+ID NETWORK TO SIGMOIDAL NETWORK

In this section, we describe how to transform a STEP + ID network into a sigmoidal network within arbitrary error. Formally, we prove the following lemma. Lemma 12. For any finite set of inputs X , for any STEP + ID network f , for any ε > 0, for any sigmoidal activation function σ, there exists a σ network g having the same architecture with f θ such that |f (x) -g(x)| < ε for all x ∈ X . Then, the construction in Lemma 12 enables us to construct STEP + ID networks instead of networks using a sigmoidal activation function for proving Theorem 1, Theorem 2, and Theorem 3. Proof of Lemma 12. Without loss of generality, we first assume that for any STEP + ID network h, in the evaluation of h(x), all inputs to STEP neurons (i.e., 1[x ≥ 0]) is non-zero for all x ∈ X . This assumption can be easily satisfied by adding some small bias to the inputs of STEP neurons where such a bias alwasy exists since |X | < ∞. Furthermore, introducing this assumption does not change h(x) for all x ∈ X Now, we describe our construction of g. Let δ > 0 be a number satisfying that the absolute value of all inputs to STEP neurons in the evaluation of f (x) for all x ∈ X is at least δ. Such δ always exists due to the assumption we made. Let L be the number of hidden layers in f . Starting from f , we iteratively substitute the STEP and ID hidden neurons into the sigmoidal activation σ, from the last hidden layer to the first hidden layer. In particular, by using Claim 9 and Claim 10, we replace ID and STEP neurons in the -th hidden layer by σ neurons approximating ID and STEP. First, let g L be a network identical to f except for its L-th hidden layer consisting of σ neurons approximating ID and STEP in f . Here, we accurately approximate ID and STEP neurons by σ using Claim 9 and Claim 10 so that |g L (x) -f (x)| < ε/L for all x ∈ X . Note that such approximation always exists due to the existence of δ > 0. Now, let g L-1 be a network identical to f L except for its (L -1)-th hidden layer consisting of σ neurons approximating ID and STEP of g L . Here, we also accurately approximate ID and STEP neurons by σ using Claim 9 and Claim 10 so that |g L (x) -g L-1 (x)| < ε/L for all x ∈ X . If we repeat this procedure until replacing the first hidden layer, then g := g 1 would be the desired network satisfying that |f (x) -g(x)| < ε for all x ∈ X . This completes the proof of Lemma 12. E.4 PROOF OF LEMMA 7 In the compression 2 step, we further improve the bound U := N 2 4 + 1 on the hidden feature values to V = Θ(N 2-w ). Namely, we map X 3 to X 4 such that X 4 ∈ [V ] N . To construct such a mapping, we introduce the following lemma. The proof of Lemma 16 is presented in Section H.3. Lemma 16. For any N, K, L, d 1 , . . . , d L ∈ N such that N < K and d ≥ 3 for all , for any X such that X ∈ [K] N , there exists a STEP + ID network f of L hidden layers having d neurons at the -th hidden layer such that f (X ) ∈ [T ] N where T := min K, max N × N C , K 2 and C := 1 2 + 1 2 L =1 (d -2) . From Lemma 16, a STEP + ID network f of one hidden layer having Θ N 2 K hidden neurons can map any X such that X ∈ [K] N to f (X ) such that f (X ) = [T ] N where T := max{N, K 2 }. Combining this and Lemma 12 completes the proof of Lemma 7.

F PROOF OF THEOREM 2

In this proof, we construct a STEP + ID network and transform it to a σ network using Lemma 12, as in the proof of Theorem 1 in Section 5 and Section E. In particular, Theorem 2 is a direct consequence of Lemma 14, Lemma 15, Lemma 16, and Lemma 17 presented as follows. The proof Lemma 17 is presented in Section H.7. Lemma 17. For any A, B, D, K ∈ N such that AB ≥ K, there exists a STEP + ID network f θ of A + (2D + 1)B hidden layers and width 3 satisfying the following property: For any finite set X ∈ [0, K), for any y : [K] → [2 D ], there exists θ such that f θ (x) = y( x ). From Lemma 14, Lemma 15, and Claim 11, a STEP + ID network of log(∆ √ 2πd x ) hidden layers and width 3 can map a ∆-separated set of inputs X 1 to X 2 such that X 2 ∈ [U ] N where U := N 2 4 +1 . From Lemma 16, a STEP+ID network of log(U/V ) -1 i=1 2N U/(2 i N ) -1 + 2N V /N -1 hidden layers and width 3 can map X 2 to X 3 such that X 3 ∈ [V ] N for some N ≤ V ≤ U . Here, we will choose V = Θ(N 4/3 ). Finally, from Lemma 17, for any A, B ∈ N such that AB ≥ V , a STEP + ID network of A + (2D + 1)B hidden layers and width 3 can map X 3 to their labels where D := log C . We will choose A, B = Θ(N 2/3 ) to satisfy AB ≥ V = Θ(N 4/3 ).

Hence, for any

A, B, V ∈ N such that N ≤ V ≤ U and AB ≥ V , a STEP + ID network of log(∆ √ 2πd x ) + log(U/V ) -1 i=1 2N U/(2 i N ) -1 + 2N V /N + A + (2D + 1)B -1 hidden layers and width 3 can memorize arbitrary ∆-separated set of size N . Note that combining functions does not require additional hidden layer as the linear maps constructing the outputs of functions can be absorbed into the first linear map in the next function. Finally, substituting V ← N 4/3 , A ← (2D + 1)V , and B ← V /(2D + 1) and using Lemma 12 result in the statement of Theorem 2. This completes the proof of Theorem 2.

G PROOF OF THEOREM 3

The proof of Theorem 3 have the same structure of the proof of Theorem 1 consisting of four steps: projection, compression 1, compression 2, and learning. In particular, we construct a STEP + ID network and use Lemma 12 as in the proofs of Theorem 1 and Theorem 2. For the network construction, we divide the function of f θ into four disjoint parts, as in Section 5. The first part does not utilize a hidden layers but construct project input vectors into scalar values. The second part corresponds to the first L 1 hidden layers decreases the upper bound on scalar values to O(N 2 ). The third part corresponds to the next L K-1 -L 1 hidden layers further decreases the upper bound to o(N 2 ). The last part corresponds to the rest hidden layers construct a network mapping hidden features to their labels. Now, we describe our construction in detail. To begin with, let us denote a ∆-separated set of inputs by X 1 . First, from Lemma 14, one can project X 1 to X 2 such that X 2 ∈ [ N 2 ∆ √ πdx/8 ] N . Note that the projection step does not require to use hidden layers as it can be absorbed into the linear map before the next hidden layer. Then, from Lemma 15, the first L 1 hidden layers can map X 2 to X 3 such that X 3 ∈ [ N 2 /4+1 ] N since L1 =1 d + 1 2 ≥ ∆ 2πd x ⇒ L1 =1 d + 1 2 ≥ N 2 ∆ πd x /8 N 2 /4 ⇒ N 2 4 ≥ N 2 ∆ πd x /8 L1 =1 (d + 1)/2 ⇒ N 2 4 + 1 ≥ N 2 ∆ πd x /8 L1 =1 (d + 1)/2 ⇒ N 2 4 + 1 ≥ N 2 ∆ πd x /8 L1 =1 (d + 1)/2 . Note that we also utilize Claim 11 for the sequential application of Lemma 15. Consecutively, from Lemma 16, the next L K-1 -L 1 hidden layers can map X 3 to X 4 such that X 4 ∈ [ U/2 K-2 ] N since the third condition holds and Li =Li-1+1 (d -2) ≥ 2 i+3 ⇒ 1 2 Li =Li-1+1 (d -2) ≥ 2 i+2 ⇒ 1 2 + 1 2 Li =Li-1+1 (d -2) ≥ 2 i+2 ⇒ N 1 2 + 1 2 Li =Li-1+1 (d -2) ≤ N 2 i+2 ⇒ N 1 2 + 1 2 Li =Li-1+1 (d -2) ≤ N 2 i+1 ⇒N × N 1 2 + 1 2 Li =Li-1+1 (d -2) ≤ N 2 /4 2 i-1 ⇒N × N 1 2 + 1 2 Li =Li-1+1 (d -2) ≤ N 2 /4 + 1 2 i-1 ⇒N × N 1 2 + 1 2 Li =Li-1+1 (d -2) ≤ N 2 /4 + 1 2 i-1 where we use the inequality a ≤ 2a for a ≥ 1/2 and the assumption K ≤ log N , i.e., N/2 i+1 ≥ 1 for i ≤ log N -1. Here, we also utilize Claim 11 for the sequential application of Lemma 16. Now, we reinterpret the last condition as follows. 2 K •   L K =L K-1 +1 (d -2)   • (L -L K -1)/ log C ≥ N 2 -4 ⇒   L K =L K-1 +1 (d -2)   • (L -L K -1)/ log C ≥ N 2 /4 + 1 2 K-2 ⇒   L K =L K-1 +1 (d -2)   • (L -L K -1)/ log C ≥ N 2 /4 + 1 2 K-2 Finally, using the following lemma and the above inequality, the rest hidden layers can map X 4 to their corresponding labels (by choosing L := L 3 ). The proof of Lemma 18 is presented in Section H.8. Lemma 18. For D, K, L, d 1 , . . . , d L ∈ N, suppose that there exist 0 < L < L and r L +1 , . . . , r L-1 ∈ N satisfying that for r L = r L = 1   L =1 (d -2)   • L-1 =L +1 r D ≥ K and 2 r + r + r -1 + 3 ≤ d for all L + 1 ≤ ≤ L. Then, there exists a STEP + ID network f θ of L hidden layers having d hidden neurons at the -th hidden layer such that for any finite X ⊂ [0, K), for any y : [K] → [2 D ], there exists θ satisfying f θ (x) = y( x ) for all x ∈ X . Choose K ← U/2 K-2 and r = 1 for all in Lemma 18 completes our construction of the STEP + ID network. Note that we utilize the condition that d ≥ 7 for all L K < here. Using Lemma 12 completes the proof of Theorem 3.

H PROOFS OF TECHNICAL LEMMAS

H.1 PROOF OF LEMMA 14 Since the upper bound holds for any unit vector u ∈ R dx , we focus on the lower bound. In addition, if N = 1, the result of Lemma 14 trivially holds. Hence, we assume N ≥ 2, i.e., 8 πdx 1 N 2 < 1. In this proof, we show that given any vector v ∈ R dx such that v 2 = 0, a random unit vector u ∈ R dx from the uniform distribution satisfies that P |u v| v 2 < 8 πd x 1 N 2 < 2 N 2 . (4) This implies that there exists a unit vector u ∈ R dx such that 8 πdx 1 N 2 x -x 2 ≤ |u (x -x )| for all x, x ∈ X , due to the following union bound: for V = x -x : {x, x } ∈ X 2 , x ≤ x for some total order ≤ on X , P v∈V |u v| v 2 < 8 πd x 1 N 2 ≤ v∈V P |u v| v 2 < 8 πd x 1 N 2 < N (N -1) 2 × 2 N 2 < 1. Now we prove (4). To begin with, we show that the following equality holds for any v ∈ R: P |u v| v 2 < 8 πd x 1 N 2 = P |u 1 | < 8 πd x 1 N 2 . Here, the equality follows from choosing v/ v 2 = (1, 0, . . . , 0) using the symmetry. Furthermore, P |u 1 | < 8 πdx 1 N 2 can be bounded by P |u 1 | < 8 πd x 1 N 2 = P 0 < u 1 < 8 πd x 1 N 2 + P - 8 πd x 1 N 2 < u 1 ≤ 0 = 2 × P 0 < u 1 < 8 πd x 1 N 2 = 2 Area dx (1) × π 2 arccos 8 πdx 1 N 2 Area dx-1 (sin φ)dφ = 2 × Area dx-1 (1) Area dx (1) × π 2 arccos 8 πdx 1 N 2 sin dx-2 φdφ = 2 × 2π dx-1 2 /Γ( dx-1 2 ) 2π dx 2 /Γ( dx 2 ) × π 2 arccos 8 πdx 1 N 2 sin dx-2 φdφ < 2 × d x 2π × π 2 arccos 8 πdx 1 N 2 1dφ = 2d x π π 2 -arccos 8 πd x 1 N 2 = 2d x π arcsin 8 πd x 1 N 2 ≤ 2d x π × π 2 × 8 πd x 1 N 2 = 2 N 2 where Area d (r) := 2π d 2 r d-1 /Γ d 2 denotes the surface area of a hypersphere of radius r in R d and Γ(x) denotes the gamma function. Here, the second equality follows from the symmetry and P(u 1 = 0) = 0. The first inequality follows from sin φ ≤ 1 and Γ( dx 2 ) Γ( dx -1 2 ) < dx 2 from the Gautschi's inequality (see Lemma 19). The second inequality follows from φ ≤ π 2 sin φ for 0 ≤ φ ≤ π 2 . This completes the proof of Lemma 14. Lemma 19 [Gautschi's inequality (Gautschi, 1959) ]. For any x > 0, for any s ∈ (0, 1), x 1-s < Γ(x + 1) Γ(x + s) < (x + 1) 1-s .

H.2 PROOF OF LEMMA 15

In this proof, we assume that K > N 2 /4 and T = K (d+1)/2 where other cases trivially follow from this case. To begin with, we first define a network f b (x) : [0, K) → [0, T ) for b = (b i ) (d-1)/2 i=1 ∈ [T ] (d-1)/2 as f b (x) :=                x if x < T x + b 1 mod T if T ≤ x < 2T x + b 2 mod T if 2T ≤ x < 3T . . . x + b (d-1)/2 mod T if d-1 2 T ≤ x = x -T × 1[x ≥ T ] + (d-1)/2 i=1   b i - i-1 j=1 b j × 1[x ≥ iT ] -T × 1[x + b i ≥ (i + 1)T ]   . One can easily observe that f b can be implemented by a STEP + ID network of 1 hidden layer and width d as T × 1[x ≥ T ] in (5) can be absorbed into b i - i-1 j=1 b j × 1[x ≥ iT ] in (5) for i = 1. Now, we show that if T > N 2 /4, then there exist b ∈ [T ] (d-1)/2 such that f b (X ) = N to complete the proof. Our proof utilizes the mathematical induction on i: If there exist b 1 , . . . , b i-1 ∈ [T ] such that f b ({x ∈ X : x < iT }) = [T ] {x ∈ X : x < iT } , then there exists b i ∈ [T ] such that f b ({x ∈ X : x < (i + 1)T }) = [T ] {x ∈ X : x < (i + 1)T } . Here, one can observe that the statement trivially holds for the base case, i.e., for {x ∈ X : x < T }. Now, using the induction hypothesis, suppose that there exist b 1 , . . . , b i-1 ∈ [T ] satisfying (6). Now, we prove that there exists b i ∈ [T ] such that S bi := f b ({x ∈ X : iT ≤ x < (i + 1)T }) does not intersect with T := f b ({x ∈ X : x < iT }) , i.e, (7) holds. Consider the following inequality: where the equality follows from the fact that for each x ∈ {x ∈ X : iT ≤ x < (i + 1)T }, there exists exactly |T | values of b i so that x mod T ∈ T . However, since the number of possible choices of b i is T , if T > N 2 /4, then there exists b i ∈ [T ] such that S bi ∩ T = ∅, i.e., (7) holds. This completes the proof of Lemma 15.

H.3 PROOF OF LEMMA 16

The proof of Lemma 16 is similar to that of Lemma 15. To begin with, we first define a network f b (x) : [0, K) → [0, T ) for b = (b i ) C i=1 ∈ [T ] C and T =: M 1 < M 2 < • • • < M C+1 := K so that |{x ∈ X : M i ≤ x < M i+1 }| ≤ N C ≤ T N for all i as f b (x) :=                x if x < T x + b 1 mod T if M 1 = T ≤ x < M 2 x + b 2 mod T if M 2 ≤ x < M 3 . . . x + b C mod T if M C ≤ x = x -2T × 1[x ≥ T ] + C i=1 T + b i - i-1 j=1 b j × 1[x ≥ M i ] -T × 1 x ≥ min{2T -b i , M i+1 } (8) where ( 8) holds as T ≥ K 2 . Here, one can easily implement f b by a STEP + ID of L hidden layer and d neurons at the -th hidden layer by utilizing one neuron for storing the input x, another one neuron for storing the temporary output, and other neurons implement indicator functions at each layer. This is because one do not require to store x in the last hidden layer and there exists 2C indicator functions to implement (T × 1[x ≥ T ] in (8) can be absorbed into T + b i - Here, one can observe that the statement trivially holds for the base case, i.e., for {x ∈ X : x < T }. From the induction hypothesis, suppose that there exist b 1 , . . . , b i-1 ∈ [T ] satisfying (9). Now, we prove that there exists b i ∈ [T ] such that S bi := f b ({x ∈ X : M i ≤ x < M i+1 }) does not intersect with T := f b ({x ∈ X : x ≤ M i }) , i.e., (10) holds. Consider the following inequality: where η r,i is a constant such that η r,i = 1 if i × 2 -R ≤ x < (i + 1) × 2 -R implies that the (( -1)R + r)-th bit of x is 1 and η r,i = 0 otherwise. Here, one can easily observe that g 1 can be implemented by a linear combinations of 1[v ≥ 2 -R ], . . . , 1[v ≥ (2 R -1) × 2 -R ] as it trivially holds that 1[v ≥ 0] and 1[v < 2 -( -1)R ], i.e., 2 R -1 indicator functions are enough for g 1 . Hence, g 1 can be implemented by a STEP + ID network of 1 hidden layer consisting of 2 R + 1 hidden neurons where additional 2 neurons are for passing x, v. In addition, g 2 can be implemented by a STEP + ID network of R hidden neurons. Finally, g 3 can be implemented by a STEP + ID network of 1 hidden layer consisting of R + 2 hidden neurons (R neurons for R indicator functions and 2 neurons for passing x, v). Therefore, f can be implemented by a STEP + ID network of 2 hidden layers consisting of 2 R +R +1 hidden neurons for the first hidden layer and R + 2 hidden neurons for the second hidden layer. Note that implementation within two hidden layer is possible since the outputs of g 1 , g 2 are simply linear combination of their hidden activation values and hence, can be absorbed into the linear map between hidden layers. This completes the proof of Lemma 22.



http://www.image-net.org/ https://storage.googleapis.com/openimages/web/index.html STEP denotes the binary threshold activation function: x → 1[x ≥ 0]. We set d0 := dx and dL := 1. hard tanh activation function:x → -1[x ≤ -1] + x • 1[-1 < x ≤ 1] + 1[x > 1]. hard tanh(x) = RELU(x + 1) -RELU(x -1) -1 = RELU 2 -RELU(1 -x) -1 e 2 + 5 e 3 < 1.



Figure 1: Depth-width trade-off under a similar number of parameters.

Figure 2: Training and test accuracy by varying width and depth for the CIFAR-10 dataset. The x-axis denotes the number of channels and the y-axis denotes the number of residual blocks.

Figure 3: Training and test accuracy by varying width and depth for the SVHN dataset. The x-axis denotes the number of channels and the y-axis denotes the number of residual blocks.

and c = k δ completes the proof of Claim 10. Claim 11. For any a, x ∈ R such that a = 0, for any b ∈ N, it holds that x a•b = x

bi∈[T ] |S bi ∩ T | = |S bi | × |T | ≤ N 2 4

b j × 1[x ≥ M i ] in (8) for i = 1). Now, we show that if T ≥ N , then there exist b ∈ [T ] C such that f b (X ) = N to completes the proof. Our proof utilizes the mathematical induction on i: If there exist b 1 , . . . , b i-1 ∈ [T ] such that f b ({x ∈ X : x < M i }) = [T ] {x ∈ X : x < M i } ,(9)then there exists bi ∈ [T ] such that f b ({x ∈ X : x < M i+1 }) = [T ] {x ∈ X : x < M i+1 } . (10)

inequality follows from the fact that |S bi | ≤ T N , |T | ≤ N -T N , and for each x ∈ {x ∈ X : M i ≤ x < M i+1 }, there exists exactly |T | values of b i so that x mod T ∈ T . However, since the number of possible choices of b i is T , there exists b i ∈ [T ] such that S bi ∩ T = ∅, i.e, (10) holds. This completes the proof of Lemma 16.We design f as f := f L • • • • • f 1 (0, x)where each f represents the function of the -th layer consisting of d neurons. In particular, we construct f as follows:f (w, x) := w + w 0 × 1[ = 1] + d -2 i=1 (w c +i -w c +i-1 ) × 1[x ≥ iB], × 1[x ≥ iB]where w -1 := 0 and c :=-1 i=1 (d -2).Then, f is the desired function and each f can be implemented by a STEP + ID networks of 1 hidden layer consisting of d hidden neurons (two neurons for storing x, w and other neurons are for d -2 indicator functions). This completes the proof of Lemma 20.

A EXPERIMENTAL SETUP

In this section, we described the details on residual network architectures and hyper-parameter setups.We use the residual networks of the following structure. First, a convolutional layer and RELU maps a 3-channel input image to a c-channel feature map. Here, the size of the feature map is identical to the size of input images. Then, we apply b residual blocks where each residual block maps x → RELU(CONV • RELU • CONV(x) + x) while preserving the number of channels and the size of feature map. Finally, we apply an average pooling layer and a fully-connected layer. We train the model for 5 × 10 5 iterations with batch size 64 by the stochastic gradient descent. We use the initial learning rate 0.1, weight decay 10 -4 , and the learning rate decay at the 1.5 × 10 5 -th iteration and the 3.5 × 10 5 -th iteration by a multiplicative factor 0.1.All presented results are averaged over three independent trials.

E PROOF OF LEMMAS FOR THEOREM 1

For proving each of Lemma 4-7, we prove stronger technical lemma, as stated in Sections E.1-E.4, and transform the STEP + ID networks to σ networks using Lemma 12.

E.1 PROOF OF LEMMA 4

To prove Lemma 4, we introduce the following lemma. We note that Lemma 13 follows the construction for proving VC-dimension lower bound of RELU networks (Bartlett et al., 2019) . The proof of Lemma 13 is presented in Section H.4.

Lemma 13. For any

parameters satisfying the following property: For any finite set X ⊂ [0, K), for any y :there exists θ such that f θ (x) = y( x ) for all x ∈ X .Combining this with Lemma 12 completes the proof of Lemma 4.

E.2 PROOF OF LEMMA 5

Since the proof is trivial when d x = 1, we consider d x ≥ 2. To this end, we first project all x ∈ X to u x ∈ R by choosing some unit vector u ∈ R dx so thatSuch a unit vector u always exists due to the following lemma. The proof of Lemma 14 is presented in Section H.1.Lemma 14. For any N, d x ∈ N, for any X ∈ R dx N , there exists a unit vector u ∈ R dx such thatFinally, we construct the desired map by. This completes the proof of Lemma 4.

E.3 PROOF OF LEMMA 6

To prove Lemma 6, we introduce the following lemma. The proof of Lemma 15 is presented in Section H.2.Lemma 15. For any N, K, d ∈ N, for any X such that X ∈ [K] N , there exists a STEP+ID network f of 1 hidden layer and width d such that f (X ) ∈ [T ] N where T := maxFrom Lemma 15, one can observe that a STEP + ID network of 1 hidden layer consisting of 3 hidden neuron which can map any Z such that Z ∈Combining Lemma 15 and Lemma 12 completes the proof of Lemma 6.

H.4 PROOF OF LEMMA 13

In this proof, we explicitly construct f θ satisfying the desired property stated in Lemma 13. To begin with, we describe the high-level idea of the construction. First, we construct a map g :Here, we give labels to the pair (a, b) corresponding to the input x as y(a, b ) := y( x ). Note that this label is well-defined as if x 1 = x 2 , then g(x 1 ) = g(x 2 ) . Now, we construct parameters w 0 , . . . , w A-1 containing the label information of X as then there exists a STEP+ID network f of L layers and d neurons at the -th layer such that f (x) = (w x/B , x mod B) for all x ∈ X .From Lemma 20, a STEP + ID network of 1 hidden layer consisting of A + 2 hidden neurons can map x ∈ X to (w x/B , x mod B). Note that this network requires overall 4A + 10 parameters (3A + 6 edges and A + 4 biases). From Lemma 21, a STEP + ID network of 2 BD R hidden layers and (2R + 5)2 R + 2R 2 + 8R + 7 BD R -R2 R -R 2 + 3 parameters. Hence, by combining Lemma 20 and Lemma 21, f θ can be implemented by a STEP + ID network of 2 BD R + 2 hidden layers and 4A + (2R + 5)2 R + 2R 2 + 8R + 7 BD R -R2 R -R 2 + 3 parameters. This completes the proof of Lemma 13.

H.6 PROOF OF LEMMA 21

We construct f (x, w)where u i denotes the i-th bit of w in the binary representation, ∧ denotes the binary 'and' operation, and m i, , r i, are defined asNamely, each f extracts R bits from the input w and it store the extracted bits to the last bits of v if the extracted bits are in from ( x • D + 1)-th bit to the ( x • D + D)-th bit of w. Thus, f (x, w) is the desired function for Lemma 21.To implement each f by a STEP + ID network, we introduce Lemma 22. Note that we extract u i from w in Lemma 22, i.e., we do not assume that u i is given. From Lemma 22, a STEP + ID network of 2 BD R hidden layers consisting of 2 R + R + 1 and R + 2 hidden neurons alternatively can map (x, w) to D i=1 u x •D+i × 2 D-i for all x ∈ X . By considering the input dimension 2 and the output dimension 1, this network requiresThis completes the proof of Lemma 21. Lemma 22. A STEP + ID network of 2 hidden layers having 2 R + R + 1 and R + 2 hidden neurons at the first and the second hidden layer, respectively, can implement f . Proof of Lemma 22. We construct f := g 3 • (g 2 ⊕ g 1 ) where g 2 ⊕ g 1 denotes the function concatenating the outputs of g 1 , g 2 . In this proof, we mainly focus on constructing f for < BD R since f BD/R can be implemented similarly. We define g 1 , g 2 , g 3 as

H.7 PROOF OF LEMMA 17

The main idea of the proof of Lemma 17 is identical to that of Lemma 13. Recall A, B and w 0 , . . . , w A-1 ∈ R from the proof of Lemma 13. From Lemma 20, for any finite set X ⊂ [0, K), for any w 0 , . . . , w A-1 ∈ R, a STEP + ID network of A hidden layers and width 3 can map x to (w x/B , x mod B) for all x ∈ X . Now, we introduce the following lemma replacing Lemma 21.Lemma 23. For any D, B ∈ N, for any finite set X ⊂ [0, B), for any w = DB i=1 u i × 2 -i for some u i ∈ {0, 1}, there exists a STEP + ID network f of (2D + 1)B hidden layers and width 3 such thatUsing Lemma 20 and Lemma 23, one can easily find a STEP + ID network of A + (2D + 1)B hidden layers and width 3 satisfying the condition in Lemma 17. This completes the proof of Lemma 17.Proof of Lemma 23. We construct f (x, w) Here, h at mod D = 1 can be implemented by a STEP + ID network of 1 hidden layer and width 3, g can be implemented by a STEP + ID network of 1 hidden layer and width 3, and f can be implemented by a STEP + ID network of 1 hidden layer and width 3. Hence, f can be implemented by a STEP + ID network of (2D + 1)B hidden layers and width 3. This completes the proof of Lemma 23.

H.8 PROOF OF LEMMA 18

The proof of Lemma 18 is almost identical to the proof of Lemma 13. Recall A, B and w 0 , . . . , w A-1 ∈ R as in the proof of Lemma 13. From Lemma 24, for any finite set X ⊂ [0, K), for any w 0 , . . . , w A-1 ∈ R, the first L hidden layers of the STEP + ID network f θ can map x to (w x/B , x mod B) for all x ∈ X . Now, we introduce the following lemma replacing Lemma 21. Using Lemma 24 completes the proof of Lemma 18. Lemma 24. For any D, B, R, L, d 1 , . . . , d L ∈ N, for any finite set X ⊂ [0, B), suppose that there exists r 1 , . . . , r L-1 ∈ N satisfying that for r 0Then, there exists a STEP + ID network f of L hidden layers having d hidden neurons at the -th hidden layer satisfying the following property: For any wProof of Lemma 24. The proof of Lemma 24 utilizes the network constructions in the proofs of Lemma 21 and Lemma 22. In particular, we construct f (x, w)where u i denotes the i-th bit of w in the binary representation, ∧ denotes the binary 'and' operation, and R , m i, , s i, are defined asNote that R 1 = 0 as the summation starts from i = 1 and R L = L-1 =1 r . Namely, each f extracts r bits from the input w and it store the extracted bits to the last bits of v if the extracted bits are in from ( x • D + 1)-th bit to the ( x • D + D)-th bit of w. Thus, f (x, w) is the desired function for Lemma 24.We construct f (x, v) using the -th hidden layer and the ( + 1)-th hidden layer, i.e., there exists an overlap between constructions of f (x, v) and f +1 (x, v). In particular, Lemma 22 directly allows us to obtain such a construction under the assumption in Lemma 18 that 2 r + r + r -1 + 3 ≤ d for all 1 ≤ ≤ L.This completes the proof of Lemma 24.

