PROVABLE MEMORIZATION VIA DEEP NEURAL NETWORKS USING SUB-LINEAR PARAMETERS

Abstract

It is known that O(N ) parameters are sufficient for neural networks to memorize arbitrary N input-label pairs. By exploiting depth, we show that O(N 2/3 ) parameters suffice to memorize N pairs, under a mild condition on the separation of input points. In particular, deeper networks (even with width 3) are shown to memorize more pairs than shallow networks, which also agrees with the recent line of works on the benefits of depth for function approximation. We also provide empirical results that support our theoretical findings.

1. INTRODUCTION

The modern trend of over-parameterizing neural networks has shifted the focus of deep learning theory from analyzing their expressive power toward understanding the generalization capabilities of neural networks. While the celebrated universal approximation theorems state that over-parameterization enables us to approximate the target function with a smaller error (Cybenko, 1989; Pinkus, 1999) , the theoretical gain is too small to satisfactorily explain the observed benefits of over-parameterizing already-big networks. Instead of "how well can models fit," the question of "why models do not overfit" has become the central issue (Zhang et al., 2017) . Ironically, a recent breakthrough on the phenomenon known as the double descent (Belkin et al., 2019; Nakkiran et al., 2020) suggests that answering the question of "how well can models fit" is in fact an essential element in fully characterizing their generalization capabilities. In particular, the double descent phenomenon characterizes two different phases according to the capability/incapability of the network size for memorizing training samples. If the network size is insufficient for memorization, the traditional bias-variance trade-off occurs. However, after the network reaches the capacity that memorizes the dataset, i.e., "interpolation threshold," larger networks exhibit better generalization. Under this new paradigm, identifying the minimum size of networks for memorizing finite input-label pairs becomes a key issue, rather than function approximation that considers infinite inputs. The memory capacity of neural networks is relatively old literature, where researchers have studied the minimum number of parameters for memorizing arbitrary N input-label pairs. Existing results showed that O(N ) parameters are sufficient for various activation functions (Baum, 1988; Huang and Babri, 1998; Huang, 2003; Yun et al., 2019; Vershynin, 2020) . On the other hand, Sontag (1997) established the negative result that for any network using analytic definable activation functions with o(N ) parameters, there exists a set of N input-label pairs that the network cannot memorize. The sub-linear number of parameters also appear in a related topic, namely the VC-dimension of neural networks. It has been proved that there exists a set of N inputs such that a neural network with o(N ) parameters can "shatter," i.e., memorize arbitrary labels (Maass, 1997; Bartlett et al., 2019) . Comparing the two results on o(N ) parameters, Sontag (1997) showed that not all sets of N inputs can be memorized for arbitrary labels, whereas Bartlett et al. (2019) showed that at least one set of N inputs can be shattered. This suggests that there may be a reasonably large family of N input-label pairs that can be memorized with o(N ) parameters, which is our main interest.

1.1. SUMMARY OF RESULTS

In this paper, we identify a mild condition satisfied by many practical datasets, and show that o(N ) parameters suffice for memorizing such datasets. In order to bypass the negative result by Sontag (1997), we introduce a condition to the set of inputs, called the ∆-separateness. Definition 1. For a set X ⊂ R dx , we say X is ∆-separated if sup x,x ∈X :x =x x -x 2 < ∆ × inf x,x ∈X :x =x x -x 2 . This condition requires that the ratio of the maximum distance to the minimum distance between distinct points is bounded by ∆. Note that the condition is milder when ∆ is bigger. By Definition 1, any given finite set of (distinct) inputs is ∆-separated for some ∆, so one might ask why ∆separateness is different from having distinct inputs in a dataset. The key difference is that even if the number of data points N grows, the ratio of the maximum to the minimum should remain bounded by ∆. Given the discrete nature of computers, there are many practical datasets that satisfy ∆-separateness, as we will see shortly. Also, this condition is more general than the minimum distance assumptions (∀i, x i 2 = 1, ∀i = j, x i -x j 2 ≥ ρ > 0) that are employed in existing theoretical results (Hardt and Ma, 2017; Vershynin, 2020) . To see this, note that the minimum distance assumption implies 2/ρ-separateness. In our theorem statements, we will use the phrase "∆-separated set of input-label pairs" for denoting the set of inputs is ∆-separated. In our main theorem sketched below, we prove the sufficiency of o(N ) parameters for memorizing any ∆-separated set of N pairs (i.e., any ∆-separated set of N inputs with arbitrary labels) even for large ∆. More concretely, our result is of the following form: Theorem 1 (Informal). For any w ∈ ( 2 3 , 1], there exists a O(N 2-2w / log N + log ∆)-layer, O(N w + log ∆)-parameter fully-connected network with sigmoidal or RELU activation that can memorize any ∆-separated set of N input-label pairs. We note that log has base 2. Theorem 1 states that if the number of layers increases with the number of pairs N , then any ∆-separated set of N pairs can be memorized by a network with o(N ) parameters. Here, we can check from Definition 1 that the log ∆ term does not usually dominate the depth or the number of parameters, especially for modern deep architectures and practical datasets. For example, it is easy to check that any dataset consisting of 3-channel images (values from {0, 1, . . . , 255}) of size a × b satisfies log ∆ < (9 +foot_0 2 log(ab)) (e.g., log ∆ < 17 for the ImageNet dataset), which is often much smaller than the depth of modern deep architectures. For practical datasets, we can show that networks with parameters fewer than the number of pairs can successfully memorize the dataset. For example, in order to perfectly classify one million images in ImageNet dataset 1 with 1000 classes, our result shows that 0.7 million parameters are sufficient. The improvement is more significant for large datasets. To memorize 15.8 million bounding boxes in Open Images V6foot_1 with 600 classes, our result shows that only 4.5 million parameters suffice. Theorem 1 improves the sufficient number of parameters for memorizing a large class of N pairs (i.e., 2 O(N w ) -separated) from O(N ) down to O(N w ) for any w ∈ ( 2 3 , 1), for deep networks. Then, it is natural to ask whether the depth increasing with N is necessary for memorization with a sub-linear number of parameters. The following existing result on the VC-dimension implies that this is indeed necessary for memorization with o(N/ log N ) parameters, at least for RELU networks. Theorem [Bartlett et al. (2019) ]. (Informal) For L-layer RELU networks, Ω(N/(L log N )) parameters are necessary for memorizing at least a single set of N inputs with arbitrary labels. The above theorem implies that for RELU networks of constant depth, Ω(N/ log N ) parameters are necessary for memorizing at least one set of N inputs with arbitrary labels. In contrast, by increasing depth with N , Theorem 1 shows that there is a large class of datasets that can be memorized with o(N/ log N ) parameters. Combining these two results, one can conclude that increasing depth is necessary and sufficient for memorizing a large class of N pairs with o(N/ log N ) parameters. Given that the depth is critical for the memorization power, is the width also critical? We prove that it is not the case, via the following theorem. Theorem 2 (Informal). For a fully-connected network of width 3 with a sigmoidal or RELU activation function, O(N 2/3 + log ∆) parameters (i.e., layers) suffice for memorizing any ∆-separated set of N input-label pairs. Theorem 2 states that under 2 O(N 2/3 ) -separateness of inputs, the network width does not necessarily have to increase with N for memorization with sub-linear parameters. Furthermore, it shows that



http://www.image-net.org/ https://storage.googleapis.com/openimages/web/index.html

