PROVABLE MEMORIZATION VIA DEEP NEURAL NETWORKS USING SUB-LINEAR PARAMETERS

Abstract

It is known that O(N ) parameters are sufficient for neural networks to memorize arbitrary N input-label pairs. By exploiting depth, we show that O(N 2/3 ) parameters suffice to memorize N pairs, under a mild condition on the separation of input points. In particular, deeper networks (even with width 3) are shown to memorize more pairs than shallow networks, which also agrees with the recent line of works on the benefits of depth for function approximation. We also provide empirical results that support our theoretical findings.

1. INTRODUCTION

The modern trend of over-parameterizing neural networks has shifted the focus of deep learning theory from analyzing their expressive power toward understanding the generalization capabilities of neural networks. While the celebrated universal approximation theorems state that over-parameterization enables us to approximate the target function with a smaller error (Cybenko, 1989; Pinkus, 1999) , the theoretical gain is too small to satisfactorily explain the observed benefits of over-parameterizing already-big networks. Instead of "how well can models fit," the question of "why models do not overfit" has become the central issue (Zhang et al., 2017) . Ironically, a recent breakthrough on the phenomenon known as the double descent (Belkin et al., 2019; Nakkiran et al., 2020) suggests that answering the question of "how well can models fit" is in fact an essential element in fully characterizing their generalization capabilities. In particular, the double descent phenomenon characterizes two different phases according to the capability/incapability of the network size for memorizing training samples. If the network size is insufficient for memorization, the traditional bias-variance trade-off occurs. However, after the network reaches the capacity that memorizes the dataset, i.e., "interpolation threshold," larger networks exhibit better generalization. Under this new paradigm, identifying the minimum size of networks for memorizing finite input-label pairs becomes a key issue, rather than function approximation that considers infinite inputs. The memory capacity of neural networks is relatively old literature, where researchers have studied the minimum number of parameters for memorizing arbitrary N input-label pairs. Existing results showed that O(N ) parameters are sufficient for various activation functions (Baum, 1988; Huang and Babri, 1998; Huang, 2003; Yun et al., 2019; Vershynin, 2020) . On the other hand, Sontag (1997) established the negative result that for any network using analytic definable activation functions with o(N ) parameters, there exists a set of N input-label pairs that the network cannot memorize. The sub-linear number of parameters also appear in a related topic, namely the VC-dimension of neural networks. It has been proved that there exists a set of N inputs such that a neural network with o(N ) parameters can "shatter," i.e., memorize arbitrary labels (Maass, 1997; Bartlett et al., 2019) . Comparing the two results on o(N ) parameters, Sontag (1997) showed that not all sets of N inputs can be memorized for arbitrary labels, whereas Bartlett et al. (2019) showed that at least one set of N inputs can be shattered. This suggests that there may be a reasonably large family of N input-label pairs that can be memorized with o(N ) parameters, which is our main interest.

1.1. SUMMARY OF RESULTS

In this paper, we identify a mild condition satisfied by many practical datasets, and show that o(N ) parameters suffice for memorizing such datasets. In order to bypass the negative result by Sontag (1997), we introduce a condition to the set of inputs, called the ∆-separateness.

