IS DEEPER BETTER? IT DEPENDS ON LOCALITY OF RELEVANT FEATURES

Abstract

It has been recognized that a heavily overparameterized artificial neural network exhibits surprisingly good generalization performance in various machinelearning tasks. Recent theoretical studies have made attempts to unveil the mystery of the overparameterization. In most of those previous works, the overparameterization is achieved by increasing the width of the network, while the effect of increasing the depth has remained less well understood. In this work, we investigate the effect of increasing the depth within an overparameterized regime. To gain an insight into the advantage of depth, we introduce local and global labels as abstract but simple classification rules. It turns out that the locality of the relevant feature for a given classification rule plays a key role; our experimental results suggest that deeper is better for local labels, whereas shallower is better for global labels. We also compare the results of finite networks with those of the neural tangent kernel (NTK), which is equivalent to an infinitely wide network with a proper initialization and an infinitesimal learning rate. It is shown that the NTK does not correctly capture the depth dependence of the generalization performance, which indicates the importance of the feature learning, rather than the lazy learning.

1. INTRODUCTION

Deep learning has achieved unparalleled success in various tasks of artificial intelligence such as image classification (Krizhevsky et al., 2012; LeCun et al., 2015) and speech recognition (Hinton et al., 2012) . Remarkably, in modern machine learning applications, impressive generalization performance has been observed in an overparameterized regime, in which the number of parameters in the network is much larger than that of training data samples. Contrary to what we learn in classical learning theory, an overparameterized network fits random labels and yet generalizes very well without serious overfitting (Zhang et al., 2017) . We do not have general theory that explains why deep learning works so well. Recently, the learning dynamics and the generalization power of heavily overparameterized wide neural networks have extensively been studied. It has been reported that training of an overparameterized network easily achieves zero training error without getting stuck in local minima of the loss landscape (Zhang et al., 2017; Baity-Jesi et al., 2018) . Mathematically rigorous results have also been obtained (Allen-Zhu et al., 2019; Du et al., 2019) . From a different perspective, theory of the neural tangent kernel (NTK) has been developed as a new tool to investigate an overparameterized network with an infinite width (Jacot et al., 2018; Arora et al., 2019) , which simply explains the reason why a sufficiently wide neural network can achieve a global minimum of the training loss. As for generalization, a "double-descent" phenomenon has attracted much attention (Spigler et al., 2019; Belkin et al., 2019) . The standard bias-variance tradeoff scenario predicts a U-shaped curve of the test error (Geman et al., 1992) ; however, one finds the double-descent curve, which implies that an increased model capacity beyond the interpolation threshold results in improved performance. This finding triggered detailed studies on the behavior of the bias and variance in an overparameterized regime (Neal et al., 2019; D'Ascoli et al., 2020) . The double-descent phenomenon is not explained by traditional complexity measures such as the Vapnik-Chervonenkis dimension and the Rademacher complexity (Mohri et al., 2018) , and hence one seeks for new complexity measures of deep neural networks that can prove better generalization bounds (Dziugaite & Roy, 2017; Neyshabur et al., 2017; 2019; Arora et al., 2018; Nagarajan & Kolter, 2017; Pérez et al., 2019) . These theoretical efforts mainly focus on the effect of increasing the network width, but benefits of the network depth remain unclear. It is known that expressivity of a deep neural network grows exponentially with the depth rather than the width (Poole et al., 2016) . See also Bianchini & Scarselli (2014) ; Montúfar et al. (2014) . However, it is unclear whether exponential expressivity does lead to better generalization (Ba & Caruana, 2014; Becker et al., 2020) . It is also nontrivial whether typical problems encountered in practice require such high expressivity. Although some works (Eldan & Shamir, 2016; Safran & Shamir, 2017) have shown that there exist simple and natural functions that are efficiently approximated by a network with two hidden layers but not by a network with one hidden layer, a recent work (Malach & Shalev-Shwartz, 2019) has demonstrated that a deep network can only learn functions that are well approximated by a shallow network by using a gradient-based optimization algorithm, which indicates that benefits of the depth are not due to high expressivity of deep networks. Some other recent works have reported no clear advantage of the depth in an overparameterized regime (Geiger et al., 2019a; b) . To gain an insight into the advantage of the depth, in the present paper, we report our experimental study on the depth and width dependences of generalization in abstract but simple, well-controlled classification tasks with fully connected neural networks. We introduce local labels and global labels, both of which give simple mappings between inputs and output class labels. By "local", we mean that the label is determined only by a few components of the input vector. On the other hand, a global label is determined by a global feature that is expressed as a sum of local quantities, and hence all components of an input contribute to the global label. Our experiments show strong depth-dependences of the generalization error for those simple input-output mappings. In particular, we find that deeper is better for local labels, while shallower is better for global labels. This result implies that the depth is not always advantageous, but the locality of relevant features would give us a clue for understanding the advantage the depth brings about. We also compare the generalization performance of a trained network of a finite width with that of the kernel method with the NTK. The latter corresponds to the infinite-width limit of a fully connected network with an appropriate initialization and an infinitesimal learning rate (Jacot et al., 2018) , which is referred to as the NTK limit. In the NTK limit, the network parameters stay close to their initial values during training, which is called the lazy learning (Chizat et al., 2019) . It is known that a wide but finite network can still be in the lazy learning regime for sufficiently small learning rates (Ji & Telgarsky, 2019; Chen et al., 2019) . We however find that even if the width increases, in some cases the generalization error at an optimal learning rate does not converge to the NTK limit. In such a case, a finite-width network shows much better generalization compared with the kernel learning with the NTK. This finding indicates the importance of the feature learning, in which network parameters change to learn relevant features.

2. SETTING

We consider a classification task with a training dataset D = {(x (µ) , y (µ) ) : µ = 1, 2, . . . , N }, where x (µ) ∈ R d is an input data and y (µ) ∈ {1, 2, . . . , K} is its label. In this work, we consider the binary classification, K = 2, unless otherwise stated.

2.1. DATASET

Each input x = (x 1 , x 2 , . . . , x d ) T is a d-dimensional vector taken from i.i.d. Gaussian random variables of zero mean and unit variance, where a T is the transpose of vector a. For each input x, we assign a label y according to one of the following rules.

k-LOCAL LABEL

We randomly fix integers {i 1 , i 2 , . . . , i k } with 1 ≤ i 1 < i 2 < • • • < i k ≤ d. In the "k-local" label, the relevant feature is given by the product of the k components of an input x, that is, the label y is determined by y = 1 if x i1 x i2 . . . x i k ≥ 0; 2 otherwise. (1)

