IS DEEPER BETTER? IT DEPENDS ON LOCALITY OF RELEVANT FEATURES

Abstract

It has been recognized that a heavily overparameterized artificial neural network exhibits surprisingly good generalization performance in various machinelearning tasks. Recent theoretical studies have made attempts to unveil the mystery of the overparameterization. In most of those previous works, the overparameterization is achieved by increasing the width of the network, while the effect of increasing the depth has remained less well understood. In this work, we investigate the effect of increasing the depth within an overparameterized regime. To gain an insight into the advantage of depth, we introduce local and global labels as abstract but simple classification rules. It turns out that the locality of the relevant feature for a given classification rule plays a key role; our experimental results suggest that deeper is better for local labels, whereas shallower is better for global labels. We also compare the results of finite networks with those of the neural tangent kernel (NTK), which is equivalent to an infinitely wide network with a proper initialization and an infinitesimal learning rate. It is shown that the NTK does not correctly capture the depth dependence of the generalization performance, which indicates the importance of the feature learning, rather than the lazy learning.

1. INTRODUCTION

Deep learning has achieved unparalleled success in various tasks of artificial intelligence such as image classification (Krizhevsky et al., 2012; LeCun et al., 2015) and speech recognition (Hinton et al., 2012) . Remarkably, in modern machine learning applications, impressive generalization performance has been observed in an overparameterized regime, in which the number of parameters in the network is much larger than that of training data samples. Contrary to what we learn in classical learning theory, an overparameterized network fits random labels and yet generalizes very well without serious overfitting (Zhang et al., 2017) . We do not have general theory that explains why deep learning works so well. Recently, the learning dynamics and the generalization power of heavily overparameterized wide neural networks have extensively been studied. It has been reported that training of an overparameterized network easily achieves zero training error without getting stuck in local minima of the loss landscape (Zhang et al., 2017; Baity-Jesi et al., 2018) . Mathematically rigorous results have also been obtained (Allen-Zhu et al., 2019; Du et al., 2019) . From a different perspective, theory of the neural tangent kernel (NTK) has been developed as a new tool to investigate an overparameterized network with an infinite width (Jacot et al., 2018; Arora et al., 2019) , which simply explains the reason why a sufficiently wide neural network can achieve a global minimum of the training loss. As for generalization, a "double-descent" phenomenon has attracted much attention (Spigler et al., 2019; Belkin et al., 2019) . The standard bias-variance tradeoff scenario predicts a U-shaped curve of the test error (Geman et al., 1992) ; however, one finds the double-descent curve, which implies that an increased model capacity beyond the interpolation threshold results in improved performance. This finding triggered detailed studies on the behavior of the bias and variance in an overparameterized regime (Neal et al., 2019; D'Ascoli et al., 2020) . The double-descent phenomenon is not explained by traditional complexity measures such as the Vapnik-Chervonenkis dimension and the Rademacher complexity (Mohri et al., 2018) , and hence one seeks for new complexity measures of deep neural networks that can prove better generalization bounds (Dziugaite & Roy, 2017; Neyshabur et al., 2017; 2019; Arora et al., 2018; Nagarajan & Kolter, 2017; Pérez et al., 2019) .

