DO DEEPER CONVOLUTIONAL NETWORKS PERFORM BETTER?

Abstract

Over-parameterization is a recent topic of much interest in the machine learning community. While over-parameterized neural networks are capable of perfectly fitting (interpolating) training data, these networks often perform well on test data, thereby contradicting classical learning theory. Recent work provided an explanation for this phenomenon by introducing the double descent curve, showing that increasing model capacity past the interpolation threshold can lead to a decrease in test error. In line with this, it was recently shown empirically and theoretically that increasing neural network capacity through width leads to double descent. In this work, we analyze the effect of increasing depth on test performance. In contrast to what is observed for increasing width, we demonstrate through a variety of classification experiments on CIFAR10 and ImageNet32 using ResNets and fullyconvolutional networks that test performance worsens beyond a critical depth. We posit an explanation for this phenomenon by drawing intuition from the principle of minimum norm solutions in linear networks.

1. INTRODUCTION

Traditional statistical learning theory argues that over-parameterized models will overfit training data and thus generalize poorly to unseen data (Hastie et al., 2001) . This is explained through the bias-variance tradeoff; as model complexity increases, so will variance, and thus more complex models will generalize poorly. Modern deep learning models, however, have been able to achieve state-of-the-art test accuracy by using an increasing number of parameters (Krizhevsky et al., 2012; Simonyan & Zisserman, 2015; He et al., 2016) . In fact, while over-parameterized neural networks have enough capacity to interpolate randomly labeled training data Zhang et al. (2017) , in practice training often leads to interpolating solutions that generalize well. To reconcile this apparent conflict, Belkin et al. (2019a) proposed the double descent risk curve, where beyond the interpolation threshhold, the risk decreases as model complexity increases. In neural networks, model complexity has thus far mainly been analyzed by varying network width. Indeed, in line with double descent, Yang et al. ( 2020 However, model complexity in neural networks can also be increased through depth. In this work, we study the effect of depth on test performance while holding network width constant. In particular, we focus on analyzing the effect of increasing depth in convolutional networks. These networks form the core of state-of-the-art models used for image classification and serve as a prime example of a network with layer constraints. In this paper we answer the following question: What is the role of depth in convolutional networks? In contrast to what has been shown for increasing model complexity through width, we demonstrate that test performance of convolutional networks worsens when increasing network depth beyond a critical point, suggesting that double descent does not happen through depth. Figure 1 demonstrates the difference between increasing width and depth in ResNets (He et al., 2016) trained on CIFAR10. In particular, Figure 1a shows that increasing width leads to a decrease in test error even when training accuracy is 100%. This effect is captured by the double descent curve. On the other hand, Figure 1b demonstrates that training ResNets of increasing depth but fixed width leads to an increase The main contributions of our work are as follows: 1. We conduct a range of experiments in the classification setting on CIFAR10 and Ima-geNet32 using ResNets, fully-convolutional networks, and convolutional neural tangent kernels, and consistently demonstrate that test performance worsens beyond a critical depth (Section 3). In particular, in several settings, we observe that the test accuracy of convolutional networks is even worse than that of fully connected networks as depth increases. 2. To gain intuition for this phenomenon we analyze linear neural networks. We demonstrate that increasing depth in linear neural networks with layer constraints (e.g. convolutional networks or Toeplitz networks) leads to a decrease in the Frobenius norm and stable rank of the resulting linear operator. This implies that increasing depth leads to poor generalization, when solutions of lower Frobenius norm (e.g. solutions learned by linear fully connected networks) do not generalize (Section 4). 3. Against conventional wisdom, our findings indicate that increasing depth does not always lead to better generalization. Namely, our results provide evidence that the driving force behind the success of deep learning is not the depth of the models, but rather their width.

2. RELATED WORK

We begin with a discussion of recent works analyzing the role of depth in convolutional networks (CNNs). Yang et al. ( 2020) study the bias-variance decomposition of deep CNNs and show that as depth increases, bias decreases and variance increases. This work observes that generally the magnitude of bias is greater than that of variance, and thus overall risk decreases. However, the focus of their analysis on depth is not on the interpolating regime. In fact, they posit that it is possible for deeper networks to have increased risk. We extend their experimental methodology for training ResNets and demonstrate that, indeed, deeper networks have increased risk. Neyshabur (2020) studied the role of convolutions, but focuses on the benefit of sparsity in weight sharing. Their work analyzed the effect of depth on fully-convolutional networks, but only considered models of two depths. Urban et al. (2017) analyzed the role of depth in student-teacher CNNs, specifically by training shallow CNNs to fit the logits of an ensemble of deep CNNs. This differs from our goal of understanding the effect of depth on CNN's trained from scratch on CIFAR10; furthermore, the ensemble of CNNs they consider only have eight convolutional layers, which is much smaller than the deep ResNets we consider in our experiments. Xiao et al. (2018) provides initial evidence that the performance of a CNN may degrade with depth; however, it is unclear whether this phenomenon is universal across CNNs used in practice or simply



); Nakkiran et al. (2020); Belkin et al. (2019a) demonstrated that increasing width beyond the interpolation threshhold while holding depth constant can decrease test loss.

(a) ResNet-18 and ResNet-34 of increasing width on CIFAR10 (b) ResNet of width 64 on CIFAR10.

Figure 1: (a) As explained by double descent, increasing width in ResNets trained on CIFAR10 results in a decrease in test error. (b) In contrast, increasing the depth of ResNets trained on CIFAR10 results in an increase in test loss (results are averaged across 3 random seeds).

