WHY ARE CONVOLUTIONAL NETS MORE SAMPLE-EFFICIENT THAN FULLY-CONNECTED NETS?

Abstract

Convolutional neural networks often dominate fully-connected counterparts in generalization performance, especially on image classification tasks. This is often explained in terms of "better inductive bias." However, this has not been made mathematically rigorous, and the hurdle is that the sufficiently wide fully-connected net can always simulate the convolutional net. Thus the training algorithm plays a role. The current work describes a natural task on which a provable sample complexity gap can be shown, for standard training algorithms. We construct a single natural distribution on R d × {±1} on which any orthogonal-invariant algorithm (i.e. fully-connected networks trained with most gradient-based methods from gaussian initialization) requires Ω(d 2 ) samples to generalize while O(1) samples suffice for convolutional architectures. Furthermore, we demonstrate a single target function, learning which on all possible distributions leads to an O(1) vs Ω(d 2 /ε) gap. The proof relies on the fact that SGD on fully-connected network is orthogonal equivariant. Similar results are achieved for 2 regression and adaptive training algorithms, e.g. Adam and AdaGrad, which are only permutation equivariant.

1. INTRODUCTION

Deep convolutional nets ("ConvNets") are at the center of the deep learning revolution (Krizhevsky et al., 2012; He et al., 2016; Huang et al., 2017) . For many tasks, especially in vision, convolutional architectures perform significantly better their fully-connected ("FC") counterparts, at least given the same amount of training data. Practitioners explain this phenomenon at an intuitive level by pointing out that convolutional architectures have better "inductive bias", which intuitively means the following: (i) ConvNet is a better match to the underlying structure of image data, and thus are able to achieve low training loss with far fewer parameters (ii) models with fewer total number of parameters generalize better. Surprisingly, the above intuition about the better inductive bias of ConvNets over FC nets has never been made mathematically rigorous. The natural way to make it rigorous would be to show explicit learning tasks that require far more training samples on FC nets than for ConvNets. (Here "task"means, as usual in learning theory, a distribution on data points, and binary labels for them generated given using a fixed labeling function.) Surprisingly, the standard repertoire of lower bound techniques in ML theory does not seem capable of demonstrating such a separation. The reason is that any ConvNet can be simulated by an FC net of sufficient width, since a training algorithm can just zero out unneeded connections and do weight sharing as needed. Thus the key issue is not an expressiveness per se, but the combination of architecture plus the training algorithm. But if the training algorithm must be accounted for, the usual hurdle arises that we lack good mathematical understanding of the dynamics of deep net training (whether FC or ConvNet). How then can one establish the limitations of "FC nets + current training algorithms"? (Indeed, many lower bound techniques in PAC learning theory are information theoretic and ignore the training algorithm.) The current paper makes significant progress on the above problem by exhibiting simple tasks that require Ω(d 2 ) factor more training samples for FC nets than for ConvNets, where d is the data dimension. (In fact this is shown even for 1-dimensional ConvNets; the lowerbound easily extends to 2-D ConvNets.) The lower bound holds for FC nets trained with any of the popular algorithms Here the input data are 3 × 32 × 32 RGB images and the binary label indicates for each image whether the first channel has larger 2 norm than the second one. The input images are drawn from entry-wise independent Gaussian (left) and CIFAR-10 (right). In both cases, the 3-layer convolutional networks consist of two 3 × 3 convolutions with 10 hidden channels, and a 3 × 3 convolution with a single output channel followed by global average pooling. The 3-layer fully-connected networks consist of two fully-connected layers with 10000 hidden channels and another fully-connected layer with a single output. The 2-layer versions have one less intermediate layer and have only 3072 hidden channels for each layer. The hybrid networks consist of a single fully-connected layer with 3072 channels followed by two convolutional layers with 10 channels each. bn stands for batch-normalization Ioffe & Szegedy (2015). listed in Table 1 . (The reader can concretely think of vanilla SGD with Gaussian initialization of network weights, though the proof allows use of momentum, 2 regularization, and various learning rate schedules.) Our proof relies on the fact that these popular algorithms lead to an orthogonalequivariance property on the trained FC nets, which says that at the end of training the FC net -no matter how deep or how wide -will make the same predictions even if we apply orthogonal transformation on all datapoints (i.e., both training and test). This notion is inspired by Ng ( 2004) (where it is named "orthogonal invariant"), which showed the power of logistic regression with 1 regularization versus other learners. For a variety of learners (including kernels and FC nets) that paper described explicit tasks where the learner has Ω(d) higher sample complexity than logistic regression with 1 regularization. The lower bound example and technique can also be extended to show a (weak) separation between FC nets and ConvNets. (See Section 4.2) Our separation is quantitatively stronger than the result one gets using Ng (2004) because the sample complexity gap is Ω(d 2 ) vs O(1), and not Ω(d) vs O(1). But in a more subtle way our result is conceptually far stronger: the technique of Ng ( 2004) seems incapable of exhibiting a sample gap of more than O(1) between Convnets and FC nets in our framework. The reason is that the technique of Ng ( 2004) can exhibit a hard task for FC nets only after fixing the training algorithm. But there are infinitely many training algorithms once we account for hyperparameters associated in various epochs with LR schedules, 2 regularizer and momentum. Thus Ng (2004)'s technique cannot exclude the possibility that the hard task for "FC net + Algorithm 1" is easy for "FC net + Algorithm 2". Note that we do not claim any issues with the results claimed in Ng (2004) ; merely that the technique cannot lead to a proper separation between ConvNets and FC nets, when the FC nets are allowed to be trained with any of the infinitely many training algorithms. (Section 4.2 spells out in more detail the technical difference between our technique and Ng's idea.) The reader may now be wondering what is the single task that is easy for ConvNets but hard for FC nets trained with any standard algorithm? A simple example is the following: data distribution in R d is standard Gaussian, and target labeling function is the sign of d/2 i=1 x 2 i - d i=d/2+1 x 2 i . Figure 1 shows that this task is indeed much more difficult for FC nets. Furthermore, the task is also hard in practice for data distributions other than Gaussian; the figure shows that a sizeable performance gap exists even on CIFAR images with such a target label. Extension to broader class of algorithms. The orthogonal-equivariance property holds for many types of practical training algorithms, but not all. Notable exceptions are adaptive gradient methods (e.g. Adam and AdaGrad), 1 regularizer, and initialization methods that are not spherically symmetric. To prove a lower bound against FC nets with these algorithms, we identify a property, permutationinvariance, which is satisfied by nets trained using such algorithms. We then demonstrate a single



Figure 1: Comparison of generalization performance of convolutional versus fully-connected models trained by SGD. The grey dotted lines indicate separation, and we can see convolutional networks consistently outperform fully-connected networks.Here the input data are 3 × 32 × 32 RGB images and the binary label indicates for each image whether the first channel has larger 2 norm than the second one. The input images are drawn from entry-wise independent Gaussian (left) and CIFAR-10 (right). In both cases, the 3-layer convolutional networks consist of two 3 × 3 convolutions with 10 hidden channels, and a 3 × 3 convolution with a single output channel followed by global average pooling. The 3-layer fully-connected networks consist of two fully-connected layers with 10000 hidden channels and another fully-connected layer with a single output. The 2-layer versions have one less intermediate layer and have only 3072 hidden channels for each layer. The hybrid networks consist of a single fully-connected layer with 3072 channels followed by two convolutional layers with 10 channels each. bn stands for batch-normalizationIoffe & Szegedy (2015).

