COMPUTATIONAL SEPARATION BETWEEN CONVOLU-TIONAL AND FULLY-CONNECTED NETWORKS

Abstract

Convolutional neural networks (CNN) exhibit unmatched performance in a multitude of computer vision tasks. However, the advantage of using convolutional networks over fully-connected networks is not understood from a theoretical perspective. In this work, we show how convolutional networks can leverage locality in the data, and thus achieve a computational advantage over fully-connected networks. Specifically, we show a class of problems that can be efficiently solved using convolutional networks trained with gradient-descent, but at the same time is hard to learn using a polynomial-size fully-connected network.

1. INTRODUCTION

Convolutional neural networks (LeCun et al., 1998; Krizhevsky et al., 2012) achieve state-of-the-art performance on every possible task in computer vision. However, while the empirical success of convolutional networks is indisputable, the advantage of using them is not well understood from a theoretical perspective. Specifically, we consider the following fundamental question: Why do convolutional networks (CNNs) perform better than fully-connected networks (FCNs)? Clearly, when considering expressive power, FCNs have a big advantage. Since convolution is a linear operation, any CNN can be expressed using a FCN, whereas FCNs can express a strictly larger family of functions. So, any advantage of CNNs due to expressivity can be leveraged by FCNs as well. Therefore, expressive power does not explain the superiority of CNNs over FCNs. There are several possible explanations to the superiority of CNNs over FCNs: parameter efficiency (and hence lower sample complexity), weight sharing, and locality prior. The main result of this paper is arguing that locality is a key factor by proving a computational separation between CNNs and FCNs based on locality. But, before that, let's discuss the other possible explanations. First, we observe that CNNs seem to be much more efficient in utilizing their parameters. A FCN needs to use a greater number of parameters compared to an equivalent CNN: each neuron of a CNN is limited to a small receptive field, and moreover, many of the parameters of the CNN are shared. From classical results in learning theory, using a large number of parameters may result in inferior generalization. So, can the advantage of CNNs be explained simply by counting parameters? To answer this question, we observe the performance of CNN and FCN based architecture of various widths and depths trained on the CIFAR-10 dataset. For each architecture, we observe the final test accuracy versus the number of trainable parameters. The results are shown in Figure 1 . As can be seen, CNNs have a clear advantage over FCNs, regardless of the number of parameters used. As is often observed, a large number of parameters does not hurt the performance of neural networks, and so parameter efficiency cannot explain the advantage of CNNs. This is in line with various theoretical works on optimization of neural networks, which show that over-parameterization is beneficial for convergence of gradient-descent (e.g. 2020)). It has been recently observed that in some cases, the performance of LCNs is on par with CNNs (Neyshabur, 2020). So, even if weight sharing explains some of the advantage of CNNs, it clearly doesn't tell the whole story. Finally, a key property of CNN architectures is their strong utilization of locality in the data. Each neuron in a CNN is limited to a local receptive field of the input, hence encoding a strong locality bias. In this work we demonstrate how CNNs can leverage the local structure of the input, giving them a clear advantage in terms of computational complexity. Our results hint that locality is the principal property that explains the advantage of using CNNs. Our main result is a computational separation result between CNNs and FCNs. To show this result, we introduce a family of functions that have a very strong local structure, which we call k-patterns. A k-pattern is a function that is determined by k consecutive bits of the input. We show that for inputs of n bits, when the target function is a (log n)-pattern, training a CNN of polynomial size with gradient-descent achieves small error in polynomial time. However, gradient-descent will fail to learn (log n)-patterns, when training a FCN of polynomial-size. 



Figure 1: Comparison between CNN and FCN of various depths (2/4/6) and widths, trained for 125 epochs with RMSprop optimizer.

RELATED WORK It has been empirically observed that CNN architectures perform much better than FCNs on computer vision tasks, such as digit recognition and image classification (e.g., Urban et al. (2017); Driss et al. (2017)). While some works have applied various techniques to improve the performance of FCNs (Lin et al. (2015); Fernando et al. (2016); Neyshabur (2020)), there is still a gap between performance of CNNs and FCNs, where the former give very good performance "out-of-the-box". The focus of this work is to understand, from a theoretical perspective, why CNNs give superior performance when trained on input with strong local structure. Various theoretical works show the advantage of architectures that leverage local and hierarchical structure. The work of Poggio et al. (2015) shows the advantage of using deep hierarchical models over wide and shallow functions. These results are extended in Poggio et al. (2017), showing an exponential gap between deep and shallow networks, when approximating locally compositional functions. The works of Mossel (2016); Malach & Shalev-Shwartz (2018) study learnability of deep hierarchical models. The work of Cohen et al. (2017) analyzes the expressive efficiency of convolutional networks via hierarchical tensor decomposition. While all these works show that indeed CNNs powerful due to their hierarchical nature and the efficiency of utilizing local structure, they do not explain why these models are superior to fully-connected models.There are a few works that provide a theoretical analysis of CNN optimization. The works ofBrutzkus & Globerson (2017); Du et al. (2018)  show that gradient-descent can learn a shallow CNN with a single filter, under various distributional assumptions. The work ofZhang et al. (2017)

The superiority of CNNs can be also attributed to the extensive weight sharing between the different convolutional filters. Indeed, it has been previously shown that weight sharing is important for the optimization of neural networks(Shalev-Shwartz et al., 2017b). Moreover, the translation-invariant nature of CNNs, which relies on weight sharing, is often observed to be beneficial in various signal processing tasks (Kauderer-Abrams, 2017; Kayhan & Gemert, 2020). So, how much does the weight sharing contribute to the superiority of CNNs over FCNs?To understand the effect of weight sharing on the behavior of CNNs, it is useful to study locallyconnected network (LCN) architectures, which are similar to CNNs, but have no weight sharing between the kernels of the network. While CNNs are far more popular in practice (also due to the fact that they are much more efficient in terms of model size), LCNs have also been used in different contexts (e.g.,Bruna et al. (2013); Chen et al. (2015); Liu et al. (

