COMPUTATIONAL SEPARATION BETWEEN CONVOLU-TIONAL AND FULLY-CONNECTED NETWORKS

Abstract

Convolutional neural networks (CNN) exhibit unmatched performance in a multitude of computer vision tasks. However, the advantage of using convolutional networks over fully-connected networks is not understood from a theoretical perspective. In this work, we show how convolutional networks can leverage locality in the data, and thus achieve a computational advantage over fully-connected networks. Specifically, we show a class of problems that can be efficiently solved using convolutional networks trained with gradient-descent, but at the same time is hard to learn using a polynomial-size fully-connected network.

1. INTRODUCTION

Convolutional neural networks (LeCun et al., 1998; Krizhevsky et al., 2012) achieve state-of-the-art performance on every possible task in computer vision. However, while the empirical success of convolutional networks is indisputable, the advantage of using them is not well understood from a theoretical perspective. Specifically, we consider the following fundamental question: Why do convolutional networks (CNNs) perform better than fully-connected networks (FCNs)? Clearly, when considering expressive power, FCNs have a big advantage. Since convolution is a linear operation, any CNN can be expressed using a FCN, whereas FCNs can express a strictly larger family of functions. So, any advantage of CNNs due to expressivity can be leveraged by FCNs as well. Therefore, expressive power does not explain the superiority of CNNs over FCNs. There are several possible explanations to the superiority of CNNs over FCNs: parameter efficiency (and hence lower sample complexity), weight sharing, and locality prior. The main result of this paper is arguing that locality is a key factor by proving a computational separation between CNNs and FCNs based on locality. But, before that, let's discuss the other possible explanations. First, we observe that CNNs seem to be much more efficient in utilizing their parameters. A FCN needs to use a greater number of parameters compared to an equivalent CNN: each neuron of a CNN is limited to a small receptive field, and moreover, many of the parameters of the CNN are shared. From classical results in learning theory, using a large number of parameters may result in inferior generalization. So, can the advantage of CNNs be explained simply by counting parameters?



Figure 1: Comparison between CNN and FCN of various depths (2/4/6) and widths, trained for 125 epochs with RMSprop optimizer.

