WHAT CAN BE LEARNT WITH WIDE CONVOLUTIONAL NEURAL NETWORKS?

Abstract

Understanding how convolutional neural networks (CNNs) can efficiently learn high-dimensional functions remains a fundamental challenge. A popular belief is that these models harness the local and hierarchical structure of natural data such as images. Yet, we lack a quantitative understanding of how such structure affects performance, e.g. the rate of decay of the generalisation error with the number of training samples. In this paper, we study deep CNNs in the kernel regime. First, we show that the spectrum of the corresponding kernel inherits the hierarchical structure of the network, and we characterise its asymptotics. Then, we use this result together with generalisation bounds to prove that deep CNNs adapt to the spatial scale of the target function. In particular, we find that if the target function depends on low-dimensional subsets of adjacent input variables, then the rate of decay of the error is controlled by the effective dimensionality of these subsets. Conversely, if the target function depends on the full set of input variables, then the error rate is inversely proportional to the input dimension. We conclude by computing the rate when a deep CNN is trained on the output of another deep CNN with randomly-initialised parameters. Interestingly, we find that, despite their hierarchical structure, the functions generated by deep CNNs are too rich to be efficiently learnable in high dimension.

1. INTRODUCTION

Deep convolutional neural networks (CNNs) are particularly successful in certain tasks such as image classification. Such tasks generally entail the approximation of functions of a large number of variables, for instance the number of pixels which determine the content of an image. Learning a generic high-dimensional function is plagued by the curse of dimensionality: the rate at which the generalisation error ϵ decays with the number of training samples n vanishes as the dimensionality d of the input space grows, i.e. ϵ(n) ∼ n -β with β = O(1/d) (Wainwright, 2019) . Therefore, the success of CNNs in classifying data whose dimension can be in the hundreds or more (Hestness et al., 2017; Spigler et al., 2020) points to the existence of some underlying structure in the task that CNNs can leverage. Understanding the structure of learnable tasks is arguably one of the most fundamental problems in deep learning, and also one of central practical importance-as it determines how many examples are required to learn up to a certain error. A popular hypothesis is that learnable tasks are local and hierarchical: features at any scale are made of sub-features of smaller scales. Although many works have investigated this hypothesis (Biederman, 1987; Poggio et al., 2017; Kondor & Trivedi, 2018; Zhou et al., 2018; Deza et al., 2020; Kohler et al., 2020; Poggio et al., 2020; Schmidt-Hieber, 2020; Finocchio & Schmidt-Hieber, 2021; Giordano et al., 2022) , there are no available predictions for the exponent β for deep CNNs trained on tasks with a varying degree of locality or a truly hierarchical structure. In this paper we perform such a computation in the overparameterised regime, where the width of the hidden layer of the neural networks diverges and the network output is rescaled so as to converge to that of a kernel method (Jacot et al., 2018; Lee et al., 2019) . Although the deep networks deployed in real scenarios do not generally operate in such regime, the connection with the theory of kernel regression provides a recipe for computing the decay of the generalisation error with the number of training examples. Namely, given an infinitely wide neural network, its generalisation abilities depend on the spectrum of the corresponding kernel (Caponnetto & De Vito, 2007; Bordelon et al., 

