WHAT CAN BE LEARNT WITH WIDE CONVOLUTIONAL NEURAL NETWORKS?

Abstract

Understanding how convolutional neural networks (CNNs) can efficiently learn high-dimensional functions remains a fundamental challenge. A popular belief is that these models harness the local and hierarchical structure of natural data such as images. Yet, we lack a quantitative understanding of how such structure affects performance, e.g. the rate of decay of the generalisation error with the number of training samples. In this paper, we study deep CNNs in the kernel regime. First, we show that the spectrum of the corresponding kernel inherits the hierarchical structure of the network, and we characterise its asymptotics. Then, we use this result together with generalisation bounds to prove that deep CNNs adapt to the spatial scale of the target function. In particular, we find that if the target function depends on low-dimensional subsets of adjacent input variables, then the rate of decay of the error is controlled by the effective dimensionality of these subsets. Conversely, if the target function depends on the full set of input variables, then the error rate is inversely proportional to the input dimension. We conclude by computing the rate when a deep CNN is trained on the output of another deep CNN with randomly-initialised parameters. Interestingly, we find that, despite their hierarchical structure, the functions generated by deep CNNs are too rich to be efficiently learnable in high dimension.

1. INTRODUCTION

Deep convolutional neural networks (CNNs) are particularly successful in certain tasks such as image classification. Such tasks generally entail the approximation of functions of a large number of variables, for instance the number of pixels which determine the content of an image. Learning a generic high-dimensional function is plagued by the curse of dimensionality: the rate at which the generalisation error ϵ decays with the number of training samples n vanishes as the dimensionality d of the input space grows, i.e. ϵ(n) ∼ n -β with β = O(1/d) (Wainwright, 2019) . Therefore, the success of CNNs in classifying data whose dimension can be in the hundreds or more (Hestness et al., 2017; Spigler et al., 2020) points to the existence of some underlying structure in the task that CNNs can leverage. Understanding the structure of learnable tasks is arguably one of the most fundamental problems in deep learning, and also one of central practical importance-as it determines how many examples are required to learn up to a certain error. A popular hypothesis is that learnable tasks are local and hierarchical: features at any scale are made of sub-features of smaller scales. Although many works have investigated this hypothesis (Biederman, 1987; Poggio et al., 2017; Kondor & Trivedi, 2018; Zhou et al., 2018; Deza et al., 2020; Kohler et al., 2020; Poggio et al., 2020; Schmidt-Hieber, 2020; Finocchio & Schmidt-Hieber, 2021; Giordano et al., 2022) , there are no available predictions for the exponent β for deep CNNs trained on tasks with a varying degree of locality or a truly hierarchical structure. In this paper we perform such a computation in the overparameterised regime, where the width of the hidden layer of the neural networks diverges and the network output is rescaled so as to converge to that of a kernel method (Jacot et al., 2018; Lee et al., 2019) . Although the deep networks deployed in real scenarios do not generally operate in such regime, the connection with the theory of kernel regression provides a recipe for computing the decay of the generalisation error with the number of training examples. Namely, given an infinitely wide neural network, its generalisation abilities depend on the spectrum of the corresponding kernel (Caponnetto & De Vito, 2007; Bordelon et al., 2020) : the main challenge is then to characterise this spectrum, especially for deep CNNs whose kernels are rather cumbersome and defined recursively (Arora et al., 2019) . This characterisation is the main result of our paper, together with the ensuing study of generalisation in deep CNNs. 1.1 OUR CONTRIBUTIONS More specifically, this paper studies the generalisation properties of deep CNNs with nonoverlapping patches and no pooling (defined in Sec. 2, see Fig. 1 for an illustration), trained on a target function f * by empirical minimisation of the mean squared loss. We consider the infinitewidth limit (Sec. 3) where the model parameters change infinitesimally over training, thus the trained network coincides with the predictor of kernel regression with the Neural Tangent Kernel (NTK) of the network. Due to the equivalence with kernel methods, generalisation is fully characterised by the spectrum of the integral operator of the kernel: in simple terms, the projections on the eigenfunctions with larger eigenvalues can be learnt (up to a fixed generalisation error) with fewer training points (see, e.g., Bach ( 2021)). Spectrum of deep hierarchical kernels (Thm. 3.1). Due to the network architecture, the hidden neurons of each layer depend only on a subset of the input variables, known as the receptive field of that neuron (highlighted by coloured boxes in Fig. 1 , left panel). We find that the eigenfunctions of the NTK of a hierarchical CNN of depth L + 1 can be organised into sectors l = 1, . . . , L associated with the hidden layers of the network (Thm. 3.1). The eigenfunctions of each sector depend only on the receptive fields of the neurons of the corresponding hidden layer: if we denote with d eff (l) the size of the receptive fields of neurons in the l-th hidden layer, then the eigenfunctions of the l-th sector are effectively functions of d eff (l) variables. We characterise the asymptotic behaviour of the NTK eigenvalues with the degree of the corresponding eigenfunctions (Thm. 3.1) and find that it is controlled by d eff (l). As a consequence, the eigenfunctions with the largest eigenvalues-the easiest to learn-are those which depend on small subsets of the input variables and have low polynomial degree. This is our main technical contribution and all of our conclusions follow from it. Adaptivity to the spatial structure of the target (Cor. 4.1). We use the above result to prove that deep CNNs can adapt to the spatial scale of the target function (Sec. 4). More specifically, by using rigorous bounds from the theory of kernel ridge regression (Caponnetto & De Vito, 2007) (reviewed in the first paragraph of Sec. 4), we show that when learning with the kernel of a CNN and optimal regularisation, the decay of the error depends on the effective dimensionality of the target f * -i.e., if f * only depends on d eff adjacent coordinates of the d-dimensional input, then ϵ ∼ n -β with β ≥ O(1/d eff ) (Cor. 4.1, see Fig. 1 for a pictorial representation). We find a similar picture in ridgeless regression by using non-rigorous results derived with the replica method (Bordelon et al., 2020; Loureiro et al., 2021) (Sec. 5) . Notice that for targets which are spatially localised (or sums of spatially localised functions), the rates achieved with deep CNNs are much closer to the Bayes-optimal rates-realised when the architecture is fine-tuned to the structure of the targetthan β = O(1/d) obtained with the kernel of a fully-connected network. Moreover, we find that hierarchical functions generated by the output of deep CNNs are too rich to be efficiently learnable in high dimensions (Lemma 5.1). We confirm these results through extensive numerical studies and find them to hold even if the nonoverlapping patches assumption is relaxed (Subsec. G.4).

1.2. RELATED WORK

The benefits of shallow CNNs in the kernel regime have been investigated by Bietti ( 2022 (2022 ). Favero et al. (2021 ), and later (Misiakiewicz & Mei, 2021; Xiao & Pennington, 2022) , studied generalisation properties of shallow CNNs, finding that they are able to beat the curse of dimensionality on local target functions. However, these architectures can only approximate functions of single input patches or linear combinations thereof. Bietti (2022), in addition, includes generic pooling layers and begins considering the role of depth by studying the approximation properties of kernels which are integer powers of other kernels. We generalise this line of work by studying CNNs of any depth with nonanalytic (ReLU) activations: we find that the depth and nonanalyticity of the resulting kernel are crucial for understanding the inductive bias of deep CNNs. This result should also be contrasted with the spectrum of the kernels of deep fully-connected networks, whose asymptotics



); Favero et al. (2021); Misiakiewicz & Mei (2021); Xiao & Pennington (2022); Xiao (2022); Geifman et al.

