WHAT CAN BE LEARNT WITH WIDE CONVOLUTIONAL NEURAL NETWORKS?

Abstract

Understanding how convolutional neural networks (CNNs) can efficiently learn high-dimensional functions remains a fundamental challenge. A popular belief is that these models harness the local and hierarchical structure of natural data such as images. Yet, we lack a quantitative understanding of how such structure affects performance, e.g. the rate of decay of the generalisation error with the number of training samples. In this paper, we study deep CNNs in the kernel regime. First, we show that the spectrum of the corresponding kernel inherits the hierarchical structure of the network, and we characterise its asymptotics. Then, we use this result together with generalisation bounds to prove that deep CNNs adapt to the spatial scale of the target function. In particular, we find that if the target function depends on low-dimensional subsets of adjacent input variables, then the rate of decay of the error is controlled by the effective dimensionality of these subsets. Conversely, if the target function depends on the full set of input variables, then the error rate is inversely proportional to the input dimension. We conclude by computing the rate when a deep CNN is trained on the output of another deep CNN with randomly-initialised parameters. Interestingly, we find that, despite their hierarchical structure, the functions generated by deep CNNs are too rich to be efficiently learnable in high dimension.

1. INTRODUCTION

Deep convolutional neural networks (CNNs) are particularly successful in certain tasks such as image classification. Such tasks generally entail the approximation of functions of a large number of variables, for instance the number of pixels which determine the content of an image. Learning a generic high-dimensional function is plagued by the curse of dimensionality: the rate at which the generalisation error ϵ decays with the number of training samples n vanishes as the dimensionality d of the input space grows, i.e. ϵ(n) ∼ n -β with β = O(1/d) (Wainwright, 2019) . Therefore, the success of CNNs in classifying data whose dimension can be in the hundreds or more (Hestness et al., 2017; Spigler et al., 2020) points to the existence of some underlying structure in the task that CNNs can leverage. Understanding the structure of learnable tasks is arguably one of the most fundamental problems in deep learning, and also one of central practical importance-as it determines how many examples are required to learn up to a certain error. A popular hypothesis is that learnable tasks are local and hierarchical: features at any scale are made of sub-features of smaller scales. Although many works have investigated this hypothesis (Biederman, 1987; Poggio et al., 2017; Kondor & Trivedi, 2018; Zhou et al., 2018; Deza et al., 2020; Kohler et al., 2020; Poggio et al., 2020; Schmidt-Hieber, 2020; Finocchio & Schmidt-Hieber, 2021; Giordano et al., 2022) , there are no available predictions for the exponent β for deep CNNs trained on tasks with a varying degree of locality or a truly hierarchical structure. In this paper we perform such a computation in the overparameterised regime, where the width of the hidden layer of the neural networks diverges and the network output is rescaled so as to converge to that of a kernel method (Jacot et al., 2018; Lee et al., 2019) . Although the deep networks deployed in real scenarios do not generally operate in such regime, the connection with the theory of kernel regression provides a recipe for computing the decay of the generalisation error with the number of training examples. Namely, given an infinitely wide neural network, its generalisation abilities depend on the spectrum of the corresponding kernel (Caponnetto & De Vito, 2007; Bordelon et al., 2020) : the main challenge is then to characterise this spectrum, especially for deep CNNs whose kernels are rather cumbersome and defined recursively (Arora et al., 2019) . This characterisation is the main result of our paper, together with the ensuing study of generalisation in deep CNNs. 1.1 OUR CONTRIBUTIONS More specifically, this paper studies the generalisation properties of deep CNNs with nonoverlapping patches and no pooling (defined in Sec. 2, see Fig. 1 for an illustration), trained on a target function f * by empirical minimisation of the mean squared loss. We consider the infinitewidth limit (Sec. 3) where the model parameters change infinitesimally over training, thus the trained network coincides with the predictor of kernel regression with the Neural Tangent Kernel (NTK) of the network. Due to the equivalence with kernel methods, generalisation is fully characterised by the spectrum of the integral operator of the kernel: in simple terms, the projections on the eigenfunctions with larger eigenvalues can be learnt (up to a fixed generalisation error) with fewer training points (see, e.g., Bach (2021) ). Spectrum of deep hierarchical kernels (Thm. 3.1). Due to the network architecture, the hidden neurons of each layer depend only on a subset of the input variables, known as the receptive field of that neuron (highlighted by coloured boxes in Fig. 1 , left panel). We find that the eigenfunctions of the NTK of a hierarchical CNN of depth L + 1 can be organised into sectors l = 1, . . . , L associated with the hidden layers of the network (Thm. 3.1). The eigenfunctions of each sector depend only on the receptive fields of the neurons of the corresponding hidden layer: if we denote with d eff (l) the size of the receptive fields of neurons in the l-th hidden layer, then the eigenfunctions of the l-th sector are effectively functions of d eff (l) variables. We characterise the asymptotic behaviour of the NTK eigenvalues with the degree of the corresponding eigenfunctions (Thm. 3.1) and find that it is controlled by d eff (l). As a consequence, the eigenfunctions with the largest eigenvalues-the easiest to learn-are those which depend on small subsets of the input variables and have low polynomial degree. This is our main technical contribution and all of our conclusions follow from it. Adaptivity to the spatial structure of the target (Cor. 4.1). We use the above result to prove that deep CNNs can adapt to the spatial scale of the target function (Sec. 4). More specifically, by using rigorous bounds from the theory of kernel ridge regression (Caponnetto & De Vito, 2007) (reviewed in the first paragraph of Sec. 4), we show that when learning with the kernel of a CNN and optimal regularisation, the decay of the error depends on the effective dimensionality of the target f * -i.e., if f * only depends on d eff adjacent coordinates of the d-dimensional input, then ϵ ∼ n -β with β ≥ O(1/d eff ) (Cor. 4.1, see Fig. 1 for a pictorial representation). We find a similar picture in ridgeless regression by using non-rigorous results derived with the replica method (Bordelon et al., 2020; Loureiro et al., 2021) (Sec. 5) . Notice that for targets which are spatially localised (or sums of spatially localised functions), the rates achieved with deep CNNs are much closer to the Bayes-optimal rates-realised when the architecture is fine-tuned to the structure of the targetthan β = O(1/d) obtained with the kernel of a fully-connected network. Moreover, we find that hierarchical functions generated by the output of deep CNNs are too rich to be efficiently learnable in high dimensions (Lemma 5.1). We confirm these results through extensive numerical studies and find them to hold even if the nonoverlapping patches assumption is relaxed (Subsec. G.4).

1.2. RELATED WORK

The benefits of shallow CNNs in the kernel regime have been investigated by Bietti (2022) ; Favero et al. (2021) ; Misiakiewicz & Mei (2021) ; Xiao & Pennington (2022) ; Xiao (2022) ; Geifman et al. (2022) . Favero et al. (2021) , and later (Misiakiewicz & Mei, 2021; Xiao & Pennington, 2022) , studied generalisation properties of shallow CNNs, finding that they are able to beat the curse of dimensionality on local target functions. However, these architectures can only approximate functions of single input patches or linear combinations thereof. Bietti (2022) , in addition, includes generic pooling layers and begins considering the role of depth by studying the approximation properties of kernels which are integer powers of other kernels. We generalise this line of work by studying CNNs of any depth with nonanalytic (ReLU) activations: we find that the depth and nonanalyticity of the resulting kernel are crucial for understanding the inductive bias of deep CNNs. This result should also be contrasted with the spectrum of the kernels of deep fully-connected networks, whose asymptotics log n log ε n -O(1/d eff ) n -O(1/d eff ) n -O(1/d eff ) Figure 1 : Left: Computational skeleton of a convolutional neural network of depth L + 1 = 4 (L = 3 hidden layers). The leaves of the graph (squares) correspond to input coordinates, and the root (empty circle) to the output. All other nodes represent (infinitely wide layers of) hidden neurons. We define as 'meta-patches' (i.e. patches of patches) the sets of input variables that share a common ancestor node along the tree (such as the squares within each coloured rectangle). Each metapatch coincides with the receptive field of the neuron represented by this common ancestor node, as indicated below the input coordinates. For each hidden layer l = 1, . . . , L, there is a family of meta-patches having dimensionality d eff (l). Right: Sketches of learning curves ϵ(n) obtained by learning target functions of varying spatial scale with the network on the left. More specifically, the target is a function of a 3-dimensional patch for the blue curve, a 6-dimensional patch for the orange curve, and the full input for the green curve. We predict (and confirm empirically) that both the decay of ϵ with n (full lines) and the rigorous upper bound (dashed lines) are controlled by the effective dimensionality of the target. do not depend on depth (Bietti & Bach, 2021) . Furthermore, we extend the analysis of generalisation to target functions that have a hierarchical structure similar to that of the networks themselves. Geifman et al. (2022) derive bounds on the spectrum of the kernels of deep CNNs. However, they consider only filters of size one in the first layer and do not include a theoretical analysis of generalisation. Instead, we allow filters of general dimension and give tight estimates of the asymptotic behaviour of eigenvalues, which allow us to predict generalisation properties. Xiao (2022) is the closest to our work, as it also investigates the spectral bias of deep CNNs in the kernel regime. However, it considers a different limit where both the input dimension and the number of training points diverge and does not characterise the asymptotic decay of generalisation error with the number of training samples. Paccolat et al. (2021) ; Malach & Shalev-Shwartz (2021) ; Abbe et al. (2022) use sparse target functions which depend only on a few of the input variables to prove sample complexity separation results between networks operating in the kernel regime and in the feature regime-where the change in parameters during training can be arbitrarily large. In this respect, our work shows that when the few relevant input variables are adjacent, i.e. the target function is spatially localised, deep CNNs achieve near-optimal performances even in the kernel regime.

2. NOTATION AND SETUP

Our work considers CNNs with nonoverlapping patches and no pooling layers. These networks are fully characterised by the depth L + 1 (or number of hidden layers L) and a set of filters sizes {s l } l (one per hidden layer). We call such networks hierarchical CNNs. Definition 2.1 (L-hidden-layers hierarchical CNN) Denote by σ the normalised ReLU function, σ(x) = √ 2 max(0, x). For each input x ∈ R d 1 and s a divisor of d, denote by x i the i-th s-dimensional patch of x, x i = (x (i-1)×s+1 , . . . , x i×s ) for all i = 1, . . . , d/s. The output of a Lhidden-layers hierarchical neural network can be defined recursively as follows. f (1) h,i (x) = σ w (1) h • x i , ∀h ∈ [1 . . H 1 ], ∀i ∈ [1 . . p 1 ]; f (l) h,i (x) = σ   1 H l-1 h ′ w (l) h,h ′ • f (l-1) h ′ i √ s l   , ∀h ∈ [1 . . H l ], i ∈ [1 . . p l ], l ∈ [2 . . L]; f (x) = f (L+1) (x) = 1 √ H L H L h=1 p L i=1 w (L+1) h,i f (L) h,i (x) √ p L . H l denotes the width of the l-th layer, s l the filter size (s 1 = s), p l the number of patches (p 1 ≡ p = d/s). w (1) h ∈ R s1 , w (l) h,h ′ ∈ R s l , w (L+1) h,i ∈ R. Hierarchical CNNs are best visualised by considering their computational skeleton, i.e. the directed acyclic graph obtained by setting H l = 1 ∀ l (example in Fig. 1 , left, with L = 3 hidden layers and filter sizes (s 1 , s 2 , s 3 ) = (3, 2, 2)). Having nonoverlapping patches, the computational skeleton is an ordered tree, whose root is the output (empty circle at the top of the figure) and the leaves are the input coordinates (squares at the bottom). All the other nodes represent neurons and all the neurons belonging to the same hidden layer have the same distance from the input nodes. The tree structure highlights that the post-activations f l i of the l-th layer depend only on a subset of the input variables, also known as the receptive field. Since the first layer of a hierarchical CNN acts on s 1 -dimensional patches of the input, it is convenient to consider each d-dimensional input signal as the concatenation of p s-dimensional patches, with s = s 1 and p × s = d. We assume that each patch is normalised to 1foot_1 , so that the input space is a product of p s-dimensional unit spheres (called multisphere in Geifman et al. (2022) ): M p S s-1 := p i=1 S s-1 ⊂ S d-1 . (2) Notice that the s-dimensional patches are also the receptive fields of the first-hidden-layer neurons (as in the blue rectangle in Fig. 1 for s = 3). In general, the receptive field of a neuron in the l-th hidden layer with l > 1 is a group of l l ′ =2 s l ′ adjacent patches (as in the orange rectangle of Fig. 1 for l = 2, s 2 = 2 or the green rectangle for l = 3, s 3 = s 2 = 2), which we refer to as a meta-patch. Due to the correspondence with the receptive fields, each meta-patch is identified with one path on the computational skeleton: the path which connects the output node to the hidden neuron whose receptive field coincides with the meta-patch. If such hidden neuron belongs to the l-th hidden layer, the path is specified by a tuple of L -l + 1 indices, i l+1→L+1 := i L+1 . . . i l+1 , where each index indicates which branch to select when descending from the root to the neuron node. With this notation, x i l+1 →i L+1 denotes one of the p l meta-patches of size l ′ ≤l s l ′ . Because of the normalisation of the s 1 -dimensional patches, each meta-patch has an effective dimensionality which is lower than its size: x i 2→L+1 ∈ S s1-1 ⇒ d eff (1) := dim(x i 2→L+1 ) = (s 1 -1), d eff (l) := dim(x i l+1→L+1 ) = (s 1 -1) l l ′ =2 s l ′ , ∀l ∈ [2 . . L]. (3)

3. HIERARCHICAL KERNELS AND THEIR SPECTRA

We turn now to the infinite-width limit H l → ∞: because of the aforementioned equivalence with kernel methods, this limit allows us to deduce the generalisation properties of the network from the spectrum of a kernel. In this section, we present the kernels corresponding to the hierarchical models of Def. 2.1 and characterise the spectra of the associated integral operators. We consider specifically two kernels: the Neural Tangent Kernel (NTK), corresponding to training all the network parameters (Jacot et al., 2018) ; and the Random Feature Kernel (RFK), corresponding to training only the weights of the linear output layer (Rahimi & Recht, 2007; Daniely et al., 2016) . In both cases, the kernel reads: K(x, y) = trained params θ ∂ θ f (x)∂ θ f (y). The NTK and RFK of deep CNNs have been derived previously by Arora et al. (2019) . In App. B we report the functional forms of these kernels in the case of hierarchical CNNs. These kernels inherit the hierarchical structure of the original architecture and their operations can be visualised again via the tree graph of Fig. 1 . In this case, the leaves represent products between the corresponding elements of two inputs x and y., i.e. x 1 y 1 to x d y d , and the root the kernel output K(x, y). The output can be built layer by layer by following the same recipe for each node: first sum the outputs of the previous layer which are connected to the present node, then apply some nonlinear function which depends on the activation function of the network. In particular, for each couple of inputs x and y on the multisphere M p S s-1 , hierarchical kernels depend on x and y via the p dot products between corresponding s-dimensional patches of x and y. As a comparison, Bietti & Bach (2021) showed that the NTK and RFK of a fully-connected network of any depth depend on the full dot product x•y, whereas those of a shallow CNN can be written as the sum of p kernels, each depending on only one of the patch dot products (Favero et al., 2021) . Given the kernel, the associated integral operator reads (T K f ) (x) := S s-1 K(x, y)f (y)dp(y), with dp(x) denoting the uniform distribution of input points on the multisphere. The spectrum of this operator provides, via Mercer's theorem (Mercer, 1909) , an alternative representation of the kernel K(x, y) and a basis for the space of functions that the kernel can approximate. The asymptotic decay of the eigenvalues, in particular, is crucial for the generalisation properties of the kernel, as it will be clarified at in Sec. 4. Since the input space is a product of s-dimensional unit spheres and the kernel depends on the p scalar products between corresponding s-dimensional patches of x and y, the eigenfunctions of T K are products of spherical harmonics acting on the patches (see App. A for definitions and the relevant background). For the sake of clarity, we limit the discussion in the main paper to the case s = 2, where, since each patch x i is entirely determined by an angle θ i , the multisphere M p S s-1 reduces to the p-dimensional torus and the eigenfunctions to p-dimensional plane waves: e ik•θ with θ := (θ 1 , . . . , θ p ) and label k := (k 1 , . . . , k p ). In this case the eigenvalues coincide with the p-dimensional Fourier transform of the kernel K (cos θ 1 , . . . , cos θ p ) and the large-k asymptotics are controlled by the nonanalyticities of the kernel (Widom, 1963) . The general case with patches of arbitrary dimension is presented in the appendix. Theorem 3.1 (Spectrum of hierarchical kernels) Let T K be the integral operator associated with a d-dimensional hierarchical kernel of depth L + 1, L > 1 and filter sizes (s 1 , . . . , s L ) with s 1 = 2. Eigenvalues and eigenfunctions of T K can be organised into L sectors associated with the hidden layers of the kernel/network. For each 1 ≤ l ≤ L, the l-th sector consists of ( l l ′ =1 s l ′ )-local eigenfunctions: functions of a single meta-patch x i l+1→L+1 which cannot be written as linear combinations of functions of smaller meta-patches. The labels k of these eigenfunctions are such that there is a meta-patch k i l+1→L+1 of k with no vanishing sub-meta-patches and all the k i 's outside of k i l+1→L+1 are 0 (because the eigenfunction is constant outside of x i l+1→L+1 ). The corresponding eigenvalue is degenerate with respect to the location of the meta-patch: we call it Λ (l) ki l+1 →i L+1 . When ∥k i l+1→L+1 ∥ → ∞, with k = ∥k i l+1→L+1 ∥, Λ (l) ki l+1→L+1 = C 2,l k -2ν-d eff (l) + o k -2ν-d eff (l) , with ν NTK = 1/2, ν RFK = 3/2 and d eff the effective dimensionality of the meta-patches defined in Eq. 3. C 2,l is a strictly positive constant for l ≥ 2 whereas for l = 1 it can take two distinct strictly positive values depending on the parity of k i 2→L+1 . The proof is in App. C, together with the extension to the s ≥ 3 case (Thm. C.1). It is useful to compare the spectrum in the theorem with the limiting cases of a deep fully-connected network and a shallow CNN. In the former case, the spectrum consists only of the L-th sector with p L = 1-the global sector. The eigenvalues decay as ∥k∥ -2ν-p , with ν depending ultimately on the nonanalyticity of the network activation function (see Bietti & Bach (2021) or App. C) and p = d eff (L) the effective dimensionality of the input. As a result, all eigenfunctions with the same ∥k∥ have the same eigenvalue, even those depending on a subset of the input coordinates. For example, assume that all the components of k are zero but k 1 , i.e. the eigenfunction depends only on the first 2-dimensional patch: the eigenvalue is O(k -2ν-p 1 ). By contrast, for a hierarchical kernel, the eigenvalue is O(k -2ν-1 1 ), much larger than the former as p > 1. In the case of a shallow CNN the spectrum consists only of the first sector, so that each eigenfunction depends only on one of the input patches. In this case only one of the k can be non-zero, say k 1 , and the eigenvalue is O(k -2ν-1 1 ). However, from (Favero et al., 2021) , a kernel of this kind is only able to approximate functions which depend on one of the input patches or linear combinations of such functions. Instead, for a hierarchical kernel with p L = 1, the eigenfunctions of the L-th sector are supported on the full input space. Then, if Λ k > 0 for all k, hierarchical kernels are able to approximate any function on the multisphere, dispensing with the need for fine-tuning the kernel to the structure of the target function. Overall, given an eigenfunction of a hierarchical kernel, the asymptotic scaling of the corresponding eigenvalue depends on the spatial structure of the eigenfunction support. More specifically, the effective dimensionality of the smallest meta-patch which contains all the variables that the eigenfunction depends on. In simple terms, the decay of an eigenvalue with k is slower if the associated eigenfunction depends on a few adjacent patches-but not if the patches are far apart! This is a property of hierarchical architectures which use nonlinear activation functions at all layers. Such a feature disappears if all hidden layers apart from the first have polynomial (Bietti, 2022) or infinitely smooth (Azevedo & Menegatto, 2015; Scetbon & Harchaoui, 2021) activation functions or if the kernels are assumed to factorise over patches as in Geifman et al. (2022) .

4. GENERALISATION PROPERTIES AND ADAPTIVITY TO SPATIAL STRUCTURE

In this section, we study the implications of the peculiar spectra of hierarchical NTKs and RFKs on the generalisation properties of and prove a form of adaptivity to the spatial structure of the target function. We follow the classical analysis of Caponnetto & De Vito (2007) for kernel ridge regression (see Bach (2021) ; Bietti (2022) for a modern treatment) and employ a spectral bias ansatz for the ridgeless limit (Bordelon et al., 2020; Spigler et al., 2020) . Theory of kernel ridge regression and source-capacity conditions. Given a set of n training points {(x µ , y µ )} n µ=1 i.i.d. ∼ p(x, y) for some probability density function p(x, y) and a regularisation parameter λ > 0, the kernel ridge regression estimate of the functional relation between x's and y's, or predictor, is f n λ (x) = argmin f ∈H 1 n n µ=1 (f (x µ ) -y µ ) 2 + λ ∥f ∥ H , where H is the Reproducing Kernel Hilbert Space (RKHS) of a (hierarchical) kernel K. If f (x) denotes the model from which the kernel was obtained via Eq. 4, the space H is contained in the span of the network features {∂ θ f (x)} θ in the infinite-width limit. Alternatively, H can be defined via the kernel's eigenvalues Λ k and eigenfunctions Y k : denoting with f k the projections of a function f onto the kernel eigenfunctions, then f belongs to H if it belongs to the span of the eigenfunctions and ∥f ∥ 2 H = k≥0 (Λ k ) -1 |f k | 2 < +∞. ( ) The performance of the kernel is measured by the generalisation error and its expectation over training sets of fixed size n (denoted with E n ) ϵ(f n λ ) = dxdy p(x, y) (f n λ (x) -y) 2 , ϵ(λ, n) = E n [ϵ(f n λ )] , or the excess generalisation error, obtained by subtracting from ϵ(λ, n) the error of the optimal predictor f * (x) = dy p(x, y)y. The decay of the error with n can be controlled via two exponents, depending on the details of the kernel and the target function. Specifically, if α ≥ 1 and r ≥ 1 -1/α satisfy the following conditions, capacity: Tr T 1/α K = k≥0 (Λ k ) 1/α < +∞, source: T 1-r 2 K f * 2 H = k≥0 (Λ k ) -r |f * k | 2 < +∞, then, by choosing a n-dependent regularisation parameter λ n ∼ n -α /(αr + 1) , one gets the following bound on generalisation (Caponnetto & De Vito, 2007) : ϵ(λ n , n) -ϵ(f * ) ≤ C ′ n -αr αr+1 . ( ) Spectral bias ansatz for ridgeless regression. The bound above is actually tight in the noisy setting, for instance when having labels y µ = f * (x µ ) + ξ µ with ξ µ Gaussian. In a noiseless problem where y µ = f * (x µ ) one expects to find the best performances in the ridgeless limit λ → 0, so that the rate of Eq. 11 is only an upper bound. In the ridgeless case-where the correspondence between kernel methods and infinitely-wide neural networks actually holds-there are unfortunately no rigorous results for the decay of the generalisation error. Therefore, we provide a heuristic derivation of the error decay based on a spectral bias ansatz. Consider the projections of the target function f * on the eigenfunctions of the student kernel Y k (f * k )foot_2 and assume that kernel methods learn only the n projections corresponding to the highest eigenvalues. Then, if the decay of f * k with k is sufficiently slow, one has (recall that both λ and ϵ(f * ) vanish in this setting) ϵ(n) ∼ k s.t. Λ k <Λ(n) |f * k | 2 , with Λ(n) the value of the n-th largest eigenvalue of the kernel. This result can be derived using the replica method of statistical physics (see Canatar et al. (2021) ; Loureiro et al. (2021) ; Tomasini et al. (2022) and App. E) or by assuming that input points lie on a lattice (Spigler et al., 2020) . These two approaches rely on the very same features of the problem, namely the asymptotic decay of Λ k and |f * k | 2 -see also Cui et al. (2021) . For instance, the capacity condition depends only on the kernel spectrum: α ≥ 1 since Tr (T K ) is finite (Schölkopf et al., 2002) ; the specific value is determined by the decay of the ordered eigenvalues with their rank, which in turn depends on the scaling of Λ k with k. Similarly, the power-law decay of the ordered eigenvalues with the rank determines the scaling of the n-th largest eigenvalue, Λ(n) ∼ n -α . The source condition characterises the regularity of the target function relative to the kernel and depends explicitly of the decay of |f * k | 2 with k, as does the right-hand side of Eq. 12. This condition was used by Bach (2021) to prove that kernel methods are adaptive to the smoothness of the target function: the projections of smoother targets on the eigenfunctions display a faster decay with k, thus allowing to choose a larger r and leading to better generalisation performances. The following corollary of Thm. 3.1 (proof and extension to s 1 ≥ 3 presented in App. D, Cor. D.1) shows that, since the spectrum can be partitioned as in Thm. 3.1, hierarchical kernels display adaptivity to targets which depend only on a subset of the input variables. Specific examples of bounds are considered explicitly in Sec. 5. Corollary 4.1 (Adaptivity to spatial structure) Let T K be the integral operator of the kernel of a hierarchical deep CNN as in Thm. 3.1 with s = 2. Then: i) the capacity exponent α is controlled by the largest sector of the spectrum, i.e. Tr T 1/α K < +∞ ⇔ α < 1 + 2ν/d eff (L); ii) the source exponent r is controlled by the structure of the target function f * , i.e., if there is l ≤ L such that f * depends only on some meta-patch x i l+1→L+1 , then only the first l sectors of the spectrum contribute to the source condition, T 1-r 2 K f * 2 H = l l ′ =1 i l ′ +1→L+1 ki l ′ +1→L+1 Λ (l ′ ) ki l ′ +1→L+1 -r f * ki l ′ +1→L+1 2 . ( ) The same holds if f * is a linear combination of such functions. As a result, when d eff (L) is large and α → 1, the decay of the error is controlled by the effective dimensionality of the target d eff (l).

5. EXAMPLES AND EXPERIMENTS

Source-capacity bound for functions of controlled smoothness and d eff . Consider a target function f * which only depends on the meta-patch x i l+1→L+1 as in Cor. 4.1. Combining the source condition (Eq. 14) with the asymptotic scaling of eigenvalues (Eq. 6), we get T 1-r 2 K f * 2 H < +∞ ⇔ k ∥k∥ r(2ν+deff(l)) |f * k | 2 < +∞, where ν = 1/2 (3/2) for the NTK (RFK) and k denotes the meta-patch k i l+1→L+1 without the subscript to ease notation. Since the eigenvalues depend on the norm of k, Eq. 15 is equivalent to a finite-norm condition for all the derivatives of f * up to order m < r (2ν + d eff (l))/2, ∥∆ m/2 f * ∥ 2 = k ∥k∥ 2m |f * k | 2 < + ∞ with ∆ denoting the Laplace operator. As a result, if f * has derivatives of finite norm up to the m-th, then the source exponent can be tuned to r = 2m/(2ν + d eff (l)), inversely proportional to the effective dimensionality of f * . Since the exponent on the right-hand side of Eq. 11 is an increasing function of r, the smaller the effective dimensionality of f * the faster the decay of the error-hence hierarchical kernels are adaptive to the spatial structure of f * . In particular, from Eq. 11 with α = 1 + 2ν/d eff (L), ϵ(n) ≤ C ′ n -β with β = 2m (2ν + d eff (L)) 2m (2ν + d eff (L)) + (2ν + d eff (l)) d eff (L) . ( ) For instance, if p L = 1 then d eff (L) = p = d/2 (the number of 2-dim. patches). Even when p ≫ 1, if f * depends only on a finite-dimensional meta-patch (or is a sum of such functions) the exponent β converges to the finite value 2m/(2(m + ν) + d eff (l)). In stark contrast, using a fully-connected kernel to learn the same target results in r = 2m/(2ν + p), thus β = 2m/(2m + p)-vanishing as 1/p when p ≫ 1, thus cursed by dimensionality. Rates from spectral bias ansatz. The same picture emerges when estimating the actual decay of the error from Eq. 12. l) for a target supported on a d eff (l)-dimensional meta-patch. Plugging such decays in Eq. 12 we obtain (details in Subsec. F.1) Λ(n) ∼ n -α , whereas k ∥k∥ 2m |f * k | 2 < + ∞ implies |f * k | 2 ≲ ∥k∥ -2m-d eff ( ϵ(n) ∼ n -β with β = 2m 2ν + d eff (l) 2ν + d eff (L) d eff (L) . ( ) Again, with p L = 1 and d eff (L) = p, the exponent remains finite for p ≫ 1. Notice that we recover the results of Favero et al. (2021) by using a shallow local kernel if the target is supported on sdimensional patches. These results show that hierarchical kernels play significantly better with the approximation-estimation trade-off than shallow local kernels, as they are able to approximate global functions of the input while not being cursed when the target function has a local structure. Numerical experiments. We test our predictions by training a hierarchical kernel (student) on a random Gaussian function with zero mean and covariance given by another hierarchical kernel (teacher). A learning problem is fully specified by the depths, sets of filter sizes, and smoothness exponents ν of teacher and student kernels. In particular, the depth and the set of filter sizes of the teacher kernel control the effective dimension of the target function. Fig. 2 shows the learning curves (solid lines) together with the predictions from Eq. 17 (dashed lines), confirming the picture emerging from our calculations. Panel (a) of Fig. 2 shows a depth-four student learning depthtwo, depth-three, and depth-four teachers. This student is not cursed in the first two cases and is cursed in the third one, which corresponds to a global target function. Panel (b) illustrates the curse of dimensionality with the effective input dimension d eff (L) by comparing the learning curves of depth-three students learning global target functions with an increasing number of variables. All our simulations are in excellent agreement with the predictions of Eq. 17. The bounds coming from Eq. 16 would display a slightly slower decay, as sketched in Fig. 1 , right panel. All the details of numerical experiments are reported in App. G, together with a comparison between the ridgeless and optimally-regularised cases (Fig. S3 ) and additional results for: s 1 ≥ 3 (Fig. S2 ); kernels with overlapping patches (Fig. S5 ); different input spaces (Fig. S4 ) and the CIFAR-10 dataset (Fig. S6 ). Notice that when the teacher kernel is a hierarchical RFK, the target is equivalent to the output of a randomly-initialised, infinitely-wide CNN Novak et al. (2019) . Although this target is highly structured, it leads to the same rate that we would have obtained for a global non-hierarchical target: This lemma is a simple consequence of the equivalence of the predictors of kernel ridgeless regression and Bayesian inference, which implies that the rate achieved when learning a Gaussian random function with a kernel matching the covariance kernel of the function is Bayes-optimal (Kanagawa et al., 2018) . The optimal rate n -3/d eff (L) comes from Eq. 17 with l = L and m = ν = 3/2. We conclude that, despite their intrinsically hierarchical structure, these targets cannot be good models of learnable tasks. 10 3 10 4 n 10 2 10 1 10 0 T: (2), S: (2, 2, 2) = 1 + deff s1 deff T: (2, 2), S: (2, 2, 2) = 1 + deff (s1 1) s2 deff T, S: (2, 2, 2) = 1/d eff

6. CONCLUSIONS AND OUTLOOK

We have proved that deep CNNs can adapt to the spatial scale of the target function, thus beating the curse of dimensionality if the target depends only on local groups of variables. Yet, if considered as 'teachers', they generate functions that cannot be learnt efficiently in high dimensions, even in the Bayes-optimal setting where the student is matched to the teacher. Thus, the architectures we considered are not good models of the hierarchical structure of real data which are efficiently learnable. Enforcing a stronger notion of compositionality is an interesting endeavour for the future. Following Poggio et al. (2017) , one may consider a much smaller family of functions of the form f i l+1→L+1 (x i l+1→L+1 ) = f i l+1→L+1 (f i l=1→L+1 (x i l=1→L+1 ), . . . , f i l=s l →L+1 (x i l=s l →L+1 )). ( ) From an information theory viewpoint, Schmidt-Hieber (2020); Finocchio & Schmidt-Hieber (2021) showed that it is possible to learn such functions efficiently. However, these arguments do not provide guarantees for any practical algorithm such as stochastic gradient descent. Moreover, preliminary results (not shown) assuming that the functions f 's are random Gaussian functions suggest that these tasks are not learnable efficiently by a hierarchical CNN in the kernel regime-see also Giordano et al. (2022) . It is unclear whether this remains true when the networks closely resemble the structure of Eq. 18 as in Poggio et al. (2017) , or when the networks are trained in a regime where features can be learnt from data. Recently, for instance, Ingrosso & Goldt (2022) have observed that under certain conditions locality can be learnt from scratch. It is not clear whether compositionality can also be learnt, beyond some very stylised settings (Abbe et al., 2022) . Finally, another direction to explore is the stability of the task toward smooth transformations or diffeomorphisms. This form of stability has been proposed as a key element to understanding how the curse of dimensionality is beaten for image datasets (Bruna & Mallat, 2013; Petrini et al., 2021) . Such a property can be enforced with pooling operations (Bietti & Mairal, 2019; Bietti et al., 2021) ; therefore diagonalising the NTK in this case as well would be of high interest.  S s-1 = {x ∈ R s | ∥x∥ = 1}, with ∥•∥ denoting the L2 norm. Given the polynomial degree k ∈ N, there are N k,s linearly independent spherical harmonics of degree k on S s-1 , with N k,s = 2k + s -2 k s + k -3 k -1 , N 0,d = 1 ∀d, N k,d ∼ k d-2 for k ≫ 1. (S1) Thus, we can introduce a set of N k,s spherical harmonics Y k,ℓ for each k, with ℓ ranging in 1, . . . , N k,s , which are orthonormal with respect to the uniform measure on the sphere dτ (x), ⟨Y k,ℓ , Y k,ℓ ′ ⟩ S s-1 := S s-1 dτ (x) Y k,ℓ (x)Y k,ℓ ′ (x) = δ ℓ,ℓ ′ . (S2) Because of the orthogonality of homogeneous polynomials with a different degree, the set {Y k,ℓ } k,ℓ is a complete orthonormal basis for the space of square-integrable functions on the s-dimensional unit sphere. Furthermore, spherical harmonics are eigenfunctions of the Laplace-Beltrami operator ∆, which is nothing but the restriction of the standard Laplace operator to S s-1 . ∆Y k,ℓ = -k(k + s -2)Y k,ℓ . (S3) The Laplace-Beltrami operator ∆ can also be used to characterise the differentiability of functions f on the sphere via the L2 norm of some power of ∆ applied to f . By fixing a direction y in S d-1 one can select, for each k, the only spherical harmonic of degree k which is invariant for rotations that leave y unchanged. This particular spherical harmonic is, in fact, a function of x • y and is called the Legendre polynomial of degree k, P k,s (x • y) (also referred to as Gegenbauer polynomial). Legendre polynomials can be written as a combination of the orthonormal spherical harmonics Y k,ℓ via the addition formula (Atkinson & Han, 2012 , Thm. 2.9), P k,s (x • y) = 1 N k,s N k,s ℓ=1 Y k,ℓ (x)Y k,ℓ (y). (S4) Alternatively, P k,s is given explicitly as a function of t = x • y ∈ [-1, +1] via the Rodrigues formula (Atkinson & Han, 2012, Thm. 2.23) , P k,s (t) = - 1 2 k Γ s-1 2 Γ k + s-1 2 1 -t 2 3-s 2 d k dt k 1 -t 2 k+ s-3 2 . ( ) Legendre polynomials are orthogonal on [-1, +1] with respect to the measure with density (1t 2 ) (s-3)/2 , which is the probability density function of the scalar product between two points on S s-1 . +1 -1 dt 1 -t 2 s-3 2 P k,s (t)P k ′ ,s (t) = |S s-1 | |S s-2 | δ k,k ′ N k,s , with |S s-1 | denoting the surface area of the s-dimensional unit sphere. To sum up, given x, y ∈ S s-1 , functions of x or y can be expressed as a sum of projections on the orthonormal spherical harmonics {Y k,ℓ } k,ℓ , whereas functions of x • y can be expressed as a sum of projections on the Legendre polynomials {P k,s (x • y)} k . The relationship between the two expansions is elucidated in the Funk-Hecke formula (Atkinson & Han, 2012, Thm. 2.22) , S s-1 dτ (y) f (x • y)Y k,ℓ (y) = Y k,ℓ (x) |S s-2 | |S s-1 | +1 -1 dt 1 -t 2 s-3 2 f (t)P k,s (t). ( ) If the function f has continuous derivatives up to the k-th order in [-1, +1], then one can plug Rodrigues' formula in the right-hand side of Funk-Hecke formula and get, after k integrations by parts, S s-1 dτ (y) f (x • y)Y k,ℓ (y) = Y k,ℓ (x) |S s-2 | |S s-1 | Γ s-1 2 2 k Γ k + s-1 2 +1 -1 dt f (k) (t) 1 -t 2 k+ s-3 2 , ( ) with f (k) (t) denoting the k-th order derivative of f in t. This trick also applies to functions which are not k times differentiable at ±1, provided the boundary terms due to integration by parts vanish. A.1 DOT-PRODUCT KERNELS ON THE SPHERE Dot-product kernels are kernels which depend on the two inputs x and y via their scalar product x • y. When the inputs lie on the unit sphere S s-1 , one can use the machinery introduced in the previous section to arrive immediately at the Mercer's decomposition of the kernel Smola et al. (2000) . K(x • y) = k≥0 N k,s |S s-2 | |S s-1 | +1 -1 dt 1 -t 2 s-3 2 K(t)P k,s (t) P k,s (x • y) = k≥0 |S s-2 | |S s-1 | +1 -1 dt 1 -t 2 s-3 2 K(t)P k,s (t) N k,s ℓ=1 Y k,ℓ (x)Y k,ℓ (y) := k≥0 Λ k N k,s ℓ=1 Y k,ℓ (x)Y k,ℓ (S9) In the first line we have just decomposed K into projections onto the Legendre polynomials, the second line follows immediately from the addition formula, and the third is just a definition of the eigenvalues Λ k . Notice that the eigenfunctions of the kernel are orthonormal spherical harmonics and the eigenvalues are degenerate with respect to the index ℓ. The Reproducing Kernel Hilbert Space (RKHS) of K can be characterised as follows, H =    f : S s-1 → R s. t. ∥f ∥ H := k≥0,Λ k ̸ =0 N k,s ℓ=1 ⟨f, Y k,l ⟩ 2 S s-1 Λ k < +∞    . (S10) A.2 MULTI-DOT-PRODUCT KERNELS ON THE MULTI-SPHERE Mercer's decomposition of dot-product kernels extends naturally to the case considered in this paper, where the input space is the Cartesian product of p s-dimensional unit sphere, M p S s-1 = {x = (x 1 , . . . , x p ) x i ∈ S s-1 ∀ i = 1, . . . , p = p ą i=1 S s-1 (S11) which we refer to as the multi-sphere following the notation of Geifman et al. (2022) . After defining a scalar product between functions on M p S s-1 by direct extension of Eq. S2, one can immediately find a set of orthonormal polynomials by taking products of spherical harmonics. With the multiindex notation k = (k 1 , . . . , k p ), ℓ = (ℓ 1 , . . . , ℓ p ), for all x ∈ M p S s-1 Ỹk,ℓ (x) = p i=1 Y ki,ℓi (x i ), with k i ≥ 0, ℓ i = 1, . . . , N ki,s = 2k i + s -2 k i s + k i -3 k i -1 . (S12) These product spherical harmonics Ỹk,ℓ (x) span the space of square-integrable functions on M p S s-1 . Furthermore, as each spherical harmonic is an eigenfunction of the Laplace-Beltrami operator, Ỹk,ℓ is an eigenfunction of the sum of Laplace-Beltrami operators on the p unit spheres, ∆ p,s Ỹk,ℓ := p i=1 ∆ i p i=1 Y ki,ℓi = p i=1 ((-k i )(k i + s -2)) Ỹk,ℓ . (S13) We can thus characterise the differentiability of functions of the multi-sphere X s,p via finiteness in L2 norm of some power of ∆ p,s . Similarly, we can consider products of Legendre polynomials to obtain a set of orthogonal polynomials on [-1, 1] p (see Geifman et al. (2022) , appendix A). Then, any function f on M p S s-1 × M p S s-1 which depends only on the p scalar products between patches, f (x, y) = g(x 1 • y 1 , . . . , x p • y p ), can be written as a sum of projections on products of Legendre polynomials Pk,s (t) := p i=1 P ki,s (t i ). ( ) Following Geifman et al. (2022) , we call such functions multi-dot-product kernels. When fixing one of the two arguments of f (say x), f becomes a function on M p S s-1 × M p S s-1 and can be written as a sum of projections on the Ỹk,ℓ 's. The two expansions are related by the following generalised Funk-Hecke formula, p i=1 S s-1 dτ (y i ) g(x 1 • y 1 , . . . , x p • y p ) Ỹk,ℓ (y) = Ỹk,ℓ (y) |S s-2 | |S s-1 | p p i=1 +1 -1 dt i 1 -t 2 i s-3 2 P ki,s (t i ) g(t 1 , . . . , t p ). (S16) Having introduced the product spherical harmonics Ỹk,ℓ as basis of M p S s-1 and the product Legendre polynomials Pk,s (t) as basis of [-1, +1] p , the Mercer's decomposition of multi-dot-product kernels follows immediately. K ({x i • y i } i ) = k≥0 p i=1 N ki,s |S s-2 | |S s-1 | +1 -1 dt i 1 -t 2 i s-3 2 P ki,s (t i ) K ({t i } i ) P k,s ({x i • y i } i ) = k≥0 Λ k N k,s ℓ=1 Y k,ℓ (x)Y k,ℓ (y).

B RFK AND NTK OF DEEP CONVOLUTIONAL NETWORKS

This appendix gives the functional forms of the RFK and NTK of hierarchical CNNs. We refer the reader to Arora et al. (2019) for the derivation. Definition B.1 (RFK and NTK of hierarchical CNNs) Let x, y ∈ M p S s-1 = p i=1 S s-1 . Denote tuples of the kind i l i l+1 . . . i m with i l→m for m ≥ l. For m < l, i l→m denotes the empty tuple. For each tuple i 2→L+1 , denote with t i 2→L+1 the scalar product between the s-dimensional patches of x and y identified by the same tuple, i.e. t i 2→L+1 = x i 2→L+1 • y i 2→L+1 (S18) For 1 ≤ l ≤ L + 1, denote with t i 2→L+1 i 2→l the sequence of t's obtained by letting the indices of the tuple i 2→l vary in their respective range. Consider a hierarchical CNN with L hidden layers, filter sizes (s 1 , . . . , s L ), p L ≥ 1 and all the weights w (1) h,i , w (l) h,h ′ ,i , w (L+1) h,i initialised as Gaussian random numbers with zero mean and unit variance. RFK. The corresponding RFK (or covariance kernel) is a function K (L+1)

RFK

of the p 1 = d/s 1 scalar products t i L ...i1 which can be obtained recursively as follows. With κ 1 (t) = (π -arccos t) t + √ 1 -t 2 /π, K RFK (t i 2→L+1 ) = κ 1 (t i 2→L+1 ); K (l) RFK t i 2→L+1 i 2→l = κ 1 1 s l i l K (l-1) RFK t i 2→L+1 i 2→l-1 , ∀ l ∈ [2 . . L] if L > 1; K (L+1) RFK t i 2→L+1 i 2→L+1 = 1 p L p L i L+1 =1 K (L) RFK t i 2→L+1 i 2→L . ( ) NTK. The NTK of the same hierarchical CNN is also a function of the p 1 = d/s 1 scalar products t i L ...i2 which can be obtained recursively as follows. With κ 0 (t) = (π -arccos t) /π, K NTK t i 2→L+1 = κ 1 (t i 2→L+1 ) + t i 2→L+1 κ 0 (t i 2→L+1 ); K (l) NTK t i 2→L+1 i 2→l = K (l) RFK ( t i 2→L+1 i 2→l ) + 1 s l i l K (l-1) NTK t i 2→L+1 i 2→l-1 × κ 0 1 s l i l K (l-1) RFK t i 2→L+1 i 2→l-1 , ∀ l ∈ [2 . . L] if L > 1; K (L+1) NTK t i 2→L+1 i 2→L+1 = 1 p L p L i L+1 =1 K (L) NTK t i 2→L+1 i 2→L . ( ) C SPECTRA OF DEEP CONVOLUTIONAL KERNELS In this section we state and prove a generalised version of Thm. 3.1 which includes non-binary patches. Our strategy is to relate the asymptotic decay of eigenvalues to the singular behaviour of the kernel, as it is customary in Fourier analysis and was done in Bietti & Bach (2021) for standard dot-product kernel. In Subsec. C.1 we perform the singular expansion of hierarchical kernels, in Subsec. C.2 we use this expansion to prove Thm. 3.1 with L = 2 (2 hidden layers) and s 1 = 2 (patches on the ring), which we then generalise to general s 1 in Subsec. C.3 and to general depth in Subsec. C.4. Theorem C.1 (Spectrum of hierarchical kernels) Let T K be the integral operator associated with a d-dimensional hierarchical kernel of depth L + 1, L > 1 and filter sizes (s 1 , . . . , s L ). Eigenvalues and eigenfunctions of T K can be organised into L sectors associated with the hidden layers of the kernel/network. For each 1 ≤ l ≤ L, the l-th sector consists of ( l l ′ =1 s l ′ )-local eigenfunctions: functions of a single meta-patch x i l+1→L+1 which cannot be written as linear combinations of functions of smaller meta-patches. The labels k of these eigenfunctions are such that there is a meta-patch k i l+1→L+1 of k with no vanishing sub-meta-patches and all the k i 's outside of k i l+1→L+1 are 0 (because the eigenfunction is constant outside of x i l+1→L+1 ). The corresponding eigenvalue is degenerate with respect to the location of the meta-patch: we call it Λ (l) ki l+1 →i L+1 . When ∥k i l+1→L+1 ∥ → ∞, with k = ∥k i l+1→L+1 ∥, i. if s 1 = 2, then Λ (l) ki l+1→L+1 = C 2,l k -2ν-d eff (l) + o k -2ν-d eff (l) , with ν NTK = 1/2, ν RFK = 3/2 and d eff the effective dimensionality of the meta-patches defined in Eq. 3. C 2,l is a strictly positive constant for l ≥ 2 whereas for l = 1 it can take two distinct strictly positive values depending on the parity of k i 2→L+1 . ii. if s 1 ≥ 3, then for fixed non-zero angles k/k, Λ (l) ki l+1→L+1 = C s1,l k i l+1→L+1 k k -2ν-d eff (l) + o k -2ν-d eff (l) , ( ) where C s1,l is a positive function for l ≥ 2, whereas for l = 1 it is a strictly positive constant which depends on the parity of k i 2→L+1 .

C.1 SINGULAR EXPANSION OF HIERARCHICAL KERNELS

Both the RFK and NTK of ReLU networks, whether deep or shallow, are built by applying the two functions κ 0 and κ 1 Cho & Saul (2009) (see also Def. B.1), κ 0 (t) = (π -arccos t) π , κ 1 (t) = (π -arccos t) t + √ 1 -t 2 π . ( ) The functions κ 0 and κ 1 are non-analytic in t = ± 1, with the following singular expansion Bietti & Bach (2021) . Near t = 1, with u = 1 -t        κ 0 (1 -u) = 1 - √ 2 π u 1/2 + O(u 3/2 ), κ 1 (1 -u) = 1 -u + 2 √ 2 3π u 3/2 + O(u 5/2 ). (S24) Near t = -1, with u = 1 + t,        κ 0 (-1 + u) = √ 2 π u 1/2 + O(u 3/2 ), κ 1 (-1 + u) = 2 √ 2 3π u 3/2 + O(u 5/2 ). (S25) As a result, hierarchical kernels have a singular expansion when the t i 2→L+1 's are close to ±1. In particular, the following expansions are relevant for computing the asymptotic scaling of eigenvalues. Proposition C.1 (RFK when x = y) The RFK of a hierarchical network of depth L + 1, filter sizes (s 1 , . . . , s L ) and p L ≥ 1 has the following singular expansion when all t i 2→L+1 → 1. With u i 2→L+1 = 1 -t i 2→L+1 , c = 2 √ 2/(3π ), and l∈I s l := 1 if I is the empty set, K (L+1) RFK 1 -u i 2→L+1 i 2→L+1 = 1 - 1   2≤l ′ ≤L s l ′   p L i 2→L+1 u i 2→L+1 + c p L L l ′ =1 1   l ′ <l ′′ ≤L s l ′′   i l ′ +1→L+1         i 2→l ′ u i 2→L+1   2≤l ′′ ≤l ′ s l ′′           3/2 + O(u 5/2 i 2→L+1 ) (S26) Proof. With L = 1 one has (recall that i 2→1+1 = i 2→2 reduces to a single index) K (1) RFK (1 -u i2 ) = 1 -u i2 + cu 3/2 i2 + O(u 5/2 i2 ) ⇒ K (1+1) RFK {1 -u i2 } i2 = 1 - 1 p 1 i2 u i2 + c p 1 i2 u 3/2 i2 + O(u 5/2 i2 ). (S27) With L = 2, K (2) RFK {1 -u i2 } i2 = κ 1 1 - 1 s 2 i2 u i2,i3 + c s 2 i2 u 3/2 i2,i3 + O(u 5/2 i2,i3 ) = 1 - 1 s 2 i2 u i2,i3 + c s 2 i2 u 3/2 i2,i3 + c 1 s 2 i2 u i2,i3 3/2 + O(u 5/2 i2,i3 ), (S28) therefore K (2+1) RFK {1 -u i2,i3 } i2,i3 =1 - 1 s 2 p 2 i2,i3 u i2,i3 + c p 2 1 s 2 i2,i3 u 3/2 i2,i3 + c p 2 i3 1 s 2 i2 u i2,i3 3/2 + O(u 5/2 i2,i3 ). (S29) The proof of the general case follows by induction by applying the function κ 1 to the singular expansion of the kernel with L -1 hidden layers, then using Eq. S24. Proposition C.2 (RFK when x = -y) The RFK of a hierarchical network of depth L + 1, filter sizes (s 1 , . . . , s L ) and p L ≥ 1 has the following singular expansion when all t i 2→L+1 → -1. With u i 2→L+1 = 1 + t i 2→L+1 , c = 2 √ 2/(3π) and l∈I s l := 1 if I is the empty set, K (L+1) RFK -1 + u i 2→L+1 i 2→L+1 = b L + c L   2≤l ′ ≤L s l ′   p L i 2→L+1 u 3/2 i 2→L+1 + O(u 5/2 i 2→L+1 ), (S30) with b L = κ 1 (b L-1 ), b 1 = 0; and c L = c L-1 κ ′ 1 (b L-1 ), c 1 = c. Proof. This can be proved again by induction. For L = 1, K (1) RFK (-1 + u i2 ) = cu 3/2 i2 + O(u 5/2 i2 ) ⇒ K (1+1) RFK {-1 + u i2 } i2 = c p 1 i2 u 3/2 i2 + O(u 5/2 i2 ). (S31) Thus, for L = 2, K (2) RFK {-1 + u i2,i3 } i2 = κ 1 c s 2 i2 u 3/2 i2,i3 + O(u 5/2 i2,i3 ) = κ 1 (0) + κ ′ 1 (0) c s 2 i2 u 3/2 i2,i3 + O(u 5/2 i2,i3 ), (S32) so that K (2+1) RFK {-1 + u i2,i3 } i2,i3 = κ 1 (0) + κ ′ 1 (0)c s 2 p 2 i2,i3 u 3/2 i2,i3 + O(u 5/2 i2,i3 ). ( ) The proof is completed by applying the function κ 1 to the singular expansion of the kernel with L -1 hidden layers. Proposition C.3 (NTK when x = y) The NTK of a hierarchical network of depth L + 1, filter sizes (s 1 , . . . , s L ) and p L ≥ 1 has the following singular expansion when all t i 2→L+1 → 1. With u i 2→L+1 = 1 -t i 2→L+1 , c = √ 2π, and l∈I s l := 1 if I is the empty set, K (L+1) NTK 1 -u i 2→L+1 i 2→L+1 = L + 1 - c p L L l ′ =1 l ′   l ′ <l ′′ ≤L s l ′′   × i l ′ +1→L+1         1   2≤l ′′ ≤l ′ s l ′′   i 2→l ′ u i 2→L+1         1/2 + O(u 3/2 i 2→L+1 ) (S34) Proposition C.4 (NTK when x = -y) The NTK of a hierarchical network of depth L + 1, filter sizes (s 1 , . . . , s L ) and p L ≥ 1 has the following singular expansion when all t i 2→L+1 → -1. With u i 2→L+1 = 1 + t i 2→L+1 , c = √ 2/π and l∈I s l := 1 if I is the empty set, K (L+1) NTK -1 + u i 2→L+1 i 2→L+1 = a L + c L   2≤l ′ ≤L s l ′   p L i 2→L+1 u 3/2 i 2→L+1 + O(u 5/2 i 2→L+1 ), (S35) with a L = b L + b L-1 κ 0 (b L-1 ), b L = κ 1 (b L-1 ), b 1 = 0; and c L = c L-1 κ 0 (b L-1 ), c 1 = c. Notice that both κ 1 and κ 0 are positive and strictly increasing in [0, 1] and κ 1 (1) = κ 0 (1) = 1, thus b L ∈ (0, 1) and c L < c L-1 . The proofs of the two propositions above are omitted, as they follow the exact same steps as the previous two proofs.

C.2 PATCHES ON THE RING

In this section, we prove a restricted version of Thm. 3.1 for the case of 2-dimensional input patches, since the reduction of spherical harmonics to the Fourier basis simplifies the proof significantly. We also consider, for convenience, hierarchical kernels of depth 3 with the filter size of the second hidden layer set to p = d/2, the total number of 2-patches of the input. Once this case is understood, extension to arbitrary filter size and arbitrary depth is trivial. Theorem C.2 (Spectrum of depth-3 kernels on 2-patches) Let T K be the integral operator associated with a d-dimensional hierarchical kernel of depth 3, (2 hidden layers), with filter sizes (s 1 = 2, s 2 ) and p 2 = 1, such that 2s 2 = d and s 2 = p (the number of 2-patches). Eigenvalues and eigenfunctions of T K can be organised into 2 sectors associated with the hidden layers of the kernel/network. i. The first sector consists of s 1 -local eigenfunctions, which are functions of a single patch x i for i = 1, . . . , p. The labels k, ℓ of local eigenfunctions are such that all the k j 's with j ̸ = i are zero (because the eigenfunction is constant outside x i ). The corresponding eigenvalue is degenerate with respect to the location of the patch: we call it Λ (1) ki . When k i → ∞, Λ ki = C 2,1 k -2ν-1 + o k -2ν-1 , with ν NTK = 1/2, ν RFK = 3/2. C 2,l can take two distinct strictly positive values depending on the parity of k i ; ii. The second sector consists of global eigenfunctions, which are functions of the whole input x. The labels k, ℓ of global eigenfunctions are such that at least two of the k i 's are nonzero. We call the corresponding eigenvalue Λ (2) k . When ∥k∥ → ∞, with k = ∥k∥, Λ (2) k = C 2,2 k -2ν-p + o k -2ν-p , Proof. If we consider binary patches in the first layer, the input space becomes the Cartesian product of two-dimensional unit spheres, i.e. circles, X = d i=1 S 1 . Then, each patch x i corresponds to an angle θ i and the spherical harmonics are equivalent to Fourier atoms, Y 0 (θ) = 1, Y k,1 (θ) = e ikθ , Y k,2 (θ) = e -ikθ , ∀k ≥ 1. (S38) Therefore, solving the eigenvalue problem for a dot-product kernel K(x • y) = K (cos(θ x -θ y )) with x, y ∈ S 1 reduces to computing its Fourier transform. With |S 0 | = 2 and |S 1 | = 2π, 1 2π π -π dθ x K (cos(θ x -θ y )) e ±ikθx = Λ k e ±ikθy ⇒ Λ k = 1 2π π -π dθ K (cos θ) e ±ikθ , ( ) where we denoted with θ the difference between the two angles. Similarly, for a multi-dot-product kernel, the eigenvalues coincide with the p-dimensional Fourier transform of the kernel, where p is the number of patches, Λ k = 1 (2π) p π -π p i=1 dθ i e ±ikiθi K ({cos θ i } p i=1 ) = 1 (2π) p π -π d p θ e ±ik•θ K ({cos θ i } p i=1 ) , with k = (k 1 , . . . , k p ) ⊤ the vector of the patch wavevectors and θ = (θ 1 , . . . , θ p ) ⊤ the vector of the patch angle differences θ i = θ x,i -θ y,i . The nonanaliticity of the kernel at t i = 1 for all i moves to θ i = 0 for all i, whereas those in t i = -1 move to θ i = π and -π. The corresponding singular expansion is obtained from Eq. S26 after replacing t i with cos (θ i ) and expanding cos (θ i ) as 1 -θ 2 i /2, resulting in K (2) RFK ({cos θ i } p i=1 ) = 1 - 1 2p p i=1 θ 2 i + 1 3πp p i=1 |θ i | 3 + 2 √ 2 3π 1 p p i=1 θ 2 i 2 3/2 + p i=1 O(θ 4 i ). (S41) The first nonanalytic terms are 1 3πp p i=1 |θ i | 3 and 2 √ 2 3π 1 p p i=1 θ 2 i 2 3/2 . After recalling that the Fourier transform of ∥θ∥ 2ν with θ ∈ R p decays asymptotically as ∥k∥ -2ν-p Widom (1963) , one has (ν = 3/2) 1 (2π) p π -π d p θ e ±ik•θ 1 3πp p i=1 |θ i | 3 ∼ p i=1 k -4 i j̸ =i δ kj ,0 , for ∥k∥ → ∞ (S42) and 1 (2π) p π -π d p θ e ±ik•θ ∥θ∥ 3 ∼ ∥k∥ -p-3 , for ∥k∥ → ∞. ( ) All the other terms in the kernel expansion will result in subleading contributions in the Fourier transform. Therefore, the former of the two equations above yields the asymptotic scaling of eigenvalues of the local sector, whereas the latter yields the asymptotic scaling of the global sector. The proof for the NTK case is analogous to the RFK case, except that the singular expansion near θ i = 0 is given by K (2) NTK ({cos θ i } p i=1 ) = 3 - 1 p p i=1 |θ i | 2 - √ 2 π 1 p p i=1 θ 2 i 2 1/2 + p i=1 O(θ 3/2 i ).

C.3 PATCHES ON THE s-DIMENSIONAL HYPERSPHERE

In this section, we make an additional step towards Thm. 3.1 by extending Thm. C.2 to the case of s-dimensional input patches. We still consider hierarchical kernels of depth 3 with the filter size of the second hidden layer set to p = d/s (the total number of s-patches of the input) so as to ease the presentation. The extension to general depth and filter sizes is presented in Subsec. C.4. Theorem C.3 (Spectrum of depth-3 kernels on s-patches) Let T K be the integral operator associated with a d-dimensional hierarchical kernel of depth 3, (2 hidden layers), with filter sizes (s 1 = s, s 2 ) and p 2 = 1, such that 2s 2 = d and s 2 = p (the number of s-patches). Eigenvalues and eigenfunctions of T K can be organised into 2 sectors associated with the hidden layers of the kernel/network. i. The first sector consists of s 1 -local eigenfunctions, which are functions of a single patch x i for i = 1, . . . , p. The labels k, ℓ of local eigenfunctions are such that all the k j 's with j ̸ = i are zero (because the eigenfunction is constant outside of x i ). The corresponding eigenvalue is degenerate with respect to the location of the patch: we call it Λ (1) ki . When k i → ∞, Λ (1) ki = C s,1 k -2ν-(s-1) + o k -2ν-(s-1) , ( ) with ν NTK = 1/2, ν RFK = 3/2. C s,1 can take two distinct strictly positive values depending on the parity of k i ; ii. The second sector consists of global eigenfunctions, which are functions of the whole input x. The labels k, ℓ of global eigenfunctions are such that at least two of the k i 's are nonzero. We call the corresponding eigenvalue Λ (2) k . When k ≡ ∥k∥ → ∞, for fixed non-zero angles k/k, Λ (2) k = C s,2 k k k -2ν-p(s-1) + o k -2ν-p(s-1) , ( ) where C s,2 is a positive function. Proof. A hierarchical RFK/NTK is a multi-dot-product kernel, therefore its eigenfunctions are products of spherical harmonics Ỹk,ℓ (x) = p i=1 Y ki,ℓi (x i ) and the eigenvalues of K are given by Eq. S17, Λ k = p i=1 |S s-2 | |S s-1 | +1 -1 dt i 1 -t 2 i s-3 2 P ki,s (t i ) K ({t i } i ) . (S47) The proof follows the following strategy: first, we show that the infinitely differentiable part of K results in eigenvalues which decay faster than any polynomial of the degrees k i . We then show that the decay is controlled by the most singular term of the singular expansion of the kernel and finally compute such decay by relating it to the number of derivatives of the kernel having a finite l2 norm. When K is infinitely differentiable in [-1, +1] p , we can plug Rodrigues' formula Eq. S5 for each P ki,s (t i ) and get Λ k = p i=1 |S s-2 | |S s-1 | - 1 2 ki Γ s-1 2 Γ k i + s-1 2 +1 -1 dt K (t) p i=1 d ki dt ki i 1 -t 2 i ki+ s-3 2 , (S48) with +1 -1 dt denoting integration over the p-dimensional hypercube [-1, +1] p . We can simplify the integral further via integration by parts, so as to obtain Λ k = p i=1 |S s-2 | |S s-1 | 1 2 ki Γ s-1 2 Γ k i + s-1 2 +1 -1 dt K (k) (t) p i=1 1 -t 2 i ki+ s-3 2 , ( ) where K (k) denotes the partial derivative of order k 1 with respect to t 1 , k 2 with respect to t 2 and so on until k p with respect to t p . Notice that the function (1 -t 2 ) d-3 2 is proportional to the probability measure of the scalar product t between two points sampled uniformly at random on the unit sphere (Atkinson & Han, 2012, Sec. 1.3) , |S d-1 | = +1 -1 dt (1 -t 2 ) d-3 2 S d-2 dS d-2 ⇒ |S d-1 | |S d-2 | +1 -1 dt (1 -t 2 ) d-3 2 = 1. (S50) This probability measure converges weakly to a Dirac mass δ(t) when d → ∞. Recall, in addition, that |S d-1 | = 2π d/2 /Γ(d/2), where Γ denotes the Gamma function Γ(x) = ∞ 0 dx x z-1 e -x . Thus, with converges weakly to a Dirac measure δ(t) as c → ∞, once properly rescaled. In particular, choosing k i such that k i + (s -3)/2 = (d -3)/2, one has lim ki→∞ Γ k i + s 2 √ πΓ k i + s-1 2 1 -t 2 i ki+ s-3 2 = δ(t i ). (S51) As a result, when K is infinitely differentiable, one has the following equivalence in the limit where all k i 's are large, Λ k ∼ p i=1 |S s-2 | |S s-1 | 1 2 ki Γ s-1 2 Γ k i + s 2 K (k) (0) , ( ) which implies that, when K is infinitely differentiable, the eigenvalues decay exponentially or faster with the k i . Let us now consider the nonanalytic part of K. There are three kinds of terms appearing in the singular expansion of depth-3 kernels (cf. Subsec. C.1): ia) c + i (1 -t i ) ν near t i = + 1; ib) c -i (1 + t i ) ν near t i = -1; ii) c +,all ( i (1 -t i )/p) ν near t i = + 1 for all i; where the exponent ν is 1/2 for the NTK and 3/2 for the RFK. We will not consider terms of the kind ib) explicitly, as the analysis is equivalent to that of terms of the kind ia). After replacing t i with cos(θ i ), as in Subsec. C.2, we get again i |θ i | 2ν and ∥θ∥ 2ν as leading nonanalytic terms. Therefore, we can rewrite the nonanalytic part of the kernel as follows, K n.a. (θ) = i f 1 (|θ i |) + f 2 (∥θ∥) + K(θ), where f 1 , f 2 are single-variable functions which behave as θ 2ν near zero and have compact support, whereas K has a singular expansion near θ i = 0 analogous to that of K but with leading nonanalyticities controlled by an exponent ν ′ ≥ν + 1. Let us look at the contribution to the eigenvalue Λ k due to the term f 1 (|θ i |):   p j=1 |S s-2 | |S s-1 | π 0 dθ j (sin (θ j )) s-2 P kj ,s (cos (θ j ))   f 1 (|θ i |) =   j̸ =i δ kj ,0   |S s-2 | |S s-1 | π 0 dθ (sin (θ)) s-2 P ki,s (cos (θ))f 1 (|θ|) =   j̸ =i δ kj ,0   (f 1 ) k1 , where we have introduced (f 1 ) k as the projection of f 1 (θ) on the k-th Legendre polynomial. The asymptotic decay of (f 1 ) k is strictly related to the differentiability of f 1 , which is in turn controlled by action of the Laplace-Beltrami operator ∆ on f 1 . As a function on the sphere S s-1 , f 1 depends only on one angle, therefore the Laplace-Beltrami operator acts as follows, ∆f 1 (θ) = 1 sin (θ) s-2 d dθ sin (θ) s-2 df 1 dθ (θ) = f ′′ 1 (θ) + (d -2) cos (θ) sin (θ) f ′ 1 (θ). In terms of singular behaviour near θ = 0, -m) . Given ν, repeated applications of ∆ eventually result in a function whose l2 norm on the sphere diverges. On the one hand, f 1 (θ) ∼ |θ| 2ν implies ∆f 1 (θ) ∼ |θ| 2ν-2 , thus ∆ m f 1 (θ) ∼ |θ| 2(ν ∥∆ m/2 f 1 ∥ 2 = π 0 dθ sin d-2 (θ)f 1 (θ)∆ m f 1 (θ). The integrand behaves as |θ| d-2+4ν-2m near 0, thus the integral diverges for m ≥ 2ν + (d -1)/2. On the other hand, from Eq. S3, ∥∆ m/2 f 1 ∥ 2 = k N k,s (k(k + s -2)) m |(f 1 ) k | 2 . ( ) As N k,s ∼ k s-2 and the sum must converge for m < 2ν + (d -1)/2 and diverge otherwise, -1) . The projections of all the other terms in K on Legendre polynomials of one of the p angles θ i display a faster decay with k, therefore the above results imply the asymptotic scaling of local eigenvalues. Notice that such scaling matches with the result of Bietti & Bach (2021) , which was obtained with a different argument. (f 1 ) k ∼ k -2ν-(s Finally, let us look at the contribution to the eigenvalue Λ k due to the term f 2 (∥θ∥):   p j=1 |S s-2 | |S s-1 | π 0 dθ j (sin (θ j )) s-2 P kj ,s (cos (θ j ))   f 2 (∥θ∥) = (f 2 ) k , where we have introduced (f 2 ) k as the projection of f 2 (∥θ∥) on the multi-Legendre polynomial with multi-degree k. The asymptotic decay of (f 2 ) k is again related to the differentiability of f 2 , controlled by action of the multi-sphere Laplace-Beltrami operator ∆ p,s in Eq. S13. As f 2 depends only on one angle per sphere, ∆ p,s f 2 (∥θ∥) = p i=1 ∂ 2 θi f 2 (∥θ∥) + (s -2) cos (θ i ) sin (θ i ) ∂ θi f 2 (∥θ∥) . (S59) Further simplifications occur since f 2 depends only on the norm of θ. In terms of the singular behaviour near ∥θ∥ = 0, f 2 ∼ ∥θ∥ 2ν implies ∆ m p,s f 2 ∼ ∥θ∥ 2(ν-m) , thus ∥∆ m/2 p,s f 2 ∥ 2 = [0,π] p d p θ p i=1 sin s-2 (θ i ) f 2 (∥θ∥)∆ m p,s f 2 (∥θ∥) < +∞ (S60) requires m < 2ν + p(s -1)/2 (compare with m < 2ν + (s -1)/2 for the local contributions). Therefore, one has ∥∆ m/2 p,s f 1 ∥ 2 = k p i=1 N ki,s p i=1 k i (k i + s -2) m |(f 2 ) k | 2 < +∞ ∀ m < 2ν+p(s-1)/2, (S61) while the sum diverges for m ≥ 2ν + p(s -1)/2. In addition, since f 2 is a radial function of θ which is homogeneous (or scale-invariant) near ∥θ∥ = 0, (f 2 ) k can be factorised in the large-∥k∥ limit into a power of the norm ∥k∥ α and a finite angular part C(k/∥k∥). By plugging the factorisation into Eq. S61, we get (f 2 ) k ∼ C(k/∥k∥)∥k∥ -2ν-p(s-1) , k,∥k∥=k p i=1 (k i /k) s-2 C(k/∥k∥) 2 < +∞ (S62) The projections of all the other terms in K on multi-Legendre polynomials display a faster decay with ∥k∥, therefore the above results imply the asymptotic scaling of global eigenvalues.

C.4 GENERAL DEPTH

The generalisation to arbitrary depth is trivial once the depth-3 case is understood. For global and s 1 -local eigenvalues, the analysis of the previous section carries over unaltered. All the other intermediate sectors correspond to the other terms singular expansion of the kernel: from Subsec. C.1, these terms can be written as c p L 1   l ′ <l ′′ ≤L s l ′′   i l ′ +1→L+1         1   2≤l ′′ ≤l ′ s l ′′   i 2→l ′ 1 -t i 2→L+1         ν , for some l ′ = 2, . . . , L -1 and fractional ν. In practice, this term is a sum over the p l ′ = p L l ′ <l ′′ ≤L s l ′′ meta-patches of t having size s 2→l ′ := 2≤l ′′ ≤l ′ s l ′′ . Each summand is the fractional power ν of the average of the t i 's within a meta-patch. When plugging such term into Eq. S47, the integrals over the t i 's which do not belong to that meta-patch yield Kronecker deltas for the corresponding k i 's. The integrals over the t i 's within the meta-patch, instead, can be written as in Eq. S58 with the product and the norm restricted over the elements of that meta-patch, i.e., ∥θ∥ → i 2→l ′ θ 2 i 2→L+1 1/2 . Therefore, the scaling of the eigenvalue with k is given again by Eq. S63, but with p replaced by the size of the meta-patch 2≤l ′′ ≤l ′ s l ′′ , so that the effective dimension of Eq. 3 appears at the exponent.

ADAPTIVITY

This appendix provides an introduction to classical generalisation bounds for kernel regression and extends Cor. 4.1 to patches on the hypersphere.

D.1 CLASSICAL GENERALISATION BOUNDS

Consider the regression setting detailed in Sec. 4 of the main text. First, assume that the target function f * belongs to the RKHS H of the kernel K. Then, without further assumptions on K, we have the following dimension-free bound on the excess risk, based on Rademacher complexity (Bach, 2021, Chs. 4, 7) , Bietti (2022) , ϵ(λ, n) -ϵ(f * ) ≤ C ∥f * ∥ H Tr(T K ) n , where T K is the integral operator associated to K. For a hierarchical kernel, having a target with more power in the local sectors can result in a smaller ∥f * ∥ H , hence a smaller excess risk. However, this gain is only a constant factor in terms of sample complexity and, more importantly, being in the RKHS requires an order of smoothness which typically is of the order of the dimension, which is a very-restrictive assumption in high-dimensional settings. This result can be extended by including more details about the kernel and the target function. In particular, (Bach, 2021, Prop. 7 .2) states that, for f * in the closure of H, regularisation λ ≤ 1 and n ≥ 5 λ (1 + log(1/λ)), one has ϵ(λ, n) -ϵ(f * ) ≤ 16 σ 2 n Tr (T K + λI) -1 T K + 16 inf f ∈H ∥f -f * ∥ 2 L2 + λ∥f ∥ 2 H + 24 n 2 ∥f * ∥ L∞ , ( ) where σ 2 bounds the conditional variance of the labels, i.e. E (x,y)∼p (y -f * (x)) 2 | x < σ 2 . Then, let us consider the following standard assumptions in the kernel literature Caponnetto & De Vito (2007) , capacity: Tr T 1/α K = k≥0 ℓ (Λ k ) 1/α < +∞, source: T 1-r 2 K f * 2 H = k≥0 ℓ (Λ k ) -r (f * k,ℓ ) 2 < +∞. (S66) In short, the first assumption characterises the 'size' of the RKHS (the larger α, the smaller the number of functions in the RKHS), while the second assumption defines the regularity of the target function relative to that of the kernel (when r = 1, f * ∈ H; when r < 1, f * is less smooth; when r > 1, f * is smoother). Combining these assumptions with Eq. S65, one gets ϵ(λ, n) -ϵ(f * ) ≤ 16 σ 2 n C 1 λ -1/α + 16 C 2 λ r + 24 n 2 ∥f * ∥ L∞ . Optimising for λ results in λ n = C 1 σ 2 α r C 2 n α αr+1 , and the bound becomes ϵ(λ n , n) -ϵ(f * ) ≲ C 2 αr+1 2 C 1 σ 2 n αr αr+1 + 1 n 2 ∥f * ∥ L∞ . (S69) Finally, when r > (α -1)/α, n ≥ 5 λn (1 + log(1/λ n )) is always satisfied for n large enough. Tr T 1/α K < +∞ ⇔ α < 1 + 2ν/d eff (L); (S70) ii) the source exponent r is controlled by the structure of the target function f * , i.e., if there is l ≤ L such that f * depends only on some meta-patch x i l+1→L+1 , then only the first l sectors of the spectrum contribute to the source condition, T 1-r 2 K f * 2 H = l l ′ =1 i l ′ +1→L+1 ki l ′ +1→L+1 ℓi l ′ +1→L+1 Λ (l ′ ) ki l ′ +1→L+1 -r f * ki l ′ +1→L+1 , ℓi l ′ +1→L+1 2 . (S71) The same holds if f * is a linear combination of such functions. As a result, when d eff (L) is large and α → 1, the decay of the error is controlled by the effective dimensionality of the target d eff (l).

D.3 DECAY OF EIGENVALUES WITH THE RANK

Shallow kernels. Consider a depth-two kernel with filters of size s. Our goal is to compute the scaling of the eigenvalues of the kernel Λ k with their rank ρ. The eigenvalues decay with k as Λ k ∼ p i=1 k -2ν S -(s-1) i j̸ =i δ kj ,0 . (S72) In order to take into account their algebraic multiplicity, we introduce the eigenvalue density D(Λ), whose asymptotic form for small eigenvalues is D(Λ) = k, ℓ δ(Λ -Λ k ) ∼ k p i=1 k s-2 i δ   Λ - p i=1 k -2ν-(s-1) i j̸ =i δ kj ,0   ∼ p i=1 ki k s-2 i δ Λ -k -2ν-(s-1) i ∼ ∞ 1 dk k s-2 δ Λ -k -2ν-(s-1) ∼ Λ -1-s-1 2ν+(s-1) . (S73) Thus, the scaling of Λ(ρ) can be determined self-consistently, ρ = Λ(1) Λ(ρ) dΛ D(Λ) ∼ Λ(ρ) -s-1 sν+(s-1) ⇒ Λ(ρ) ∼ ρ -1-2ν s-1 . (S74) Deep kernels. Consider a kernel of depth L + 1 with filter sizes (s 1 , . . . , s L ) and p L = 1. For each sector l, one can compute the density of eigenvalues D (l) (Λ). Depending on s 1 , there are two different cases. If s 1 = 2, D (l) (Λ) = k δ(Λ -Λ (l) k ) ∼ i l+1→L+1 ki l+1→L+1 δ Λ -C 2,l ∥k i l+1→L+1 ∥ -2ν-d eff (l) ∼ ∞ 1 dk k d eff (l)-1 δ Λ -C 2,l k -2ν-d eff (l) ∼ Λ -1- d eff (l) 2ν+d eff (l) . (S75) If s 1 ≥ 3, D (l) (Λ) = k, ℓ δ(Λ -Λ (l) k ) ∼ i l+1→L+1 ki l+1→L+1 , ℓi l+1→L+1 δ Λ -C s1,l k i l+1→L+1 ∥k i l+1→L+1 ∥ ∥k i l+1→L+1 ∥ -2ν-d eff (l) ∼ Λ -1- d eff (l) 2ν+d eff (l) . (S76) When summing over all layers l's, the asymptotic behaviour of the total density of eigenvalues D(Λ) = l D (l) (Λ) is dictated by the density of the sector with the slowest decay, i.e. the last one. Hence, D(Λ) ∼ Λ -1- d eff (L) 2ν+d eff (L) . (S77) Therefore, similarly to the shallow case, one finds self-consistently that the ρ-th eigenvalue of the kernel decays as Λ(ρ) ∼ ρ -1-2ν d eff (L) . (S78)

E STATISTICAL MECHANICS OF GENERALISATION IN KERNEL REGRESSION

In Bordelon et al. (2020) ; Canatar et al. (2021) , the authors derived a heuristic expression for the average-case mean-squared error of kernel (ridge) regression with the replica method of statistical physics Mézard et al. (1987) . Denoting with {ϕ ρ (x), Λ ρ } ρ≥1 the eigenfunctions and eigenvalues of the kernel and with c ρ the coefficients of the target function in this basis, i.e. f * (x) = ρ≥1 c ρ ϕ ρ (x), one has ϵ(λ, n) = ∂ λ κ λ (n) n ρ κ λ (n) 2 (nΛ ρ + κ λ (n)) 2 E[c 2 ρ ], ( ) where λ is the ridge and κ(n) satisfies the implicit equation κ λ (n) n = λ + 1 n ρ Λ ρ κ λ (n)/n Λ ρ + κ λ (n)/n . (S80) In short, the replica calculation used to obtain these equations consists in defining an energy functional E(f ) related to the empirical MSE and assigning to the predictor f a Boltzmann measure, i.e. P (f ) ∝ e -βE(f ) . When β → ∞, the measure concentrates around the minimum of E(f ), which coincides with the minimiser of the empirical MSE. Then, since E(f ) depends only quadratically on the projections c ρ , computing the average over data that appears in the definition of the generalisation error, reduces to computing Gaussian integrals. While non-rigorous, this method has been successfully used in physics-to study disordered systems-and in machine learning theory. In particular, the predictions obtained with Eq. S79 and Eq. S80 have been validated numerically for both synthetic and real datasets. In Eq. S79, κ λ (n)/n plays the role of a threshold: the modal contributions to the error tend to 0 for ρ such that Λ ρ ≫ κ λ (n)/n, and to E[c 2 ρ ] for ρ such that Λ ρ ≪ κ λ (n)/n. This is equivalent to saying that kernel regression can capture only the modes corresponding to the eigenvalues larger than κ λ (n)/n (see also Jacot et al. (2020a; b) ). In the ridgeless limit λ → 0 + , this threshold asymptotically tends to the n-th eigenvalue of the student, resulting in the intuitive picture presented in the main text. Namely, given n training points, ridgeless regression learns the n projections corresponding to the highest eigenvalues. In particular, assume that the kernel spectrum and the target function projections decay as power laws. Namely, Λ ρ ∼ ρ -a and E[c ρ 2 ] ∼ ρ -b , with 2a > b -1. Furthermore, we can approximate the summations over modes with an integral by using the Euler-MacLaurin formula. Hence, we substitute the eigenvalues with their asymptotic limit Λ ρ = Aρ -a . Since, κ 0 (n)/n → 0 as n → ∞, these two operations result in an error which is asymptotically independent of n. In particular, κ 0 (n) n = κ 0 (n) n 1 n ∞ 0 Aρ -a Aρ -a + κ 0 (n)/n dρ + O(1) = κ 0 (n) n 1 n κ 0 (n) n -1 a ∞ 0 σ 1 a -1 A 1 a a -1 1 + σ dσ + O(1) . (S81) Since the integration over σ is finite and independent of n, we obtain that κ 0 (n)/n = O(n -a ). Similarly, we find that the mode-independent prefactor ∂ λ (κ λ (n)/n) | λ=0 = O(1). As a result, we have ϵ(n) ∼ ρ n -2a (Aρ -a + n -a ) 2 E[c 2 ρ ]. (S82) Following the intuitive argument about the thresholding action of κ 0 (n)/n ∼ n -a , we can split the summation in Eq. S82 into modes where Λ ρ ≫ κ 0 (P )/n, Λ ρ ∼ κ 0 (n)/n and Λ ρ ≪ κ 0 (n)/n, ϵ(n) ∼ ρ≪n n -2a (Aρ -a ) 2 E[c 2 ρ ] + ρ∼n 1 2 E[c 2 ρ ] + ρ≫n E[c 2 ρ ]. (S83) Finally, Eq. 12 is obtained by noticing that, under the assumption on the decay of E[c 2 ρ ], the contribution of the summation over ρ ≪ n is subleading in n, whereas the other two can be merged together.

F EXAMPLES

F.1 RATES FROM SPECTRAL BIAS ANSATZ Consider a target function f * which only depends on the meta-patch x i l+1→L+1 and with squareintegrable derivatives up to order m, i.e. ∥∆ m/2 f * ∥ 2 < +∞, with ∆ denoting the Laplace operator. Moreover, consider a hierarchical kernel of depth L + 1 with filter sizes (s 1 , . . . , s L ) and p L = 1. We want to compute the asymptotic scaling of the error by using Eq. 12, i.e. ϵ(n) ∼ k,ℓ s.t. Λ k <Λ(n) |f * k,ℓ | 2 . (S84) In the previous section, we showed that the n-th eigenvalue of the kernel Λ(n) decays as Λ(n) ∼ n -1-2ν d eff (L) . (S85) Since by construction the target function depends only on a meta-patch of the l-th sector, the non-zero projections will be the ones on eigenfunctions of the first l sectors. Thus, all the k's corresponding to the sectors of layers with l ′ > l do not contribute to the sum. In particular, the sum is dominated by the k's of the largest sector and the set {k s.t. Λ k < Λ(n)} is the set of k i l+1→L+1 's with norm larger than n 2ν+d eff (L) (2ν+d eff (l)) d eff (L) . Finally, we notice that the finite-norm condition on the derivatives, ∥∆ m/2 f * ∥ 2 = k p i=1 N ki,s p i=1 k i (k i + s -2) m |f * k,ℓ | 2 < +∞, (S86) implies |f * k,ℓ | 2 ≲ ∥k∥ -2m-d eff (L) (see Subsec. C.3 ). Hence, plugging everything in Eq. S84 we find  ϵ(n) ∼ n - 2m 2ν+d eff (l) 2ν+d eff (L) d eff (L) . (

G.2 TEACHER-STUDENT LEARNING CURVES

In order to obtain the learning curves, we generate n + n test random points uniformly distributed on the product of hyperspheres over the patches. We use n ∈ {128, 256, 512, 1024, 2048, 4096, 8192} and n test = 8192. For each value of n, we sample a Gaussian random field with zero mean and covariance given by the teacher kernel. Then, we compute the kernel regression predictor of the student kernel, and we estimate the generalisation error as the mean squared error of the obtained predictor on the n test unseen example. The expectation over the teacher randomness is obtained by averaging over 16 independent sets of random input points and realisations of the Gaussian random fields. As teacher and student kernels, we use the analytical forms of the neural tangent kernels of hierarchical convolutional networks, with different combinations of depths and filter sizes. The predicted asymptotic scaling ϵ ∼ n -β are reported as dashed lines. Depth-two and depth-three architectures. Fig. S1 reports the learning curves of depth-two and depth-three kernels with binary filters at all layers. Depth-three students defeat the curse of dimensionality when learning depth-two teachers, achieving a similar performance of depth-two students matched to the teacher's structure. However, as we predict, these students encounter the curse of dimensionality when learning depth-three teachers. Ternary filters. Fig. S2 reports the learning curves for kernels with 3-dimensional filters and confirms our predictions in the s 1 ≥ 3 case. Comparison with the noisy and optimally-regularised case. Panel (a) of Fig. S3 compares the learning curves obtained in the optimally-regularised and ridgeless cases for noisy and noiseless data, respectively. The first case corresponds to the setting studied in Caponnetto & De Vito (2007) , in which the source-capacity formalism applies. In contrast with the second setting-which is the one used in the teacher-student scenarios and where it holds the correspondence between kernel methods and neural networks-i) we add to the labels a Gaussian random noise with standard deviation σ = 0.1, ii) for each n, we select the ridge resulting in the best generalisation performance. We observe that the decay obtained in the bound derived from the source-capacity conditions is exactly the one found numerically, i.e. the rate of the bound is tight. As a further check, panel (b) shows that the optimal ridge decays as prescribed.

G.3 ILLUSTRATION OF DIFFERENT TEACHER-STUDENT SCENARIOS

In this subsection, we comment on the results obtained in the different teacher-student scenarios of Fig. 2 , panel (a), and Fig. S1 , panel (a). To ease notation, in the following we always consider the NTK for both teacher and student kernels, i.e. smoothness exponent ν T = ν S = 1/2. However, we point out that when the teacher kernel is a hierarchical RFK (ν T = 3/2), the target function corresponds to the output of an infinitely-wide, deep hierarchical network at initialisationfoot_3 . The error rates are obtained from Eq. 17, after setting the smoothness exponent m = ν T (the smoothness exponent of the teacher covariance kernel). The first case we consider consists of one-hidden-layer convolutional teacher (left) and student (right) kernels. x1 x2 x3 x4 xd x1 x2 x3 x4 xd ϵ(n) ∼ n -1 s 1 -1 As highlighted in blue, the output of the teacher is a linear combination (dashed lines indicate the linear output weights) of s 1 -dimensional functions of the input patches. If the structure of the student is matched to the one of the teacher, the learning problem becomes effectively (s 1 -1)-dimensional and the error decays as n -1/(s1-1) , instead of n -1/d eff , with d eff the total input dimension with the number of spherical constraints subtracted (one per patch). Notice that the role of the student's structure, i.e. the algorithm, is as crucial as the role of the teacher, i.e. the task. Indeed, using a fully-connected student with no prior on the task's locality would result in an error's decay cursed by dimensionality. However, in contrast to fully-connected students, shallow convolutional students are only able to learn tasks with the same structure. In particular, any task entailing non-linear interactions between patches-which are arguably crucial in order to learn image data-belongs to their null space. As we illustrated in the main text, to solve this strong constraint on the hypothesis space, one has to consider deep convolutional architectures. In particular, consider the same shallow teacher of the previous paragraph (left) learnt by a depth-four convolutional student (right). x1 x2 x3 x4 xd x1 x2 x3 x4 xd ϵ(n) ∼ n -1 s 1 1+d eff (3) d eff (3) Remarkably, this student is able to learn the teacher without being cursed by input dimensionality. Indeed, as the number of patches diverges, the error decay asymptotes to n -1/s1 . This rate is slightly worse than the one obtained by the student matched with the teacher, which is proven to be the Bayes-optimal case, but far from being cursed. Intuitively, this fast rate is obtained because the student eigenfunctions of the first sector, i.e. constant outside a single patch, correspond to large eigenvalues and bias the learning dynamics towards s 1 -local functions. Yet, this student is also able to represent functions which are considerably more complex. Now consider a depth-three teacher (left) learned by a depth-four student (right). As highlighted in orange, the output of the teacher is a linear combination of a composition of nonlinear functions acting on patches and coupling them. In this setting, the error decay is controlled by the effective dimension of the second layer. In fact, when the number of patches diverges, the error decay asymptotes to n -1/d eff (2) . In general, this behaviour is a result of what we called 'adaptivity to the spatial of the target. Finally, consider teacher and student with the complete hierarchy, i.e. the receptive fields the neurons in the penultimate layers coincide with the full input. In this case, we show that the error decays as n -1/d eff (3) , i.e. the rate is cursed by the input dimension. The physical meaning of this result is that the hierarchical structure we are considering is still too complex and cannot be learnt efficiently. In other words, these hierarchical convolutional networks are excellent students, since they can adapt to the spatial structure of the task, but bad teachers, since they generate global functions which are too complex to be learnt efficiently.

G.4 EXTENSIONS TO DIFFERENT NORMALISATIONS AND OVERLAPPING PATCHES

This section investigates the robustness of our results to changes in the input distribution, i.e., for data outside the multisphere M p S s-1 , and relaxes the non-overlapping patches assumption. Inputs in R d . While our analysis requires that each patch of the input data is normalised to lie on a unit sphere, this normalisation is not the standard one used for neural networks. Therefore, in this section we investigate the robustness of our predictions to the data distribution. In particular, we consider data uniformly distributed in the unit hypercube, i.e., x ∈ [0, 1] d , and data with standard Gaussian distribution, i.e., x ∼ N (0, I d ). First, we extend the definition of the RFK and NTK to inputs in R d . Definition G.1 (RFK and NTK of hierarchical CNNs for inputs in R d ) Let x, y ∈ R d . Denote tuples of the kind i l i l+1 . . . i m with i l→m for m ≥ l. For m < l, i l→m denotes the empty tuple. For each tuple i 2→L+1 and s a divisor of d, denote with t i 2→L+1 the angle between the s-dimensional patches of x and y identified by the same tuple, i.e. t i 2→L+1 = x i 2→L+1 • y i 2→L+1 ∥x i 2→L+1 ∥∥y i 2→L+1 ∥ (S88) For 1 ≤ l ≤ L + 1, denote with x i 2→L+1 , y i 2→L+1 i 2→l the sequence of patches obtained by letting the indices of the tuple i 2→l vary in their respective range. Consider a hierarchical CNN with filter sizes (s 1 , . . . , s L ), p L ≥ 1 and all the weights w (1) h,i , w (l) h,h ′ ,i , w (L+1) h,i initialised as Gaussian random numbers with zero mean and unit variance. RFK. The corresponding RFK (or covariance kernel) can be obtained recursively as follows. With κ 1 (t) = (π -arccos t) t + √ 1 -t 2 /π, K RFK (x i 2→L+1 , y i 2→L+1 ) = ∥x i 2→L+1 ∥∥y i 2→L+1 ∥ κ 1 (t i 2→L+1 ); NTK. The NTK of the same hierarchical CNN can be obtained recursively as follows. With κ 0 (t) = (π -arccos t) /π, K Overlapping patches. Fig. S5 shows the comparison between convolutional kernels with nonoverlapping patches, i.e., stride corresponding to the filter size, and overlapping patches, i.e., stride 1, for inputs uniform in the d-dimensional hypercube. Despite our theoretical analysis requires the patches to be non-overlapping, our predictions are still confirmed for architectures with overlapping patches. K (l) RFK x i 2→L+1 , y i 2→L+1 i 2→l = 1 s l i l ∥x i l→L+1 ∥ 2 1 s l i l ∥y i l→L+1 ∥ 2 × κ 1   1 s l i l K (l-1) RFK x i 2→L+1 , y i 2→L+1 i 2→l-1 1 s l i l ∥x i l→L+1 ∥ 2 1 s l i l ∥y i l→L+1 ∥ 2   ; K (L+1) RFK x i 2→L+1 , y i 2→L+1 i 2→L+1 = 1 p L p L i L+1 =1 K (L) RFK x i 2→L+1 , y i 2→L+1 i 2→L . NTK x i 2→L+1 , y i 2→L+1 = ∥x i 2→L+1 ∥∥y i 2→L+1 ∥ κ 1 (t i 2→L+1 ) + x i 2→L+1 • y i 2→L+1 κ 0 (t i 2→L+1 ); K (l) NTK x i 2→L+1 , y i 2→L+1 i 2→l = K (l) RFK ( x i 2→L+1 , y i 2→L+1 i 2→l ) + 1 s l i l K (l-1) NTK x i 2→L+1 , y i 2→L+1 i 2→l-1 × κ 0   1 s l i l K (l-1) RFK x i 2→L+1 , y i 2→L+1 i 2→l-1 1 s l i l ∥x i l→L+1 ∥ 2 1 s l i l ∥y i l→L+1 ∥ 2   ; K (L+1) NTK x i 2→L+1 , y i 2→L+1 i 2→L+1 = 1 p L p L i L+1 =1 K (L) NTK x i 2→L+1 , y i 2→L+1 i 2→L . G.5 CIFAR-2 LEARNING CURVES Fig. S6 shows the learning curves of the neural tangent kernels of different architectures applied to pairs of classes of the CIFAR-10 dataset. In particular, the task is built by selecting two CIFAR-10 classes, e.g. plane and car, and assigning label +1 to the elements belonging to one class and label -1 to the remaining ones. Learning is again achieved by minimising the empirical mean squared error using a 'student' kernel. We find that the kernels with the worst performance are the ones corresponding to shallow fully-connected and convolutional architectures. Instead, for all the pairs of classes considered here, deep hierarchical convolutional kernels achieve the best performance. 



Notice that all our results can be readily extended to image-like input signals {xij}i,j or tensorial objects with an arbitrary number of indices. We show in Subsec. G.4 that our predictions remain true if the inputs are sampled uniformly in the ddimensional hypercube [0, 1] d or from a Gaussian distribution on R d . We are again limiting the presentation to the case s = 2 but the extension to the general case is immediate. See, e.g,Lee et al. (2017) for the equivalence between infinitely-wide networks and Gaussian random fields with covariance given by the RFK.



Figure 2: Learning curves for deep convolutional NTKs in a teacher-student setting. a. Depth-four student learning depth-two, depth-three, and depth-four teachers. b. Depth-three models cursed by the effective input dimensionality d eff (L). The numbers inside brackets are the sequence of filter sizes of the kernels. Solid lines are the results of experiments averaged over 16 realisations with the shaded areas representing the empirical standard deviations. The predicted asymptotic scaling ϵ ∼ n -β are reported as dashed lines. Details on the numerical experiments are reported in App. G.Lemma 5.1 (Curse of dimensionality for hierarchical targets) The problem of regression of the output of a randomly-initialised and infinitely-wide hierarchical network suffers from the curse of dimensionality, in the sense that no methods using n examples can achieve a generalisation error decaying faster than n -β with β = 3/d eff (L).

EXTENSION OF COR. 4.1 TO PATCHES ON THE HYPERSPHERE Corollary D.1 (Adaptivity to spatial structure) Let T K be the integral operator of the kernel of a hierarchical deep CNN as in Thm. 3.1. Then: i) the capacity exponent α is controlled by the largest sector of the spectrum, i.e.

on a high-performance computing cluster with nodes having Intel Xeon Gold processors with 20 cores and 192 GB of DDR4 RAM. All codes are written in PyTorchPaszke et al. (2019).

Figure S1: Learning curves for deep convolutional NTKs (ν = 1/2) in a teacher-student setting. a.Depth-two teachers learned by depth-two (matched) and depth-three (mismatched) students. Both these students are not cursed by the input dimension. b. Depth-three students learning depth-two and depth-three teachers. These students are cursed only in the second case. The numbers inside brackets are the sequence of filter sizes of the kernels. Solid lines are the results of experiments averaged over 16 realisations with the shaded areas representing the empirical standard deviations. The predicted asymptotic scaling ϵ ∼ n -β are reported as dashed lines.

Figure S2: Learning curves for deep convolutional NTKs (ν = 1/2) with filters of size 3 in a teacher-student setting. a. Depth-three students learning depth-two and depth-three teachers. These students are cursed only in the second case. b. Depth-three models are cursed by the effective input dimensionality. The numbers inside brackets are the sequence of filter sizes of the kernels. Solid lines are the results of experiments averaged over 16 realisations with the shaded areas representing the empirical standard deviations. The predicted asymptotic scaling ϵ ∼ n -β are reported as dashed lines.

Figure S4: Learning curves for deep convolutional NTKs (ν = 1/2) in a teacher-student setting with different input normalisations. In particular, we consider inputs on the multisphere M p S s-1 (MS.), uniformly-distributed in the unit d-hypercube [0, 1] d (Cb.), and with standard Gaussian distribution N (0, I d ) (Ga.). The numbers inside brackets are the sequence of filter sizes of the kernels. Solid lines are the results of experiments averaged over 16 realisations with the shaded areas representing the empirical standard deviations. The asymptotic scaling ϵ ∼ n -β predicted for inputs on the multisphere are reported as dashed lines.

Fig.S4reports the learning curve of different teacher-student scenarios with the kernels defined in Def. G.1 and inputs i) on the multisphere M p S s-1 , ii) uniformly-distributed in the unit d-hypercube [0, 1] d , and iii) with standard Gaussian distribution N (0, I d ). Remarkably, our predictions are in excellent agreement with the different input normalisations.

Figure S5: Learning curves for deep convolutional NTKs (ν = 1/2) with non-overlapping (NO.) and overlapping (Ov.) patches in a teacher-student setting with inputs normalised in the d-hypercube.The numbers inside brackets are the sequence of filter sizes of the kernels. Solid lines are the results of experiments averaged over 16 realisations with the shaded areas representing the empirical standard deviations. The asymptotic scaling ϵ ∼ n -β predicted for kernels with non-overlapping patches are reported as dashed lines.

Figure S6: Learning curves of the neural tangent kernels of fully-connected (F-NTK) and convolutional (C-NTK) networks with various depths learning to classify two CIFAR-10 classes in a regression setting. Deep hierarchical convolutional kernels achieve the best performance. Shaded areas represent the empirical standard deviations obtained averaging over different training sets. a. Plane vs car. b. Cat vs bird.

Hao Henry Zhou, Yunyang Xiong, and Vikas Singh. Building bayesian neural networks with blocks: On structure, interpretability and uncertainty. arXiv preprint arXiv:1806.03563, 2018. Dot-product kernels on the sphere . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.2 Multi-dot-product kernels on the multi-sphere . . . . . . . . . . . . . . . . . . . . 16 Singular expansion of hierarchical kernels . . . . . . . . . . . . . . . . . . . . . . 18 C.2 Patches on the ring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 C.3 Patches on the s-dimensional hypersphere . . . . . . . . . . . . . . . . . . . . . . 22

