Deep Equals Shallow for ReLU Networks in Kernel Regimes

Abstract

Deep networks are often considered to be more expressive than shallow ones in terms of approximation. Indeed, certain functions can be approximated by deep networks provably more efficiently than by shallow ones, however, no tractable algorithms are known for learning such deep models. Separately, a recent line of work has shown that deep networks trained with gradient descent may behave like (tractable) kernel methods in a certain over-parameterized regime, where the kernel is determined by the architecture and initialization, and this paper focuses on approximation for such kernels. We show that for ReLU activations, the kernels derived from deep fully-connected networks have essentially the same approximation properties as their "shallow" two-layer counterpart, namely the same eigenvalue decay for the corresponding integral operator. This highlights the limitations of the kernel framework for understanding the benefits of such deep architectures. Our main theoretical result relies on characterizing such eigenvalue decays through differentiability properties of the kernel function, which also easily applies to the study of other kernels defined on the sphere.

1. Introduction

The question of which functions can be well approximated by neural networks is crucial for understanding when these models are successful, and has always been at the heart of the theoretical study of neural networks (e.g., Hornik et al., 1989; Pinkus, 1999) . While early works have mostly focused on shallow networks with only two layers, more recent works have shown benefits of deep networks for approximating certain classes of functions (Eldan & Shamir, 2016; Mhaskar & Poggio, 2016; Telgarsky, 2016; Daniely, 2017; Yarotsky, 2017; Schmidt-Hieber et al., 2020) . Unfortunately, many of these approaches rely on constructions that are not currently known to be learnable using efficient algorithms. A separate line of work has considered over-parameterized networks with random neurons (Neal, 1996) , which also display universal approximation properties while additionally providing efficient algorithms based on kernel methods or their approximations such as random features (Rahimi & Recht, 2007; Bach, 2017b) . Many recent results on gradient-based optimization of certain over-parameterized networks have been shown to be equivalent to kernel methods with an architecture-specific kernel called the neural tangent kernel (NTK) and thus also fall in this category (e.g., Jacot et al., 2018; Li & Liang, 2018; Allen-Zhu et al., 2019b; Du et al., 2019a; b; Zou et al., 2019) . This regime has been coined lazy (Chizat et al., 2019) , as it does not capture the common phenomenon where weights move significantly away from random initialization and thus may not provide a satisfying model for learning adaptive representations, in contrast to other settings such as the mean field or active regime, which captures complex training dynamics where weights may move in a non-trivial manner and adapt to the data (e.g., Chizat & Bach, 2018; Mei et al., 2018) . Nevertheless, one benefit compared to the mean field regime is that the kernel approach easily extends to deep architectures, leading to compositional kernels similar to the ones of Cho & Saul (2009) ; Daniely et al. (2016) . Our goal in this paper is to study the role of depth in determining approximation properties for such kernels, with a focus on fully-connected deep ReLU networks. Our approximation results rely on the study of eigenvalue decays of integral operators associated to the obtained dot-product kernels on the sphere, which are diagonalized in the basis of spherical harmonics. This provides a characterization of the functions in the corresponding reproducing kernel Hilbert space (RKHS) in terms of their smoothness, and leads to convergence rates for non-parametric regression when the data are uniformly distributed on the sphere. We show that for ReLU networks, the eigenvalue decays for the corresponding deep kernels remain the same regardless of the depth of the network. Our key result is that the decay for a certain class of kernels is characterized by a property related to differentiability of the kernel function around the point where the two inputs are aligned. In particular, the property is preserved when adding layers with ReLU activations, showing that depth plays essentially no role for such networks in kernel regimes. This highlights the limitations of the kernel regime for understanding the power of depth in fully-connected networks, and calls for new models of deep networks beyond kernels (see, e.g., Allen-Zhu & Li, 2020; Chen et al., 2020, for recent works in this direction). We also provide applications of our result to other kernels and architectures, and illustrate our results with numerical experiments on synthetic and real datasets. Neal (1996) for shallow networks, and later for deep networks (Cho & Saul, 2009; Daniely et al., 2016; Lee et al., 2018; Matthews et al., 2018) . Smola et al. ( 2001 2020) provide a lower bound on the eigenvalues for deep networks; our work makes this observation rigorous by providing tight asymptotic decays. Spectral properties of wide neural networks were also considered in (Cao et al., 2019; Fan & Wang, 2020; Ghorbani et al., 2019; Xie et al., 2017; Yang & Salman, 2019) . Azevedo & Menegatto (2014) ; Scetbon & Harchaoui (2020) also study eigenvalue decays for dot-product kernels but focus on kernels with geometric decays, while our main focus is on polynomial decays. Additional works on over-parameterized or infinite-width networks in lazy regimes include (Allen-Zhu et al., 2019a; b; Arora et al., 2019a; b; Brand et al., 2020; Lee et al., 2020; Song & Yang, 2019) .

Related work. Kernels for deep learning were originally derived by

Concurrently to our work, Chen & Xu (2021) also studied the RKHS of the NTK for deep ReLU networks, showing that it is the same as for the Laplace kernel on the sphere. They achieve this by studying asymptotic decays of Taylor coefficients of the kernel function at zero using complex-analytic extensions of the kernel functions, and leveraging this to obtain both inclusions between the two RKHSs. In contrast, we obtain precise descriptions of the RKHS and regularization properties in the basis of spherical harmonics for various dot-product kernels through spectral decompositions of integral operators, using (real) asymptotic expansions of the kernel function around endpoints. The equality between the RKHS of the deep NTK and Laplace kernel then easily follows from our results by the fact that the two kernels have the same spectral decay.

2. Review of Approximation with Dot-Product Kernels

In this section, we provide a brief review of the kernels that arise from neural networks and their approximation properties.



); Minh et al. (2006) study regularization properties of dot-product kernels on the sphere using spherical harmonics, and Bach (2017a) derives eigenvalue decays for such dot-product kernels arising from shallow networks with positively homogeneous activations including the ReLU. Extensions to shallow NTK or Laplace kernels are studied by Basri et al. (2019); Bietti & Mairal (2019b); Geifman et al. (2020). The observation that depth does not change the decay of the NTK was previously made by Basri et al. (2020) empirically, and Geifman et al. (

