DEEP LEARNING MEETS NONPARAMETRIC REGRES-SION: ARE WEIGHT DECAYED DNNS LOCALLY ADAPTIVE?

Abstract

We study the theory of neural network (NN) from the lens of classical nonparametric regression problems with a focus on NN's ability to adaptively estimate functions with heterogeneous smoothness -a property of functions in Besov or Bounded Variation (BV) classes. Existing work on this problem requires tuning the NN architecture based on the function spaces and sample size. We consider a "Parallel NN" variant of deep ReLU networks and show that the standard ℓ 2 regularization is equivalent to promoting the ℓ p -sparsity (0 < p < 1) in the coefficient vector of an end-to-end learned function bases, i.e., a dictionary. Using this equivalence, we further establish that by tuning only the regularization factor, such parallel NN achieves an estimation error arbitrarily close to the minimax rates for both the Besov and BV classes. Notably, it gets exponentially closer to minimax optimal as the NN gets deeper. Our research sheds new lights on why depth matters and how NNs are more powerful than kernel methods.

1. INTRODUCTION

Why do deep neural networks (DNNs) work better? They are universal function approximators (Cybenko, 1989 ), but so are splines and kernels. They learn data-driven representations, but so are the shallower and linear counterparts such as matrix factorization. The theoretical understanding on why DNNs are superior to these classical alternatives is surprisingly limited. In this paper, we study DNNs in nonparametric regression problems -a classical branch of statistical theory and methods with more than half a century of associated literatures (Nadaraya, 1964; De Boor et al., 1978; Wahba, 1990; Donoho et al., 1998; Mallat, 1999; Scholkopf & Smola, 2001; Rasmussen & Williams, 2006) . Nonparametric regression addresses the fundamental problem: • Let y i = f (x i ) + Noise for i = 1, ..., n. How can we estimate a function f using data points (x 1 , y 1 ), ..., (x n , y n ) in conjunction with the knowledge that f belongs to a function class F? Function class F typically imposes only weak regularity assumptions such as smoothness, which makes nonparametric regression widely applicable to real-life applications under weak assumptions. Local adaptivity. We say a nonparametric regression technique is locally adaptive if it can cater to local differences in smoothness, hence allowing more accurate estimation of functions with varying smoothness and abrupt changes. A subset of nonparametric regression techniques were shown to have the property of local adaptivity (Mammen & van de Geer, 1997) in both theory and practice. These include wavelet smoothing (Donoho et al., 1998) , locally adaptive regression splines (LARS, Mammen & van de Geer, 1997), trend filtering (Tibshirani, 2014; Wang et al., 2014) and adaptive local polynomials (Baby & Wang, 2019; 2020) . In light of such a distinction, it is natural to consider the following question: Are NNs locally adaptive, i.e., optimal in learning functions with heterogeneous smoothness? This is a timely question to ask, partly because the bulk of recent theory of NN leverages its asymptotic Reproducing Kernel Hilbert Space (RKHS) in the overparameterized regime (Jacot et al., 2018; Belkin et al., 2018; Arora et al., 2019) . RKHS-based approaches, e.g., kernel ridge regression with

