DEEP LEARNING MEETS NONPARAMETRIC REGRES-SION: ARE WEIGHT DECAYED DNNS LOCALLY ADAPTIVE?

Abstract

We study the theory of neural network (NN) from the lens of classical nonparametric regression problems with a focus on NN's ability to adaptively estimate functions with heterogeneous smoothness -a property of functions in Besov or Bounded Variation (BV) classes. Existing work on this problem requires tuning the NN architecture based on the function spaces and sample size. We consider a "Parallel NN" variant of deep ReLU networks and show that the standard ℓ 2 regularization is equivalent to promoting the ℓ p -sparsity (0 < p < 1) in the coefficient vector of an end-to-end learned function bases, i.e., a dictionary. Using this equivalence, we further establish that by tuning only the regularization factor, such parallel NN achieves an estimation error arbitrarily close to the minimax rates for both the Besov and BV classes. Notably, it gets exponentially closer to minimax optimal as the NN gets deeper. Our research sheds new lights on why depth matters and how NNs are more powerful than kernel methods.

1. INTRODUCTION

Why do deep neural networks (DNNs) work better? They are universal function approximators (Cybenko, 1989 ), but so are splines and kernels. They learn data-driven representations, but so are the shallower and linear counterparts such as matrix factorization. The theoretical understanding on why DNNs are superior to these classical alternatives is surprisingly limited. In this paper, we study DNNs in nonparametric regression problems -a classical branch of statistical theory and methods with more than half a century of associated literatures (Nadaraya, 1964; De Boor et al., 1978; Wahba, 1990; Donoho et al., 1998; Mallat, 1999; Scholkopf & Smola, 2001; Rasmussen & Williams, 2006) . Nonparametric regression addresses the fundamental problem: This is a timely question to ask, partly because the bulk of recent theory of NN leverages its asymptotic Reproducing Kernel Hilbert Space (RKHS) in the overparameterized regime (Jacot et al., 2018; Belkin et al., 2018; Arora et al., 2019) . RKHS-based approaches, e.g., kernel ridge regression with any fixed kernels are suboptimal in estimating functions with heterogeneous smoothness (Donoho et al., 1990) . Therefore, existing deep learning theory based on RKHS does not satisfactorily explain the advantages of neural networks over kernel methods. • Let y i = f (x i ) + Noise for i = 1, ...,



Comparison with the results in the literature

annex

< l a t e x i t s h a 1 _ b a s e 6 4 = " S h S F g w b 0 1 N V S y V 3 p K 8 f F z h q 8 o N c = " > A A K m g a c 7 8 3 3 l b C 0 3 / 6 s 5 i f L P 3 Y y F c a J o S C Y f + Q l H K k J 5 H K j L B C W K p x o w E U z v i k g f C 0 y U D q 2 o Q 7 B n T 5 6 H x k n F 1 n x 7 W q p e T u M o w A E c Q h l s O I M q 3 E A N 6 k D g E Z 7 h F d 6 M J + P F e D c + J q 0 L x n R m D / 7 I + P w B p 5 y V s w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " S h S F g w b 0R m D / 7 I + P w B p 5 y V s w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " S h S F g w b 0Free knots Splines with adaptive orders Doppler-like functionsR m D / 7 I + P w B p 5 y V s w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " S h S F g w b 0Can Weight Decayed ReLU DNN estimate such functions with heterogeneous smoothness optimally (using noisy observations)? We build upon the recent work of Suzuki ( 2018) and Parhi & Nowak (2021a) who provided encouraging first answers to the question above. Specifically, Parhi & Nowak (2021a, Theorem 8) showed that a two-layer truncated power function activated neural network with a non-standard regularization is equivalent to the LARS. This connection implies that such NNs achieve the minimax rate for the (high order) bounded variation (BV) classes. A detailed discussion is provided in Section B. Suzuki (2018) showed that multilayer ReLU DNNs can achieve minimax rate for the Besov class, but requires the artificially imposed sparsity-level of the DNN weights to be calibrated according to parameters of the Besov class, thus is quite difficult to implement in practice. (Oono & Suzuki, 2019; Liu et al., 2021) replaced the sparse neural network with Resnet-style CNN and achieved the same rate, but they similarly require carefully choosing the number of parameters for each nonparametric class. We show that ℓ 2 regularization suffices for mildly overparameterized DNNs to achieve the optimal "local adaptive" rates for many nonparametric classes at the same time. Weight decay, also known as square ℓ 2 regularization, is one of the most popular regularization techniques for preventing overfitting in DNNs. It is called "weight decay" because each iteration of the gradient descent (or SGD) shrinks the parameter towards 0 multiplicatively. Many tricks in deep learning, including early stopping (Yao et al., 2007) , quantization (Hubara et al., 2016), and dropout (Wager et al., 2013) behaves like ℓ 2 regularization. Thus even though we focus on the exact minimizer of the regularized objective, it may explain the behavior of SGD in practice.Summary of results. Our main contributions are:1. We prove that the (standard) ℓ 2 regularization in training an L-layer parallel ReLUactivated neural network is equivalent to a sparse ℓ p penalty term (where p = 2/L) on the linear coefficients of a learned representation (Proposition 4). 2. We show that the estimation error of ℓ 2 regularized parallel NN can be close to the minimax rate for estimating functions in Besov space. Notably, the method can adapt to different smoothness parameter, which is not the case for many other methods. 3. We find that deeper models achieve closer to the optimal error rate. This result helps explain why deep neural networks can achieve better performance than shallow ones empirically.Besides, we have the following technical contributions which could be of separate interest:

