ON THE IMPLICIT BIAS OF WEIGHT DECAY IN SHAL-LOW UNIVARIATE RELU NETWORKS

Abstract

We give a complete characterization of the implicit bias of infinitesimal weight decay (i.e. an `2 penalty on network weights) in the modest setting of univariate one layer ReLU networks. Our main result is a surprisingly simple geometric description of all one layer ReLU networks that exactly fit a dataset D = {(x i , y i )} with the minimum value of the `2-norm of the neuron weights. Specifically, we prove that such functions must be either concave or convex between any two consecutive data sites x i and x i+1 . Our description implies that interpolating ReLU networks with weak `2-regularization achieve the best possible `1 generalization error for learning 1d Lipschitz functions, up to universal constants.

1. INTRODUCTION

The ability of overparameterized neural networks to simultaneously fit training data (i.e. interpolate) and generalize to unseen test data (i.e. extrapolate) is a robust empirical finding that underpins the success of deep learning in computer vision He et al. (2016) ; Krizhevsky et al. (2012) , natural language processing Brown et al. ( 2020), and reinforcement learning Jumper et al. (2021) ; Silver et al. (2016) ; Vinyals et al. (2019) . This observation is surprising when viewed from the lens of traditional learning theory Bartlett & Mendelson (2002); Vapnik & Chervonenkis (1971) , chiefly because such complexity-based methods are agnostic to the choice of optimizer and seek to predict generalization based solely on the complexity of the overall hypothesis class and how well a learned model fits the training data. In an overparameterized neural network, however, the quality of predictions at test time often varies dramatically across settings of trainable parameters (e.g. weights and biases) that exactly fit all training data Zhang et al. (2017) . Which setting of parameters is learned depends crucially on the optimization procedure, and an insightful analysis of generalization in the presence of overparameterization must therefore combine properties of the model class with the often subtle criteria according to which different minimizers of an empirical risk are selected by different optimizers. This has led to a vibrant sub-field of deep learning theory that analyzes the implicit bias or implicit regularization of optimizers used in practice Arora et al. ( 2019 2021). The high level goal of this line of work is to explain how optimization hyperparameters such as initialization scheme, learning rate, batch size, data augmentation scheme, and choice of explicit regularizer influence which of the many global minima of the empirical risk are selected in the course of optimization. A key difficulty in studying implicit bias is that it is unclear how to understand, concretely in terms of the network function, the effect of particular optimization hyperparameters. For example, a wellchosen initialization for gradient-based optimizers is key to ensuring good generalization properties of the resulting learned network He et al. (2015) ; Mishkin & Matas (2015) ; Xiao et al. (2018) . However, the corresponding geometric or analytic properties of the learned network are often hard to pin down, obscuring our understanding of what it is about the learned functions that encourages generalization. In a similar vein, it is standard practice to experiment with explicit regularizers such as an `2 penalty on network weights. While the effect of this choice is easy to describe in terms of model parameters (e.g. it tends to make them smaller), it is typically challenging to translate such a description into 2019) explore and develop the fact that `2 regularization on parameters in this setting is provably equivalent to penalizing the total variation of the derivative of the network function (cf eg Theorem 1.3 from prior work below). These articles apply to networks with any input dimension. In this article, however, we consider the simplest case of input dimension 1 and significantly refine these prior results to give a complete geometric answer to how interpolating ReLU networks with a weak `2 penalty use training data to make predictions on unseen data. Our main results are: 1. We consider a dataset D = {(x i , y i )} with x i , y i 2 R and give a complete description of the space of one layer ReLU networks with a single linear unit which fit the data and, among all such interpolating networks, do so with the minimal `2 norm of the neuron weights. There are infinitely many such networks, and they are described by the constraint that they fit the data with as few inflection points as possible (see Thms. 1.1, 1.2). 2. The above description of the space of interpolants of D gives uniform control of the Lipschitz constant of any such interpolant and immediately yields sharp generalization bounds for learning 1d Lipschitz functions. This is stated in Corollary 1.1. Specifically, if the dataset D is generated by setting y i = f ⇤ (x i ) for f ⇤ : [0, 1] ! R a Lipschitz function, then any one layer ReLU network with a single linear unit which interpolates D but does so with minimal `2-norm of the network parameters will generalize as well as possible to unseen data, up to a small universal multiplicative constant. To the author's knowledge this is the first time such generalization guarantees have been obtained. 



); Blanc et al. (2020); Gunasekar et al. (2018); Hanin & Sun (2021); Jacot et al. (2020); Ma et al. (2018); Razin & Cohen (2020); Smith et al. (

Figure1: A dataset D with m = 8 points. Shown are the "connect the dots" interpolant f D (dashed line), its slopes s i and the "discrete curvature" ✏ i at each x i .

SETUP AND INFORMAL STATEMENT OF RESULTSLet us denote[t] + := ReLU(t) = max {0, t} and consider a one layer ReLU networkz(x) = z(x; ✓) = z(x; ✓, n) := ax + b + n

