ON THE IMPLICIT BIAS OF WEIGHT DECAY IN SHAL-LOW UNIVARIATE RELU NETWORKS

Abstract

We give a complete characterization of the implicit bias of infinitesimal weight decay (i.e. an `2 penalty on network weights) in the modest setting of univariate one layer ReLU networks. Our main result is a surprisingly simple geometric description of all one layer ReLU networks that exactly fit a dataset D = {(x i , y i )} with the minimum value of the `2-norm of the neuron weights. Specifically, we prove that such functions must be either concave or convex between any two consecutive data sites x i and x i+1 . Our description implies that interpolating ReLU networks with weak `2-regularization achieve the best possible `1 generalization error for learning 1d Lipschitz functions, up to universal constants.

1. INTRODUCTION

The ability of overparameterized neural networks to simultaneously fit training data (i.e. interpolate) and generalize to unseen test data (i.e. extrapolate) is a robust empirical finding that underpins the success of deep learning in computer vision He et al. ( 2016 2019). This observation is surprising when viewed from the lens of traditional learning theory Bartlett & Mendelson (2002); Vapnik & Chervonenkis (1971) , chiefly because such complexity-based methods are agnostic to the choice of optimizer and seek to predict generalization based solely on the complexity of the overall hypothesis class and how well a learned model fits the training data. In an overparameterized neural network, however, the quality of predictions at test time often varies dramatically across settings of trainable parameters (e.g. weights and biases) that exactly fit all training data Zhang et al. (2017) . Which setting of parameters is learned depends crucially on the optimization procedure, and an insightful analysis of generalization in the presence of overparameterization must therefore combine properties of the model class with the often subtle criteria according to which different minimizers of an empirical risk are selected by different optimizers. This has led to a vibrant sub-field of deep learning theory that analyzes the implicit bias or implicit regularization of optimizers used in practice Arora et al. ( 2019 2021). The high level goal of this line of work is to explain how optimization hyperparameters such as initialization scheme, learning rate, batch size, data augmentation scheme, and choice of explicit regularizer influence which of the many global minima of the empirical risk are selected in the course of optimization. A key difficulty in studying implicit bias is that it is unclear how to understand, concretely in terms of the network function, the effect of particular optimization hyperparameters. For example, a wellchosen initialization for gradient-based optimizers is key to ensuring good generalization properties of the resulting learned network He et al. (2015) ; Mishkin & Matas (2015) ; Xiao et al. (2018) . However, the corresponding geometric or analytic properties of the learned network are often hard to pin down, obscuring our understanding of what it is about the learned functions that encourages generalization. In a similar vein, it is standard practice to experiment with explicit regularizers such as an `2 penalty on network weights. While the effect of this choice is easy to describe in terms of model parameters (e.g. it tends to make them smaller), it is typically challenging to translate such a description into



); Krizhevsky et al. (2012), natural language processing Brown et al. (2020), and reinforcement learning Jumper et al. (2021); Silver et al. (2016); Vinyals et al. (

); Blanc et al. (2020); Gunasekar et al. (2018); Hanin & Sun (2021); Jacot et al. (2020); Ma et al. (2018); Razin & Cohen (2020); Smith et al. (

