GENERALIZATION ERROR BOUNDS FOR NEURAL NETWORKS WITH RELU ACTIVATION Anonymous

Abstract

We show rigorous bounds on the generalization error for Neural Networks with ReLU activation under the condition that the network size doesn't grow with the training set size. In order to prove these bounds we weaken the notion of uniform stability of a learning algorithm in a probabilistic way by positing the notion of almost sure (a.s.) support stability and proving that if an algorithm has low enough a.s. support stability its generalization error tends to 0 as the training set size increases. Further we show that for Stochastic Gradient Descent to be almost surely support stable we only need the loss function to be locally Lipschitz and locally smooth with probability 1, thereby showing low generalization error with weaker conditions than have been used in the literature. We then show that Neural Networks with ReLU activation and a doubly differentiable loss function possess these properties, thereby proving low generalization error. The caveat is that the size of NN must not grow with the size of the training set. Finally we present experimental evidence to validate our theoretical results.

1. INTRODUCTION

Given the importance of the generalization error in machine learning there have been numerous theoretical approaches to bounding it, one of the most significant being the concept of algorithmic stability whose origins can be traced back to Vapnik & Chervonenkis (1974) . In this paradigm the definition of uniform stability by Bousquet & Elisseeff (2002) was a landmark because it allowed the authors to prove sharp concentration bounds showing that the generalization error for the algorithms that satisfy this property tends to 0 as the training set size increases. Thanks to the success of Bousquet & Elisseeff (2002) on a variety of non-neural methods, when Hardt et al. (2016) showed that a learning function computed by optimizing certain loss functions through Stochastic Gradient Descent is uniformly stable under mild conditions on the loss function, this clearly raised the expectation that the next step would be to show that Neural Network-based models are also uniformly stable and, hence, have a low generalization error. This analysis was then expected to corroborate the empirically observed fact that the generalization error is low for NN models (c.f., e.g., Krizhevsky et al. ( 2012 2017) speculated that the problem arose from the fact that the concept of uniform stability was based only on properties of the algorithm and not on the properties of the unknown data distribution. While this was a reasonable speculation, it left open some questions: Is there a reasonable definition of stability that incorporates distribution properties and leads to small generalization error? And can we show that NNs with non-linear activation functions such as ReLU at the neurons have this property? In this paper we give a strongly affirmative answer to the first question and a partially affirmative answer to the second question. We define a notion called almost sure (a.s.) support stability which is a probabilistic weakening of uniform stability. Unlike the data-dependent notions of stability defined in Kuzborskij & Lampert (2018) , Lei & Ying (2020) that bound generalization error in expectation, a.s. support stability can be used to show high probability bounds on generalization error. This proof entails proving a mild generalization of McDiarmid's Inequality that could be of independent interest. The major contribution of our paper is that we show how to handle the non-linearity of ReLU in a mathematically rigorous way by modifying a result of Milne (2019) to show that ReLU



), Hinton et al. (2012).) Case closed. However, this triumphal march of algorithmic stability was rudely interrupted by Zhang et al. (2017), who showed explicit examples of distributions that cannot be learned with low generalization error and yet can be learned with an NN of an appropriate size. Zhang et al. (

