GENERALIZATION ERROR BOUNDS FOR NEURAL NETWORKS WITH RELU ACTIVATION Anonymous

Abstract

We show rigorous bounds on the generalization error for Neural Networks with ReLU activation under the condition that the network size doesn't grow with the training set size. In order to prove these bounds we weaken the notion of uniform stability of a learning algorithm in a probabilistic way by positing the notion of almost sure (a.s.) support stability and proving that if an algorithm has low enough a.s. support stability its generalization error tends to 0 as the training set size increases. Further we show that for Stochastic Gradient Descent to be almost surely support stable we only need the loss function to be locally Lipschitz and locally smooth with probability 1, thereby showing low generalization error with weaker conditions than have been used in the literature. We then show that Neural Networks with ReLU activation and a doubly differentiable loss function possess these properties, thereby proving low generalization error. The caveat is that the size of NN must not grow with the size of the training set. Finally we present experimental evidence to validate our theoretical results.

1. INTRODUCTION

Given the importance of the generalization error in machine learning there have been numerous theoretical approaches to bounding it, one of the most significant being the concept of algorithmic stability whose origins can be traced back to Vapnik & Chervonenkis (1974) . In this paradigm the definition of uniform stability by Bousquet & Elisseeff (2002) was a landmark because it allowed the authors to prove sharp concentration bounds showing that the generalization error for the algorithms that satisfy this property tends to 0 as the training set size increases. Thanks to the success of Bousquet & Elisseeff (2002) on a variety of non-neural methods, when Hardt et al. (2016) showed that a learning function computed by optimizing certain loss functions through Stochastic Gradient Descent is uniformly stable under mild conditions on the loss function, this clearly raised the expectation that the next step would be to show that Neural Network-based models are also uniformly stable and, hence, have a low generalization error. This analysis was then expected to corroborate the empirically observed fact that the generalization error is low for NN models (c.f., e.g., Krizhevsky et al. (2012 ), Hinton et al. (2012) .) Case closed. However, this triumphal march of algorithmic stability was rudely interrupted by Zhang et al. (2017) , who showed explicit examples of distributions that cannot be learned with low generalization error and yet can be learned with an NN of an appropriate size. Zhang et al. (2017) speculated that the problem arose from the fact that the concept of uniform stability was based only on properties of the algorithm and not on the properties of the unknown data distribution. While this was a reasonable speculation, it left open some questions: Is there a reasonable definition of stability that incorporates distribution properties and leads to small generalization error? And can we show that NNs with non-linear activation functions such as ReLU at the neurons have this property? In this paper we give a strongly affirmative answer to the first question and a partially affirmative answer to the second question. We define a notion called almost sure (a.s.) support stability which is a probabilistic weakening of uniform stability. Unlike the data-dependent notions of stability defined in Kuzborskij & Lampert (2018) , Lei & Ying (2020) that bound generalization error in expectation, a.s. support stability can be used to show high probability bounds on generalization error. This proof entails proving a mild generalization of McDiarmid's Inequality that could be of independent interest. The major contribution of our paper is that we show how to handle the non-linearity of ReLU in a mathematically rigorous way by modifying a result of Milne (2019) to show that ReLU affects the smoothness of the gradients computed by Stochastic Gradient Descent with probability 0 for well-behaved distributions. To the best of our knowledge such a result has not been reported in the literature. Clearly the question arises: What about the evidence presented by Zhang et al. ( 2017) that there exist unknown distributions that can be learned by an NN but will always have unbounded generalization error? To answer this we need to understand that our analysis of SGD presented requires the loss function to be smooth and Lipschitz in the parameter space (though with an appropriate probabilistic weakening of these properties.) Our analysis is able to demonstrate that these properties hold for NNs whose size doesn't grow with the size of the training set. So, our specific answer that we have for the second question listed above: Almost support stability can be used to demonstrate that the generalization error for NNs with ReLU as long as the size of the NN doesn't grow too fast with the training set size, and if the unknown distribution places probability 0 on sets of Lebesgue measure 0. This leaves open the matter of a theoretical explanation for why certain heavily overparametrized NNs generalize well in practice. However, we do provide some directions on how our analytical framework may be used to prove such results. In particular our contributions are: • In Section 3 we define a new notion of stability called a.s. support stability and show in Theorem 2 that algorithms with a.s. support stability o(1/ log 2 m) have generalization error tending to 0 as m → ∞ where m is the size of the training set. • In Section 4 we show that if we run stochastic gradient descent on a parametrized optimization function that is only locally Lipschitz and locally smooth in the parameter space and that too only on input points selected with probability 1, then the replace-one error is bounded even if the training is conducted for c log m epochs. This implies (Corollary 7) that any learning algorithm trained this way has generalization error that goes to 0 as m → ∞ for learning rate α 0 /t at step t for an appropriately selected value of α 0 that does not depend on m. • In Section 5 we show that the output of an NN with ReLU activations when used with a doubly differentable loss function is locally Lipschitz and locally smooth in the parameter space for all inputs except those from a set of Lebesgue measure 0 (Theorem 9). This allows us to show a.s. support stability and a generalization error bound for NNs with size that doesn't grow with m. • We experimentally verify our theoretical results in Section 6, showing that our bounded Lipschitz and bounded smoothness conditions hold in practice.

2. RELATED WORK

Several authors have that although NNs are known to generalize well in practice, many different theoretical approaches have been tried without satifactorily explaining this phenomenon, c.f., Jin et al. The applicability of the algorithmic stability paradigm to the study of generalization error in NNs was brought to light by Hardt et al. (2016) who showed that functions optimized via Stochastic Gradient Descent have the property of uniform stability defined by Bousquet & Elisseeff (2002) , implying that NNs should also have this property. Subsequently, there was renewed interest in uniform stability and a sequence of papers emerged using improved probabilistic tools to give better generalization bounds for uniformly stable algorithms, e.g., Feldman & Vondrak (2018; 2019a) and Bousquet et al. (2020) . Some other works, e.g. Klochkov & Zhivotovskiy (2021) 



(2020); Chatterjee & Zielinski (2022). We refer the reader to the work of Jin et al. (2020) which presents a concise taxonomy of these different theoretical approaches. There are also several works that seek to understand what a good theory of generalization should look like, c.f. Kawaguchi et al. (2017); Chatterjee & Zielinski (2022). Our own work falls within the paradigm that seeks to use notions of algorithmic stability to bound generalization error that began with Vapnik & Chervonenkis (1974) but gathered steam with the publication of the work by Bousquet & Elisseeff (2002).

, took this line forward by focussing on the relationship of uniform stability with the excess risk. However the work of Zhang et al. (2017) complicated the picture by pointing out examples where the theory suggests the opposite of what happens in practice. This led to two different strands of research. In one thread an attempt was made to either discover those cases where uniform stability, (e.g. Charles & Papailiopoulos (2018)), or to show lower bounds on stability that ensure that uniform stability does not

