NONVACUOUS LOSS BOUNDS WITH FAST RATES FOR NEURAL NETWORKS VIA CONDITIONAL INFORMA-TION MEASURES

Abstract

We present a framework to derive bounds on the test loss of randomized learning algorithms for the case of bounded loss functions. This framework leads to bounds that depend on the conditional information density between the the output hypothesis and the choice of the training set, given a larger set of data samples from which the training set is formed. Furthermore, the bounds pertain to the average test loss as well as to its tail probability, both for the PAC-Bayesian and the single-draw settings. If the conditional information density is bounded uniformly in the size n of the training set, our bounds decay as 1/n, which is referred to as a fast rate. This is in contrast with the tail bounds involving conditional information measures available in the literature, which have a less benign 1/ √ n dependence. We demonstrate the usefulness of our tail bounds by showing that they lead to estimates of the test loss achievable with several neural network architectures trained on MNIST and Fashion-MNIST that match the state-of-the-art bounds available in the literature.

1. INTRODUCTION

In recent years, there has been a surge of interest in the use of information-theoretic techniques for bounding the loss of learning algorithms. While the first results of this flavor can be traced to the probably approximately correct (PAC)-Bayesian approach (McAllester, 1998; Catoni, 2007) (see also (Guedj, 2019) for a recent review), the connection between loss bounds and classical information-theoretic measures was made explicit in the works of Russo & Zou (2016) and Xu & Raginsky (2017) , where bounds on the average population loss were derived in terms of the mutual information between the training data and the output hypothesis. Since then, these average loss bounds have been tightened (Bu et al., 2019; Asadi et al., 2018; Negrea et al., 2019) . Furthermore, the information-theoretic framework has also been successfully applied to derive tail probability bounds on the population loss (Bassily et al., 2018; Esposito et al., 2019; Hellström & Durisi, 2020a) . Of particular relevance to the present paper is the random-subset setting, introduced by Steinke & Zakynthinou (2020) and further studied in (Hellström & Durisi, 2020b; Haghifam et al., 2020) . In this setting, a random vector S is used to select n training samples Z(S) from a larger set Z of 2n samples. Then, bounds on the average population loss are derived in terms of the conditional mutual information (CMI) I(W ; S| Z) between the chosen hypothesis W and the random vector S given the set Z. The bounds obtained by Xu & Raginsky (2017) depend on the mutual information I(W ; Z), a quantity that can be unbounded if W reveals too much about the training set Z. In contrast, bounds for the random-subset setting are always finite, since I(W ; S| Z) is never larger than n bits. Most information-theoretic population loss bounds mentioned thus far are given by the training loss plus a term with a IM(P WZ )/n-dependence, where IM(P WZ ) denotes an information measure, such as mutual information or maximal leakage (Issa et al., 2020) . Assuming that the information measure grows at most polylogarithmically with n, the convergence rate of the population loss to the training loss is Õ(1/ √ n), where the Õ-notation hides logarithmic factors. This is sometimes referred to as a slow rate. In the context of bounds on the excess risk, defined as the difference between the achieved population loss for a chosen hypothesis w and its infimum over the hypothesis class, it is known that slow rates are optimal for worst-case distributions and hypothesis classes (Talagrand, 1994) . However, it is also known that under the assumption of realizability (i.e., the existence of a w in the hypothesis class such that the population loss L P Z (w) = 0) and when the hypothesis class is finite, the dependence on the sample size can be improved to Õ(1/n) (Vapnik, 1998, Chapter 4) . This is referred to as a fast rate. Excess risk bounds with fast rates for randomized classifiers have also been derived, under certain additional conditions, for both bounded losses (Van Erven et al., 2015) and unbounded losses (Grünwald & Mehta, 2020). Notably, Steinke & Zakynthinou (2020, Thm. 2(3)) derive a population loss bound whose dependence on n is I(W ; S| Z)/n. The price for this improved dependence is that the training loss that is added to the n-dependent term is multiplied by a constant larger than 1. Furthermore, (Steinke & Zakynthinou, 2020, Thm. 8) shows that if the Vapnik-Chervonenkis (VC) dimension of the hypothesis class is finite, there exists an empirical risk minimizer (ERM) whose CMI grows at most logarithmically with n. This implies that the CMI approach leads to fast-rate bounds in certain scenarios. However, the result in (Steinke & Zakynthinou, 2020, Thm. 2(3)) pertains only to the average population loss: no tail bounds on the population loss are provided. Throughout the paper, we will, with an abuse of terminology, refer to bounds with an n-dependence of the form IM(P WZ )/n as fast-rate bounds. Such bounds are also known as linear bounds (Dziugaite et al., 2020) . Note that the n-dependence of the information measure IM(P WZ ) has to be at most polylogarithmic for such bounds to actually achieve a fast rate in the usual sense. An intriguing open problem in statistical learning is to find a theoretical justification for the capability of overparameterized neural networks (NNs) to achieve good generalization performance despite being able to memorize randomly labeled training data sets (Zhang et al., 2017) . As a consequence of this behavior, classical population loss bounds that hold uniformly over a given hypothesis class, such as VC bounds, are vacuous when applied to overparameterized NNs. This has stimulated recent efforts aimed at obtaining tighter population loss bounds that are algorithm-dependent or data-dependent. In the past few years, several studies have shown that promising bounds are attainable by using techniques from the PAC-Bayesian literature (Dziugaite & Roy, 2017; Zhou et al., 2019; Dziugaite et al., 2020) . The PAC-Bayesian approach entails using the Kullback-Leibler (KL) divergence to compare the distribution on the weights of the NN induced by training to some reference distribution. These distributions are referred to as the posterior and the prior, respectively. Recently, Dziugaite et al. (2020) used data-dependent priors to obtain state-of-the-art bounds for LeNet-5 trained on MNIST and Fashion-MNIST. In their approach, the available data is used both for training the network and for choosing the prior. This leads to a bound that is tighter than previously available bounds. Furthermore, the bound can be further improved by minimizing the KL divergence between the posterior and the chosen prior during training. One drawback of the PAC-Bayesian approach is that it applies only to stochastic NNs, whose weights are randomly chosen each time the network is used, and not to deterministic NNs with fixed weights. Information-theoretic bounds have also been derived for iterative, noisy training algorithms such as stochastic gradient Langevin dynamics (SGLD) (Bu et al., 2019) . These bounds lead to nonvacuous estimates of the population loss of overparameterized NNs that are trained using SGLD through the use of data-dependent priors (Negrea et al., 2019) . However, these bounds do not apply to deterministic NNs, nor to standard stochastic gradient descent (SGD) training. Furthermore, the bounds pertain to the average population loss, and not to its tails. Although the techniques yielding these estimates can be adapted to the PAC-Bayesian setting, as discussed by Negrea et al. (2019, App. I), the resulting bounds are generally loose.

1.1. CONTRIBUTIONS

In this paper, we extend the fast-rate average loss bound by Steinke & Zakynthinou (2020) to the PAC-Bayesian and the single-draw settings. We then use the resulting PAC-Bayesian and single-draw bounds to characterize the test loss of NNs used to classify images from the MNIST and Fashion-MNIST data sets. The single-draw bounds can be applied to deterministic NNs trained through SGD but with Gaussian noise added to the final weights, whereas the PAC-Bayesian bounds apply only to randomized neural networks, whose weights are drawn from a Gaussian distribution each time the network is used. For the same setup, we also evaluate the slow-rate PAC-Bayesian and single-draw bounds from (Hellström & Durisi, 2020b) . Our numerical experiments reveal that both the slow-rate

