NONVACUOUS LOSS BOUNDS WITH FAST RATES FOR NEURAL NETWORKS VIA CONDITIONAL INFORMA-TION MEASURES

Abstract

We present a framework to derive bounds on the test loss of randomized learning algorithms for the case of bounded loss functions. This framework leads to bounds that depend on the conditional information density between the the output hypothesis and the choice of the training set, given a larger set of data samples from which the training set is formed. Furthermore, the bounds pertain to the average test loss as well as to its tail probability, both for the PAC-Bayesian and the single-draw settings. If the conditional information density is bounded uniformly in the size n of the training set, our bounds decay as 1/n, which is referred to as a fast rate. This is in contrast with the tail bounds involving conditional information measures available in the literature, which have a less benign 1/ √ n dependence. We demonstrate the usefulness of our tail bounds by showing that they lead to estimates of the test loss achievable with several neural network architectures trained on MNIST and Fashion-MNIST that match the state-of-the-art bounds available in the literature.

1. INTRODUCTION

In recent years, there has been a surge of interest in the use of information-theoretic techniques for bounding the loss of learning algorithms. While the first results of this flavor can be traced to the probably approximately correct (PAC)-Bayesian approach (McAllester, 1998; Catoni, 2007) (see also (Guedj, 2019) for a recent review), the connection between loss bounds and classical information-theoretic measures was made explicit in the works of Russo & Zou (2016) and Xu & Raginsky (2017) , where bounds on the average population loss were derived in terms of the mutual information between the training data and the output hypothesis. Since then, these average loss bounds have been tightened (Bu et al., 2019; Asadi et al., 2018; Negrea et al., 2019) . Furthermore, the information-theoretic framework has also been successfully applied to derive tail probability bounds on the population loss (Bassily et al., 2018; Esposito et al., 2019; Hellström & Durisi, 2020a) . Of particular relevance to the present paper is the random-subset setting, introduced by Steinke & Zakynthinou (2020) and further studied in (Hellström & Durisi, 2020b; Haghifam et al., 2020) . In this setting, a random vector S is used to select n training samples Z(S) from a larger set Z of 2n samples. Then, bounds on the average population loss are derived in terms of the conditional mutual information (CMI) I(W ; S| Z) between the chosen hypothesis W and the random vector S given the set Z. The bounds obtained by Xu & Raginsky (2017) depend on the mutual information I(W ; Z), a quantity that can be unbounded if W reveals too much about the training set Z. In contrast, bounds for the random-subset setting are always finite, since I(W ; S| Z) is never larger than n bits. Most information-theoretic population loss bounds mentioned thus far are given by the training loss plus a term with a IM(P WZ )/n-dependence, where IM(P WZ ) denotes an information measure, such as mutual information or maximal leakage (Issa et al., 2020) . Assuming that the information measure grows at most polylogarithmically with n, the convergence rate of the population loss to the training loss is Õ(1/ √ n), where the Õ-notation hides logarithmic factors. This is sometimes referred to as a slow rate. In the context of bounds on the excess risk, defined as the difference between the achieved population loss for a chosen hypothesis w and its infimum over the hypothesis class, it is known that slow rates are optimal for worst-case distributions and hypothesis classes (Talagrand,

