GENERALIZATION BOUNDS VIA DISTILLATION

Abstract

This paper theoretically investigates the following empirical phenomenon: given a high-complexity network with poor generalization bounds, one can distill it into a network with nearly identical predictions but low complexity and vastly smaller generalization bounds. The main contribution is an analysis showing that the original network inherits this good generalization bound from its distillation, assuming the use of well-behaved data augmentation. This bound is presented both in an abstract and in a concrete form, the latter complemented by a reduction technique to handle modern computation graphs featuring convolutional layers, fullyconnected layers, and skip connections, to name a few. To round out the story, a (looser) classical uniform convergence analysis of compression is also presented, as well as a variety of experiments on cifar10 and mnist demonstrating similar generalization performance between the original network and its distillation.

1. OVERVIEW AND MAIN RESULTS

Generalization bounds are statistical tools which take as input various measurements of a predictor on training data, and output a performance estimate for unseen data -that is, they estimate how well the predictor generalizes to unseen data. Despite extensive development spanning many decades (Anthony & Bartlett, 1999) , there is growing concern that these bounds are not only disastrously loose (Dziugaite & Roy, 2017) , but worse that they do not correlate with the underlying phenomena (Jiang et al., 2019b) , and even that the basic method of proof is doomed (Zhang et al., 2016; Nagarajan & Kolter, 2019) . As an explicit demonstration of the looseness of these bounds, Figure 1 calculates bounds for a standard ResNet architecture achieving test errors of respectively 0.008 and 0.067 on mnist and cifar10; the observed generalization gap is 10 -1 , while standard generalization techniques upper bound it with 10 15 . Contrary to this dilemma, there is evidence that these networks can often be compressed or distilled into simpler networks, while still preserving their output values and low test error. Meanwhile, these simpler networks exhibit vastly better generalization bounds: again referring to Figure 1 , those same networks from before can be distilled with hardly any change to their outputs, while their bounds reduce by a factor of roughly 10 10 . Distillation is widely studied (Buciluȗ et al., 2006; Hinton et al., 2015) , but usually the original network is discarded and only the final distilled network is preserved. The purpose of this work is to carry the good generalization bounds of the distilled network back to the original network; in a sense, the explicit simplicity of the distilled network is used as a witness to implicit simplicity of the original network. The main contributions are as follows. • The main theoretical contribution is a generalization bound for the original, undistilled network which scales primarily with the generalization properties of its distillation, assuming that wellbehaved data augmentation is used to measure the distillation distance. An abstract version of this bound is stated in Lemma 1.1, along with a sufficient data augmentation technique in Lemma 1.2. A concrete version of the bound, suitable to handle the ResNet architecture in Figure 1 , is described in Theorem 1.3. Handling sophisticated architectures with only minor proof alterations is another contribution of this work, and is described alongside Theorem 1.3. This abstract and concrete analysis is sketched in Section 3, with full proofs deferred to appendices. • Rather than using an assumption on the distillation process (e.g., the aforementioned "wellbehaved data augmentation"), this work also gives a direct uniform convergence analysis, culminating in Theorem 1.4. This is presented partially as an open problem or cautionary tale, as

