THE DEEP BOOTSTRAP FRAMEWORK: GOOD ONLINE LEARNERS ARE GOOD OFFLINE GENERALIZERS

Abstract

We propose a new framework for reasoning about generalization in deep learning. The core idea is to couple the Real World, where optimizers take stochastic gradient steps on the empirical loss, to an Ideal World, where optimizers take steps on the population loss. This leads to an alternate decomposition of test error into: (1) the Ideal World test error plus (2) the gap between the two worlds. If the gap ( 2) is universally small, this reduces the problem of generalization in offline learning to the problem of optimization in online learning. We then give empirical evidence that this gap between worlds can be small in realistic deep learning settings, in particular supervised image classification. For example, CNNs generalize better than MLPs on image distributions in the Real World, but this is "because" they optimize faster on the population loss in the Ideal World. This suggests our framework is a useful tool for understanding generalization in deep learning, and lays a foundation for future research in the area.

1. INTRODUCTION

Figure 1 : Three architectures trained from scratch on CIFAR-5m, a CIFAR-10-like task. The Real World is trained on 50K samples for 100 epochs, while the Ideal World is trained on 5M samples in 1 pass. The Real World Test remains close to Ideal World Test, despite a large generalization gap. The goal of a generalization theory in supervised learning is to understand when and why trained models have small test error. The classical framework of generalization decomposes the test error of a model f t as: TestError(f t ) = TrainError(f t ) + [TestError(f t ) -TrainError(f t )] Generalization gap (1) and studies each part separately (e.g. Vapnik and Chervonenkis (1971); Blumer et al. (1989) ; Shalev-Shwartz and Ben-David (2014) ). Many works have applied this framework to study generalization of deep networks (e.g. Bartlett (1997) ; Bartlett et al. (1999) ; Bartlett and Mendelson (2002) ; Anthony and Bartlett ( 2009 2019)). However, there are at least two obstacles to understanding generalization of modern neural networks via the classical approach. 1. Modern methods can interpolate, reaching TrainError ≈ 0, while still performing well. In these settings, the decomposition of Equation (1) does not actually reduce test error into two different subproblems: it amounts to writing TestError = 0 + TestError. That is, understanding the generalization gap here is exactly equivalent to understanding the test error itself. 2. Most if not all techniques for understanding the generalization gap (e.g. uniform convergence, VC-dimension, regularization, stability, margins) remain vacuous (Zhang et al., 2017; Belkin et al., 2018a; b; Nagarajan and Kolter, 2019) and not predictive (Nagarajan and Kolter, 2019; Jiang et al., 2019; Dziugaite et al., 2020) for modern networks. In this work, we propose an alternate approach to understanding generalization to help overcome these obstacles. The key idea is to consider an alternate decomposition: TestError(f t ) = TestError(f iid t ) A: Online Learning + [TestError(f t ) -TestError(f iid t )] B: Bootstrap error (2) where f t is the neural-network after t optimization steps (the "Real World"), and f iid t is a network trained identically to f t , but using fresh samples from the distribution in each mini-batch step (the "Ideal World"). That is, f iid t is the result of optimizing on the population loss for t steps, while f t is the result of optimizing on the empirical loss as usual (we define this more formally later). This leads to a different decoupling of concerns, and proposes an alternate research agenda to understand generalization. To understand generalization in the bootstrap framework, it is sufficient to understand: (A) Online Learning: How quickly models optimize on the population loss, in the infinite-data regime (the Ideal World). (B) Finite-Sample Deviations: How closely models behave in the finite-data vs. infinite-data regime (the bootstrap error). Although neither of these points are theoretically understood for deep networks, they are closely related to rich areas in optimization and statistics, whose tools have not been brought fully to bear on the problem of generalization. The first part (A) is purely a question in online stochastic optimization: We have access to a stochastic gradient oracle for a population loss function, and we are interested in how quickly an online optimization algorithm (e.g. SGD, Adam) reaches small population loss. This problem is well-studied in the online learning literature for convex functions (Bubeck, 2011; Hazan, 2019; Shalev-Shwartz et al., 2011) , and is an active area of research in nonconvex settings (Jin et al., 2017; Lee et al., 2016; Jain and Kar, 2017; Gao et al., 2018; Yang et al., 2018; Maillard and Munos, 2010) . In the context of neural networks, optimization is usually studied on the empirical loss landscape (Arora et al., 2019; Allen-Zhu et al., 2019) , but we propose studying optimization on the population loss landscape directly. This highlights a key difference in our approach: we never compare test and train quantities-we only consider test quantities. The second part (B) involves approximating fresh samples with "reused" samples, and reasoning about behavior of certain functions under this approximation. This is closely related to the nonparametric bootstrap in statistics (Efron, 1979; Efron and Tibshirani, 1986) , where sampling from the population distribution is approximated by sampling with replacement from an empirical distribution. Bootstrapped estimators are widely used in applied statistics, and their theoretical properties are known in certain cases (e.g. Hastie et al. (2009); James et al. (2013) ; Efron and Hastie (2016); Van der Vaart (2000) ). Although current bootstrap theory does not apply to neural networks, it is conceivable that these tools could eventually be extended to our setting. Experimental Validation. Beyond the theoretical motivation, our main experimental claim is that the bootstrap decomposition is actually useful: in realistic settings, the bootstrap error is often small, and the performance of real classifiers is largely captured by their performance in the Ideal World. Figure 1 shows one example of this, as a preview of our more extensive experiments in Section 4. We plot the test error of a ResNet (He et al., 2016a), an MLP, and a Vision Transformer (Dosovitskiy et al., 2020) on a CIFAR-10-like task, over increasing minibatch SGD iterations. The Real World is trained on 50K samples for 100 epochs. The Ideal World is trained on 5 million samples with a single pass. Notice that the bootstrap error is small for all architectures, although the generalization



); Neyshabur et al. (2015b); Dziugaite and Roy (2017); Bartlett et al. (2017); Neyshabur et al. (2017); Harvey et al. (2017); Golowich et al. (2018); Arora et al. (2018; 2019); Allen-Zhu et al. (2019); Long and Sedghi (2019); Wei and Ma (

