THE DEEP BOOTSTRAP FRAMEWORK: GOOD ONLINE LEARNERS ARE GOOD OFFLINE GENERALIZERS

Abstract

We propose a new framework for reasoning about generalization in deep learning. The core idea is to couple the Real World, where optimizers take stochastic gradient steps on the empirical loss, to an Ideal World, where optimizers take steps on the population loss. This leads to an alternate decomposition of test error into: (1) the Ideal World test error plus (2) the gap between the two worlds. If the gap ( 2) is universally small, this reduces the problem of generalization in offline learning to the problem of optimization in online learning. We then give empirical evidence that this gap between worlds can be small in realistic deep learning settings, in particular supervised image classification. For example, CNNs generalize better than MLPs on image distributions in the Real World, but this is "because" they optimize faster on the population loss in the Ideal World. This suggests our framework is a useful tool for understanding generalization in deep learning, and lays a foundation for future research in the area.

1. INTRODUCTION

Figure 1 : Three architectures trained from scratch on CIFAR-5m, a CIFAR-10-like task. The Real World is trained on 50K samples for 100 epochs, while the Ideal World is trained on 5M samples in 1 pass. The Real World Test remains close to Ideal World Test, despite a large generalization gap. The goal of a generalization theory in supervised learning is to understand when and why trained models have small test error. The classical framework of generalization decomposes the test error of a model f t as: TestError(f t ) = TrainError(f t ) + [TestError(f t ) -TrainError(f t )] Generalization gap (1) and studies each part separately (e.g. Vapnik and Chervonenkis (1971); Blumer et al. (1989) ; Shalev-Shwartz and Ben-David (2014)). Many works have applied this framework to study generalization of deep networks (e.g. Bartlett (1997) ; Bartlett et al. (1999) ; Bartlett and Mendelson (2002) ; Anthony and Bartlett ( 2009 



); Neyshabur et al. (2015b); Dziugaite and Roy (2017); Bartlett et al. (2017); Neyshabur et al. (2017); Harvey et al. (2017); Golowich et al. (2018); Arora et al. (2018; 2019);

