SEQUENTIAL LEARNING OF NEURAL NETWORKS FOR PREQUENTIAL MDL

Abstract

Minimum Description Length (MDL) provides a framework and an objective for principled model evaluation. It formalizes Occam's Razor and can be applied to data from non-stationary sources. In the prequential formulation of MDL, the objective is to minimize the cumulative next-step log-loss when sequentially going through the data and using previous observations for parameter estimation. It thus closely resembles a continual-or online-learning problem. In this study, we evaluate approaches for computing prequential description lengths for image classification datasets with neural networks. Considering the computational cost, we find that online-learning with rehearsal has favorable performance compared to the previously widely used block-wise estimation. We propose forward-calibration to better align the models predictions with the empirical observations and introduce replay-streams, a minibatch incremental training technique to efficiently implement approximate random replay while avoiding large in-memory replay buffers. As a result, we present description lengths for a suite of image classification datasets that improve upon previously reported results by large margins.

1. INTRODUCTION

Within the field of deep learning, the paradigm of empirical risk minimization (ERM, Vapnik (1991) ) together with model and hyper-parameter selection based on held-out data is the prevailing training and evaluation protocol. This approach has served the field well, supporting seamless scaling to large model and data set sizes. The core assumptions of ERM are: a) the existence of fixed but unknown distributions q(x) and q(y|x) that represent the data-generating process for the problem under consideration; b) the goal is to obtain a function ŷ = f (x, θ * ) that minimizes some loss L(y, ŷ) in expectation over the data drawn from q; and c) that we are given a set of (i.i.d.) samples from q to use as training and validation data. Its simplicity and well understood theoretical properties make ERM an attractive framework when developing learning machines. However, sometimes we wish to employ deep learning techniques in situations where not all these basic assumptions hold. For example, if we do not assume a fixed data-generating distribution q we enter the realm of continual learning, life long learning or online learning. Multiple different terms are used for these scenarios because they operate under different constraints, and because there is some ambiguity about what problem exactly a learning machine is supposed to solve. A recent survey on continual-and life-long learning for example describes multiple, sometimes conflicting desiderata considered in the literature: forward-transfer, backward-transfer, avoiding catastrophic forgetting and maintaining plasticity (Hadsell et al., 2020) . Another situation in which the ERM framework is not necessarily the best approach is when minimizing the expected loss L is not the only, or maybe not even the primary objective of the learning machine. For example, recently deep-learning techniques have been used to aid structural and causal inference (Vowels et al., 2021) . In these cases, we are more interested in model selection or some aspect of the learned parameters θ * than the generalization loss L. Here we have little to gain from the generalization bounds provided by the ERM framework and in fact some of its properties can be harmful.

Independent of ERM, compression based approaches to inference and learning such as Minimum

Description Length (Rissanen, 1984; Grunwald, 2004 ), Minimum Message Length (Wallace, 2005) and Solomonoffs theory of inductive inference (Solomonoff, 1964; Rathmanner & Hutter, 2011) have been extensively studied. In practice, however, they have been primarily applied to small scale problems and with very simple models compared to deep neural networks. These approaches strive to find and train models that can compress the observed data well and rely on the fact that such models have a good chance of generalizing on future data. It is considered a major benefit that these approaches come with a clear objective and have been studied in detail even for concrete sequence of observations; without assuming stationarity or a probabilistic generative process at all (also called the individual sequence scenario). The individual sequence setting is problematic for ERM because the fundamental step of creating training and validation splits is not clearly defined. In fact, creating such splits implies making an assumption about what is equally distributed across training and validation data. But creating validation splits can be problematic even with (assumed) stationary distributions, and this becomes more pressing as ML research moves from curated academic benchmark data sets to large user provided or web-scraped data. In contrast to ERM, compression approaches furthermore include a form of Occam's Razor, i.e. they prefer simpler models (according to some implicit measure of complexity) as long as the simplicity does not harm the model's predictive performance. The literature considers multiple, subtly different approaches to defining and computing description lengths. Most of them are intractable and known approximations break down for overparameterized model families such as neural networks. One particular approach however, prequential MDL (Dawid & Vovk, 1999; Poland & Hutter, 2005) , turns computing the description length L(D|M ) = -log p(D|M ) into a specific kind of continual or online learning problem: log p(D|M ):= T t=1 log p M (y t |x t , θ(D <t )), where D={(x t , y t )} T 1 is the sequence of inputs x t and associated prediction targets y t ; p M (y|x, θ) denotes a model family M and θ(D <t ) an associated parameter estimator given training data D <t . In contrast to ERM which considers the performance of a learning algorithm on held-out data after training on a fixed dataset, prequential MDL evaluates a learner by its generalization performance on y t after trained on initially short but progressively longer sequences of observations D <t . This resembles an online learning problem where a learner is sequentially exposed to new data (x t , y t ), however a sequential learner is also allowed to revisit old data x t ′ , y t ′ for training and when making a prediction at time t>t ′ . We refer to (Bornschein & Hutter, 2023) for an in-depth discussion of the benefits and challenges when using prequential MDL with deep learning. Contributions. Previous work on computing prequential description lengths with neural networks relied on a block-wise (chunk-incremental) approximation: at some positions t a model is trained from random initialization to convergence on data D <t and then their prediction losses on the next intervals are combined (Blier & Ollivier, 2018; Bornschein et al., 2020; Jin et al., 2021; Bornschein et al., 2021; Perez et al., 2021; Whitney et al., 2020) . We investigate alternatives that are inspired by continual-learning (CL) based methods. In particular, chunk-incremental and mini-batch incremental fine-tuning with rehearsal. Throughout this empirical study, we consider the computational costs associated with these methods. We propose two new techniques: forward-calibration and replay streams. Forward-calibration improves the results by making the model's predictive uncertainty better match the observed distribution. Replay streams makes replay for mini-batch incremental learning more scalable by providing approximate random rehearsal while avoiding large in-memory replay-buffers. We identify exponential moving parameter averages, label-smoothing and weight-standardization as generally useful techniques. As a result we present description lengths for a suite of popular image classification datasets that improve upon the previously reported results (Blier & Ollivier, 2018; Bornschein et al., 2020) by large margins. The motivation for studying prequential MDL stems from the desire to apply deep-learning techniques in situations violating the assumptions of ERM. In this work however we concentrate on established iid. datasets: stationarity is most challenging for a sequential learner, which has to uniformly learn from all previously examples, and it allows us to evaluate well known architectures and compare their held-out performance against ERM-based training, which has been the focus of the community for decades. We include additional experimental results on non-stationary data in Appendix C.7.3.

