IMPROVING INFORMATION RETENTION IN LARGE SCALE ONLINE CONTINUAL LEARNING

Abstract

Given a stream of data sampled from non-stationary distributions, online continual learning (OCL) aims to adapt efficiently to new data while retaining existing knowledge. The typical approach to address information retention (the ability to retain previous knowledge) is keeping a replay buffer of a fixed size and computing gradients using a mixture of new data and the replay buffer. Surprisingly, the recent work (Cai et al., 2021) suggests that information retention remains a problem in large scale OCL even when the replay buffer is unlimited, i.e., the gradients are computed using all past data. This paper focuses on this peculiarity to understand and address information retention. To pinpoint the source of this problem, we theoretically show that, given limited computation budgets at each time step, even without strict storage limit, naively applying SGD with constant or constantly decreasing learning rates fails to optimize information retention in the long term. We propose using a moving average family of methods to improve optimization for non-stationary objectives. Specifically, we design an adaptive moving average (AMA) optimizer and a moving-average-based learning rate schedule (MALR). We demonstrate the effectiveness of AMA+MALR on large-scale benchmarks, including Continual Localization (CLOC), Google Landmarks, and ImageNet. Code will be released upon publication.

1. INTRODUCTION

Supervised learning commonly assumes that the data is independent and identically distributed (iid.). This assumption is violated in practice when the data comes from a non-stationary distribution that evolves over time. Continual learning aims to solve this problem by designing algorithms that efficiently learn and retain knowledge over time from a data stream. Continual learning can be classified into online and offline. The offline setting (Li & Hoiem, 2017) mainly limits the storage: only a fixed amount of training data can be stored at each time step. The computation is not limited in offline continual learning: the model can be trained from scratch until convergence at each step. In contrast, the online setting only allows a limited amount of storage and computation at each time step. A number of metrics can be used to evaluate a continual learner. If the model is directly evaluated on the incoming data, the objective is learning efficacy; this evaluates the ability to efficiently adapt to new data. If the model is evaluated on historical data, the objective is information retention; this evaluates the ability to retain existing knowledge. These two objectives are in conflict, and their trade-off is known as the plasticity-stability dilemma (McCloskey & Cohen, 1989) . Following recent counterintuitive results, we single out information retention in this work. A common assumption in the continual learning literature is that information retention is only a problem due to the storage constraint. Take replay-buffer-based methods as an exemplar. It is tacitly understood that since they cannot store the entire history, they forget past knowledge which is not stored. However, this intuition is challenged by recent empirical results. For example, Cai et al. (2021) show that the information retention problem persists even when past data is stored in its entirety. We argue that, at least in part, the culprit for information loss is optimization. Direct application of SGD to a continual data stream is problematic. Informally, consider the learning rate (i.e., step size). It needs to decrease to zero over time to guarantee convergence (Ghadimi & Lan, 2013) . However, this cannot be applied to a continual and non-iid. stream since infinitesimal learning rates would simply ignore the new information and fail to adapt, resulting in underfitting. This underfitting would worsen over time as the distribution continues to shift. We formalize this problem and further show that there is no straightforward method to control this trade-off, and this issue holds even when common adaptive learning rate heuristics are applied. Orthogonal to continual learning, one recently proposed remedy to guarantee SGD convergence with high learning rates is using the moving average of SGD iterates (Mandt et al., 2016; Tarvainen & Valpola, 2017) . Informally, SGD with large learning rates bounces around the optimum. Averaging its trajectory dampens the bouncing and tracks the optimum better (Mandt et al., 2016) . We apply these ideas to OCL for the first time to improve information retention. To summarize, we theoretically analyze the behavior of SGD for OCL. Following this analysis, we propose a moving average strategy to optimize information retention. Our method uses SGD with large learning rates to adapt to non-stationarity, and utilizes the average of SGD iterates for better convergence. We propose an adaptive moving average (AMA) algorithm to control the moving average weight over time. Based on the statistics of the SGD and AMA models, we further propose a moving-average-based learning rate schedule (MALR) to better control the learning rate. In terms of replay buffer strategies, mixed replay (Chaudhry et al., 2019) , originating from offline continual learning, forms a minibatch by sampling half of the data from the online stream and the other half from the history. It has been applied in OCL to optimize learning efficacy (Cai et al., 2021) . Our work uses pure replay instead to optimize information retention, where a minibatch is formed by sampling uniformly from all history. Continual learning algorithms. We focus on the optimization aspect of OCL. Other aspects, such as the data integration (Aljundi et al., 2019b) and the sampling procedure of the replay buffer (Aljundi et al., 2019a; Chrysakis & Moens, 2020) , are complementary and orthogonal to our study. These aspects are critical for a successful OCL strategy and can potentially be used in conjuction with our optimizers. Offline continual learning (Li & Hoiem, 2017; Kirkpatrick et al., 2017) aims to improve information retention with limited storage. Unlike the online setting, SGD works in this case since the model can be retrained until convergence at each time step. We refer the readers to Delange et al. ( 2021) for a detailed survey of offline continual learning algorithms. Moving average in optimization. We propose a new moving-average-based optimizer for OCL. Although we are the first to apply this idea to OCL, moving average optimizers have been widely utilized for convex (Ruppert, 1988; Polyak & Juditsky, 1992) and non-convex optimization (Izmailov et al., 2018; Maddox et al., 2019; He et al., 2020) . Beyond supervised learning (Izmailov et al., 2018; Maddox et al., 2019) , the moving average model has also been used as a teacher of the SGD model in semi-supervised (Tarvainen & Valpola, 2017) and self-supervised learning (He et al., 2020) . The moving average of stochastic gradients (rather than model weights) has also been widely used in ADAM-based optimizers (Kingma & Ba, 2014) . Continual learning benchmarks. We need a large-scale and realistic benchmark to evaluate OCL. For language modeling, Hu et al. (2020) created the Firehose benchmark using a large stream of Twitter posts. The task of Firehose is continual per-user tweet prediction, which is self-supervised and multi-task. For visual recognition, Lin et al. ( 2021) created the CLEAR benchmark by manually labeling images on a subset of YFCC100M (Thomee et al., 2016) . Though the images are ordered in time, the number of labeled images is small (33K). Cai et al. (2021) proposed the continual localization (CLOC) benchmark using a subset of YFCC100M with time stamps and geographic locations. The task of CLOC is geolocalization, which is formulated as image classification. CLOC



Experiments on Continual Localization (CLOC)(Cai et al., 2021), Google Landmarks (Weyand et al., 2020), and  ImageNet (Deng et al., 2009)  demonstrate superior information retention and long-term transfer for large-scale OCL.2 RELATED WORKOptimization in OCL. OCL methods typically focus on improving learning efficacy. Cai et al. (2021) proposed several strategies, including adaptive learning rates, adaptive replay buffer sizes, and small batch sizes. Hu et al. (2020) proposed a new optimizer, ConGrad, which, at each time step, adaptively controls the number of online gradient descent steps (Hazan, 2019) to balance generalization and training loss reduction. Our work instead focuses on information retention and proposes a new optimizer and learning rate schedule that trade off learning efficacy to improve long-term transfer.

