POPULATING MEMORY IN CONTINUAL LEARNING WITH CONSISTENCY AWARE SAMPLING Anonymous authors Paper under double-blind review

Abstract

Continual Learning (CL) methods aim to mitigate Catastrophic Forgetting (CF), where knowledge from previously learned tasks is often lost in favor of the new one. Among those algorithms, some have shown the relevance of keeping a rehearsal buffer with previously seen examples, referred to as memory. Yet, despite their popularity, limited research has been done to understand which elements are more beneficial to store in memory. It is common for this memory to be populated through random sampling, with little guiding principles that may aid in retaining prior knowledge. In this paper, and consistent with previous work, we found that some storage policies behave similarly given a certain memory size or compute budget, but when these constraints are relevant, results differ considerably. Based on these insights, we propose CAWS (Consistency AWare Sampling), an original storage policy that leverages a learning consistency score (C-Score) to populate the memory with elements that are easy to learn and representative of previous tasks. Because of the impracticality of directly using the C-Score in CL, we propose more feasible and efficient proxies to calculate the score that yield state-of-the-art results on CIFAR-100 and Tiny Imagenet.



) or games previously thought to be intractable to solve, such as Go Silver et al. (2016) and Starcraft II Vinyals et al. (2019) . However, as a common limitation, all these models lack versatility: when trained to perform novel tasks, they rapidly forget how to solve previous ones. This condition is known as catastrophic forgetting and is the main problem tackled by Continual Learning methods Parisi et al. (2019); Delange et al. (2021) . A variety of methods have been proposed to approach this problem. Some have focused on allocating parameters sub-spaces for each new task Rusu et al. ( 2016 It is clear that having a representative set of examples of the underlying distribution is critical for preserving previous knowledge. Ideally, one would like to save a large number of samples. Unfortunately, since saving large amounts of data results in computational overhead, we have to limit the memory size and choose which elements to keep. In this paper, we argue that this memory must satisfy two fundamental requirements in order to perform reliably. The first is to have elements that are easy to remember, or that the model can learn quickly. The second is to have elements that are a suitable representation of the distributions of past experiences, having diversity but being careful to avoid outliers. We refer to these two ideas as Fast Learning and Diversity. For the first requirement, we leverage a concept called learning consistency (Jiang et al., 2021) to measure how consistently a sample is learned by a given class of a models. Specifically, we populate the memory with elements with higher consistency values, which have also been shown to be the fastest to learn. Thus, sample efficiency is improved for memory samples. However, by selecting only those samples with the highest consistency values, the model learns only a limited set of patterns, reducing the diversity of samples stored in the memory and limiting how much of the decision boundary the model is capable of representing. To overcome this, we propose Consistency AWare Sampling (CAWS) as a new populating strategy that incorporates sampling from a broader group of high consistency elements. This new proposal adds diversity to the memory while allowing the model to learn a far more detailed decision boundary. One of the limitations of the consistency score (C-Score) is that it requires training multiple models with the entire training distribution in order to find out how easy or difficult it is to train an example, which, in addition to being expensive, is impractical for CL. To mitigate this problem, we propose proxies to calculate the consistency of an example, achieving similar performance without the need of training multiple models on the entire training set beforehand. These proxies are not limited to CL scenarios, they can also be used in environments where these types of scores are commonly used, such as in Curriculum Learning (Bengio et al., 2009) . Thus, our contributions can be summarized as follows: • Taking a step towards understanding how the memory should be populated based on the effectiveness-efficiency trade-off of the scenario. • In Section 3, we propose a novel method -Consistency AWare Sampling (CAWS)-for populating the memory for Continual Learning based on the idea of learning consistency. This method equals or outperforms state-of-the-art memory selection methods. • Since learning consistency requires trained models on the same training data to be estimated, in Section 4 we propose practical proxies that require no extra training and achieve similar results. Moreover, these proxies could be used for other scenarios where C-Scores are required, such as Curriculum Learning.

2. FAST LEARNING

Multiple ways to populate memory in CL have been proposed. However, few studies have explored when different approaches work better than others. Some studies have shown that, under certain conditions, there is no significant difference between the proposed methods, showing how limited our understanding of how replay strategies is. In this work, we will consider the following methods as baselines: (a) Reservoir. A reservoir (Vitter, 1985) strategy allows sampling elements from a stream without knowing how many instances to expect. The method selects each sample with a probability M N where N is the number of elements observed so far, and M is the memory size. This way, it acts randomly to maintain a uniform sample from the already seen stream. (b) Class Balance. As the name states, each class has an equal proportion of the buffer size (Chrysakis & Moens, 2020). We use a dynamic assignment, meaning that the memory is always complete. Samples of new classes replace instances of old classes to maintain equal distribution in the memory. (c) Task Balance. Similar to Class Balance, but instead of an equal proportion of classes, the memory is divided by the number of tasks the sequence has (Lopez-Paz & Ranzato, 2017) . (d) Mean of Features (MF). Proposed by Rebuffi et al. (2017) , it calculates an average class feature vector, based on the representations of the elements in memory for a given class. If the distance of the new vector to the corresponding class vector is smaller than the farthest in the memory, we replace the new example with the farthest one.



have repeatedly shown state of the art performance in numerous tasks, including image recognitionHe et al. (2016); Dosovitskiy et al. (2020), Natural Language Processing (NLP) Devlin et al. (2018); Brown et al. (

); Mallya et al. (2018), others define restrictions on gradients learned Kirkpatrick et al. (2017); Lopez-Paz & Ranzato (2017), while others use meta-learning to learn reusable weights for all tasks Rajasegaran et al. (2020); Hurtado et al. (2021). Among these, memory-based methods like Experience Replay Chaudhry et al. (2019); Kim et al. (2020) have consistently exhibited greater performance while being easy to understand. In these methods, a memory of samples from previous tasks is kept during training of the current task to avoid forgetting how to solve previous tasks. Notwithstanding the popularity and effectiveness of memory-based methods, few studies have been conducted on how populating the memory affects the performance of CL methods. In particular, Chaudhry et al. (2018a); Wu et al. (2019); Hayes et al. (2020); Araujo et al. (2022) show that when populating the memory by focusing solely on sample diversity or class balance, random selection of elements ends up performing nearly or just as well without adding extra computation.

