MAKE MEMORY BUFFER STRONGER IN CONTINUAL LEARNING: A CONTINUOUS NEURAL TRANSFORMA-TION APPROACH

Abstract

Continual learning (CL) focuses on learning non-stationary data distribution without forgetting previous knowledge. However, the most widely used memory-replay approach often suffers from memory overfitting. To mitigate the memory overfitting, we propose a continuous and reversible memory transformation method so that the memory data is hard to overfit, thus improving generalization. The transformation is achieved by optimizing a bi-level optimization objective that jointly learns the CL model and memory transformer. Specifically, we propose a deterministic continuous memory transformer (DCMT) modeled by an ordinary differential equation, allowing for infinite memory transformation and generating diverse and hard memory data. Furthermore, we inject uncertainty into the transformation function and propose a stochastic continuous memory transformer (SCMT) modeled by a stochastic differential equation, which substantially enhances the diversity of the transformed memory buffer. The proposed neural transformation approaches have significant advantages over existing ones: (1) we can obtain infinite many transformed data, thus significantly increasing the memory buffer diversity; (2) the proposed continuous transformations are reversible, i.e., the original raw memory data could be restored from the transformed memory data without the need to make a replica of the memory data. Extensive experiments on both task-aware and task-free CL show significant improvement with our approach compared to strong baselines.

1. INTRODUCTION

Continual learning (CL) aims to learn non-stationary data distribution without forgetting previous knowledge. Depending on whether there are explicit task definitions (partitions) during training, CL can be categorized into task-aware and task-free CL. For task-aware CL, there are explicit tasks and class splits during training; according to whether the task identities are known or not during testing, it can be further categorized into task/domain/class-incremental CL (van de Ven & Tolias, 2019). For task-free CL (Aljundi et al., 2019b) , there is no explicit task definition, and data distribution shift could happen at any time. Memory replay is an effective way to mitigate forgetting and has been widely used in CL. One major problem of the memory-based methods is that the effectiveness of memory buffer data could gradually decay during training (Delange et al., 2021; Jin et al., 2021) , i.e., the CL model may overfit the limited memory data and could not generalize well to the previous tasks. Recently, gradient-based memory editing (GMED) (Jin et al., 2021) has been proposed to mitigate memory overfitting by editing memory data with hard examples in a way similar to adversarial data augmentation (ADA) (Madry et al., 2018) . Specifically, it creates hard examples that increase model losses at each gradient step but restricts to a few (less than three) discrete gradient-based editing steps. With more editing steps, similar to ADA, GMED would make the memory even harder but cause less data diversity since the adversarial force will drive the feature space of different classes overlap and cluster together (Madry et al., 2018; Wang et al., 2021) . However, as studied in previous work (Gontijo-Lopes et al., 2021) , improving the diversity of training data is crucial to improving model generalization. An illustration of this phenomenon is shown in Figure 1 problem: how can we increase the diversity of the edited memory data and maintain its hardness at the same time? To address this problem, we present a continuous, expressive, and flexible memory transformation method to obtain a diverse set of memory data and make the memory buffer harder to memorize at the same time. We first model the gradual and continuous memory transformation as a deterministic neural ordinary differential equation in the time interval [0, T ], named Deterministic Continuous Memory Transformer (DCMT). There are several advantages compared to existing methods. First, we can obtain infinite time steps of transformed memory data for any t ∈ [0, T ] and thus significantly improve the diversity in the transformed memory data. Second, we do not need to make a replica of the raw memory data since the transformation process is reversible. We can restore original raw memory data from the transformed memory data. As shown in Figure 1 (d), DCMT diversifies memory data while maintaining hardness. The proposed DCMT considers a single transformation function. However, there are infinite possible transformation functions for transforming the memory data, and it is beneficial to model the uncertainty in the transformation function to further avoid overfitting (Lu et al.; Liu et al., 2019) . To model the underlying various transformation functions and further improve the data diversity, we thus generalize the methods in a probabilistic manner to model the memory transformation as a stochastic process with neural stochastic differential equations, named Stochastic Continuous Memory Transformer (SCMT). This enables us to model infinite transformation functions and significantly improves the diversity of the transformed data with some increased computation cost compared to DCMT. The overview of the proposed methods is presented in Figure 2 . Figure 2 : Overview of the proposed approach for memory transformation. DCMT and SCMT continuously and gradually transform the memory data to be diverse and hard to memorize. Note that the transformed data could be obtained at any continuous time step, thus providing significantly larger diversity. We propose a bi-level optimization to jointly learn the memory transformer and CL model. The memory transformer can generate diversified memory data that is hard to memorize. Concretely, after continuous interval [0, T ] transformation, we optimize the loss increase before (t = 0) and after



(b) and (c). This naturally leads to a new

Figure 1: T-SNE visualization of existing memory-replay and proposed methods on CIFAR10. We use features extracted from the last layer output of ResNet18 as the input to T-SNE. We use four classes of memory data to illustrate the difference. T-SNE embeds each data point, and each color denotes one class of memory buffer data. (a): ER is very easy to overfit and easy to classify; (b): GMED with smaller editing steps has limited effectiveness due to the limited hardness; (c): GMED with larger editing steps creates memory examples harder to classify but lack of diversity; (d): DCMT (our method) with better diversity and transformed memory data is hard to classify and overfit.

