LEARNING WITHOUT PREJUDICES: CONTINUAL UNBI-ASED LEARNING VIA BENIGN AND MALIGNANT FOR-GETTING

Abstract

Although machine learning algorithms have achieved state-of-the-art status in image classification, recent studies have substantiated that the ability of the models to learn several tasks in sequence, termed continual learning (CL), often suffers from abrupt degradation of performance from previous tasks. A large body of CL frameworks has been devoted to alleviating this forgetting issue. However, we observe that forgetting phenomena in CL are not always unfavorable, especially when there is bias (spurious correlation) in training data. We term such type of forgetting benign forgetting, and categorize detrimental forgetting as malignant forgetting. Based on this finding, our objective in this study is twofold: (a) to discourage malignant forgetting by generating previous representations, and (b) encourage benign forgetting by employing contrastive learning in conjunction with feature-level augmentation. Extensive evaluations of biased experimental setups demonstrate that our proposed method, Learning without Prejudices, is effective for continual unbiased learning.

1. INTRODUCTION

In continual learning (CL), a model learns a sequence of tasks to accumulate existing knowledge for a new task. This is preferable in practice, where a model cannot retrieve previously used data, owing to privacy, limited data capacity, or an online streaming setup. The main challenge in CL is to alleviate "catastrophic forgetting," whereby a model forgets prior information while training on new information (McCloskey & Cohen, 1989) . A line of recent works has been dedicated to mitigating this issue. Regularization-based methods force a current model not to be far from the previous one by penalizing changes in the parameters learned in previous tasks (Kirkpatrick et al., 2017; Chaudhry et al., 2018; Aljundi et al., 2018; 2019a; Ahn et al., 2019; Dhar et al., 2019; Douillard et al., 2020) . Replay-based methods store samples of prior tasks in a buffer and employ them along with present samples (Robins, 1995; Lopez-Paz & Ranzato, 2017; Buzzega et al., 2020; Aljundi et al., 2019b; Mai et al., 2021; Lin et al., 2021; Madaan et al., 2021; Chaudhry et al., 2021; Bonicelli et al., 2022) . Generator-based methods generate prior samples and input them into current tasks (Shin et al., 2017; Kemker & Kanan, 2017; Xiang et al., 2019; Ostapenko et al., 2019; Liu et al., 2020; Yin et al., 2020) . A common assumption of the above-mentioned existing methods is that the training dataset is welldistributed. However, a source dataset is often biased, and a machine learning algorithm could perceive the bias as meaningful information, thereby leading to misleading generalizability of the model (Kim et al., 2019; Jeon et al., 2022) . In the experiment in Section 3.1, we show that biased distributions are detrimental to the robustness of models in existing CL scenarios. Thus, we propose a new type of CL, termed "continual unbiased learning (CUL)", in which the dataset of each task has a different bias. With CUL, we aim to make any model trained on any task unbiased, considering all models as candidates for application. This is particularly desirable in practice, whereby a model designed for a specific purpose is deployed for long periods and training datasets with divergent distributions are fed sequentially to update the model. Even with CUL, forgetting past information ("malignant forgetting") degrades the generalizability of a model. For instance, with Biased MNIST in Figure 1 , the classifier perceives color as meaningful information for prediction, although it is not a natural meaning associated with the number. If the model clearly memorizes prior information that there are (red, 0) and (gray, 0) samples, it could know that color is not the key factor for predicting numbers. Furthermore, we observe that forgetting is not always malignant through the experiment in Section 3.2. Although information (derived from prior data) itself can contribute to a model's generalizability, it is beneficial to forget the misguidance learned from biased datasets, and hence we term such a forgetting "benign forgetting". As an example, suppose a classifier trained on the MNIST dataset is extremely biased toward the background color, as in Section 3.2. It is unfavorable for the classifier to make a logic that color = number and thus bet all the 'blue' images on '3', for instance. Therefore, we aim to discourage malignant forgetting and encourage benign forgetting. Toward this, we design a novel method, named Learning without Prejudices (LwP), which employs feature generator and contrastive learning. (i) Inspired by the research in Section 3.1 that the model trained with a set of data from all the tasks does not suffer from malignant forgetting, we exploit the capabilities of a feature generator. The feature generator generates feature maps containing previous information via a generative adversarial network (GAN). Feature maps provide a larger range of feature space (to be referenced to) than images, making the classifier more robust. (ii) The generated features are fed into the model by contrastive learning (Grill et al., 2020) , and then current data are used for training in supervised mode. Because bias means a spurious correlation between some particular attribute variables and label space, the model can learn representations free of bias, with self-supervised learning that does not require labels. (iii) To optimize the classifier with generated features effectively, we propose feature-level augmentation that spatially and channel-wise transforms features. An extensive evaluation of biased datasets shows that our proposed framework is effective for CUL. The main contributions of this study are summarized as follows: • We present a novel framework, termed "continual unbiased learning", to address bias in CL. Additionally, we propose continual unbiased learning benchmarks and an evaluation protocol for future research. • We find that forgetting phenomena in CL is not always catastrophic when the training dataset exhibits the non-uniform distribution of features, e.g., a biased dataset, and hence categorize them into malignant forgetting and benign forgetting. • We propose a novel method, Learning without Prejudices (LwP), that employs a feature generator and contrastive learning, presenting feature-level augmentation to bridge them. LwP contributes to models' generalizability significantly.

2.1. PROBLEM STATEMENT

Bias. Let X be an input space and Y be a label space. We define an attribute variable attr as an informative data feature of x ∈ X , possibly ranging from fine details (e.g., the pixel at (0, 0) is black) to high-level semantics of the image (e.g., there is a cat). Thus, a set of attributes can represent data x. Formally, let A be an attribute space and α : X → 2 A , where 2 A denotes the power set of A. A function α extracts attribute variables attr ∈ A from input space X , i.e., α(x) = {attr 1 , attr 2 , . . . , attr n }. Among these attr, some might be very correlated to Y while they are irrelevant to the natural meaning of the target object. We define this attr as "bias". As machine learning algorithms (e.g., convolutional neural networks (CNNs)) are overly dependent on training data distribution, the model could be biased, potentially leading to misleading generalizability (Torralba & Efros, 2011; Tommasi et al., 2017; Jeon et al., 2022) . For instance, according to Bahng et al. ( 2020), the majority of frog images are captured in swamp scenes and many bird images are captured in the sky, making the model consider the background as a dominating cue that often fails to infer (frog, sky) and (bird, swamp) images correctly. Continual learning. Consider a dataset D = {(x, y)|x ∈ X , y ∈ Y} for a classification problem. "continual learning" is a learning type with a sequence of D S = {D t = (X t , Y t )} T t=1 where each X t and Y t implicitly changes, expecting that f : X t → Y t accumulates previous information without forgetting while learning new tasks. Here, T means the number of tasks. A task t is predicting the target label y with unseen feature variable x and learning a task means the procedure of optimizing a classifier f : X t → Y t with D t to make a discriminative logic.

