LEARNING WITHOUT PREJUDICES: CONTINUAL UNBI-ASED LEARNING VIA BENIGN AND MALIGNANT FOR-GETTING

Abstract

Although machine learning algorithms have achieved state-of-the-art status in image classification, recent studies have substantiated that the ability of the models to learn several tasks in sequence, termed continual learning (CL), often suffers from abrupt degradation of performance from previous tasks. A large body of CL frameworks has been devoted to alleviating this forgetting issue. However, we observe that forgetting phenomena in CL are not always unfavorable, especially when there is bias (spurious correlation) in training data. We term such type of forgetting benign forgetting, and categorize detrimental forgetting as malignant forgetting. Based on this finding, our objective in this study is twofold: (a) to discourage malignant forgetting by generating previous representations, and (b) encourage benign forgetting by employing contrastive learning in conjunction with feature-level augmentation. Extensive evaluations of biased experimental setups demonstrate that our proposed method, Learning without Prejudices, is effective for continual unbiased learning. In continual learning (CL), a model learns a sequence of tasks to accumulate existing knowledge for a new task. This is preferable in practice, where a model cannot retrieve previously used data, owing to privacy, limited data capacity, or an online streaming setup. The main challenge in CL is to alleviate "catastrophic forgetting," whereby a model forgets prior information while training on new information (McCloskey & Cohen, 1989) . A line of recent works has been dedicated to mitigating this issue. Regularization-based methods force a current model not to be far from the previous one by penalizing changes in the parameters learned in previous tasks (Kirkpatrick et al.



Even with CUL, forgetting past information ("malignant forgetting") degrades the generalizability of a model. For instance, with Biased MNIST in Figure 1 , the classifier perceives color as meaningful information for prediction, although it is not a natural meaning associated with the number. If the model clearly memorizes prior information that there are (red, 0) and (gray, 0) samples, it could know that color is not the key factor for predicting numbers. Furthermore, we observe that forgetting is not always malignant through the experiment in Section 3.2. Although information (derived from prior data) itself can contribute to a model's generalizability, it is beneficial to forget the misguidance learned from biased datasets, and hence we term such a forgetting "benign forgetting". As an example, suppose a classifier trained on the MNIST dataset is extremely biased toward the background color, as in Section 3.2. It is unfavorable for the classifier to make a logic that color = number and thus bet all the 'blue' images on '3', for instance. Therefore, we aim to discourage malignant forgetting and encourage benign forgetting. Toward this, we design a novel method, named Learning without Prejudices (LwP), which employs feature generator and contrastive learning. (i) Inspired by the research in Section 3.1 that the model trained with a set of data from all the tasks does not suffer from malignant forgetting, we exploit the capabilities of a feature generator. The feature generator generates feature maps containing previous information via a generative adversarial network (GAN). Feature maps provide a larger range of feature space (to be referenced to) than images, making the classifier more robust. (ii) The generated features are fed into the model by contrastive learning (Grill et al., 2020) , and then current data are used for training in supervised mode. Because bias means a spurious correlation between some particular attribute variables and label space, the model can learn representations free of bias, with self-supervised learning that does not require labels. (iii) To optimize the classifier with generated features effectively, we propose feature-level augmentation that spatially and channel-wise transforms features. An extensive evaluation of biased datasets shows that our proposed framework is effective for CUL. The main contributions of this study are summarized as follows: • We present a novel framework, termed "continual unbiased learning", to address bias in CL. Additionally, we propose continual unbiased learning benchmarks and an evaluation protocol for future research. • We find that forgetting phenomena in CL is not always catastrophic when the training dataset exhibits the non-uniform distribution of features, e.g., a biased dataset, and hence categorize them into malignant forgetting and benign forgetting. • We propose a novel method, Learning without Prejudices (LwP), that employs a feature generator and contrastive learning, presenting feature-level augmentation to bridge them. LwP contributes to models' generalizability significantly.

2. PRELIMINARIES

2.1 PROBLEM STATEMENT Bias. Let X be an input space and Y be a label space. We define an attribute variable attr as an informative data feature of x ∈ X , possibly ranging from fine details (e.g., the pixel at (0, 0) is black) to high-level semantics of the image (e.g., there is a cat). Thus, a set of attributes can represent data x. Formally, let A be an attribute space and α : X → 2 A , where 2 A denotes the power set of A. A function α extracts attribute variables attr ∈ A from input space X , i.e., α(x) = {attr 1 , attr 2 , . . . , attr n }. Among these attr, some might be very correlated to Y while they are irrelevant to the natural meaning of the target object. We define this attr as "bias". As machine learning algorithms (e.g., convolutional neural networks (CNNs)) are overly dependent on training data distribution, the model could be biased, potentially leading to misleading generalizability (Torralba & Efros, 2011; Tommasi et al., 2017; Jeon et al., 2022) . For instance, according to Bahng et al. (2020) , the majority of frog images are captured in swamp scenes and many bird images are captured in the sky, making the model consider the background as a dominating cue that often fails to infer (frog, sky) and (bird, swamp) images correctly. Continual learning. Consider a dataset D = {(x, y)|x ∈ X , y ∈ Y} for a classification problem. "continual learning" is a learning type with a sequence of D S = {D t = (X t , Y t )} T t=1 where each X t and Y t implicitly changes, expecting that f : X t → Y t accumulates previous information without forgetting while learning new tasks. Here, T means the number of tasks. A task t is predicting the target label y with unseen feature variable x and learning a task means the procedure of optimizing a classifier f : X t → Y t with D t to make a discriminative logic. Distorted test set DS . We randomly choose pixels of α ratio on a single-colored image and then paint them to 'white' setting α as the average ratios of all the number pixels, which are represented by 'white', in the MNIST. A more detailed dataset configuration is provided in the Appendix. Continual unbiased learning. In addition to CL, we suppose Y 1 = • • • = Y T and each D t has a different bias. We aim at making the classifier f unbiased toward any of D t in the sequence, and term such type of learning as "continual unbiased learning". In practice, inconsistent distributions with different biases could be fed over time for a model that is applied for long periods for the same purpose. In this scenario, it is desirable for the model to be unbiased toward any of the datasets encompassing all of these domains.

2.2. EVALUATING PROTOCOL AND METRICS

We set a bias attribute and define a dataset that has biases as a "biased dataset" and a dataset in which the bias attribute is uniformly distributed across the labels as an "unbiased dataset". We randomly split each biased dataset D t = (D train t , D val t ) ∈ D S into train and validation set to have the same ratio of biased samples. We use one unbiased test data D test for evaluation. And, we define f t as the model trained with {D train t } T t=1 sequentially and denote F = {f 1 , • • • , f T }. After training the model with a sequence of the differently biased datasets in favor of one attribute or another, we evaluate its average accuracy following the conventional CL protocol: Acc(f T , D val S ) = 1 T T t=1 acc(f T , D val t ) , where T denotes the number of tasks and acc(f T , D val t ) denotes the accuracy of the T -th model on the t-th task of the biased evaluation set. Additionally, we suggest an average unbiased accuracy for an unbiased test set whenever each task is trained to estimate the generalizability of all the models: Acc ub (F, D test ) = 1 T T t=1 acc(f t , D test ), where acc(f t , D test ) denotes the accuracy of f t with unbiased test dataset. Using this metric, we can evaluate F = {f 1 , • • • , f T }, considering every model as a candidate for deployment.

3.1. CONTINUAL LEARNING ON BIASED DATASETS

As a first step toward CUL, we investigate the learning tendency of a CNN-based classifier on biased datasets. We exploit biased MNIST (Bahng et al., 2020) toward background color, as shown in Figure 1 . The unbiased test set has a uniform distribution for background color overall targets. Using the biased MNIST, we compare the average unbiased accuracy of the models trained with biased datasets ∪ 1≤t≤T (D t ) simultaneously, and in the sequence {D 1 , • • • , D T }. Figure 2 (a) shows a significant performance gap between these two learning scenarios. We suggest that this is because the prior information is forgotten, e.g., although the model refers to (red, 0) in the first task, it still makes a biased decision that gray = 0, considering only the (gray, 0) samples in the second task. We term this type of forgetting "malignant forgetting".

3.2. A CLOSER LOOK AT FORGETTING

Setup. With our motivation for the study being that a classifier could learn unintended information in biased conditions, hence forgetting such logic being desirable, we further conduct another motivating experiment. We additionally construct shape-absent color images and their label (x, y) ∈ D to investigate the model's adaptability to a sequence of biased sets. The distorted dataset D is shown in Figure 1 (b). We estimate the benign forgetting rate (BFR), which quantifies the model's generalizability by calculating the forgetting rate of the previous biased logic during the training of  B(F, DS ) = 1 T -1 T -1 t=1 acc(f t , Dt ) -acc(f t+1 , Dt ) acc(f t , Dt ) -1 n(Yt) , where acc(f t , Dt ) denotes the accuracy of f t ∈ F for Dt , and Dt is the augmented images having the same correlation between the background and target labels as the original colored MNIST samples D t . The function n(Y t ) denotes the number of target labels, displaying 10 for MNIST (0-9). Intuitively, the performance distance acc(f t , Dt ) -1/n(Y t ) is the degree of the model's dependence on the bias attribute (color) for predicting Y i because there is only 'color' information in Dt (Note that 1/n(Y t ) is the lower limit of the model's discrimination, meaning f t takes a guess with unclear confidence). Thus, for the numerator acc(f t , Dt ) -acc(f t+1 , Dt ) with offset 1/n(Y t ), if the classifier f t+1 still sustains the biased logic of f t (e.g., red = 0, . . . , grey = 9 for f 1 in Figure 1 ), the difference acc(f t , Dt ) -acc(f t+1 , Dt ) ≈ 0, whereas acc(f t , Dt ) -acc(f t+1 , Dt ) increases towards 1 otherwise. Denominator acc(f t , Dt ) -1/n(Y t ) is the scaling factor. This is because if the model is relatively unbiased and thus acc(f t , Dt ) is initially small, the numerator takes a penalty, making acc(f t , Dt ) -acc(f t+1 , Dt ) small (and B(F, DS ) becomes small), even though it forgets the biased logic well. We set the baseline using a simple CNN model and compare it with conventional CL models, regularization-based (Kirkpatrick et al., 2017; Li & Hoiem, 2017) , generator-based (Shin et al., 2017; Liu et al., 2020; Smith et al., 2021) methods. We calculate Acc ub and B(F, DS ) with β = 0.95. Forgetting is not always malignant. First, by performing the experiment displayed in Figure 2 (b), we find that the BFR and generalization on biased datasets have a correlation; meaning that forgetting is not always catastrophic. Confirming our assumption that forgetting previous biased logic is preferable in CUL, we define such forgetting as "benign forgetting". Second, it is notable that conventional CL methods are limited to CUL with insufficient BFR on biased MNIST, and unsatisfactory accuracy on the unbiased test set, even when compared to the baseline. This is because they are designed only to mitigate malignant forgetting and hence cannot adequately utilize benign forgetting.

4. METHODOLOGY

We discourage malignant forgetting by generating previous representations and encourage benign forgetting via contrastive learning. Although feature generator (Liu et al., 2020) and contrastive learning (Cha et al., 2021; Fini et al., 2022) are dedicated to CL, we first employ both of them, proposing feature-level augmentations to bridge the two methods. Additionally, we qualitatively and quantitatively suggest that these methods are effective for CUL.

4.1. DISCOURAGING MALIGNANT FORGETTING

Let f : X → Y be a classifier with L layers. For 1 ≤ l < L, classifier f can be split into three sub-modules as f = {f [1,...,l] , f [l+1,...,L-1] , f [L] }, where [•] denotes the indices of the layers included in the sub-modules. For convenience, we denote f [1,••• ,l] as f a and f [l+1,••• ,L-1] as f b , i.e., f = {f a , f b , f [L] }. To address malignant forgetting, we intend to make the latent feature vector of l-th layer v ∈ R H×W ×C that include prior information. Following the adversarial training of the GAN (Gulrajani et al., 2017) , we train (feature) generator G : Z → R H×W ×C that maps the noise vector z ∼ P z into R H×W ×C , and discriminator D : R H×W ×C → [0, 1] that distinguishes real samples from P r and fake samples G(z). Thus, with fake features v f := G(z) and real features v r ∈ R H×W ×C := f a (x), feature generator G and discriminator D are optimized by min G max D E x∼Pr [D(f a (x))] -E z∼Pz [D(G(z)]. With G trained with prior tasks, f b receives both v r (current) and v f (previous) as input and hence can memorize previous knowledge, i.e., discouraging malignant forgetting. When generating fake images and inputting them as Shin et al. (2017) ; Kemker & Kanan (2017) ; Xiang et al. (2019) ; Ostapenko et al. (2019) , because they are only 3-channel aggregated features, the classifier may be confused if the generator does not work well. However, the generated feature maps have a larger range of feature space, providing the classifier with more information to be utilized selectively (please consult the comparison experiment in the Appendix.). This is favorable for CUL because a biased classifier results when the model is overly dependent on a certain few features, especially the bias attribute variable.

4.2. ENCOURAGING BENIGN FORGETTING

To train f b , we apply bootstrap your own latent (BYOL) contrastive learning (Grill et al., 2020) because the generator G does not provide target labels for fake features. Although pseudo-labels can be obtained through a discriminator (Shin et al., 2017) , auto-encoder-based generators (Kemker & Kanan, 2017) , or conditional GANs (Xiang et al., 2019; Ostapenko et al., 2019) , under biased conditions, few mislabeled samples could cause a large degradation in performance. Furthermore, bias is fundamentally based on the correlation between label and attribute space. Thus, label-free contrastive learning encourages the classifier to make representations independent of bias and hence to be robust for bias by forgetting the biased logic, i.e., encouraging benign forgetting. Following the two network designs of BYOL, let g online θ := {f b θ , p θ , q θ } and g target ξ := {f b ξ , p ξ } be the online and target networks, where p, q denote additional embedding layers and θ, ξ are the parameters of the online and target networks, respectively. The embedding layers p encode v and q receives the output of p as input. We aim to train our objective network, g online θ , by learning to predict the representations made by g target ξ , i.e., knowledge distillation. The f b ξ and p ξ networks have the same architecture as f b θ = f [l+1,...,L-1] and p θ , respectively, but with different weight parameters ξ. Since augmentations of input image space are not directly applicable in the feature space due to distribution shift, we propose feature-level augmentation. For a given feature vector v, the transformed features v ′ = Dropout(v + ϵ ′ , γ) and v ′′ = Dropout(v + ϵ ′′ , γ), where ϵ ′ , ϵ ′′ ∈ R H×W ×C sampled from N (µ, Σ) are Gaussian noise vectors with the same shape as the feature v, and γ is the dropout ratio. By adding noise vectors, the feature v is augmented spatially because adjacent pixels of feature maps contain spatial information and are closely correlated. And, we apply the channel-wise dropout technique (Tompson et al., 2015) , which means channel-wise sampling of feature maps. With augmented features v ′ and v ′′ , contrastive loss can be formulated as: L contra θ,ξ (v ′ , v ′′ ) := g θ (v ′ ) ||g θ (v ′ )|| - g ξ (v ′′ ) ||g ξ (v ′′ )|| 2 , where || • || denotes L 2 -norm. For the real and fake features, we optimize the online network g θ using the contrastive loss L contra θ,ξ (v ′ f , v ′′ f ) + L contra θ,ξ (v ′ r , v ′′ r ). The target network g ξ is updated via the moving average technique expressed as τ ξ + (1 -τ )θ, where 0 < τ < 1 denotes the increase in the target decay rate during training to adjust the weight updating ratio between g online θ and g target ξ . After contrastive learning, only f b θ is used in conjunction with f a and f [L] . Then, we train the classifier f by cross-entropy loss with samples of the current task in supervised mode. The overall procedure for the proposed approach is presented in Algorithm 1. Algorithm 1: LwP: Learning without Prejudices  Inputs : Datasets for T tasks {D t = (X t , Y t )} T t=1 , classifier f = {f a , f b , f [L] }, for v ∈ {v r , v f } do ϵ ′ , ϵ ′′ ∼ N (µ, Σ) // sample noises for augmentations v ′ ← Dropout(v + ϵ ′ , γ) // augment the feature vector  v ′′ ← Dropout(v + ϵ ′′ , γ) // augment the feature vector 10 Calculate L contra θ,ξ (v ′ , v ′′ ) in (3) // calculate contrastive loss θ ← Optimizer(∇ θ L contra θ,ξ (v ′ , v ′′ ), θ) //

5. EXPERIMENTS

In this section, we experimentally evaluate the proposed method and compare it with several state-ofthe-art models. We used three biased datasets: Biased MNIST (Bahng et al., 2020) , Biased CIFAR-10 (Hendrycks & Dietterich, 2019), and Biased CelebA-HQ modified from (Karras et al., 2017) .

5.1. DATASETS

For Biased MNIST, we use the experimental setup in Section 3.1 to evaluate the model's generalizability. Following the bias planting protocol proposed by Nam et al. (2020) , we create the Biased CIFAR-10. For the types of corruption {Snow, F rost, F og, Brightness, Contrast, Spatter, Elastic, JP EG, P ixelate, Saturate}, we set each as a bias corresponding to the target object. The biases are changed for each task in exactly the same way as the Biased MNIST, indicating (airplane, snow), (automobile, frost), . . . , (truck, saturate) in the first task, and (airplane, frost), (automobile, fog), . . . , (truck, snow) in the second task. The unbiased test set exhibited a uniform distribution of corruption types along the target object. Thereby, the number of training image samples for each task is 4,000, and the test set has 10,000 images. We set β = 0.85 for Biased MNIST and Biased CIFAR-10. Among the attributes of images in CelebA-HQ (Karras et al., 2017) , we set 'gender' as the target label and select 'makeup' and 'hair color' as the bias of the first and second task, respectively, because they have a significant correlation with 'gender' in the dataset. We name this sampled dataset as Biased CelebA-HQ. Thus, randomly sampled images for training are {(HeavyM akeup, F emale), (N oM akeup, M ale)} and {(BlondHair, F emale), (BlackHair, M ale)} for each task, respectively. We additionally set the bias of the third task by 'hair length' utilizing the public annotation provided by Jeon et al. (2022) ; hence, the training set consists of {(LongHair, F emale), (ShortHair, M ale)} images. All pairs of (attribute, gender) for training have 2,000 samples, and the unbiased test set is composed of 100 images to be evenly distributed. For CelebA-HQ, we do not split biased data for validation because there are not enough biased pair samples. The detailed distribution of the training and test sets is presented in the Appendix. 

5.2. EXPERIMENTAL SETUP

Competing models. Previous approaches to CL can be categorized into regularization, replay, and generator-based approaches. We set all the categories on a competing line. We add an unbiasing method, LfF (Nam et al., 2020) , although it is not designed for CL, it can be applied for CUL. Thus, we compare the proposed method with EWC (Kirkpatrick et al., 2017) , LwF (Li & Hoiem, 2017) , DGR (Shin et al., 2017) , GFR (Liu et al., 2020) , ABD (Smith et al., 2021) , HAL (Chaudhry et al., 2021) , DER (Buzzega et al., 2020) , LiDER (Bonicelli et al., 2022) and LfF for all the datasets. We do not consider previous samples during training for a new task in our model design. Because it cannot be deployed when access to prior data is strictly limited due to privacy problems, e.g., personal medical or credit information. Nonetheless, for a fair comparison to the replay-based method, we evaluate our method with a buffer. We set the buffer size as 200. Implementation details. For the Biased MNIST, we employ a simple CNN composed of four convolutional layers and three fully connected layers as the baseline model and backbone network of all competing models. For Biased CIFAR-10 and Biased CelebA-HQ, ResNet-18 (He et al., 2016) is used as the backbone. For each experiment, we used the Adam optimizer (Kingma & Ba, 2014), grid search learning rate (initial value and decay schedule), stopping criterion, and batch size. For hyperparameters in competing models, we follow the same setting as that presented in this paper. However, as we experimentally found that the learning rate is important in CUL, we fine-tune it and display the result if better than the original setting. More implementation details are provided in the Appendix. We set l by experiments reported in the Appendix.

5.3. EXPERIMENTAL RESULTS

Table 1 exhibits the evaluation results of LwP and state-of-the-art models on CUL. From the experiments on three intentionally biased datasets, we find that regularization-based models are limited to CUL, showing performance degradation from the Base model on some datasets. It is notable that the generator-based methods show remarkable generalizability for some experiments. This implies that generating prior samples and feeding them contribute to CUL. Nonetheless, utilizing only the generated prior samples is limited to encouraging benign forgetting (experiments on Biased CelebA-HQ). Based on the generator-based approach, LwP significantly increase performance with contrastive learning. This quantitatively demonstrates contrastive learning contributes for benign forgetting (It is also demonstrated in Table 2 ). Further, it is noteworthy that LwP (w/o buffer) outperforms Published as a conference paper at ICLR 2023 replay-based methods on some datasets (Biased MNIST, Biased CelebA-HQ). The unbiased learning method, LfF, deteriorates the performance for all the datasets, exhibiting performance degradation compared to Base. Analysis of LwP components. We conduct an ablation study to evaluate the contribution of each presented component in section 4. Table 2 shows that the self-supervised technique (BYOL) and feature generator contribute to the generalizability of the model.

5.4. ANALYSIS

Choice of self-supervised learning. We compare the performance of the models using different self-supervised learning (SSL) approaches (He et al., 2020; Grill et al., 2020; Caron et al., 2021; Zbontar et al., 2021) . MoCo (He et al., 2020) requires large memory banks with contrastive loss, called infoNCE (Oord et al., 2018) . BYOL (Grill et al., 2020) proposed a metric learning approach trained using a momentum encoder. DINO (Caron et al., 2021) complemented BYOL with similarity matching loss and mean-teacher self-distillation (Tarvainen & Valpola, 2017) . Barlow Twins (Zbontar et al., 2021) proposed an objective function that measures the cross-correlation matrix between two embeddings via an identical network from two different distorted samples. MoCo is limited in its use because of the unacceptable computational costs of maintaining (positive, negative) pairs. We decide on BYOL with our SSL technique through an experimental comparison between BYOL, DINO, and Barlow Twins. Table 3 shows that the BYOL method outperforms other SSL methods. Generalization of LwP on various β. In reality, the training dataset is biased to varying degrees. Thus, we evaluate LwP on several β and compare it to state-of-the-art methods. In Figure 3 , LwP exhibits the best generalizability for all the β in both cases. 6 RELATED WORK

6.1. CONTINUAL LEARNING

Recent literature on continual learning can be categorized into regularization-based, replay-based, and generator-based methods. Regularization-based methods. EWC (Kirkpatrick et al., 2017) approximated the importance of parameters from a probabilistic perspective and regularized the update of decisive ones training a new task. LwF (Li & Hoiem, 2017) has multiple task-specific heads. It records probabilities obtained from the previous task heads and uses them as targets of surrogate loss when learning a new task. Chaudhry et al. (2018) regularized parameter updating in a current task such that the new conditional likelihood is close to the previously learned one in terms of KL-divergence. Aljundi et al. (2018) presented memory-aware synapses (MAS) that estimate the importance of the parameters in a model and then penalize changing them in new tasks. In further work, Aljundi et al. (2019a) investigated how to transform MAS into an online setup where the data distribution changes and the tasks are not specified. Dhar et al. (2019) presented an attention-based approach that incrementally learns new classes by restricting divergence. Ahn et al. ( 2019) presented an uncertainty-regularized continual learning framework based on Bayesian online learning. Douillard et al. (2020) proposed spatial-based distillation loss applied throughout the model. Replay-based methods. Inspired by the first replay-based approach for catastrophic forgetting and experience replay (Robins, 1995) , the majority of related studies have been devoted to replay-based methods. Lopez-Paz & Ranzato (2017) united rehearsal methods with knowledge distillation and regularization, making the model sufficiently close to the previous model. Bonicelli et al., 2022) , constrained the backbone network by its layer-wise Lipschitz constants with respect to replay samples. Generator-based methods. Although the replay-based method is an intuitive approach for tackling catastrophic forgetting, it cannot be deployed if there is a privacy issue regarding data. As an alternative, generator-based methods that do not directly store prior samples were presented. DGR (Shin et al., 2017) obtained past information via a GAN, making pseudo-labels from the discriminator. Kemker & Kanan (2017) generated prior informative images via auto-encoder, where the encoder approximates pseudo labels, and the decoder makes images guided by reconstruction loss. Xiang et al. (2019) exploited a GAN conditioned on the labeled features embedded by the discriminator, which allows explicit supervised learning for the classifier. Ostapenko et al. (2019) presented dynamic generative memory that employs an auxiliary classifier GAN with an increased number of parameters. In each task, binary masks were applied to concentrate on influential parameters. GFR (Liu et al., 2020) proposed a feature generator, reducing the complexity of generative replay and preventing the imbalance problem. Yin et al. (2020) and ABD (Smith et al., 2021) applied 'inversion', which makes class-conditional input images from random noise via a trained network.

6.2. UNBIASED LEARNING

Unbiased learning is a branch of robustness in machine learning. As machine-learning algorithms are overly dependent on the distribution of training datasets, models are often biased, causing unreliable generalization at inference (Torralba & Efros, 2011) . A line of recent works has been dedicated to mitigating this issue. Most of the studies assumed that the biases in datasets are known (e.g., color, texture) and exploited this information by designing various models (Kim et al., 2019; Geirhos et al., 2019; Bahng et al., 2020; Gong et al., 2020; Adeli et al., 2021; Dhar et al., 2021) . However, with the motivation that a mechanism based on bias predefined by human knowledge is unsuited for image datasets including countless sensory attributes, LfF (Nam et al., 2020) and Jeon et al. (2022) addressed the challenge of unknown biases.

7. CONCLUSION AND FUTURE WORK

A large body of literature suggests that forgetting while learning a sequence of tasks is catastrophic. However, in this study, we found that forgetting could encourage the generalizability of a model if the dataset has bias. Based on this motivation, our proposed method, LwP, encourages benign forgetting and regularizes malignant forgetting for continual unbiased learning via a feature generator and contrastive learning in conjunction with feature-level augmentation. Experimentally, the LwP contributes to generalization, while conventional CL methods are limited to unbiasing. In terms of further work, extensive exploration of the relationship between forgetting and various incomplete data distributions, e.g., imbalanced or mislabeled data distributions, are potentially value-adding future research directions. Figure 7 : Distribution of CelebA-HQ test set. We randomly sampled test set to be uniform for all the pairs. However, as the limited number of image samples, e.g., (Male, HeavyMakeup), and such samples are biased, e.g., all the (Male, HeavyMakeup) have short hair, the distribution could not be strictly uniform. 4 ). In this paper, We set l = 2 splitting f into {f [1,2] , f [3, 4, 5, 6, 7] , f [8] }. Hence, f a and f b are f [1,2] and f [3, 4, 5, 6, 7] , respectively. g online θ = {f b θ , p θ , q θ } and g target ξ = {f b ξ , p ξ } are used for contrastive learning, where p and q are FC layers with 128 units. Since the generator, G makes fake features with the same shape as the output of encoder f a , the generated features v f via G belong to R 28×28×32 as feature maps activated from the input image by 4 CONV layers. Note that l denotes an l-th layer, e.g., l = 5 means GAP layer and l = 7 is the second FC layer. Biased CIFAR-10 and Biased CelebA-HQ. For Biased CIFAR-10 and Biased CelebA-HQ, we use ResNet-18 as our classifier f as depicted in table 5 . We modify the first CONV of ResNet-18 with the kernel size 3 × 3, instead of 7 × 7. For convenience, we consider a block as a layer l. We split f ,3,4,5] for Biased CIFAR-10 with l = 1 and set l = 2 for Biased CelebA-HQ. We choose l by ablation study. We set g online θ = {f b θ , p θ , q θ } and g target ξ = {f b ξ , p ξ }, where p and q are FC layers with 512 units. Thus, the size of v f generated by G is H × W × 64 for Biased CIFAR-10 and H/2 × W/2 × 128 for Biased CelebA-HQ, where H and W denotes the height and width of input images. into {f [1] , f [2,3,4,5] , f [6] } resulting in f a = f [1] and f b = f [2 Generator G and discriminator D. We set the architecture of the feature generator G by one FC layer and three CONV layers. During generating features, we increase the size of feature maps in G with two interpolation layers. The discriminator D consists of four CONV layers and an FC layer to distinguish whether input features are real or fake. The leaky ReLU activation layer follows every CONV layer. We adjusted the number of units in the FC layer for G to fit the different shapes of the target feature maps for the three datasets while maintaining the overall architecture. To train the classifier f , for both supervised learning with samples of the current task and contrastive learning with previous samples, we use Adam optimizer with learning rate 10 -4 , weight decay 5 × 10 -4 , and (β 1 , β 2 ) = (0.9, 0.999). To train the generator G and discriminator D, we use Adam optimizers with learning rate 5 × 10 -5 for G and 2 × 10 -4 for D. We set (β 1 , β 2 ) = (0.5, 0. Splitting layer l is an important parameter in our model design. From the feature generator to online and target networks, all the architectures of the networks are decided by the parameter l. Therefore, we conduct a comparing experiment for l. By experiment in Table 6 and Table 7 , we found all the l show competitive generalizability compared to state-of-the-art CL methods. Among them, we chose the best one, l = 2, for our model. We conducted the same ablation experiments for Biased CIFAR-10 to choose l. Choice of self-supervised learning. We compare the performance of the models using different self-supervised learning (SSL) approaches MoCo, DINO, Barlow Twins, and BYOL for Biased CIFAR-10 following the exactly same way as Biased MNIST. 

C.4 REPLAY BUFFER

Learning with replay buffer can help the model to memorize previous information, discouraging malignant forgetting. Although there are some special cases, where access to prior samples is strictly limited (e.g., privacy data such as medical or financial data, with a short storage period), a replay buffer is often available. Therefore, we compare the generalizability of LwP and replay-based methods with small-sized (200) and big-sized buffers (2,000). As a result, Table 12 shows that our method achieves the best performance in both scenarios.



Figure 1: Our experimental setup. (a) Biased MNIST. At each task, we change the bias (background color) sliding each one by +1 for the label (0-9), and the color corresponding to 9 moves to 0. (b)Distorted test set DS . We randomly choose pixels of α ratio on a single-colored image and then paint them to 'white' setting α as the average ratios of all the number pixels, which are represented by 'white', in the MNIST. A more detailed dataset configuration is provided in the Appendix.

Figure 2: Motivating experiments. (a) Investigation for CUL. Each line in the graph means the performance of vanilla CNN for a sequence of Biased MNIST tasks with several different biased degrees β. Biased degree β denotes the percentage of biased images in the dataset meaning '0.1' as an unbiased set with uniform distribution and '1' as a completely biased one. We train all the tasks from 1 to 10 concurrently and then represent its Acc ub by red line indicating the upper bound. (b) Benign forgetting rate of models. We estimate BFR for the conventional CL model, the baseline, and our model (LwP). new biased tasks. With all the models F and all the datasets DS = { D1 , • • • , DT -1 }, BFR B(F, DS ) can be defined as

Figure 3: Evaluations on various β. (a) Continual unbiased learning. (b) Continual unbiased learning with replay buffer. We evaluate all the experiments with Biased MNIST.

EXPLORATION ON THE DATA DISTRIBUTION We construct the biased training set and unbiased test set from Background Colored MNIST, Corrupted CIFAR-10, and CelebA-HQ to set up continual unbiased learning. Figure 4, 5, 6, 7 exhibit the overall distribution of each task on each dataset. And, the examples of Biased CIFAR-10 and Biased CelebA are displayed in Figure 8.

Figure 4: Distribution of Biased MNIST. (a) Training set (β = 0.85). We randomly sampled images from MNIST to make ten subsets allocating them as training sets for ten tasks. Following the same sampling scenario, we just slid bias (color) whenever the task is changed from 1 to 10. (b) Test set. The colors are uniformly distributed for each label, denoted by an unbiased set.

Figure 5: Distribution of Biased CIFAR-10. (a) Training set (β = 0.85). We construct datasets in exactly the same way as Biased MNIST. (b) Unbiased test set.

Figure 6: Distribution of Biased CelebA-HQ training set. (a) Task1. 2,000 (Female, HeavyMakeup) and 2,000 (Male, NoMakeup) images are randomly sampled. (b) Task2. 2,000 (Female, BloncHair) and 2,000 (Male, BlackHair) images. (c) Task3. 2,000 (Female, LongHair) and 2,000 (Male, ShortHair) images. However, as a face image include several attributes, we display all these pairs.

Figure 8: Experimental dataset. (a) Biased CIFAR-10. (b) Biased CelebA-HQ.We display image samples of the first task. For CIFAR-10, corruption is the bias attribute, e.g., all the airplanes are corrupted by snow noise. Each time the task changes, the bias slides one by one. For CelebA-HQ, all the women put on makeup, while the men do not. Hair color and hair length are bias for the second and third tasks, respectively.

999) for Adam in G and D. To optimize the generator G and discriminator D, we use the WGAN-GP. We set input image size as 28 × 28 for Biased MNIST, 32 × 32 for Biased CIFAR-10, and 128 × 128 for Biased CelebA-HQ. The batch size and epochs per task are 32 and 20, respectively for all the experiments. For feature-level augmentation, we set channel-wise applied dropout rate γ = 0.2, and use µ = 0 and Σ = 0.005 * I for N (µ, Σ) when spatially augmenting. Following exactly the same setup of BYOL, we set variable τ := 1 -(1 -0.996)(cos(πk/K) + 1)/2, where k denotes the current training epoch and K denotes the total number of training epochs. C EXPLORATION ON LWP C.1 INVESTIGATION FOR ARCHITECTURAL PARAMETER, l

Average accuracy Acc and average unbiased accuracy Acc ub all along the datasets. Base denotes simple CNN for Biased MNIST and ResNet-18 for Biased CIFAR-10, and Biased CelebA-HQ without any additional regularization for forgetting or unbiasing. Replay (200) means replay-based approaches with 200 previous samples buffer. We display the best performance by bold and second performance by underline. All experiments run on three different random seeds and we report the means and standard deviations.

Ablation study on components of LwP. We use Biased MNIST for evaluation. The model with none of them applied is the base model, simple CNN.

Ablation study on self-supervised learning. We use Biased MNIST for evaluation. All the architecture and experimental setup except for SSL are exactly the same.

Aljundi et al. (2019b)   considered a sampling of prior tasks as a constraint reduction problem, resulting in maximizing the diversity of samples.Mai et al. (2021) suggested that softmax classifiers could cause recency bias in continual learning and hence exploited the nearest-class-mean classifier, instead. For preferable clustering, data samples from previous tasks are saved and utilized in the current training step.Lin et al. (2021) suggested continual contrastive self-supervised learning via a rehearsal method, which preserves the feature vectors using k-means clustering from the previous dataset.Madaan et al. (2021) presented a technique interpolating previous and current instances. DER(Buzzega et al., 2020) exploited logits sampled during the optimization trajectory to encourage consistency with its past. HAL(Chaudhry et al., 2021) complemented a new objective termed anchoring, where the model is bi-level optimized. LiDER (

Classifier f , Simple CNN. CONV_n denotes the CONV layer and FC_n denotes the FC layer for Simple CNN. H and W mean the height and width of input images.

Classifier f , ResNet-18. Block_n denotes the basic building block for ResNet. H and W mean the height and width of input images.

Comparison for splitting layer l. We evaluate the models on Biased MNIST three times and display the average of them with standard deviation. Acc ub 90.52(±0.15) 91.08(±1.24) 90.24(±0.90) 91.57(±0.82)

Comparison for splitting layer l. We evaluate the models on Biased CIFAR-10 three times and display the average of them with standard deviation. Acc ub 29.79(±0.53) 26.39(±0.54) 25.59(±0.87) 31.18(±0.29)C.2 CONTRIBUTION OF FEATURE GENERATORAlthough several conventional CL approaches employed image generators to memorize previous samples, they are only 3-channel aggregated features. Thus, if the generator does not work well, generated input image could disturb the desirable optimization of the classifier. Instead, generated features can provide more information to be used selectively and hence make the classifier more robust. Table8experimentally demonstrates our intuition.

Comparison for image and feature generation. We evaluate all the models on Biased MNIST three times and display the average of them with standard deviation. Acc ub

Ablation study on components of LwP. We use Biased CIFAR-10 for evaluation. The model with none of them applied is the base model, ResNet-18.Acc ub26.58(±0.45) 30.36(±0.29) 31.18(±0.29)

Ablation study on self-supervised learning. We use Biased CIFAR-10 for evaluation. All the architecture and experimental setup except for SSL are exactly the same.Acc ub 31.00(±0.57) 27.81(±0.63) 30.61(±0.36) 31.18(±0.29)C.3 MODEL SIZEGenerally, the number of parameters affects the performance of the model. Although simple comparison for the number of parameters is not fair because many methods for CL address malignant forgetting by regularization-based (additional operation for regularization) and replay-based (more training iterations with buffer) approaches, which are not related to model size. Nonetheless, generator-based methods can be compared for model size. Therefore, we compare the model size and present a tiny version of LwP in Table11. It is notable that LwP(tiny) also exhibits competitive performance.

acknowledgement

Acknowledgement. This work was supported by the NRF grant [2012R1A2C3010887] and the MSIT/IITP([1711117093], [2021-0-00077], [No. 2021-0-01343, Artificial Intelligence Graduate School Program (SNU)]).

annex

Published as a conference paper at ICLR 2023 

C.5 INVESTIGATION ON MORE VARIOUS β

Collecting data from the real world, the degree of bias β could be various. To investigate the generalizability of LwP and competing models, we set β 0.1 ∼ 0.8 at intervals of 0.1 along with 0.8 ∼ 0.95 at intervals of 0.05. Figure C .5 shows that our method generalizes best for all β. 

