SELF-SUPERVISED CONTINUAL LEARNING BASED ON BATCH-MODE NOVELTY DETECTION Anonymous

Abstract

Continual learning (CL) plays a key role in dynamic systems in order to adapt to new tasks, while preserving previous knowledge. Most existing CL approaches focus on learning new knowledge in a supervised manner, while leaving the data gathering phase to the novelty detection (ND) algorithm. Such presumption limits the practical usage where new data needs to be quickly learned without being labeled. In this paper, we propose a unified approach of CL and ND, in which each new class of the out-of-distribution (ODD) data is first detected and then added to previous knowledge. Our method has three unique features: (1) a unified framework seamlessly tackling both ND and CL problems; (2) a self-supervised method for model adaptation, without the requirement of new data annotation; (3) batch-mode data feeding that maximizes the separation of new knowledge vs. previous learning, which in turn enables high accuracy in continual learning. By learning one class at each step, the new method achieves robust continual learning and consistently outperforms state-of-the-art CL methods in the singlehead evaluation on MNIST, CIFAR-10, CIFAR-100 and TinyImageNet datasets.

1. INTRODUCTION

Machine learning methods have been widely deployed in dynamic applications, such as drones, self-driving vehicles, surveillance, etc. Their success is built upon carefully handcrafted deep neural networks (DNNs), big data collection and expensive model training. However, due to the unforeseeable circumstances of the environment, these systems will inevitably encounter the input samples that fall out of the distribution (OOD) of their original training data, leading to their instability and performance degradation. Such a scenario has inspired two research branches: (1) novelty detection (ND) or one-class classification and (2) continual learning (CL) or life-long learning. The former one aims to make the system be able to detect the arrival of OOD data. The latter one studies how to continually learn the new data distribution while preventing catastrophic forgetting Goodfellow et al. (2013) from happening to prior knowledge. While there exists a strong connection between these two branches, current practices to solve them are heading toward quite different directions. ND methods usually output all detected OOD samples as a single class while CL methods package multiple classes into a single task for learning. Such a dramatic difference in problem setup prevents researchers to form an unified algorithm from ND to CL, which is necessary for dynamic systems in reality. One particular challenge of multi-class learning in CL is the difficulty in data-driven (i.e., selfsupervised or unsupervised) separation of new and old classes, due to their overlap in the feature space. Without labeled data, the model can be struggling to find a global optimum that can successfully separate the distribution of all classes. Consequently, the model either ends up with the notorious issue of catastrophic forgetting, or fails to learn the new data. To overcome this challenge, previous methods either introduce constraints in model adaptation in order to protect prior knowledge when learning a new task Aljundi et al. (2018) ; Li & Hoiem (2018) ; Kirkpatrick et al. (2017) ; Rebuffi et al. (2017) , or expand the network structure to increase the model capacity for new knowledge Rusu et al. (2016) ; Yoon et al. (2017) . However, the methods with constraints may not succeed when the knowledge distribution of a new task is far from prior knowledge distribution. On the other hand, the methods with a dynamic network may introduce too much overhead when the amount of new knowledge keeps increasing. In this context, our aims of this work are: (1) Connecting ND and CL into one method for dynamic applications, (2) Completing ND and CL without the annotation of OOD data, and (3) Improving the Figure 1 : The framework of our unified method for both novelty detection and one-class continual learning. The entire process is self-supervised, without the need of labels for new data. robustness and accuracy of CL in the single-head evaluation. We propose a self-supervised approach for one-class novelty detection and continual learning, with the following contributions: • A unified framework that continually detects the arrival of the OODs, extracts and learns the OOD features, and merges the OOD into the knowledge base of previous IDDs. More specially, we train a tiny binary classifier for each new OOD class as the feature extractor. The binary classifier and the pre-trained IDD model is sequentially connected to form a "N + 1" classifier, where "N " represents prior knowledge which contains N classes and "1" refers to the newly arrival OOD. This CL process continues as "N + 1 + 1 + 1..." (i.e, one-class CL), as demonstrated in this work. • A batch-mode training and inference method that fully utilizes the context of the input and maximizes the feature separation between OOD and previous IDDs, without using data labels. This method helps achieve the high accuracy in OOD detection and prediction in a scenario where IDDs and OODs are streaming into the system, such as videos and audios. • Comprehensive evaluation on multiple benchmarks, such as MNIST, CIFAR-10, CIFAR-100, and TinyImageNet. Our proposed method consistently achieves robust and a high single-head accuracy after learning a sequence of new OOD classes one by one.

2. BACKGROUND

Most continual learning methods belong to the supervised type, where the input tasks are welllabeled. To mitigate catastrophic forgetting, three directions have been studied in the community: These methods are especially useful when the knowledge base of previous tasks and that of the new task are overlapping. Although these three supervised methods improve the performance of continual learning, we argue that the capability to automatically detect the task shifting and learn the new knowledge (i.e., self-supervised or unsupervised) is the most preferred solution by a realistic system, since an expensive annotation process of the new task is impractical in the field. There also exist self-supervised CL methods Rao et al. (2019) ; Smith et al. (2021) , which focus on the generation of the class-discriminative representations from the unlabelled sequence of tasks. However, their demonstration so far are still limited to relatively simple and low-resolution datasets, such as MNIST and CIFAR-10.

3. METHODOLOGY

3.1 TERMINOLOGY Algorithm 1 Unification of One-class Novelty Detection and Continual Learning Input: IDD memory budget X mem IDD = {X mem1 IDD , . . . , X mem N IDD }, IDD sequence X IDD = X 1 IDD , . . . , X N IDD OOD sequence X OOD = X 0 OOD , . . . , X t OOD , trained model M 0 1: M current ← M 0 2: for i = 0 to len(X OOD ) do 3: X current OOD ← X OOD [i] 4: X mixture ← X IDD + X current OOD 5: D IDD ← IDD estimator(X mem IDD , M current ) 6: X pred OOD , B i ← noveltyDetector(M current , X mem IDD , D IDD , X mixture ) 7: X mem IDD .append(X pred OOD ) 8: M current ← M current + B i //M erge M odel 9: end for 10: return M current Previous CL methods use a taskbased setup, where each task consists of multiple classes of training samples. The model is trained to learn each task (i.e., multiple classes) sequentially. Different from previous methods, our proposed solution is an unified system that leverages the output from the novelty detector for CL. Therefore, we embrace the one-class per task setup as follows: As shown in Algorithm 1, the system is continuously exposed to a stream of mixed input that contains both in-distribution data (IDD) X IDD = X 1 IDD , . . . , X N IDD and one of the out-of-distribution (OOD) data X current OOD ∈ X OOD = X 0 OOD , . . . , X t OOD , where each X i corresponds to a single class in the dataset. X IDD denotes the IDD set containing all the classes that the system already recognizes. X OOD contains the OOD classes that the system is currently facing and will encounter in the future. The primary task is to filter out X current OOD from X IDD through the novelty detection engine, while learning the features of X current OOD for continual learning. Once X current OOD is successfully detected and learned, we move X current OOD from X OOD into X IDD , and randomly draw a new OOD class from X OOD as the new X current OOD . This process continues until X OOD is empty and all classes have been learned by the model, i.e., becoming IDDs. For simplicity, we denote the initial IDD classes and the first OOD class as X 0 IDD and X 0 OOD , respectively. When the system has successfully learned X 0 OOD , we denote the updated IDD set as X 1 IDD and the next OOD as X 1 OOD . Then S i = X i IDD , X i OOD represents the status when the system just finishes learning X i-1 OOD . From the perspective of task-based learning, each X i OOD can be considered as a single task. Our proposed method will detect and learn only one class at each time.

3.2. ONE-CLASS NOVELTY DETECTION AND CONTINUAL LEARNING

Our method is built upon the recently proposed work on "Self-supervised gradient-based novelty detector" Sun et al. (2022) . The contributions of this method are two-fold. First, given a pre-trained model M 0 that is able to classify X 0 IDD , a statistical analysis evaluating the Mahalanobis distance in the gradient space is developed to threshold the OOD. Second, to further boost the performance, they introduce a self-supervised binary classifier, denoted as B 0 , which guides the label selection process to generate the gradients so that the Mahalanobis distance between the IDD and OOD data is further maximized. The primary OOD detector, which is based on the Mahalanobis distance, interleaves with the binary classifier. As more data stream into the system, the OOD detection accuracy gradually improves through this closed-loop interaction. This method can be naturally converted into a one-class CL solution. Upon the successful detection of X 0 OOD , the binary classifier B 0 is well-trained to recognize X 0 OOD from previous X 0 IDD . Intuitively, we can deploy B 0 upstream from the pre-trained IDD model M 0 to filter out the newly learned X 0 OOD , and leave the predicted X 0 IDD samples to the downstream M 0 for further classification. This merged model M 1 = {M 0 + B 0 } will become the new baseline to detect the next X 1 OOD . Continuously, we can keep adding a new binary classifier B i to the previously merged model M i = {M i-1 + B i-1 } for every new X i OOD upon its detection. Eventually, we obtain this chain structure (Fig. 1 ) where the latest binary classifier is the most knowledgeable one to recognize all X IDD and X OOD . During the inference phase, its job is to filter out the X t OOD from a mixture input that contains {X t IDD , X t OOD }. The remaining inputs X t IDD = X t-1 IDD , X t-1 OOD will be sent downstream to the next binary classifier to sequentially recognize the X t-1 OOD . This sequence continues until only X 0 IDD is left and arrives at the original model M 0 for final classification. Algorithm 2 Gradient-based Novelty Detection 1: function noveltyDetector (M, X mem IDD , D IDD , X mix ) 2: Initialize B new 3: X pure mix ← batchP urityEstimator(X mix , X mem IDD , M) 4: repeat 10 iterations 5: X pred IDD ← [ ], X pred OOD ← [ ] 6: for each x ∈ X pure mix do 7: if B new not trained then 8: c M0 ← c pred M0 , c Bi ← c pred Bi 9: else 10: return X pred IDD , B new 24: end function There are two advantages of this oneclass learning method. First, after we merge a new binary classifier to the existing structure, all weights are frozen and isolated from future training towards the new OODs. This prevents the knowledge shifting and thus, minimizes catastrophic forgetting. Second, each binary classifier only induces a small memory overhead since its task is simple enough to be accomplished with very few layers of neurons. Compared with previous dynamic methods which either add an indeterminate amount of neurons to each layer or make new branches using sub-modules, our method requires much lower memory. c M0 ← c cust M0 , c Bi ← 1 -c pred Bi 11: end if ∇f M (x) ← ∇ c M 0 f M0 (x)∥ . . . ∥∇ c B i f Bi (x) 12: noveltyScore l ← [ ] 13: for each ( μc , Σc ) in D IDD do score c ← (∇f M (x)-μc ) T Σc -1 (∇f M (x)-μc ) noveltyScore l. On the other hand, such a sequential learning model leaves two questions to be answered: (1) How to port the original Mahalanobis distance method into this merged model so that it can detect the next OOD? (2) How to achieve a high inference accuracy of the binary classifier, such that the testing samples can go through multiple steps in this sequence and reach the final classifier without losing the accuracy. We will address each problem in the subsection 3.3 and 3.4, respectively.

3.3. GRADIENT-BASED NOVELTY DETECTOR

The stepping stone of the previous novelty detector (Algorithm 2, Algorithm 3) is to characterize the IDD and OOD in the gradient space. If a model M has been trained to classify N classes in X train IDD = X train 1 IDD , . . . , X train N IDD , the gradients collected from each X i IDD ∈ X IDD will form a class-wise multi-variant distribution D IDD = D 1 IDD , . . . , D N IDD . Each D i IDD ∼ N (μ i , Σi ) corresponds to a Gaussian distribution for a particular class i, and future IDDs X val IDD = X val 1 IDD , . . . , X val N IDD will be within the range of corresponding D i IDD . On the contrary, due to the low confidence towards the OOD, the model M will adjust itself to fit the OOD properly by back-propagating the gradients with abnormal magnitude and direction. Any deviant from D IDD will be considered as abnormal and therefore, a distance metric can be utilized to measure the novelty confidence. They propose to use the Mahalanobis Distance as the novelty confidence score Sun et al. (2022) : M x = (∇ c f M (x) -μc ) T Σ-1 c (∇ c f M (x) -μc ) To collect the gradient ∇ c f M (x), they further utilize two types of labels for back-propagation: c predicted M and c custom M . c predicted M = argmax c∈X IDD (Sof tmax(f M (X; Θ))) c custom M = argmin c∈X IDD (Sof tmax(f M (X; Θ))) Measured by the softmax output from the trained model M, c predicted M and c custom M refer to the most and least possible class the input belongs to, respectively, leading to mild and aggressive backpropagated gradients. Since the goal is to detect the abnormal gradient from the OOD, using c custom M for OOD maximizes weight update to the model and thus, the gradients will be even more likely to fall out of the D IDD . On the other hand, using c predicted M for IDDs helps their gradients stay within D IDD and thus, reduces the possibility of false alarm. To guide the label selection, they introduce a binary classifier into the system to predict the samples. Based on the prediction, the primary ND engine M collects the gradients with either c predicted To make the above algorithm compatible with the unified ND model, we set up the following scenario for explanation. Assume the system just finishes the detection and learning on X i OOD and is in the status of S i+1 = X i+1 IDD , X i+1 OOD . The latest binary classifier B i , which can distinguish X i OOD and X i IDD , is placed upstream from the previously merged model M i to form a new model M i+1 = {M i + B i }. At this point, M i+1 becomes the new baseline to detect X i+1 OOD . It is a sequential structure composed of i + 1 number of the binary classifier, i.e., one for each learned OOD, and one original neural network M 0 that predicts N classes and is placed at the end of the chain. With these notations mentioned above, we introduce our solution as follows: Gradient Collection: To make this merged model M i+1 = {M i + B i } capable of detecting the upcoming X i+1 OOD , we need to first conduct the D i+1 IDD estimation on X i+1 IDD = X i IDD , X i OOD from both M i and B i . Only then we can start evaluating how far the X i+1 OOD deviates from the distributions in D i+1 IDD . Given an input x, we propose to collect the gradients ∇ c f Bi (x) and ∇ c f Mi (x) separately from both B i and M i . Then we concatenate them together ∇ c f Bi+Mi (x) = ∇ c f Bi (x)∥∇ c f Mi (x). The overall gradient dimension becomes ∇f Mi+1 = ∇f M0 ∥(∇f B0 ∥ . . . ∥∇f Bi ), where ∇f M0 is the gradient collected from the original neural network, and ∇f B0...i are the gradients from each binary classifier in the chain. Class Selection: Similar as c predicted  = f Bi (X; Θ) ; c custom Bi = 1 -f Bi (X; Θ) Ideally we prefer to use the label c predicted Bi for all the samples X ∈ X i IDD , X i OOD , so that the gradient ∇ c f Bi (x) will be minimized. For X i+1 OOD , using c custom Bi will make their gradients stand out even more and thus, easier to be detected.

Algorithm 3 Gradient-based Evaluation of IDD Distribution

1: function IDD estimator (X mem IDD , M i ) 2: D IDD ← [ ] 3: for each X mem IDD in X mem IDD do 4: IDD , we assign a memory at a constant size to re-evaluate D IDD after the expansion of the gradient dimension. In addition, the newly learned X i OOD becomes a new IDD class from the perspective of M i+1 and thus, D IDD needs to include D i IDD . Before we use M i+1 to detect the next OOD, D IDD should include the distribution estimation of N + i + 1 classes, in the dimension of ∇f Mi+1 . Furthermore, a small memory for each class helps improve the ND accuracy. Previous methods only rely on the predicted IDD and OOD from the novelty detector engine to train the binary classifier, with limited training accuracy and convergence speed when the prediction is not accurate enough. Instead, we propose to directly use the pre-stored IDD samples as the IDD training dataset. For the OOD, we still use the predicted OOD from the ND routine. Such a solution makes the training of the binary classifier more stable, by reducing mislabeled OODs. After the system converges towards current OOD, those predicted OOD samples, selected by the novelty detector with the assistance from the binary classifier, will be stored in the memory as the representation of this newly learned class, which will be used for D IDD evaluation and future training on the new binary classifier. Except for the first N classes which have the pre-labeled dataset, all the samples from the new coming OODs are self-selected by the engine itself. ∇ c f Mi (x k ) = ∇ c f M0 (x k )∥ . . . ∥∇ c f Bi (x k ) 5: μc ← 1 Nc y k =c ∇ c f Mi (x k ) 6: Σc ← 1 Nc y k =c (∇ c f Mi (x k ) -μc )(∇ c f Mi (x k ) - μc ) ⊤ where (x k , y k ) ∈ X mem

3.4. BATCH-MODE TRAINING AND INFERENCE

The success of the sequential classification, from a sequence of binary classifiers, relies on the high inference accuracy of each binary classifier. Even with very minor accuracy degradation from each of them, the accumulated accuracy will drop exponentially. To achieve such a goal, we propose to not only consider the sample itself but the context of the sample as well. If a testing sample is within the cluster of other samples that all come from the same class, its prediction will be biased towards that class as well. This inspires the idea of the batch-mode training and inference: Based on the features, we will cluster the samples in one class into a batch first, before feeding them to the engine. During the training phase, we introduce the new loss function as follows: L(X ) = 1 N N -1 i=0 L(f (X i ), i) where each X i refers to a single batch with samples of class i only. Different from traditional training method where the feed-forward operation is conducted on a single batch that contains randomly selected samples from N classes, we send N individual one-class batches to the classifier all at once and calculate their average loss. Due to the nature of the BatchNorm layer in the neural network, we find that the batch average from each class can be better separated and thus, the boundary of the classes is easier to be learned. To prove the effectiveness of the batch-mode training, we evaluate the feature distribution of IDD and OOD using both the traditional and batch-mode method with CIFAR-10 dataset on a binary classifier, where IDD contains five classes and OOD contains one class. The batch size is at 32 samples. As shown in Fig. 2 (a)(b), the result from batch-mode training is less intertwined, promising a higher chance to detect OOD. We exploit this property to improve the accuracy of each binary classifier. One challenges in this approach is that only X IDD classes are available for creating the pure batches, but the OOD samples are mixed with IDD in the incoming stream, not pre-labeled. Therefore, a prefiltering operation is necessary to detect and prepare the OOD from the data stream. We propose to first divide the input stream into small consecutive batches and use a purity metric to localize the batches with the highest OOD percentage. In reality, the assumption of IDD/OOD batches is quite feasible. For instance, in a video or audio stream, once OOD data appear, there will be multiple, continuous samples of OOD, rather than one glitching sample only. Regarding the purity metric, we propose to compare the mean of the features from the testing batch with the pre-estimated features of each class in X IDD . More specifically, assuming the system is in status S i = X i IDD , X i OOD , there are totally N + i classes in X i IDD . Therefore, we expect the next input stream will contain a mixture of X i OOD batch and N + i kinds of batches from previous classes. We use M 0 to filter out the batches that contain the first N IDD classes, by comparing the L 2 distance between the batch features ψ M0 (X test ) and the X IDD 0 features ψ M0 (X IDD 0 ), both extracted from M 0 . Since M 0 is trained by X IDD 0 , if X test batch is dominated by the samples from X IDD 0 , then the L 2 distance between them should not be large. On the other hand, if the batch is mixed with samples from another class, then the L 2 distance will increase which helps us set a threshold to separate them. The selected batches will be sent to the upstream of M 0 , which is B 0 , for another round of filtering to remove the batches that are dominated by X 0 OOD . This procedure continues along the sequential structure from bottom to top until all the batches containing the N + i classes are filtered out. The remaining batches then only contain the new OOD X i OOD . Algorithm 4 Unsupervised Estimation of Batch Purity 1: function batchP urityEstimator (X mix , X mem IDD , M) 2: Divide X mix sequence into batches with size 32 [X B0 , X B1 , . . . , X Bt ] 3: X pure mix ← [ ] 4: for each X B ∈ [X B0 , X B1 , . . . , X Bt ] do 5: for each m along the chain [M 0 , B 0 , . . . , B i ] do 6: if m is M 0 then 7: X ref IDD ← select X 0 IDD from X mem IDD 8: else if m is B k where k ∈ [0, i] then 9: X ref IDD ← select X k OOD from X mem IDD 10: end if 11: L2 ← d(ψ m (X B ), ψ m (X ref IDD )) 12: if L2 < threshold IDD then 13: X pure mix .append(X B ) 14: else if L2 > threshold OOD then Figure 3 : L 2 distance-based batch purity estimation using M 0 and B 0-3 on IDD: X 0 IDD , X 0 OOD , . . . , X 3 OOD and OOD: X 4 OOD . All data are collected using CIFAR-10 dataset. Fig. 3 illustrates the process to separate the pure batches from the CIFAR-10 input stream that consists of five types of IDDs ( X 0 IDD , X 0 OOD , . . . , X 3 OOD ) and one OOD (X 4 OOD ), using M 0 and B 0-3 . The gray and white area corresponds to the batches with 100% purity and mixture data, respectively. Starting from M 0 (the red curve), X 0 IDD pure batches are collected by comparing the L 2 distance with low threshold. The batches above threshold are sent to B 0 to find the X 0 OOD pure batches. This process continues until the stream data reaches B 3 and all IDD and OOD pure batches are successfully separated.

4. EXPERIMENTS

To prove the efficacy of our proposed method, we conduct several one-class ND and CL experiments using MNIST Deng (2012) , CIFAR-10, CIFAR-100 Krizhevsky & Hinton (2009) and Tiny-ImageNet Le & Yang (2015) . All experiments are implemented using PyTorch Paszke et al. (2019) on NVIDIA GeForce RTX 2080 platform.

4.1. EXPERIMENTAL SETUP

Input Sequence and Memory Budget: Different from the multiple-class based setup, our method is evaluated after every exposure to a new OOD class. For every new X i OOD , we mix it with the same amount of randomly selected samples from X i IDD . This mixed input stream is then sent to the system for ND and CL. The input stream consists of the batches from N + i classes in X i IDD and from the X i OOD current class. Each batch is at the size of 32 frames for all the experiments. We also create the transition phases to mimic the input change from one class to another. Each transition phase last three batches long. The ratios of the mixture between the previous class and the next class are 1/4, 1/2 and 3/4. This transition setup is used in batch purity evaluation. For each dataset, the size of X i OOD and X i IDD samples is shown in the first row of Table 1 . X i OOD samples are selected from the training dataset while the X i IDD are selected from the testing dataset. The reason for using testing rather than training on X i IDD is because previously trained binary classifiers have already seen the training data of X i IDD and we need to avoid any unfair evaluation in the current iteration. The second row of Table 1 presents the memory budget for X 0 IDD . Network Structure and Training: For fair comparisons with previous methods, we select the structure of M 0 as shown in Table 1 , Row 3. The binary classifier has three convolution layers, one BatchNorm layer, and a Sigmoid classifier. This structure is used in all experiments. For M 0 training, standard Stochastic Gradient Decent is used with momentum 0.9 and weight decay 0.0005. The number of epochs is listed in Table 1 , Row 4. The initial learning rate is set to 0.1 and is divided by 10 after reaching the 50% and 75% milestones. For the binary classifier, we train it in 100 epochs with the Adam optimizer Kingma & Ba ( 2014) where the initial learning rate is set to 0.0002 and the decay rate is controlled by β 1 = 0.5 and β 2 = 0.999. To estimate the novelty in the gradient space, we collect the gradients from the last convolution layer of M 0 and the second from the last convolution layer of each binary classifier. 

4.2. IN-DEPTH ANALYSIS

All experiments are conducted in an unsupervised manner, which means all new OOD classes are not manually labeled, but purely rely on the prediction from the novelty detector engine. Compared with traditional CL algorithms which learn multiple classes in one shot, our method learns one class at each time and takes more steps to achieve the same learning goal. To test how many classes can be learned before the accuracy starts to drop, we test our algorithm using CIFAR-100 by training a baseline model using 10 classes and then continually feeding 20 new classes to the system one class after another. The inset from the Figure 4 (a) illustrates the accuracy curve from the actual testing result (red points) plus the extrapolation (dashed line). This curve proves that our method is able to stably learn multiple steps; the accuracy eventually drops as the error in each binary classifier accumulates through the sequential process. Therefore, we design our further experiment as follows: For CIFAR-100 and Tiny-ImageNet, we divide them into 10 tasks where each task contains 10 and 20 classes, respectively. After the baseline training with the first task, we test the algorithm performance by feeding the next task as the incoming OOD. Once all the classes from the new task are learned one by one, we terminate the current iteration and retrain a new baseline model using all the previous tasks. This new baseline will then be used for learning the next available task. This process continues till all the tasks have been tested. For each experiment on CIFAR-10 and MNIST, we train two baseline models using the first two classes and the first five classes to mimic the 5-tasks (5T) and 2-tasks (2T) setup that used by other methods. We then feed the remaining classes to the baseline model to test the performance. As shown in Fig. 4 , the single-head accuracy of all the experiments stay at a high value after consecutively detecting and learning new classes from each checkpoint, which proves that our batch-mode method successfully boosts the performance of the binary classifier. For CIFAR-10 and MNIST, our method significantly improves the state-of-the-art even though it is unsupervised. For Tiny-ImageNet, the accuracy is less stable due to the increased complexity of the dataset, but overall the performance is still robust after learning 20 new classes.

5. ABLATION STUDY

We conduct four ablation studies to test how the performance is influenced by various input stream patterns using CIFAR-10 and CIFAR-100 dataset. First, we conduct two experiments by feeding the input with 75%/25% and 90%/10% of IDD/OOD mixture. As shown in Figs. 

6. CONCLUSION

In this paper, we propose an unified framework for one-class novelty detection and continual learning, by using a sequence of binary classifiers with the batch-mode technique. We demonstrate that our method successfully detects and learns consecutive OOD classes in an unsupervised setup, achieving a stable single-head accuracy without triggering catastrophic forgetting. For instance, our method reaches 97.08% on CIFAR-10 in continual learning, better than the state-of-the-art. The performance on all other datasets is also among the top list, without the need of manually labeled training data. The success of this approach promises high stabilization, high learning accuracy, and practical usage by various dynamic systems.



Regularization methods Zeng et al. (2019); Aljundi et al. (2018); Li & Hoiem (2018); Kirkpatrick et al. (2017); Rebuffi et al. (2017); Zenke et al. (2017); Ahn et al. (2019), which aim to penalize the weight shifting towards the new task. This is realized by introducing a new loss constraint to protect the most important weights for previous tasks. Many metrics to measure the importance of weights are proposed, such as the Fisher information matrix, distillation loss and training trajectory. (2) Rehearsal-based methods Lopez-Paz & Ranzato (2017); Chaudhry et al. (2018); Rolnick et al. (2019); Aljundi et al. (2019); Cha et al. (2021), which maintain a small buffer to store the samples from previous tasks. To prevent the drift of prior knowledge, these samples are replayed during the middle of the training routine on the new task. Some rehearsal-based methods are combined with the regularization methods to improve their performance. (3) Expansion-based methods Rusu et al. (2016); Yoon et al. (2017); Schwarz et al. (2018); Li et al. (2019); Hung et al. (2019), which aim to protect previous knowledge by progressively adding new network branches for the new task.

or c custom M to measure the Mahalanobis distance towards D IDD .

and c custom M to ∇ c f M (x), here we introduce c predicted Bi and c custom Bi to control the gradient ∇ c f Bi (x). Since there are only two possible predictions: IDD (label 0) and OOD (label 1), we can simply make the original prediction as c predicted Bi and use the flipped result as c custom Bi c predicted Bi

IDD .append((μ c , Σc )) 10: end function Memory for D IDD Evaluation: The extra dimension from ∇f Bi requires the preestimated distributions in D IDD to be re-evaluated. For each class in X i+1

Figure 2: The feature distribution of IDD and OOD using (a) traditional training and (b) batch-mode training.

Figure 4: Single-head accuracy of one-class novelty detection and continual learning using (a) CIFAR-100, (b) Tiny-ImageNet, (c) CIFAR-10, and (d) MNIST.

5(a)(b), the single-head accuracy of both experiments are worse than previous experiments using 50%/50% of IDD/OOD mixture. This is because fewer OOD samples increase the difficulty in OOD separation, especially the unsupervised training of the binary classifier. Second, we test our model with two smaller batch size for batch purity estimation. Figs. 5(c)(d) present the performance after dividing the input stream into smaller batches at the size of 16 and 8. The smaller the batch size is, the worse performance our model achieves. With fewer samples in each batch, the average feature estimated from the trained model becomes more diverse, which exacerbates the error in the purity estimation engine.

Figure 5: Ablation studies on CIFAR-10 and CIFAR-100 with (a)(b) multiple ratios of IDD/OOD mixture, and (c)(d) various batch sizes.

append(score c)

