SELF-SUPERVISED CONTINUAL LEARNING BASED ON BATCH-MODE NOVELTY DETECTION Anonymous

Abstract

Continual learning (CL) plays a key role in dynamic systems in order to adapt to new tasks, while preserving previous knowledge. Most existing CL approaches focus on learning new knowledge in a supervised manner, while leaving the data gathering phase to the novelty detection (ND) algorithm. Such presumption limits the practical usage where new data needs to be quickly learned without being labeled. In this paper, we propose a unified approach of CL and ND, in which each new class of the out-of-distribution (ODD) data is first detected and then added to previous knowledge. Our method has three unique features: (1) a unified framework seamlessly tackling both ND and CL problems; (2) a self-supervised method for model adaptation, without the requirement of new data annotation; (3) batch-mode data feeding that maximizes the separation of new knowledge vs. previous learning, which in turn enables high accuracy in continual learning. By learning one class at each step, the new method achieves robust continual learning and consistently outperforms state-of-the-art CL methods in the singlehead evaluation on MNIST, CIFAR-10, CIFAR-100 and TinyImageNet datasets.

1. INTRODUCTION

Machine learning methods have been widely deployed in dynamic applications, such as drones, self-driving vehicles, surveillance, etc. Their success is built upon carefully handcrafted deep neural networks (DNNs), big data collection and expensive model training. However, due to the unforeseeable circumstances of the environment, these systems will inevitably encounter the input samples that fall out of the distribution (OOD) of their original training data, leading to their instability and performance degradation. Such a scenario has inspired two research branches: (1) novelty detection (ND) or one-class classification and (2) continual learning (CL) or life-long learning. The former one aims to make the system be able to detect the arrival of OOD data. The latter one studies how to continually learn the new data distribution while preventing catastrophic forgetting Goodfellow et al. (2013) from happening to prior knowledge. While there exists a strong connection between these two branches, current practices to solve them are heading toward quite different directions. ND methods usually output all detected OOD samples as a single class while CL methods package multiple classes into a single task for learning. Such a dramatic difference in problem setup prevents researchers to form an unified algorithm from ND to CL, which is necessary for dynamic systems in reality. One particular challenge of multi-class learning in CL is the difficulty in data-driven (i.e., selfsupervised or unsupervised) separation of new and old classes, due to their overlap in the feature space. Without labeled data, the model can be struggling to find a global optimum that can successfully separate the distribution of all classes. Consequently, the model either ends up with the notorious issue of catastrophic forgetting, or fails to learn the new data. To overcome this challenge, previous methods either introduce constraints in model adaptation in order to protect prior knowledge when learning a new task Aljundi et al. ( 2018 2017). However, the methods with constraints may not succeed when the knowledge distribution of a new task is far from prior knowledge distribution. On the other hand, the methods with a dynamic network may introduce too much overhead when the amount of new knowledge keeps increasing. In this context, our aims of this work are: (1) Connecting ND and CL into one method for dynamic applications, (2) Completing ND and CL without the annotation of OOD data, and (3) Improving the Figure 1 : The framework of our unified method for both novelty detection and one-class continual learning. The entire process is self-supervised, without the need of labels for new data. robustness and accuracy of CL in the single-head evaluation. We propose a self-supervised approach for one-class novelty detection and continual learning, with the following contributions: • A unified framework that continually detects the arrival of the OODs, extracts and learns the OOD features, and merges the OOD into the knowledge base of previous IDDs. More specially, we train a tiny binary classifier for each new OOD class as the feature extractor. The binary classifier and the pre-trained IDD model is sequentially connected to form a "N + 1" classifier, where "N " represents prior knowledge which contains N classes and "1" refers to the newly arrival OOD. This CL process continues as "N + 1 + 1 + 1..." (i.e, one-class CL), as demonstrated in this work. • A batch-mode training and inference method that fully utilizes the context of the input and maximizes the feature separation between OOD and previous IDDs, without using data labels. This method helps achieve the high accuracy in OOD detection and prediction in a scenario where IDDs and OODs are streaming into the system, such as videos and audios. • Comprehensive evaluation on multiple benchmarks, such as MNIST, CIFAR-10, CIFAR-100, and TinyImageNet. Our proposed method consistently achieves robust and a high single-head accuracy after learning a sequence of new OOD classes one by one.

2. BACKGROUND

Most continual learning methods belong to the supervised type, where the input tasks are welllabeled. To mitigate catastrophic forgetting, three directions have been studied in the community: These methods are especially useful when the knowledge base of previous tasks and that of the new task are overlapping. Although these three supervised methods improve the performance of continual learning, we argue that the capability to automatically detect the task shifting and learn the new knowledge (i.e., self-supervised or unsupervised) is the most preferred solution by a realistic system, since an expensive annotation process of the new task is impractical in the field.



); Li & Hoiem (2018); Kirkpatrick et al. (2017); Rebuffi et al. (2017), or expand the network structure to increase the model capacity for new knowledge Rusu et al. (2016); Yoon et al. (

Regularization methods Zeng et al. (2019); Aljundi et al. (2018); Li & Hoiem (2018); Kirkpatrick et al. (2017); Rebuffi et al. (2017); Zenke et al. (2017); Ahn et al. (2019), which aim to penalize the weight shifting towards the new task. This is realized by introducing a new loss constraint to protect the most important weights for previous tasks. Many metrics to measure the importance of weights are proposed, such as the Fisher information matrix, distillation loss and training trajectory. (2) Rehearsal-based methods Lopez-Paz & Ranzato (2017); Chaudhry et al. (2018); Rolnick et al. (2019); Aljundi et al. (2019); Cha et al. (2021), which maintain a small buffer to store the samples from previous tasks. To prevent the drift of prior knowledge, these samples are replayed during the middle of the training routine on the new task. Some rehearsal-based methods are combined with the regularization methods to improve their performance. (3) Expansion-based methods Rusu et al. (2016); Yoon et al. (2017); Schwarz et al. (2018); Li et al. (2019); Hung et al. (2019), which aim to protect previous knowledge by progressively adding new network branches for the new task.

