A DISTINCT UNSUPERVISED REFERENCE MODEL FROM THE ENVIRONMENT HELPS CONTINUAL LEARNING Anonymous authors Paper under double-blind review

Abstract

The existing continual learning methods are mainly focused on fully-supervised scenarios and are still not able to take advantage of unlabeled data available in the environment. Some recent works tried to investigate semi-supervised continual learning (SSCL) settings in which the unlabeled data are available, but it is only from the same distribution as the labeled data. This assumption is still not general enough for real-world applications and restricts the utilization of unsupervised data. In this work, we introduce Open-Set Semi-Supervised Continual Learning (OSSCL), a more realistic semi-supervised continual learning setting in which outof-distribution (OoD) unlabeled samples in the environment are assumed to coexist with the in-distribution ones. Under this configuration, we present a model with two distinct parts: (i) the reference network captures general-purpose and task-agnostic knowledge in the environment by using a broad spectrum of unlabeled samples, (ii) the learner network is designed to learn task-specific representations by exploiting supervised samples. The reference model both provides a pivotal representation space and also segregates unlabeled data to exploit them more efficiently. By performing a diverse range of experiments, we show the superior performance of our model compared with other competitors and prove the effectiveness of each component of the proposed model.

1. INTRODUCTION

In a real-world continual learning (CL) problem, the agent has to learn from a non-i.i.d. stream of samples with serious restrictions on storing data. In this case, the agent must be prone to catastrophic forgetting during training (French, 1999) . The existing CL methods are mainly focused on supervised scenarios and can be categorized into three main approaches (Parisi et al., 2019) : (i) Replay-based methods reuse samples from previous tasks either by keeping raw samples in a limited memory buffer (Rebuffi et al., 2017; Lopez-Paz & Ranzato, 2017; Aljundi et al., 2019) or by generating pseudo-samples from previous classes (Shin et al., 2017; Wu et al., 2018; van de Ven et al., 2020) . (ii) Regularization-based methods aim to maintain the stability of the network across tasks by penalizing deviation from the previously learned representations or parameters (Nguyen et al., 2018; Cha et al., 2021; Rebuffi et al., 2017; Li & Hoiem, 2016) . (iii) Methods based on parameter isolation dedicate distinct parameters to each task by introducing new task-specific weights or masks (Rusu et al., 2016; Yoon et al., 2018; Wortsman et al., 2020) . Humans, as intelligent agents, are constantly in contact with tons of unsupervised data being endlessly streamed in the environment that can be used to facilitate concept learning in the brain (Zhuang et al., 2021; Bi & Poo, 1998; Hinton & Sejnowski, 1999) . With this in mind, an important but less explored issue in many practical CL applications is how to effectively utilize a vast stream of unlabeled data along with limited labeled samples. Recently, efforts have been made in this direction leading to the investigation of three different configurations: Wang et al. (2021) introduced a very restricted scenario for semi-supervised continual learning in which the unsupervised data are only from the classes which are being learned at the current time step. On the other hand, Lee et al. ( 2019) introduced a configuration that is "more similar to self-taught learning rather than semi-supervised learning". In fact, they introduced a setting in which the model is exposed to plenty of labeled samples which is a necessary assumption for their model to achieve a good performance; in addition, their model has access to a large corpse of unsupervised data in an environment that typically does not include samples related to the current CL problem. By adopting this idea, Smith et al. ( 2021) proposed a more realistic setting by assuming a limitation on the number of supervised samples available for the training. In addition to that, they assumed the existence of a shared hidden hierarchy between the supervised and unsupervised samples, which is not necessarily true for practical applications. In this work, we will first propose a general scenario to unify the mentioned configurations into a more realistic setting called Open-Set Semi-Supervised Continual Learning (OSSCL). In this scenario, the agent can observe unsupervised data from two sources: (i) Related unsupervised data, which are sampled from the same distribution as the supervised dataset, and (ii) Unrelated unsupervised data which have a different distribution from the classes of the current CL problem. The in-distribution unsupervised samples can be from the classes that are being solved, have been solved at previous time steps, or are going to be solved in the future. Previous CL works in which unlabeled data was available alongside labeled data, mainly utilized unlabeled data by creating pseudo-labels for them using a model which is trained by labeled samples (Lee et al., 2019; Smith et al., 2021; Wang et al., 2021) . Those unlabeled data with their pseudo-labels were used directly in the training procedure. However, due to the fact that labeled data are scarce in realistic scenarios, the pseudo-labeling process will be inaccurate and creates highly noisy labels. Therefore, we present a novel method to learn in the OSSCL setting which alleviates the mentioned problem and utilizes unlabeled data effectively. Our proposed model, which is consisted of an Unsupervised Reference network and a Supervised Learner network (URSL), can effectively absorb information by leveraging contrastive learning techniques combined with knowledge distillation methods in the representation space. While the reference network is mainly responsible for learning general knowledge from unlabeled data, the learner network is expected to capture task-specific information from a few supervised samples using a contrastive loss function. In addition, the learner retains a close connection to the reference network to utilize the essential related information provided by unsupervised samples. At the same time, the representation space learned in the reference network can be utilized to provide an out-of-distribution detector that segregates unlabeled data to employ the filtered ones more properly in the training procedure of the learner model. In short, our main contributions are as follows: • We propose OSSCL as a realistic semi-supervised continual learning scenario that an intelligent agent encounters in practical applications (Section 2). • We propose a novel dual-structured model that is suitable for learning in the mentioned scenario and can effectively exploit unlabeled samples (Section 3). • We show the superiority of our method in several benchmarks and different combinations of unlabeled samples. our model achieves state-of-the-art accuracy with a notable gap compared to the baselines and previous methods (Section 4).

2. PRELIMINARIES

In this work, we consider the training dataset to consist of two parts; the supervised dataset D sup is a sequence of T tasks {T 1 , T 2 , ..., T T }. At time step t, the model only has access to T t = {(x i , y i )} Nt i=1 where x i i.i.d. ∼ P (X|y i ) denotes a training sample and y i represents its corresponding label. We consider K separate classes at each task and follow the common class-incremental setting as it is shown to be the most challenging scenario for evaluation. Given a training loss ℓ and the network parameters θ, the training objective at time step t is defined as θ * = arg min θ 1 Nt Nt i=1 ℓ(x i , y i , θ). On the other hand, the unsupervised dataset D unsup is a sequence of T sets {U 1 , U 2 , ..., U T } containing only unlabeled data points. We assume that U t represents the unsupervised data available in the environment at time step t, which is accessible by the model along with T t . Based on the OSSCL setting which is a general framework, we assume that the unsupervised dataset is composed of two parts: (i) The related part, also called the in-distribution set, is consisted of unsupervised samples generated from the same distribution as D sup . In order to maintain generality, we assume that this set consists not only of unsupervised samples related to the current supervised task but also of the other tasks of the CL problem that have either been observed in previous time steps or will be observed in the future. (ii) The unrelated data points, also called the out-of-distribution samples, are a set of unsupervised data sampled from the distribution Q, which is not necessarily the distribution from

