A DISTINCT UNSUPERVISED REFERENCE MODEL FROM THE ENVIRONMENT HELPS CONTINUAL LEARNING Anonymous authors Paper under double-blind review

Abstract

The existing continual learning methods are mainly focused on fully-supervised scenarios and are still not able to take advantage of unlabeled data available in the environment. Some recent works tried to investigate semi-supervised continual learning (SSCL) settings in which the unlabeled data are available, but it is only from the same distribution as the labeled data. This assumption is still not general enough for real-world applications and restricts the utilization of unsupervised data. In this work, we introduce Open-Set Semi-Supervised Continual Learning (OSSCL), a more realistic semi-supervised continual learning setting in which outof-distribution (OoD) unlabeled samples in the environment are assumed to coexist with the in-distribution ones. Under this configuration, we present a model with two distinct parts: (i) the reference network captures general-purpose and task-agnostic knowledge in the environment by using a broad spectrum of unlabeled samples, (ii) the learner network is designed to learn task-specific representations by exploiting supervised samples. The reference model both provides a pivotal representation space and also segregates unlabeled data to exploit them more efficiently. By performing a diverse range of experiments, we show the superior performance of our model compared with other competitors and prove the effectiveness of each component of the proposed model.

1. INTRODUCTION

In a real-world continual learning (CL) problem, the agent has to learn from a non-i.i.d. stream of samples with serious restrictions on storing data. In this case, the agent must be prone to catastrophic forgetting during training (French, 1999) . The existing CL methods are mainly focused on supervised scenarios and can be categorized into three main approaches (Parisi et al., 2019) : (i) Replay-based methods reuse samples from previous tasks either by keeping raw samples in a limited memory buffer (Rebuffi et al., 2017; Lopez-Paz & Ranzato, 2017; Aljundi et al., 2019) or by generating pseudo-samples from previous classes (Shin et al., 2017; Wu et al., 2018; van de Ven et al., 2020) . (ii) Regularization-based methods aim to maintain the stability of the network across tasks by penalizing deviation from the previously learned representations or parameters (Nguyen et al., 2018; Cha et al., 2021; Rebuffi et al., 2017; Li & Hoiem, 2016) . (iii) Methods based on parameter isolation dedicate distinct parameters to each task by introducing new task-specific weights or masks (Rusu et al., 2016; Yoon et al., 2018; Wortsman et al., 2020) . Humans, as intelligent agents, are constantly in contact with tons of unsupervised data being endlessly streamed in the environment that can be used to facilitate concept learning in the brain (Zhuang et al., 2021; Bi & Poo, 1998; Hinton & Sejnowski, 1999) . With this in mind, an important but less explored issue in many practical CL applications is how to effectively utilize a vast stream of unlabeled data along with limited labeled samples. Recently, efforts have been made in this direction leading to the investigation of three different configurations: Wang et al. (2021) introduced a very restricted scenario for semi-supervised continual learning in which the unsupervised data are only from the classes which are being learned at the current time step. On the other hand, Lee et al. (2019) introduced a configuration that is "more similar to self-taught learning rather than semi-supervised learning". In fact, they introduced a setting in which the model is exposed to plenty of labeled samples which is a necessary assumption for their model to achieve a good performance; in addition, their model has access to a large corpse of unsupervised data in an environment that typically does not include samples related to the current CL problem. By adopting this idea, Smith et al. (2021) proposed a more realistic setting by assuming a limitation on the number of supervised samples available for the training. In addition to that, they assumed the existence of a shared hidden hierarchy between the supervised and unsupervised samples, which is not necessarily true for practical applications. In this work, we will first propose a general scenario to unify the mentioned configurations into a more realistic setting called Open-Set Semi-Supervised Continual Learning (OSSCL). In this scenario, the agent can observe unsupervised data from two sources: (i) Related unsupervised data, which are sampled from the same distribution as the supervised dataset, and (ii) Unrelated unsupervised data which have a different distribution from the classes of the current CL problem. The in-distribution unsupervised samples can be from the classes that are being solved, have been solved at previous time steps, or are going to be solved in the future. Previous CL works in which unlabeled data was available alongside labeled data, mainly utilized unlabeled data by creating pseudo-labels for them using a model which is trained by labeled samples (Lee et al., 2019; Smith et al., 2021; Wang et al., 2021) . Those unlabeled data with their pseudo-labels were used directly in the training procedure. However, due to the fact that labeled data are scarce in realistic scenarios, the pseudo-labeling process will be inaccurate and creates highly noisy labels. Therefore, we present a novel method to learn in the OSSCL setting which alleviates the mentioned problem and utilizes unlabeled data effectively. Our proposed model, which is consisted of an Unsupervised Reference network and a Supervised Learner network (URSL), can effectively absorb information by leveraging contrastive learning techniques combined with knowledge distillation methods in the representation space. While the reference network is mainly responsible for learning general knowledge from unlabeled data, the learner network is expected to capture task-specific information from a few supervised samples using a contrastive loss function. In addition, the learner retains a close connection to the reference network to utilize the essential related information provided by unsupervised samples. At the same time, the representation space learned in the reference network can be utilized to provide an out-of-distribution detector that segregates unlabeled data to employ the filtered ones more properly in the training procedure of the learner model. In short, our main contributions are as follows: • We propose OSSCL as a realistic semi-supervised continual learning scenario that an intelligent agent encounters in practical applications (Section 2). • We propose a novel dual-structured model that is suitable for learning in the mentioned scenario and can effectively exploit unlabeled samples (Section 3). • We show the superiority of our method in several benchmarks and different combinations of unlabeled samples. our model achieves state-of-the-art accuracy with a notable gap compared to the baselines and previous methods (Section 4).

2. PRELIMINARIES

In this work, we consider the training dataset to consist of two parts; the supervised dataset D sup is a sequence of T tasks {T 1 , T 2 , ..., T T }. At time step t, the model only has access to T t = {(x i , y i )} Nt i=1 where x i i.i.d. ∼ P (X|y i ) denotes a training sample and y i represents its corresponding label. We consider K separate classes at each task and follow the common class-incremental setting as it is shown to be the most challenging scenario for evaluation. Given a training loss ℓ and the network parameters θ, the training objective at time step t is defined as θ * = arg min θ 1 Nt Nt i=1 ℓ(x i , y i , θ). On the other hand, the unsupervised dataset D unsup is a sequence of T sets {U 1 , U 2 , ..., U T } containing only unlabeled data points. We assume that U t represents the unsupervised data available in the environment at time step t, which is accessible by the model along with T t . Based on the OSSCL setting which is a general framework, we assume that the unsupervised dataset is composed of two parts: (i) The related part, also called the in-distribution set, is consisted of unsupervised samples generated from the same distribution as D sup . In order to maintain generality, we assume that this set consists not only of unsupervised samples related to the current supervised task but also of the other tasks of the CL problem that have either been observed in previous time steps or will be observed in the future. (ii) The unrelated data points, also called the out-of-distribution samples, are a set of unsupervised data sampled from the distribution Q, which is not necessarily the distribution from Figure 1 : A schematic of the method and configuration. The unsupervised reference (U R t ), supervised learner (SL t ), labeled data (T t ), and related and unrelated unlabeled data (U t ) at time step t are shown on the left while the OoD segregation module is shown on the right of the figure . which the supervised samples have been generated. In the next section, we will propose a novel method to perform in this configuration, and in Section 4, a variety of experiments are provided to show the effectiveness of our model.

3. METHOD

Learning continually from D sup has been widely explored by the community. Meanwhile, unlike deep models, humans are less hungry for supervised data. Although they observe a large volume of data during their lifetime, only a small and insignificant portion of this data is labeled. It is believed that the considerable human ability to learn with a few instances is due to the rich representations learned from the large volumes of unsupervised observations (Zhuang et al., 2021; Bi & Poo, 1998; Hinton & Sejnowski, 1999) . Here, we aim to explore the benefits of using D unsup and its impact on empowering the continual learner. Specifically, we will show how D unsup will promote representation learning in addition to providing positive forward/backward transfer in the continual learning process. We propose our URSL model, which is consisted of two parts: 1) The general task-agnostic reference network, which is responsible for absorbing information from unsupervised data in the environment, and 2) the learner network, which is designed to capture knowledge from a few supervised samples while it is also guided by the reference network. The notation U R t and SL t are used to respectively demonstrate the reference and learner network instances at time step t (refer to Figure 1 and Algorithm 1 for an overview). We employ a contrastive representation learning approach for training both the reference and the learner networks. This approach has been proven to be a proper solution for supervised CL problems. Indeed, some previous works in CL claimed that classifier heads placed on top of the representation network are the serious sources of catastrophic forgetting (Ramasesh et al., 2021; Banayeeanzade et al., 2021; Cha et al., 2021) , therefore, Co 2 L (Cha et al., 2021) presented a supervised contrastive loss to avoid this problem. We utilize contrastive representation learning as a unified approach for training both the reference and the learner networks, which allows information to flow between these networks easily. Combined with knowledge distillation techniques applied in the representation space, this approach provides a convenient tool to exploit the most out of unsupervised samples. Our model is also equipped with an exemplar memory M to randomly store a portion of supervised samples from previous tasks (Lopez-Paz & Ranzato, 2017; Rebuffi et al., 2017) . The stored samples will contribute to the training of the learner network. After the final time step, these samples are also used to train a classifier head on top of the representation space of the learner network. It is noteworthy that our model does not store unlabeled data in its own memory since this data is always found in abundance in the environment, and this makes our model needless of a large memory.

3.1. REFERENCE NETWORK

The unsupervised reference network U R t : X → R d is a general-purpose feature extractor responsible for encoding all kinds of unsupervised information available in the environment. The network is composed of an encoder f , and a projector g, responsible for embedding input x in the representation space by z = (f • g) θt (x) where z is on the unit d-dimensional Euclidean sphere and θ t represents the model parameters at time step t. Considering a batch B ⊆ U t with size N , the SimCLR (Chen et al., 2020a) loss function used for training the network can be written as: h i,j = -log exp(zi.zj /τ ) 2N k=1 1 [k̸ =i] exp(zi.z k /τ ) , L unsup (θ t ; τ ) = 1 2N N k=1 (h 2k-1,2k + h 2k,2k-1 ), where z2i and z2i-1 are the representations of two different augmentations of the same image x i ∈ B and τ is the temperature hyperparameter.

3.2. SEGREGATING UNSUPERVISED SAMPLES

In this section, we show how to segregate unlabeled samples by employing the reference network and supervised samples. Although unsupervised samples can play an important role in both learning the representation space and controlling changes in this space through time, naive approaches to incorporating these samples into the training of the learner network can lead to inferior performance due to the existence of unrelated samples among unlabeled ones. Therefore, we will first explain the OoD detection method, which is designed to segregate unlabeled data and incorporate them more properly in the continual learning process of the learner network. To efficiently segregate unsupervised data, we employ a prototypical-based OoD detection method (Park et al., 2021) in the representation space of the reference network using samples in T t ∪ M. It is noteworthy that the representation space of the reference network is chosen for OoD detection since it provides better sample discrimination than any other representation space obtained by training over a small number of labeled samples. Additionally, this approach eliminates the need to train another network specialized in OoD detection in contrast to the previous works (Chen et al., 2020b; Huang et al., 2021; Saito et al., 2021) . At time step t, our OoD method creates P t = P t 1 , P t 2 , . . . , P t K×t , a set of K × t prototypes representing the centroids of observed classes so far, which is extracted using the labeled data available in T t ∪ M: P t i = ψ 1 |A| (x j ,y j )∈T t ∪M 1 [y j =i] (xj ,yj )∈Tt∪M 1 [yj =i] a∈A (f • g) θt (a(x j )) , where A is a set of augmentations meant to form different views of a real image, and ψ is the operator that projects vectors into the unit d-dimensional sphere. We also define the score operator S (P t , z) = max i c (P t i , z) where c denotes the cosine similarity measure. This operator takes prototypes in addition to a sample in the representation space and calculates the score of its most probable assignment. With this in mind, we consider S t l as the scores of the labeled data obtained by passing T t ∪M through the S (P t , .) operator, i.e. : S t l = {S(P t , (f •g) θt (x))|x ∈ T t ∪M} . By considering η id as a hyperparameter, we define a threshold τ id = mean (S t l ) + η id var (S t l ) on the scores of unlabeled data to specify in-distribution samples as Ût = {x|x ∈ U t , S(P t , x) > τ id }. Furthermore, we assign pseudo-labels to the unsupervised samples on which we have superior confidence by defining a higher threshold τ pl = mean (S t l ) + η pl var (S t l ), with the hyperparameter η pl , and prepare pseudo-labeled samples as Tt = {(x, ŷ)|x ∈ U t , S(P t , x) > τ pl , ŷ = arg max i c(P t i , x)}. In other words, an unsupervised sample with a similarity value higher than τ pl to a class prototype is pseudolabeled to that class. However, to reduce pseudo-labeling noise, we do not utilize pseudo-labels directly during the training procedure. Those pseudo-labels are used to identify whether this unlabeled data is from past classes or not. Samples of Tt are mainly used to compensate for the small number of supervised samples in the memory, as further explained in the next section. We provide a detailed investigation of the performance of the OoD module in Appendix C.

3.3. LEARNER NETWORK

Similar to the reference network, the learner network SL t : X → R d is a feature extractor with the form z = (f • g) φt (x) where φ t denotes the model parameters at time step t. The training of the learner network is done using three mechanisms: Supervised Training: Following Co 2 L (Cha et al., 2021) , we will use an asymmetric supervised version of the contrastive loss function to train the learner network. By considering a supervised batch B = {(x i , y i )} N i=1 , which is sampled from T t ∪ M ∪ Tt , and applying an augmentation policy to form two different views of real samples, we can write the supervised contrastive loss as follow: L sup (φ t ; τ ) = 1 N N i=1 -1 [y i ∈O t] |ζi| j∈ζi log exp(zi.zj /τ ) N k=1 1 [k̸ =i] exp(zi.z k /τ ) , where O t is the new classes of the current time step t, and ζ i are the other samples of the current batch with the same label y i . The existence of Tt is crucial for learning a proper representation since only a small amount of labeled data is available during continual learning. In fact, Co 2 L intends to prevent overfitting to the small number of past task samples stored in the memory by proposing the asymmetric supervised contrastive loss that utilizes samples from the memory only as negative samples (Cha et al., 2021) . However, when the labeled data are limited, even employing the past samples in M, as negative samples, still may cause overfitting. Therefore, we enrich M by Tt to diversify the samples from previous classes. Knowledge Transfer Through Time: The loss function in Eq. 3 allows the model to discriminate between new and previous classes. However, it is not sufficient to maintain the discrimination power of the learner network among previous tasks. Therefore to avoid catastrophic forgetting, at each time step t, we use an instance-wise relation distillation (IRD) loss to transfer knowledge from the previous time step to the current model (Cha et al., 2021) . This self-distillation technique, which is also compatible with the contrastive representation learning approach, retains the old knowledge by maintaining the samples' similarity in the representation space of the learner network. To this end, first, we sample a batch B from T t ∪ M ∪ Tt , augment each sample x i twice to create x2i-1 , x2i , and then calculate the instance-wise similarity vector as: p (x i ; φ, τ ) = [p i,0 , . . . , p i,i-1 , p i,i+1 , . . . , p i,2N ] where p i,j = exp(zi.zj /τ ) 2N k=1 1 [k̸ =i] exp(zi.z k /τ ) . By computing probabilities for both SL t and SL t-1 , we can write time distillation loss as: L T D (φ t ; φ t-1 , τ ′ , τ ′′ ) = 2N i=1 -p(x i ; φ t-1 , τ ′ ). log p(x i ; φ t , τ ′′ ), where τ ′ and τ ′′ represent the distillation-specific temperatures for the previous model and the current model, respectively. Knowledge Transfer from Reference: The reference network encounters numerous unsupervised samples throughout its training and is expected to learn a rich representation space using the objective introduced in Section 3.1. This representation is used as guidance for the learner network, and the knowledge can be transferred to the learner network using an IRD loss similar to the Eq. 5: L KD (φ t ; θ t , τ ′ , τ ′′ ) = 2N i=1 -p (x i ; θ t , τ ′ ) . log p (x i ; φ t , τ ′′ ). ( ) This distillation is applied to the learner network based on the samples in T t ∪ M ∪ Ût . It is noteworthy that this distillation, rather than using all of the unsupervised samples in U t , only uses the unsupervised samples Ût , which seems to be related to the training of the learner network.

3.4. THE URSL ALGORITHM

In summary, the model receives two sets of samples at each time step: T t and U t . The reference network is trained on U t using the self-supervised loss function introduced in Eq. 1. Then, an OoD detection and a pseudo-labeling technique introduced in Section 3.2, are used to segregate unsupervised samples in U t . Finally, the learner network is trained based on the weighted aggregation of three loss functions introduced in Section 3.3 by defining γ and λ as hyperparameters: while not done do L s (φ t ) = L sup (φ t ; τ ) + γL T D (φ t ; φ t-1 , τ ′ , τ ′′ ) + λL KD (φ t ; θ t , τ ′ , τ ′′ ). 10: Sample a batch B from T t ∪ M ∪ Tt 11: Compute L s ← L sup (φ t ; τ ) based on B (Eq. 3) 12: if t > 1 then 13: Update L s ← L s + γL T D (φ t ; φ t-1 , τ ′ , τ ′′ ) based on B (Eq. 5) 14: Update L s ← L s + λL KD (φ t ; θ t , τ ′ , τ ′′ ) based on a batch from T t ∪ M ∪ Ût (Eq. 6) 15: Update φ t ← φ t -α∇ φ L s 16: Update M such that the number of samples for each class is the same. 17: Train the classifier head using T T ∪ M

4. EXPERIMENTS

Benchmark Scenario: To demonstrate the effectiveness of our method, we have performed several experiments in this section. We use two datasets for each experiment: the main and the peripheral. A small portion of the main dataset, which is determined by P , is selected as supervised data, the rest is considered as related unsupervised data, and all samples of the peripheral dataset are considered as (probably) unrelated unlabeled data. At each time step, 9000 examples from each unsupervised dataset are randomly sampled, shuffled together, and fed into the model as unsupervised data. In Appendix F, we provide the results of experiments in which the number of datasets inside U t is greater than two datasets, and the environment is even more realistic. The hyperparameters of our model are not dependent on the experiment configuration, and a general and consistent solution for all conditions is provided. We conducted a wide range of experiments to demonstrate the model's robustness in various scenarios. In our experiments, we used the CIFAR10, CIFAR100 (Krizhevsky et al., 2009) , and Tiny-ImageNet (Le & Yang, 2015) datasets as the main or peripheral datasets, which are commonly used datasets in the open-set semi-supervised learning literature (Chen et al., 2020b; Huang et al., 2021; Yu et al., 2020) ; moreover, the settings of our experiments are known as the "cross dataset" setting in the open-set semi-supervised literature (Chen et al., 2020b) . We have utilized ResNet-18 architecture as the backbone of both networks with a two-layer MLP on its head as the projector. The input images for the model are 32 x 32 pixels in size. Additionally, we use the notation |M| to show the size of the supervised memory introduced in Section 3. Further experimental setups and details are provided in Appendix B. Baselines: Co 2 L (Cha et al., 2021) can be seen as a simplified version of URSL in which there is no reference network and no means for using unsupervised samples. Therefore, we propose a modified version of Co 2 L, Co 2 L-j, in which the model is trained jointly by employing both a supervised and an unsupervised contrastive loss on the supervised and unsupervised data, respectively. In another baseline, Co 2 L-p, we only pre-train the model with unsupervised data available in the first time step and ignore the unsupervised data in the subsequent steps to avoid possible conflict with the supervised loss during continual learning. There are also two other baselines in the prior works that seem consistent with the OSSCL setting due to the presence of an OoD detection module. GD (Lee et al., 2019) trained an OoD module to recognize unlabeled data from previous classes among the entire unlabeled dataset. This in-distribution data was only used to combat catastrophic forgetting. DM (Smith et al., 2021) mainly changed GD setting through defining some policies over unlabeled data by using superclasses of the CIFAR100 and using the FixMatch method (Sohn et al., 2020) . On the other side, we also report results of fully supervised continual learning for two popular continual learning models, GEM and iCaRL, and also the state-of-the-art Co 2 L. These methods have access to all samples of the related dataset as labeled ones during continual learning but cannot use unlabeled samples from any source.

4.1. RESULTS

Tables 1, 2, and 3 show the classification accuracy at the final time step when the main datasets are selected as CIFAR10, CIFAR100, and Tiny-ImageNet, respectively. In almost all the experiments, URSL outperforms all other baselines. There are two reasons for the superiority of URSL over GD and DM: (i) Unlike GD and DM, which train OoD detection with a small number of labeled samples, OoD detection of URSL is based on the representation of the reference network, which is trained with a large amount of unlabeled data and has high discrimination power. (ii) GD only uses these unlabeled data to solve the forgetting, while URSL uses those to transfer a rich representation from the reference network to the learner network. Although Co 2 L-p and Co 2 L-j improved Co 2 L, URSL outperformed them in all scenarios, showing the effectiveness of the proposed ideas compared with the naive approaches for incorporating unlabeled data. Furthermore, URSL achieved comparable or even better results than state-of-the-art full-supervised CL methods. This phenomenon suggests that URSL can benefit from unsupervised samples to mitigate the forgetting of previous classes or to learn a general representation that is proper for learning the classes that will be observed in the future. We provide multiple benchmarks in Appendix D to show the robustness and power of our method. For instance, After and Before scenarios, in which the unlabeled related samples are respectively restricted to the future and past classes of the main dataset, prove that our method has positive forward and backward transfer. In a Non-I.I.D. scenario, we examined our method in an environment in which only a fraction of the classes of the main dataset are present in U t at each time step. Additionally, Appendix E indicates that our method can achieve remarkable performance even in situations in which the ratio of the number of related unsupervised samples to the number of unrelated unsupervised samples is very low.

4.2. ABLATION STUDIES

In this section, we conducted experiments to demonstrate the contribution of different components of the model to the final performance. To that end, we have selected CIFAR100 as the main dataset, CIFAR10 as the peripheral dataset, P = 0.05, and |M| = 500. Effect of Ût : Eq. 6 is designed to transfer the rich representation of the reference network to the learner network. The results show that adding U t to this loss naively and without segregation leads to the transfer of irrelevant knowledge to the learner network. That unrelated transferred knowledge to tasks prevents the model from learning a discriminative representation for the target tasks. Segregation of U t boosts the performance of Variant 1, and Variant 3 by 3.8% and 1.2%, respectively.

5. CONCLUSION

In this paper, we present OSSCL, a novel setting for continual learning which is more realistic than previously studied settings. The setting assumes that the agent has access to a large number of unsupervised data in the environment, some of which are relevant to tasks due to the similarity between surroundings and tasks. As a possible solution for this setting, we presented a novel model, consisting of a supervised learner and an unsupervised reference network, to effectively utilize both supervised and unsupervised samples. The learner network benefits from three loss functions; the supervised loss function, which is formed based on limited supervised samples and segregated unsupervised samples, knowledge distillation through time, and representational guidance from the reference network. URSL has outperformed other state-of-the-art continual learning models with a considerable margin. The experiments and ablation studies demonstrate the superiority of the model and the effectiveness of each of its components. (Shin et al., 2017; Wu et al., 2018; van de Ven et al., 2020) . Regularization-based methods aim to maintain the network's parameters stable across tasks by penalizing deviation from the important parameters for the previously learned tasks (Nguyen et al., 2018; Lee et al., 2017; Zenke et al., 2017; Cha et al., 2021; Rebuffi et al., 2017) . Methods based on parameter isolation dedicate different parameters to each task by introducing new task-specific weights or masks (Rusu et al., 2016; Yoon et al., 2018; Wortsman et al., 2020) . Methods based on parameter isolation suffer either from extensive resource usage or capacity shortage when the number of tasks is large. Regularization-based methods are promising when the number of tasks is small; however, as the number of tasks increases, they become more prone to catastrophic forgetting and failure. However, replay-based methods have shown promising results in general continual learning settings. This work can be categorized as a replay-based method. Self-supervised Learning Self-supervised learning methods are being explored to learn a representation using unlabeled data such that the learned representation will be able to convey meaningful semantic or structural information. Based on this, various ideas, such as distortion (Alexey et al., 2015; Gidaris et al., 2018) , jigsaw puzzles (Noroozi & Favaro, 2016) , colorization (Zhang et al., 2016) , and generative modeling (Vincent et al., 2008) , have been investigated. Meanwhile, contrastive learning has played a significant role in recent developments of self-supervised representation learning. Contrastive learning involves learning an embedding space in which samples (e.g., crops) from the same instance (e.g., an image) are pulled together, and samples from different instances are pushed apart. Early work in this field incorporated some form of instance-level classification with contrastive learning and was successful in some cases. The results of recent methods such as SimCLR (Chen et al., 2020a) , SwAV (Caron et al., 2020) , and BYOL (Grill et al., 2020) are comparable to those produced by the state-of-the-art supervised methods. Knowledge Distilation Knowledge distillation aims to transfer knowledge from a teacher model to a student model without losing too much generalization power (Hinton et al., 2015) . The idea was adapted in continual learning tasks to alleviate catastrophic forgetting by keeping the network's responses to the samples from the old tasks unchanged while updating it with new training samples (Shmelkov et al., 2017; Rebuffi et al., 2017; Li & Hoiem, 2016) . iCaRL (Rebuffi et al., 2017) applies a distillation loss to maintain the probability vector of the last model outputs in learning new tasks, while UCIR (Hou et al., 2019) maximizes the cosine similarity between the embedded features of the last model and the current model. Co 2 L (Cha et al., 2021) proposed a novel instance-wise relation distillation loss for continual learning that maintain features' relation between batch samples in the representation space. Semi-supervised learning In practical scenarios, the number of labeled data is limited; therefore, training the models using such limited labeled data leads to low performance. Due to this fact, semi-supervised learning methods try to utilize the unlabeled data among the labeled data to achieve better performance. There are three main categories of semi-supervised training methods: generative, consistency regularization, and pseudo-labeling methods. A generative method can learn implicit and (Springenberg, 2015; Dumoulin et al., 2016; Li et al., 2017) . Consistency regularization describes a category of methods in which the model's prediction should not change significantly if a realistic perturbation is applied to the unlabeled data samples (Rasmus et al., 2015; Laine & Aila, 2016; Tarvainen & Valpola, 2017) . By pseudo-labeling, a trained model on the labeled set is utilized to provide pseudo-labels for a portion of unlabeled data in order to produce additional training examples that can be used as labeled samples in the training data set (Lee et al., 2013; Xie et al., 2020b; Pham et al., 2021) . UDA (Xie et al., 2020a) and FixMatch (Sohn et al., 2020) are two examples of recent brilliant works in semi-supervised learning. UDA (Xie et al., 2020a) employs data augmentation methods as perturbations for consistency training and encourages the consistency between predictions on the original and augmented unsupervised samples. In the FixMatch method (Sohn et al., 2020) , consistency regularization and pseudo-labeling are combined, and cross-entropy loss is used to calculate both supervised and unsupervised losses. Open-set Semi-supervised Learning Most semi-supervised learning methods assume that labeled and unlabeled data share the same label space. Nevertheless, in the Open-set Semi-supervised Learning setting, unlabeled data can contain categories that aren't present in the labeled data, i.e., outliers, which can adversely affect the performance of SSL algorithms. In UASD (Chen et al., 2020b) , soft targets are produced by averaging predictions from some temporally ensembled networks, and out-of-distribution samples are detected using a simple threshold applied to the largest prediction score. Using a cross-modal matching strategy, Huang et al. (2021) trained a network to predict whether a data sample matches a one-hot class label or not. By using this module, they filter out samples that have low matching scores with all possible class labels. In Saito et al. (2021) , inlier confidence scores were calculated using one-vs-all (OVA) classifiers. Furthermore, a soft-consistency regularization loss is also applied to enhance the OVA-classifier's smoothness, thereby improving outlier detection. Out-of-Distribution Detection Previous works in semi-supervised settings utilized an out-ofdistribution detector in order to filter relevant unlabeled data. Some methods train a K-way classifier, assign pseudo-labels to unlabeled data and incorporate them in the training procedure. Due to the neural networks' overconfidence over even noisy data (Hsu et al., 2020) , these methods use specific techniques to alleviate this phenomenon. Lee et al. (2019) trained a classifier using the confidence calibration technique in order to lower confidence over unseen data. In this work, they sampled a bunch of random data from a massive dataset like ImageNet and applied a loss to reduce the model confidence on them. Smith et al. (2021) used another technique called DeConf, by which they calibrated probabilities only using in-distribution data and without needing out-of-distribution data. In another type of method called "Learning from Positive and Unlabeled Data" (Comité et al., 1999; Elkan & Noto, 2008; Garg et al., 2021) , authors train a binary classifier that demonstrates whether each input is in-distribution or not. Garg et al. (2021) proposed an iterative two-stage method in which first they estimate α that determines the mixture proportion of positive data among unlabeled data. Then, they train a classifier using estimated α. They iterated these two stages until a convergence criterion was satisfied. The CIFAR10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. If this dataset is used as the main continual dataset, we randomly split it into 5 tasks with 2 classes per task. For this dataset, we used the ratios of P = 0.01 and P = 0.1 for the supervised samples, respectively equal to 50 and 500 samples per class. The CIFAR100 dataset contains 100 classes with 500 training and 100 test samples for each class. Each supervised task includes the training samples of 10 classes if this dataset is used as the main continual dataset. In Table 2 of the main paper, we used P = 0.05 and P = 0.1 configurations for this dataset, corresponding to 25 and 50 training samples per class. Tiny-Imagenet is a subset of the Imagenet dataset which contains 200 classes, 100000 training samples, and 10000 test samples. Before using the dataset, we downsize the input images from 64x64 to 32x32 in order to make all image sizes equal. We split the dataset into 10 equally sized supervised tasks. Similar to CIFAR100, this dataset is used with P = 0.05 and P = 0.1 ratios which is equivalent to 25 and 50 training samples per class. Caltech256 is an object recognition dataset that contains 30607 real-world images from 257 categories. Images sizes are different from each other, and the minimum number of images per category is 80 images. We only use this dataset in Appendix F to increase the number of datasets in U t in order to diversify the objects in unlabeled samples and provide a more realistic environment.

B.2 TRAINING DETAILS

As explained in the main paper, we used two datasets for each experiment: (i) the main dataset to construct the supervised and related unsupervised samples and (ii) the peripheral dataset to provide the unrelated unsupervised samples. At each time step, 9000 unlabeled samples are provided from the main dataset and 9000 unlabeled samples from the peripheral dataset. In our experiments, the ResNet-18 architecture is used as the encoder for our method, as well as all other baselines. In our method, starting from random initialization, the reference network is trained for E T1 = 400 epochs at time step t = 1 to converge to a good representation. However, for subsequent time steps, it would only be trained for E T>1 = 100 epochs. On the other hand, the learner network is trained for E S = 200 epochs in all time steps like all the baseline methods whose main number of epochs is 200. The mean and standard deviation of results are obtained over 3 runs. In Table 6 , the required running times to train a single epoch of different models are reported. These results are recorded on a GeForce RTX 3080 Ti GPU. 

D OTHER BENCHMARKS

In the "Other Benchmarks" section of the paper, we investigated different configurations to demonstrate the effectiveness and robustness of the URSL model in dealing with various conditions. All benchmarks are clarified, and the results are analyzed further below. After and Before In these two benchmarks, we assume that there are no unrelated samples among unlabeled data. The difference between these two settings is the presence of classes from the main dataset in the unlabeled data at any time step. More specifically, in the After scenario, unrelated samples are only from classes of the current time step and future classes from subsequent time steps are provided. However, in the Before scenario, the unlabeled data from only previous classes are presented. This experiment was designed to show that the URSL model can benefit from a positive forward/backward knowledge transfer from the unsupervised samples to the supervised tasks. As shown in Table9, in the Before scenario, it seems that visiting unlabeled data of previous classes helps to mitigate catastrophic forgetting (positive backward transfer). In contrast, in the After scenario, the model learns a decent representation space which is beneficial for learning the new coming classes (positive forward transfer). Only Related and Only Unrelated To investigate the effect of each type of unlabeled data on the model's functionality, we defined Only Related and Only Unrelated settings. As their names suggest, in the former, all unlabeled data at each task only contains main dataset samples. In contrast, in the latter, all unlabeled data are only from the peripheral dataset. In Table 10 , comparing Only Unrelated with Only Supervised shows that even unrelated samples improve performance by enriching the representation of the reference network and providing a pivot model that prevents the learner network from high accumulative changes during the continual learning process. Also, the Only Related scenario demonstrates the effectiveness of existing related samples among unlabeled data in performance, and when compared with the OSSCL setting, the effectiveness of unrelated samples can be understood. It is worth mentioning that although most of the improvement is due to incorporating related samples, accessing pure related unlabeled data is not usually a realistic assumption. Instead, they are among a huge stream of unlabeled data containing unrelated samples. Therefore, we considered both related and unrelated datasets as unlabeled samples (in the OSSCL setting) and showed that properly employing these datasets (in URSL) further improves the results compared with the OnlyRelated case. Non-I.I.D. The OSSCL scenario considers an I.I.D. assumption on the related unsupervised samples available in the environment. To challenge this assumption, we introduced a new benchmark in which the related data is generated only from a portion of the supervised classes at each time step. For example, in the Non-I.I.D. (50 %) experiment, the related unsupervised dataset, only includes the samples from half of the supervised classes, which are randomly selected at each time step. As it is shown in Table 11 , the URSL model still demonstrates a good performance even with this limited access to the related unlabeled samples.

E NUMBER OF RELATED AND UNRELATED SAMPLES

In this section, we investigated the effect of the number and ratio of the related and unrelated samples among unlabeled data. In contrast to other baselines, URSL is able to utilize unrelated unlabeled samples to boost final performance even with an imbalanced number of related and unrelated unlabeled sets. Table 12 shows that increasing unrelated samples improves results slightly while increasing related samples provides the model with more in-distribution samples to improve its performance and combat catastrophic forgetting.

F MORE COMPLICATED ENVIRONMENTS

In this section, we examine the performance of our model in even more realistic environments by conducting more experiments in scenarios in which the unlabeled data is comprised of multiple datasets. Table 13 shows the performance of experiments. Besides the datasets we used in the main experiments, we also used Caltech256 (Griffin et al., 2007) in our experiments. At each experiment, we add 9000 samples from each dataset sampled randomly to the T t . The results suggest that our model is robust to a variety of unlabeled data and performs well in more realistic scenarios in which the model is exposed to plenty of unlabeled samples that most of which are not related to its target tasks. 

G THE REFERENCE AND LEARNER ARCHITECTURES

The authors of Co 2 L (Cha et al., 2021) used ResNet-18 as the feature extractor architecture of their model. Following this design choice, we used the same architecture for both the learner and reference networks as well as all other models and baselines in all experiments to ensure a fair comparison. In this section, we investigate the effect of changing the architecture for the learner and the reference networks as reported in Table 14 . As expected, the model's performance slightly increases as the number of model parameters grows. Moreover, deep ResNet architectures compared with wide ResNet architectures achieved better performance. It is noteworthy that although we used a batch size of 512 in all of our experiments in other sections, the experiments in this section are performed with a batch size of 128 to meet the memory limit requirement, in addition to providing a fair comparison.

H MEMORY BUFFER SELECTION ALGORITHM

Selecting the suitable samples to be stored in the memory is an active area of research in continual learning (Bang et al., 2021; Tiwari et al., 2022; Isele & Cosgun, 2018) . However, the purpose of our research was not to focus on memory selection policies. Therefore, we have used a random policy as it is widely adopted in many CL works (Prabhu et al., 2020; Guo et al., 2020; Balaji et al., 2020) . It is noteworthy that the segregation of unlabeled data provides more diverse data than what exists in the memory buffer from past classes. Nevertheless, because the stored samples in the memory buffer play an important role in segregating the unsupervised samples we conducted experiments using different selection algorithms for memory buffer samples. In addition to the "random" selection method, we defined three other selection strategies: • Low-confidence: select the data on which the model has low confidence • High-confidence: select the data on which the model has high confidence • Rainbow (Bang et al., 2021) : select from all the ranges of confidence. This algorithm calculates a confidence score for each sample and sorts all scores; then, it selects some data by considering the presence of samples from all ranges of model confidence. Table 15 shows the performance of all algorithms: As can be seen, the "Random" selection algorithm outperforms both the "High-confidence" and "Low-confidence" selection strategies by a good margin. Moreover, the "Rainbow" achieves similar results as the "Random" strategy. Table 15 : effect of different algorithms for data selection for memory buffer on CIFAR100 classification with CIFAR10 dataset as the peripheral dataset.

Algorithm

Low-confidence high-confidence Rainbow Random Accuracy(%) 67.9 ±0.9 % 69.7 ±1.0 % 72.5 ±0.6 % 72.8 ±0.9 72.8 ±0.9 72.8 ±0.9 % The reference network is expected to gradually absorb unsupervised knowledge from the environment. In this section, we designed an experiment to show the success of the reference network in continually learning the unsupervised samples. In this experiment, the reference network is first pre-trained with all unsupervised samples before starting the learning of the first supervised task. Next, the reference network is frozen, and its parameters are maintained throughout the entire learning procedure. Other training details and learning mechanisms of the learner network are the same as the original URSL model. Table 16 demonstrates that pretraining only slightly improves the URSL results. This suggests that the unsupervised samples available in the environment are sufficient for the reference network to learn a proper representation even if the data is observed continually.

I.2 SELF-SUPERVISED METHOD

In whole experiments, we used NT-Xent loss (Sohn, 2016) , a popular and straightforward contrastive loss, which is widely used in self-supervised learning literature (Chen et al., 2020a) and achieved remarkable performances for training the reference network. However, our model and algorithm perform well regardless of the self-supervised loss used to train the reference network. To demonstrate it, we compared the results of our model with experiments in which the reference network is trained by BYOL (Grill et al., 2020) , a different self-supervised algorithm from NT-Xent. Tables 17 and 18 report the results of runs for two different scenarios. The advantage of BYOL over SimCLR is that it does not need negative data in training. Indeed, in our experiments, datasets have less diverse samples than huge datasets such as ImageNet; therefore, we expect BYOL to perform better than SimCLR. The empirical results are also a confirmation of this point. However, BYOL needs more time to obtain comparable results.

J LIMITATIONS

In our method, there exist several limitations. First, we have to keep a minimum number of main dataset samples from each class in the memory buffer in order to create more precise prototypes; Second, due to the non-parallelism of the training phase of the teacher network and the student network, the time complexity of our method is higher than other methods and baselines. Furthermore, our model performs worse if the number of samples for each class becomes rigorously imbalanced. There exist other limitations related to the Open-Set Semi-Supervised Continual Learning scenario. Although this configuration seems more realistic than the previous works in literature, there may still exist situations in which the assumptions of OSSCL do not hold. For example, an agent may have limited access to both related and unrelated unlabeled samples in the environment. This will 

K CODE AND DATA AVAILABILITY

The source code to reproduce the results of this paper is attached to this document. In this repository, there exists a README file containing instructions and configuration details. Moreover, the licenses of the freely available datasets and used source codes are also available in the README file.



Figure 2: (left) AUROC of OoD detection based on the number of the main dataset seen classes for CIFAR100 classification with CIFAR10 dataset as peripheral when the number of related and unrelated data are 9000 (right) The precision of OoD detection at each task of CIFAR100 classification with the CIFAR10 dataset as peripheral when the number of related and unrelated data is 9000.

Figure 3: (left) AUROC of OoD detection based on the number of the main dataset seen classes for CIFAR100 classification with CIFAR10 dataset as peripheral when the number of related and unrelated data are 4500 (right) The precision of OoD detection at each task of CIFAR100 classification with the CIFAR10 dataset as peripheral when the number of related and unrelated data is 4500

Algorithm 1 URSL: Unsupervised Reference and Supervised Learner Require: A supervised dataset D sup = {T t } T t=1 and an unsupervised dataset D unsup = {U t } T t=1 1: initialize U R 0 and SL 0 respectively with random parameters θ 0 and φ 0 2: for t = 1, ..., T do

Accuracy of different models on the CIFAR10 dataset.

Accuracy of different models on the CIFAR100 dataset.

Accuracy of different models on the Tiny-Imagenet dataset.

Ablation of Eq. 7 on CIFAR100 classification with CIFAR10 dataset as peripheral.

Table4indicates the model's performance in the experiments created by ablations over the losses of the model presented in Eq. 7:Effect of L sup : L sup induces the representation of the learner network to discriminate between classes directly by using supervised contrastive loss and labels; therefore, As the results suggest, this loss is important and contributes to the performance of the model. Adding L sup to the URSL w/o L sup version is increased the performance by 4.7%. Moreover, although L KD provides great discrimination for the learner network and achieves 28.2% accuracy, adding L sup to this version still enhances the performance.Effect of L T D : The role of L T D is to transfer previously learned knowledge and reduce forgetting. Ablation of OoD on CIFAR100 classification with CIFAR10 dataset as peripheral.it is shown, using this loss alone to train the learner network achieves great performance. In addition, adding L KD to Only L sup increases performance from 19.1% to 28.4%. Although L KD and L T D both reduce forgetting in different ways and have overlap in their function, adding L KD to URSL without L KD still boosts the performance.It is worth mentioning that performance reduction from Only L KD version to URSL without L sup version is because of equality of L KD and L T D coefficients, λ, and γ, respectively. The high ratio of

The Running time of a single task for the URSL and baselines for CIFAR100 classification with CIFAR10 as peripheral dataset



Chosen hyperparameters for URSL

Non-I.I.D. Benchmarks of CIFAR100 classification with CIFAR10 dataset as peripheral

The effect of the number of related and unrelated samples on CIFAR100 classification with CIFAR10 dataset as the peripheral dataset.

The results of using multiple datasets in U t to stimulate a more realistic environment.Table14: effect of different architectures for the reference and the learner network on CIFAR100 classification with CIFAR10 dataset as the peripheral dataset.

Comparison of URSL with URSL with full-pretraining

Comparison between NT-Xent and BYOL performance on CIFAR10 classification with Tiny-Imagenet dataset as peripheral Main dataset Peripheral dataset SSL method Accuracy (%) Time cost (mins)

Comparison between NT-Xent and BYOL performance on CIFAR100 classification with CIFAR10 dataset as peripheral Main dataset Peripheral dataset SSL method Accuracy (%) Time cost (mins) performance of our model since it is designed to perform in a situation where plenty of unsupervised data exists.

B.3 TUNING THE HYPERPARAMETERS

We created a validation set for all three main datasets by selecting 10% of the training samples at random and then performed a hyperparameter search according to Table7. Table8 shows the chosen hyperparameters obtained either by considering the validation results or by adapting from the Co 2 L paper. The strength of our proposed method is that the selected hyperparameters are invariant across different scenarios, and we used a single configuration for all experiments. For Co 2 L, DM, and GD, we used the optimal hyperparameters if the authors reported it in the original papers. In addition, Co 2 L-j and Co 2 L-p used a similar set of hyperparameters as URSL except for the new hyperparameter introduced in Co 2 L-j, where the unsupervised loss coefficient was set to 1.

B.4 AUGMENTATIONS

To increase the diversity of training samples, following previous works (Cha et al., 2021; Chen et al., 2020a) , we used the following augmentation techniques for all data:1. RandomResizedCrop: The image is randomly cropped with the scale in [0.2, 1] and then the cropped image will be resized to 32 × 32.2. RandomHorizontalFlip: Each image is flipped horizontally with a probability p = 0.5, independently from other samples.

3.. ColorJitter:

The brightness, contrast, saturation, and hue of each image are changed with a probability of p = 0.8, with maximum strength [0.4, 0.4, 0.4, 0.1], respectively.4. RandomGrayscale: Images are converted to grayscale with probability p = 0.2.

B.5 TRAINING CLASSIFIER

At the end of the training, we trained a linear classifier on the learner network's encoder head for 100 epochs using all memory data and the last time step labeled data T T ∪ M. We used Weighted Random Sampler to draw mini-batches due to class imbalance in labeled data.

C THE PERFORMANCE OF THE OOD DETECTION

In this section, we evaluate the performance of the OoD detection module in two scenarios. For the first one, the number of related and unrelated data is 9,000 each, and for the other one, this number is 4,500. Precision and AUROC metric diagrams for the OoD detection module have been shown in those two settings (during the time steps) in Figures 2 and 3 . It can be seen that the performance of the OoD detection module is improved over time due to the fact that it sees more classes and can detect class boundaries more precisely.

