BRIDGING THE GAP BETWEEN SEMI-SUPERVISED AND SUPERVISED CONTINUAL LEARNING VIA DATA PRO-GRAMMING

Abstract

Semi-supervised continual learning (SSCL) has shown its utility in learning cumulative knowledge with partially labeled data per task. However, the state-of-the-art has yet to explicitly address how to reduce the performance gap between using partially labeled data and fully labeled. In response, we propose a general-purpose SSCL framework, namely DP-SSCL, that uses data programming (DP) to pseudolabel the unlabeled data per task, and then cascades both ground-truth-labeled and pseudo-labeled data to update a downstream supervised continual learning model. The framework includes a feedback loop that brings mutual benefits: On one hand, DP-SSCL inherits guaranteed pseudo-labeling quality from DP techniques to improve continual learning, approaching the performance of using fully supervised data. On the other hand, knowledge transfer from previous tasks facilitates training of the DP pseudo-labeler, taking advantage of cumulative information via self-teaching. Experiments show that (1) DP-SSCL bridges the performance gap, approaching the final accuracy and catastrophic forgetting as using fully labeled data, (2) DP-SSCL outperforms existing SSCL approaches at low cost, by up to 25% higher final accuracy and lower catastrophic forgetting on standard benchmarks, while reducing memory overhead from 100 MB level to 1 MB level at the same time complexity, and (3) DP-SSCL is flexible, maintaining steady performance supporting plug-and-play extensions for a variety of supervised continual learning models.

1. INTRODUCTION

Lifelong machine learning, also known as continual learning (CL), is a machine learning paradigm that accumulates knowledge over sequential tasks (Ruvolo & Eaton, 2013a; Silver et al., 2013; Chen & Liu, 2016; Liu, 2017) . It empowers machine learning at the application level such that an agent does not need to be trained from scratch with large amounts of data for every new task, as well as enabling the agent's self-improvement on previously-learned tasks by continuing to learn post-deployment. Nevertheless, researchers have identified that obtaining labeled training data is expensive (Olivier et al., 2006; Settles, 2009) , which semi-supervised continual learning (SSCL) addresses (Baucum et al., 2017; Wang et al., 2021; Smith et al., 2021) . As the name suggests, SSCL utilizes not only labeled data, but also leverages unlabeled task data to construct a cumulative knowledge base for learning agents, reducing labeling cost in applied machine learning. Despite all the research efforts on SSCL, the state-of-the-art of SSCL (Baucum et al., 2017; Wang et al., 2021; Smith et al., 2021 ) has yet to address an elephant in the room: closing the performance gap between supervised and semi-supervised CL. Ideally, learning from n L labeled data and n U unlabeled data per task should provide the same lifelong performance as if all the n L + n U data are labeled, but state-of-the-art SSCL frameworks have not approached this goal, and rarely consider computational cost required to do so. Moreover, multiple supervised CL tools have matured Lee et al. (2019) ; Yoon et al. (2018) ; Bulat et al. (2020) and would likely benefit by extending them to the semi-supervised setting, but current SSCL approaches are architecture-specific and such extension is non-trivial. Motivated by the challenges above, we propose data programming (DP) (Ratner et al., 2016b) as a solution. DP is an automatic psuedo-labeling approach that collectively generates labels from noisy labeling functions, with some methods providing probabilistic guarantees on pseudo-label accuracy (Ratner et al., 2016a; Varma & Ré, 2016) . Ideally, the more diverse noisy labelers are sampled, the higher quality pseudo-labels can be produced -approaching perfect labeling accuracy. Therefore, upon every task in SSCL, by training a pseudo-labeler via DP and then cascading both ground-truthlabeled and pseudo-labeled data into a supervised CL model, we are able to approach high quality pseudo-labeling and decrease the performance gap between semi-supervised and supervised CL. This procedure is also benefited from the small overhead of DP in terms of both time and memory, lowering resource costs on large amounts of unlabeled data and long task sequences. Furthermore, the cumulative knowledge along CL can assist the pseudo-labeler performance, leveraging transferability analysis metrics (Nguyen et al., 2020; Tan et al., 2021; Pandy et al., 2022; Tran et al., 2019) . Intuitively, the more similar two tasks, the more similar ways they should handle the unlabeled data to shrink the gap. In practice, noisy labeling functions from previous tasks can be retained and transferred to new tasks based on task transferability, utilizing cumulative knowledge to self-teach the pseudo-labeler throughout the lifelong sequence. This framework design also allows supervised CL approaches to be extended in a plug-and-play fashion, by decoupling the pseudo-labeling and continual learning modules. Experiments on standard image classification benchmarks show DP-SSCL achieves final accuracy and catastrophic forgetting comparable to supervised CL on fully labeled data. Moreover, DP-SSCL outperforms existing SSCL tools with up to 25% higher final accuracy and lower catastrophic forgetting, while reducing the memory overhead for unlabeled data processing from the 100 MB to the 1 MB level with the same time complexity. Additionally, ablation studies show DP-SSCL maintains steady continual performance at increasing sizes of unlabeled data per tasks, over longer task sequences, and using different knowledge transfer mechanisms.

2.1. LIFELONG LEARNING/CONTINUAL LEARNING (CL)

The primary goal of continual or lifelong learning is to learn tasks consecutively, exploiting forward transfer to facilitate the learning of new tasks while retaining performance on previous tasks without catastrophic forgetting. The vast majority of research focuses on supervised methods, using techniques such as weight importance vectors (Fernando et al., 2017; Aljundi et al., 2019) to cache critical pathways and prevent catastrophic forgetting, factorized transfer to decompose the model parameter space (Ruvolo & Eaton, 2013b; Bulat et al., 2020; Lee et al., 2019) , deconflicting projections to ensure that new tasks are trained using unused capacity within the deep network (Farajtabar et al., 2019; Zeng et al., 2019; Saha et al., 2021) , and dynamically expanding networks that grow to accommodate tasks (Veniat et al., 2021) .

2.2. SEMI-SUPERVISED CONTINUAL LEARNING (SSCL)

Recently, techniques have been developed for CL in semi-supervised settings to take advantage of unlabeled data. A common procedure of SSCL is to pseudo-label these unlabeled data for training set augmentation. For instance, CNNL (Baucum et al., 2017) fine-tunes a lifelong learning model by repeatedly pseudo-labeling unlabeled data using the model itself, and then augments its training set with the newly-labeled data. Alternatively, DistillMatch (Smith et al., 2021) identifies unlabeled data points that are possibly seen in previous tasks by an out-of-distribution detector, and pseudo-labels them using distilled accumulated knowledge. A third example is ORDisCo (Wang et al., 2021) , which trains a GAN-based pseudo-labeler in parallel with a lifelong learning model by using a three-branch network, which enables it to learn the joint distribution of data and labels simultaneously. Similarly, Semi-ACGAN (Brahma et al., 2021) utilizes GAN for training task-dependent classifiers while using the unlabeled data only to train the discriminator of GAN for the source of data (real vs fake). The last example (Ho et al., 2022) combines prototypical learning for pseudo-labeling with meta learning to achieve both the label generation on the unlabeled data and fast adaptation to any task in the continual learning scenario. Under pseudo-labeling, bridging the gap towards supervised CL becomes simple: the higher pseudo-labeling accuracy, the closer performance to CL on fully labeled data. Straightforwardly, to shrink the gap is to improve the pseudo-labeler. The methods above all generate labels for the unlabeled data by using the classifiers that are trained for task objectives, and their novelty comes from supplementary design decisions that tackle issues of SSCL such as catastrophic forgetting and pseudo-label consistency. A major shortcoming of these approaches arises due to the frequently observed tendency of neural network models to have overconfidence in their predictions (Guo et al., 2017a; Nguyen et al., 2015; Hein et al., 2019) ; as a result neural-net-based labelers may confidently generate incorrect pseudo-labels, especially when encountering novel tasks or out-of-distribution data. On the other hand, our DP-SSCL method utilizes DP, which provides theoretical guarantees on the quality of labels, making the pseudo-labeling process more robust to new data from the novel tasks that arise in CL settings. Moreover, these existing SSCL tools require large additional memory and computation overhead for pseudo-labeling, such as storage of a ResNet-34 backbone (Smith et al., 2021) , GAN networks (Wang et al., 2021; Brahma et al., 2021) or MAML (Ho et al., 2022) , while DP-SSCL minimizes this overhead by storing light-weight labeling functions facilitated by data programming, which we describe below. Additionally, techniques that do not depend directly on pseudo-labels have been proposed for stable semi-supervised continual learning. For instance, pseudo-gradient learners (Luo et al., 2022) are trained instead of pseudo-labelers to provide auxiliary gradients for the model update from the unlabeled data. This is useful for the case that the unlabeled data may include instances of unknown classes. The other work, CCIC (Smith et al., 2021) , adapts MixMatch to use pseudo-labels of the unlabeled data as consistency regularization target rather than training target. Still, these methods avoid the usage of pseudo-labels as direct training target, but lack theoretical guarantees on the benefit of these approaches as provided by our DP-SSCL.

2.3. DATA PROGRAMMING (DP)

Data programming (DP) is an approach to automatically produce pseudo-labels on an unlabeled data set X U , given a labeled data set (X L , y L ) (Ratner et al., 2016b) . The idea is to ensemble a set of noisy weak labeling functions (WLFs) with each performing only slightly better than random guessing on their own, such that the ensembled pseudo-labels, or strong labels, achieve high accuracy. Therefore, DP papers generally discuss two problems in cascading order: (1) how to generate these WLFs and (2) how to do the ensembling. To address the upstream problem on WLF generation, one simple way is to collect manually designed functions by experts (Ratner et al., 2016b) . One succeeding tool, namely Snuba (Varma & Ré, 2016) , iteratively trains multiple sets of WLFs, and then from every set selects the top k WLFs ranked by a score s = w * F 1(y L , ŷL ) + (1 -w) * Jaccard(ŷ U ) where ŷL and ŷU are the predicted label vectors by a WLF on the labeled and unlabeled data respectively. The score consists of a performance metric (F1) on the labeled set, as well as a diversity metric (Jaccard distance (Jaccard, 1902) ), with a weighting factor w usually = 0.5. The selected WLFs that pass this pruning step form the committed labeler set F . Then, for the downstream problem on ensembling, multiple techniques can be applied. For example, majority voting is one of the brute force methods. Other existing techniques such as repeated labeling could also apply (Ipeirotis et al., 2014) . Among these methods, Snorkel (Ratner et al., 2016a ) learns a generative model on top of committed WLFs in form of π ϕ ( ŶU , Y U ) = 1 Z ϕ exp(ϕ T ŶU Y U ) where ŶU and Y U are the aggregated label matrices of all committed WLF labels and ground-truth labels, ϕ is the parameters and Z ϕ is a normalization factor. Snorkel trains this generative model such that it labels y U with high accuracy. This ensembling requires each WLF to have higher accuracy than random guessing, which is a low requirement for learners. When the downstream ensembling uses a generative model as in equation 2, Snuba provides a probabilistic guarantee that the accuracy of the generative model on labeled data and unlabeled data has a maximum difference of ϵ with probability 1 -δ. This guarantee exists because Snuba checks an exit condition on WLFs, such that each committed WLF before termination must have a certain level of confidence on d data points, with d ≥ 1 2(γ -ϵ) 2 log 2|F | 2 δ . Here, γ is the measured error, i.e., the difference of accuracy between the generative model and WLFs on the labeled data. Please refer to the original paper for a more detailed proof. We also extend the proof of this guarantee to the continual learning setting in Appendix A.

2.4. TRANSFERABILITY

In transfer learning, transferring a trained model to a target task that shares no common features with the source task typically hurts the performance of the model, even worse than learning only the target task. As such, understanding the similarity of tasks and measuring transferability of a learning model from one task to another is a key aspect of not only transfer learning but also continual learning. An intuitive metric of transferability is the accuracy of the source model on the target data (Tran et al., 2019; Dhillon et al., 2020) : measuring how well the trained model performs on the target task. LEEP (Nguyen et al., 2020) extends this metric by weighting the likelihood of the model with empirical conditional distribution of the target label given the source label. OTCE (Tan et al., 2021) quantifies the distance between two classification tasks as a sum of domain difference and task difference. The domain difference -the difference of the data distribution -is computed by optimal transport theory with entropic regularization, and the task difference -the difference of the classification objectives such as a set of classes -is derived from conditional entropy using the optimal coupling matrix of the optimal transport problem. GBC (Pandy et al., 2022) adopts the Bhattacharyya coefficient that measures the amount of overlap between two distributions in the feature space of the source model. Based on the positive results by using transferability score in transfer learning, this score can help figuring out the reusable knowledge of earlier tasks more effectively.

3. PROBLEM FORMULATION

We aim to solve the following problem: How to design an SSCL framework, such that it is able to (1) minimize the performance gap between using partially labeled data and fully labeled data per task at low cost, (2) allow cumulative knowledge transfer to assist handling of unlabeled data, and (3) extend from arbitrary existing supervised CL frameworks? Formally, in an SSCL problem, a lifelong learner will face a sequence of classification tasks {Z (1) , Z (2) , . . .}, with each task Z (i) having data space X (i) ⊆ R di , label space Y (i) = {1, . . . , c i }, and a joint distribution D (i) : X (i) × Y (i) → [0, 1] governing the data-label pairs. For task Z (i) , we are given n (i) L labeled training data (X (i) L , y (i) L ) ∼ D (i) , n (i) U unlabeled training data X (i) U ∼ X (i) and n (i) T testing data (X (i) T , y (i) T ) ∼ D (i) . From this setting, we consider a two-part SSCL framework consisting of an upstream pseudo-labeler and a downstream supervised CL module that are isolated from each other's processes. Specifically, at task Z (i) , a pseudo-labeling function π (i) : X (i) → Y (i) is first trained. Then, the unlabeled data X (i) U will be labeled with ŷ(i) U by this function. Next, labeled data (X (i) L , y L ) and pseudo-labeled data (X (i) U , ŷ(i) U ) will be given to a supervised CL module for continual learning. With this design, users are able to plug-and-play existing well-developed supervised CL tools and augment them into SSCL, addressing sub-problem (3). The design of the pseudo-labeling procedure addresses sub-problems (1) and (2). Intuitively, the higher the pseudo-label quality, the smaller the gap is between using partially labeled and fully labeled data (as if each task Z (i) has all n (i) L + n (i) U labeled). Hence, (1) asks for a labeler that can ideally approach perfect labeling by learning every underlying distribution D (i) , preferably with guaranteed label quality, and low time and memory overhead. Furthermore, (2) asks for a cumulative knowledge base and its input/output algorithm specifically for the pseudo-labeler. In the next section, we provide a concrete framework that meets these requirements. 

4.1. META INITIALIZATION

In a typical CL setting, the tasks are not available at once, but we are able to assume some general knowledge for the type of tasks (Liu, 2017; Ruvolo & Eaton, 2013a; Silver et al., 2013) . For example, what input dimensions and what latent space that encodes representative features. Therefore, we perform a meta initialization step to allow users to input these prior information before the first task. Such prior knowledge includes a training hyperparameter search space, in which hyperparameters such as learning rate, batch sizes and number of epochs for WLF training will be grid-searched later as new tasks arise. Other prior knowledge includes model templates, which define the architectures for WLFs, which are typically decision trees, regressors, or small neural networks not exceeding 5 layers or 1000 parameters. The WLF storage is initialized to be empty.

4.2. WLF INITIALIZATION VIA KNOWLEDGE TRANSFER

Upon the arrival of a new task, DP-SSCL first acquires a set of WLFs, each in form of f weak : X (i) → Y (i) ∪ {0} , where an additional 0 label means the confidence of f weak on a data point is lower than a given threshold t such that labeling is abstained. To produce WLFs that can ensemble to an accurate pseudo-labeler, we consider two sources of WLF initialization: (1) random sampling on the pre-defined model templates in meta initialization, leveraging the user's prior knowledge of the task and (2) transfer from previous tasks, leveraging knowledge accumulated throughout the continual learning process. Knowledge is represented in the form of previously committed WLFs, that is, WLFs that enters the pseudo-labeler ensembling procedure in previous tasks Z (1) , . . . , Z (i-1) . Upon task Z (i) , DP-SSCL evaluates the suitability of the previously committed WLFs for transfer to the current task using transferability measures, such as the LEEP or OTCE transferability score introduced in Section 2.4, and selects from the cached WLFs using the procedure shown in Algorithm 1. Algorithm 1 returns one of the previously committed WLFs to initialize a new weak labeling function. In this algorithm, the suitability score and the committed WLF selection criteria can be designed based on prior data or task knowledge. For example, one can use only the task transferability score to sample tasks relevant to the current task, and then select a weak labeling function at random by treating WLFs of the sampled task uniformly. On the other hand, it is also possible to consider the labeling accuracy of the earlier WLFs in addition to task transferability for the selection of WLFs. Since there is no guarantee that all previously encountered tasks are closely related to the current task, a suitability score threshold ϕ excludes negatively correlated tasks from the pool of WLFs for transfer. Similarly, we introduce another parameter -probability of initializing a WLF by transfer ρ

Algorithm 1 WLF Transfer

Input: WLF storage W, CL model M, and the current task data (X (i) L , y (i) L ) Parameter: Suitability score threshold ϕ Output: A selected WLF fweak in the storage W 1: s W LF ← computeSuitabilityScore(W, M, X (i) L , y (i) L ) 2: p W LF ← convertSelectionProbability(s W LF , ϕ) 3: fweak ← randomSelection(W, p W LF ) 4: return fweak -to control the ratio of the transferred WLFs to newly generated WLFs for the current task, which allows for the some WLFs to be initialized from scratch to maintain diversity in the WLF pool.

4.3. WLF TRAINING AND PRUNING

After initialization, the framework obtains a set of both transferred and randomly generated WLFs, which are fine-tuned by labeled data at current task. The training hyperparameters such as learning rate, batch size and epochs are selected from within the bounds specified during the meta initialization phase. The trained WLFs then pass through a Snuba pruner (Varma & Ré, 2016) , which commits only the top functions ranked by score computed by Equation equation 1. To improve the diversity of each WLF, we adopt bootstrapping on the training data, such that each WLF is trained on a randomly selected subset of the labeled data, with the boostrapped size specified during meta initialization. The initialization and training procedure is repeated until either a maximum size of committed WLFs is met, or if the condition in equation 3 will become violated in the next iteration. Consequently, although we include transferred functions in this procedure, our framework still maintains the Snuba guarantee on pseudo-labeler quality. A detailed proof of how DP-SSCL maintains this guarantee in the continual learning setting is presented in Appendix A.

4.4. PSEUDO-LABELER ENSEMBLING AND CONTINUAL MODEL UPDATE

The committed WLFs of task Z (i) enter ensembling, where different aggregators can be used. For instance, majority voting is a simple method to combine WLFs, as well as repeated labeling (Ipeirotis et al., 2014) . More advanced ensembling methods are available, such as training a Snorkel (Ratner et al., 2016a) generative model π (i) in the form of equation 2. We empirically evlauate different ensembling techniques in an ablation study in Section 5.2. Via the ensembled labeler, the framework obtains pseudo-labels ŷ(i) U . After ensembling, if the confidence of the pseudo-labels are available, the pseudo-labels can be further adjusted by confidence calibration (Platt, 1999; Guo et al., 2017b) . Last, together with labeled data (X (i) L , y (i) L ), the pseudo-labeled data (X (i) U , ŷ(i) U ) enters a supervised continual learning module to update its model to learn a function in the form of f : j) . For the underlying model, our framework supports many state-of-the-art continual learning approaches, such as DF-CNN (Lee et al., 2019) , TF (Bulat et al., 2020) , and DEN (Yoon et al., 2018) , which we demonstrate below. i j=1 X (j) → i j=1 Y (

5. EXPERIMENTAL EVALUATION

In this section, we first explain our instantiation of DP-SSCL from ablation study results on ensembling methods and WLF transfer. Then, we explore how DP-SSCL's lifelong performance with partially labeled data compares to that of supervised CL using fully labeled data sets, as well as how DP-SSCL performs compared to existing SSCL methods. More experiments and discussion are presented in Appendix D. 2010), CIFAR-10, and CIFAR-100 Krizhevsky (2009) . For each task, we hold out 10% of labeled and unlabeled data for validation. We evaluate performance using (1) peak per-task performance metrics, (2) final task performance metrics, and (3) forgetting metrics, with forgetting metrics measured as backward-transfer (Ruvolo & Eaton, 2013b; Lopez-Paz & Ranzato, 2017) . The detailed experimental setup , including hyperparameter selection and evaluation metrics, is explained in the following subsections as well as in Appendix B.

5.2. ABLATION STUDIES

The following two ablation studies are evaluated on validation data to select components for the main experiments in Section 5.3, which is then evaluated on separate testing data. More ablation studies are detailed in Appendix D. Ensembling Methods We evaluate three methods of pseudo-labeler ensembling discussed in Section 4.4: majority voting, repeated labeling, and Snorkel. As shown in Table 1 (refer Due to the fact that the increment is small, we pick a high LEEP score threshold that results in no transfer for all the remaining experiments, without a loss in generality. Though the accuracies are within 95% confidence interval of the baseline, these enhanced CL performances show its positive potential of knowledge transfer in weak labeling functions.

5.3.1. SSCL PERFORMANCE

Based on the instantiation of Snorkel ensembler and no LEEP transfer, we compare DP-SSCL to three state-of-the-art SSCL algorithms: CNNL (Baucum et al., 2017) , ORDisCo (Wang et al., 2021) , and DistillMatch (Smith et al., 2021) . To enable fair comparisons, we replicate the experimental conditions and compare our approach to the best results reported in each algorithm's original publication. Specifically, we replicate the instance-incremental learning experiments of CNNL on MNIST and CIFAR-10, and the class-incremental learning experiments of ORDisCo and DistillMatch on CIFAR-10 and CIFAR-100, respectively. The replicated experimental protocols are summarized below; please refer to the original papers for detailed setups. For instance-incremental experiments, all tasks are present at every epoch, but subsequent epochs contain different batches of unlabeled data. Each experiment was run over 10 random seeds. As shown in Table 2 , DP-SSCL achieves 90% of final testing accuracy as if fully labeled data is used, comparable to CNNL in MNIST and higher in CIFAR-10. Moreover, on both benchmarks DP-SSCL has comparable or higher sample efficiency measured in batches to saturation, which means the first epoch where a 3-batch sliding window average meets or exceeds the final accuracy. For class-incremental experiments, the model is sequentially presented with tasks containing new sets of classes. Each experiment is run with 10 random seeds. As shown in Table 3 , when equipped with certain supervised module (DF-CNN), DP-SSCL is able to achieve comparable final testing accuracy to supervised CL using fully labeled data, strictly exceeding that of ORDisCo and Dis-tillMatch. In terms of catastrophic forgetting, we depict backward transfer as a percentage, and so lies in the range [-100%, 100%], and the higher value the less forgetting occurs. We can see that DP-SSCL also produces similar forgetting as if fully labeled data is used. This result shows that DP-SSCL is able to capture the properties of continually shifting data and tasks, and generate appropriate pseudo-labels for the lifelong learners. 

5.3.2. SSCL COMPUTATIONAL COST

We analyze DP-SSCL's computational cost required to leverage unlabeled data compared to stateof-the-art SSCL methods, shown in Table 4 , where m is the size of one data point, b is the total number of data batches in a task, and r is the number of iterations in CNNL's pseudo-labeling loop. Each scalar parameter in a machine learning model is assumed to take a 4B floating point. We conclude that our DP-SSCL has small overhead in terms of both memory and time for unlabeled data processing, which we elaborate on below. For memory, CNNL maintains a constant-length queue of data, such that the queue size is proportional to the size of a data point (Gao et al., 2021) . On the other hand, DP-SSCL utilizes small weak labelers at 0.1MB-level, with 0.083 MB and 0.108 MB for MNIST and CIFAR10/100 benchmarks, respectively. We set a buffer of maximally 25 weak labelers and reallocate the buffer upon each task, so the overhead is 0.083 × 25 = 2.7 MB for MNIST and 0.108 × 25 = 2.75 MB for CIFAR10/100. Since the existing SSCL papers do not present timing measures, we compare them using time complexity. Upon every batch of data, the CNNL algorithm repeatedly labels the unlabeled data for an unbounded number of iterations, denoted as r, until it is confident of the labels. For ORDisCo, DistillMatch and DP-SSCL, the G + D, OoD detector and weak labelers are trained on all data batches sequentially, so the time overhead is proportional to the number of batches. The conclusion is that DP-SSCL has either lower or the same time complexity to utilize unlabeled data for learning compared to existing SSCL tools.  (b) O(b × r) O(b) O(b) More experiments are run to show that DP-SSCL improves its performance on larger size of unlabeled data per task, maintains stable learning on increasing number of tasks, and is sensitive to corrupted pseudo-labels. These studies are detailed in Appendix D.

6. CONCLUSION

We designed an SSCL framework, namely DP-SSCL, that leverages a DP-based pseudo-labeler and a supervised CL module in cascade manner with a feedback loop. This design allows us to obtain high quality pseudo-labels, shrinking the performance gap between SSCL and supervised CL. Our framework also shows how CL can improve DP by allowing knowledge transfer from previous tasks to improve pseudo-labeling quality based on transferability metrics. Furthermore, the framework is compatible with many existing mature supervised CL approaches, enabling trivial extension from supervised to semi-supervised CL. Our ablation studies show that the framework's performance depend on component selection, and succeeding research shall focus on exploring different component settings more thoroughly. Nevertheless, experiments show DP-SSCL is able to output high lifelong learning accuracy and low forgetting, approaching that of supervised CL on fully labeled data, and outperforming existing SSCL approaches by up to 25% higher final accuracy and lower forgetting, with only 1% of memory overhead at the same time complexity.

REPRODUCIBILITY STATEMENT

Our experiment code is pushed to an anonymous GitHub repository (https://github.com/dpsscl-anon/ DPSSCL). All readers are more than welcomed to replicate our experiments.

A PROOF OF SNUBA THEORETICAL GUARANTEE IN A LIFELONG SETTING

This section extends Snuba's guaranteed performance, equation 3 in the main paper, to lifelong setting. For completeness, we start by stating proposition: Proposition 1 (Snuba's guarantee (Varma & Ré, 2016) , adapted to a lifelong setting). Consider consecutive tasks Z (1) , Z (2) , . . . for which Snuba was used to obtain sets of committed weak labelers F (1) , F (2) , . . . for all tasks with corresponding empirical accuracies on X (i) L as a vector a (i) L . For each task, before ensembling, a factor graph-based generative model trains a set of fine-tuned weak labelers from F (i) , denoted as i) . If each labeler labels a minimum of F (i) and | F (i) | = |F (i) |, that have accuracies on X (i) L as ã(i) L and on X (i) U as ã(i) U , with a (i) L , ã(i) L , ã(i) U ∈ R |F (i) | and ã(i) U is unknown. We have a measured error ||a (i) L - ã(i) L || ∞ ≤ ϵ ( d (i) ≥ 1 2(γ -ϵ (i) ) 2 log 2|F (i) | 2 δ data points in X (i) L with above some given confidence threshold ν for all iterations, we can guarantee that ||ã (i) U - ã(i) L || ∞ < γ for all iterations and all tasks with probability 1 -δ. Proof. We know that Snuba ensures the following bound on the labeling performance of an individual task Z (i) learned in isolation; the following guarantee is a restatement from the original Snuba publication (Varma & Ré, 2016) , modified only to include superscripts for the task index i: Proposition 2 (Snuba's guarantee (Varma & Ré, 2016) for an individual task Z (i) ). Given a set F (i) of committed weak labelers by Snuba for task Z (i) , denote their empirical accuracies on X (i) L as a vector a (i) L . Before ensembling, a factor graph-based generative model trains a set of fine-tuned weak labelers from F (i) , denoted as F (i) and | F (i) | = |F (i) |, that have accuracies on X (i) L as ã(i) L and on X (i) U as ã(i) U , with a (i) L , ã(i) L , ã(i) U ∈ R |F (i) | and ã(i) U is unknown. We have a measured error ||a (i) L - ã(i) L || ∞ ≤ ϵ. If each labeler labels a minimum of d (i) ≥ 1 2(γ -ϵ (i) ) 2 log 2|F (i) | 2 δ data points in X (i) L with above some given confidence threshold ν for all iterations, we can guarantee that ||ã (i) U - ã(i) L || ∞ < γ for all iterations with probability 1 -δ. See (Varma & Ré, 2016) for the proof of Proposition 2. We now need to show that this same guarantee holds in a lifelong setting. Assume we have a sequence of T consecutive tasks Z (1) , Z (2) , . . . , Z (T ) . Since the first task Z (1) is learned in isolation to produce the set of weak labelers F (1) using Snuba, by Proposition 2 we know that we have a measured error for the first task of ||a (1) L - ã(1) L || ∞ ≤ ϵ (1)

and that ||ã

(1) U - ã(1) L || ∞ < γ for all iterations with probability 1 -δ. Let us assume that the bound holds for each F (i) after learning Z (1) , Z (2) , . . . , Z (T ) ; we will show that this bound also holds for task Z (T +1) . From tasks Z (1) , . . . , Z (T ) , Snuba has learned a set of weak labelers F = T i=1 F (i) . When creating the set of weak labelers F (T +1) incrementally, Snuba has two choices at each iteration j. Either it can choose to add to F (T +1) an existing weak labeler f ∈ F -F (T +1) or it can add a previously unused weak labeler f ′ ∈ U -F -F (T +1) , where U is the pool of all candidate weak labelers. If Snuba chose f ′ over f , then weak labeler f ′ was assigned a higher score than all other f ∈ F . Similarly, if Snuba chose a particular f ∈ F to add, then that f was assigned a higher score than all others. Let fj be the weak labeler chosen to add to F (T +1) at iteration j, either f or f ′ , which we know has maximum score of all weak labelers in U -F (T +1) . Snuba obtains an accuracy on the labeled data set X (T +1) L of â(T +1) L,j = 1 | X(T +1) L,j | k 1(y (T +1) k = ŷ(T +1) k ) , where X(T +1) L,j ⊆ X (T +1) L such that fj achieves a confidence greater than or equal to the confidence threshold ν on each data point in X(T +1) L , y (T +1) k is the true label, and ŷ(T +1) k is the predicted label by the weak labeler fj . (Varma & Ré, 2016) (see their Equation 4) show that the probability of Snuba's failure to maintain ||ã (T +1) U - ã(T +1) L || ∞ < γ in one iteration is Pr[||ã (T +1) U -a (T +1) L || ∞ + ϵ ≥ γ] ≤2|F (T +1) | exp(-2(γ -ϵ) 2 min(| X(T +1) L,1 |, . . . , | X(T +1) L,j |)) . Following (Varma & Ré, 2016) and applying the union bound over the sequence of iterations to bound the probability of failure over all iterations used to acquire F (T +1) , we can obtain that δ ≤ 2|F (T +1) | 2 exp(-2(γ -ϵ (T +1) ) 2 d (T +1) ) , where d (T +1) = min | X(T +1) L,1 |, . . . , | X(T +1) L,j | , and so consequently ||ã (T +1) U - ã(T +1) L || ∞ < γ similarly holds for Z (T +1) for all iterations used to obtain F (T +1) with probability 1 -δ. By induction, this holds for the entire lifelong sequence. To transform these benchmarks into the lifelong setting, we created tasks as described in Table 5 . For example, to re-create the binary MNIST task sequence, one needs to first arrange the 45 tasks as {0 vs 1, 0 vs 2, ..., 8 vs 9}. Then, split the labeled training data, unlabeled training data, and testing data into 120/11880/2000 per task, ensuring that all data splits have balanced classes. We then hold out 10% of training data for validation. That is, only 108 labeled and 10692 unlabeled data participate in training, with 12 labeled and 1188 used for validation. Finally, perform our procedure described in Section 4 to generate pseudo-labels as well as train a lifelong learner. Figure 3 : Pseudo-labeling accuracies produced by Snorkel-ensembled labeler. To train WLFs for data programming, keeps fine-tuning hyperparameters representing the WLF architectures, learning rate, and Snuba configurations in the given search spaces. Since the hyperparameters keep adapting to the current task even throughout a single lifelong sequence, we report these hyperparameters and their constrained search spaces in Table 5 and Figure 4 . The search spaces were inspired by various previous works on CNN designs (Lecun et al., 1998; He et al., 2016) . Nonetheless, the final architectures are much smaller. Figure 3 presents the pseudo-labeling accuracy by Snorkel-ensembled pseudo-labeler Ratner et al. (2016a) , which is the best labeler according to our ablation studies. The mean accuracies of binary MNIST, binary CIFAR-10, and 5-way MNIST are all around 90%. Although the accuracies of 10-way CIFAR-10 and 5-way CIFAR-100 are 60 -65% due to difficulty, we can still outperform existing SSCL frameworks as demonstrated in Section 5.3. We then detail our metric selection for the experiments. The performance of CL can be quantitatively measured by (1) peak per-task performance metrics, (2) final task performance metrics, and (3) forgetting metrics. After training on task Z (i) , we measure the accuracy a (i) j of the updated model when evaluated on the testing data of all known tasks {Z (j) : j ≤ i}. Peak per-task accuracy (Lee et al., 2019) at task Z (i) measures the average accuracy upon the first encounter of each task: ãi = 1 i i j=1 a (j) j , and final accuracy (Lopez-Paz & Ranzato, 2017) at task i measures the average performance of the current model on all tasks seen so far: āi = 1 i i j=1 a (i) j . The retention of knowledge is measured by backward transfer (Ruvolo & Eaton, 2013b; Lopez-Paz & Ranzato, 2017) at task Z (i)  as bt i = 1 i-1 i-1 j=1 a (i) j -a (j) j . Note that bt i ∈ [-1, 1] is the negative of the forgetting metric from (Chaudhry et al., 2019) , where positive values indicate improvement and negative values indicate forgetting. In addition to the lifelong learning metrics, we measure pseudo-labeling accuracy and system overhead of DP-SSCL in terms for memory cost. We also report the proportion of weak labelers using knowledge of earlier tasks and the transferability score to analyze the effect of knowledge sharing on data programming.

C ADDITIONAL EXPERIMENT ANALYSIS OF SEMI-SUPERVISED CONTINUAL LEARNING

This appendix details and provides additional results that complement Section 5.2 and Section 5.3.1. We start with the ablation studies that are not covered in the main paper. Accuracy Accuracy Accuracy Accuracy  n (i) U = 0 n (i) U = 30 n (i) U = 60 n (i) U = 120 n (i) U = 240 n (i) U = 360 n (i) U = n (i) U = 0 n (i) U = 30 n (i) U = 60 n (i) U = 120 n (i) U = 240 n (i) U = 360 n (i) U = n (i) U = 0 n (i) U = 200 n (i) U = 400 n (i) U = 800 n (i) U = 1200 n (i) U = n (i) U = 0 n (i) U = 200 n (i) U = 400 n (i) U = 800 n (i) U = 1200 n (i) U =

Sizes of Unlabeled Data

We measure the performance of continual learners with the groundtruth/generated labels on different supervised methods while varying the quantity of unlabeled training data. Table 6 summarizes the empirical results, providing details in addition to Figure 5 . Figure 6 visualizes the learning curves of final accuracy in these experiments, showing stable performance of lifelong learners regardless of the size of unlabeled data, number of tasks, and supervised modules. Peak per-task accuracy of both continual models increases with more DP-SSCL-labeled data. This supports the capability of DP-SSCL to capture the data distribution and assign representative labels. It is well documented that continual models typically have increased negative transfer (i.e., inference or catastrophic forgetting) when trained on more data. However, negative transfer increase is not amplified by using DP-SSCL-generated labels in place of true labels -the continual performance with DP-SSCL-generated pseudo-labels achieves at least 96% of the continual performance when models are trained on the same training data but with true labels instead. Consequently, DP-SSCL maintains stable performance at increasing scale of unlabeled data. In real-world deployment, labeled data has limited availability but unlabeled data is easy to be collected. Therefore, DP-SSCL's scalability supports practical continual learning. Number of Tasks Figure 6 also demonstrates that DP-SSCL preserves relatively stable continual learning performance on increasing number of tasks under different supervised CL modules and task sequences. This result entails that the quality of pseudo-labels generated by DP-SSCL is stable with respect to the number of tasks. Pseudo-label Noise Levels We also examined the effect of introducing noise to DP-SSCL's label generation process: randomly corrupting a portion of the generated labels among the unlabeled data. Figure 7 shows an inverse relationship between manually added pseudo-label noises and lifelong performance. This shows that CL is sensitive to inaccurate pseudo-labels and using more accurate labeling, such as more diverse WLFs on DP, is necessary. We next detail the ablation study on ensembling methods in Section 5.2. Ensembling Methods (More) We measure the performance of two continual learners, TF (Bulat et al., 2020) and DF-CNN (Lee et al., 2019) , on binary MNIST and binary CIFAR-10 experiments (45 binary classification tasks of pairs of classes such as 0 vs 1, 0 vs 2, ..., 8 vs 9). We ran each MNIST and CIFAR-10 experiment over 10 and 5 random seeds and task sequences, respectively. Table 7 shows all the peak per-task accuracy, final accuracy and backward transfer with respect to the pseudo-labeling methods and the amount of training data. We see that Snorkel obtains consistently strong performance across continual learning algorithms, data sets, and amounts of unlabeled training data. Finally, we discuss the learning curves in these experiments. More detailed experiment result statistics are listed in Tables 8, 9 , 11 and 10. Learning Curve We include learning curves for the instance-incremental semi-supervised lifelong experiments (Figure 8 ) and the class-incremental semi-supervised lifelong experiments (Figure 9 ). All training curves show final accuracy at task Z (i) as defined in Appendix B, where task Z (i) is the current task being trained. For easier tasks with high labeling accuracy, such as CIFAR-10, the DP-SSCL-enabled semi-supervised continual learning methods perform similarly to the equivalent 



Figure1: The DP-SSCL framework, our contribution marked as red and adopted modules blue.

Figure 2: SSCL performance when transferring WLFs of earlier tasks.

Figure 4: WLF model templates at meta initialization of. Notations: out c: output channels/number of filters, k: size (width and height) of a filter, p: padding and s: stride.

Figure 5: SSCL performance on different sizes of unlabeled data per task.

Figure 6: Learning Curve Comparisons. Dotted lines display accuracy of continual learning models trained on unlabeled data X U with true labels (TrueL).

Figure 7: Semi-supervised DF-CNN with three levels of DP-SSCL label noise, showing mean final accuracy.

Figure 8: Instance-incremental semi-supervised vs fully supervised learning curve comparisons

Comparison of different pseudo-labeler ensembling methods.

Instance-incremental SSCL results. SSCL Labeled 90.0± 0.4 3.3± 1.6 54.2± 0.5 26.5± 1.1 Fully Labeled 99.0± 0.1 17.0± 4.0 57.6± 0.5 27.1± 1.7

Class-incremental SSCL results.

. A typical queue length is 1000 as used in their experiments, with each MNIST data costing 784 B and CIFAR 3072 B. ORDisCo, DistillMatch and DP-SSCL all have a constant O(1) overhead once the architectures are fixed. Nevertheless, in implementation, ORDisCo requires storage of its generator (G) and discriminator (D), which approximately costs 100 MB for parameters, estimated from the last table in(Li et al., 2017) reporting ORDisCo's architecture. DistillMatch requires storage for its OoD detector, which is 254 MB for a ResNet34-based DeConf network

Memory and time overhead to process unlabeled data in different SSCL methods.

Task configurations used in the experiments in Section 5 of the main paper, with one additional committed WLF per iteration, 25 committed WLFs in total at maximum. An interval [a, b] means the hyperparameter is searched within this interval. Notice that 10% of the labeled and unlabeled data are hold-out for validation. Please refer to Figure4for the WLF architectures. section details the experimental setup used in Section 5 of the main paper. Our lifelong learning experiments uses the MNIST, CIFAR-10, and CIFAR-100 benchmark datasets(LeCun & Cortes, 2010;Krizhevsky, 2009).

Block structure of Architecture B, where block out c is the out c input into the block

Supervised continual learning on binary MNIST (top) and CIFAR-10 (bottom), showing mean ± standard deviation. Performance is assessed using three metrics, along with accuracy metrics relative to continual models trained on the same data with ground-truth labels instead. * Note: here, backward transfer is depicted as a percentage, and so is scaled by a factor of 100 to lie in range[-100, 100]  as compared to the metric definition in Appendix B.

Comparison between pseudo-labeling methods, showing mean final accuracy ± standard deviation. The training set for each task is a combination of the labeled data (120 and 400 images per task for MNIST and CIFAR-10, respectively) and specified quantity of unlabeled data. Models are trained in a continual learning setting with labels for the unlabeled data generated by one of three weak supervision methods.

Class-incremental Binary CIFAR-10 experiment results broken down by task, showing mean ± standard deviation

Class-incremental 5-way CIFAR-100 experimental results with the DF-CNN, broken down by task, showing mean ± standard deviation

Class-incremental 5-way CIFAR-100 experimental results with TF, broken down by task, showing mean ± standard deviation

Class-incremental 5-way CIFAR-100 experimental results with the DEN, broken down by task, showing mean ± standard deviation

D ADDITIONAL EXPERIMENT ANALYSIS OF DP-SSCL WITH KNOWLEDGE TRANSFER IN WEAK LABELERS

This appendix details and provides additional results that complement Section 5.2 with respect to the transfer of weak labeling functions across tasks. As described in Section 4.2, we can utilize the previously trained weak labelers for the current task if the current task has enough similarity with some earlier tasks. The key component is the suitability score in Algorithm 1 determining which earlier tasks or which earlier WLFs are sufficiently related to the current task.The main paper showed experimental results using the LEEP transferability score as the suitability score. In this appendix, we experiment with other suitability scores. Specifically we compare the following similarity measures:• the LEEP transferability score, as used in the original paper. LEEP provides task-level information on the similarity of earlier tasks to the current one.• LEEP + Snuba score. Instead of only relying on task transferability, this measure incorporates Snuba's s-score (Equation 1) to measure how earlier WLFs perform on the current task prior to fine-tuning. Specifically, we computed it asL , yL , yL ) where w ∈ W is an individual WLF, M is a CL model and {α L , α F , α J } are weights of the sum. In this experiment, all terms were weighted identically (α L = α F = α J = 1/3), and the LEEP score was projected to be in the range [0, 1] for compatibility with the other terms via the exponential function ((-inf, 0) → (0, 1)).• the OTCE transferability score. OTCE Tan et al. (2021) measures the similarity of earlier tasks to the current one with respect to the distance between probability distributions. • OTCE + Snuba score. Similar to LEEP + Snuba score, this measure incorporates the individual performance of earlier WLFs as well as task-wise relationships. We computed this score asL , X(1:i-1) L , yL , y(1:i-1) L and y(1:i-1) L are labeled data of earlier tasks, and {α L , α F , α J } are weights of the sum. In this experiment, the OTCE score was projected to be in the range [0, 1] by linear transformation, and all terms were weighted identically (α L = α F = α J = 1/3).We evaluated the different suitability scores in our approach on the CIFAR-10 dataset, using 45 binary classification tasks formed by all pairs of two image classes (see Table 5 ). As in the experiments from Section 5.2, DF-CNN is used as the continual learner. For both experiments of (n (400, 200) and (nU ) = (50, 400), transferring WLFs makes the continual learner achieve equal or better peak per-task accuracy and final accuracy compared to the performance of DP-SSCL without WLF transfer. This experiment shows that this result is robust to the choice of suitability score. 

