TOWARDS SEMI-SUPERVISED LEARNING WITH NON-RANDOM MISSING LABELS

Abstract

Semi-supervised learning (SSL) tackles the label missing problem by enabling the effective usage of unlabeled data. While existing SSL methods focus on the traditional setting, a practical and challenging scenario called label Missing Not At Random (MNAR) is usually ignored. In MNAR, the labeled and unlabeled data fall into different class distributions resulting in biased label imputation, which deteriorates the performance of SSL models. In this work, class transition tracking based Pseudo-Rectifying Guidance (PRG) is devised for MNAR. We explore the class-level guidance information obtained by the Markov random walk, which is modeled on a dynamically created graph built over the class tracking matrix. PRG unifies the history information of each class transition caused by the pseudo-rectifying procedure to activate the model's enthusiasm for neglected classes, so as the quality of pseudo-labels on both popular classes and rare classes in MNAR could be improved. We show the superior performance of PRG across a variety of the MNAR scenarios, outperforming the latest SSL solutions by a large margin. Checkpoints and evaluation code are available at the anonymous link https://anonymous.4open.science/r/PRG4SSL-MNAR-8DE2 while the source code will be available upon paper acceptance.



The class distribution of total data is balanced whereas labeled data is unevenly distributed across classes. For better illustration, the y-axis has different scaling for labeled (blue) and unlabeled data (green). Semi-supervised learning (SSL), which is in the ascendant, yields promising results in solving the shortage of large-scale labeled data (Chapelle et al., 2009; Zhou, 2021; Van Engelen & Hoos, 2020) . Current prevailing SSL methods (Lee et al., 2013; Berthelot et al., 2020; Sohn et al., 2020; Tai et al., 2021; Zhang et al., 2021) utilize the model trained on the labeled data to impute pseudo-labels for the unlabeled data, thereby boosting the model performance. Although these methods have made exciting advances in SSL, they only work well in the conventional setting, i.e., the labeled and unlabeled data fall into the same (balanced) class distribution. Once this setting is not guaranteed, the gap between the class distributions of the labeled and unlabeled data will lead to a significant accuracy drop of the pseudo-labels, resulting in strong confirmation bias (Arazo et al., 2019) which ultimately corrupts the performance of SSL models. The work in Hu et al. (2022) originally terms the scenario of the labeled and unlabeled data belonging to mismatched class distributions as label Missing Not At Random (MNAR) and proposes an unified doubly robust framework to train an unbiased SSL model in MNAR. It can be easily found that in MNAR, either the labeled or the unlabeled data has an imbalanced class distribution, otherwise, it degrades to the conventional SSL setting. A typical MNAR scenario is shown in Fig. 1 , in which the popular classes of labeled data cause the model to ignore the rare classes, increasingly magnifying the bias in label imputation on the unlabeled data. It is worth noting that although some recent SSL methods (Kim et al., 2020; Wei et al., 2021) are proposed to deal with the class imbalance, they are still built upon the assumption of the matched class distributions between the labeled and unlabeled data, and their performance inevitably declines in MNAR. 1 2 3 4 5 6 7 8 9 MNAR is a more realistic scenario than the conventional SSL setting. In the practical labeling process, labeling all classes uniformly is usually not affordable because some classes are more difficult to recognize (Rosset et al., 2005; Misra et al., 2016; Colléony et al., 2017) . Meanwhile, most automatic data collection methods also have difficulty in ensuring that the collected labeled data is balanced (Mahajan et al., 2018; Hu et al., 2022) . In a nutshell, MNAR is almost inevitable in SSL. In MNAR, the tricky troublemaker is the mismatched class distributions between the labeled and unlabeled data. Training under MNAR, the model increasingly favors some classes, seriously affecting the pseudo-rectifying procedure. Pseudo-rectifying is defined as the change of the label assignment decision made by the SSL model for the same sample according to the knowledge learned at each new epoch. This process may cause class transition, i.e., given a sample, its class prediction at the current epoch is different from that at the last epoch. In the self-training process of the SSL model driven by the labeled data, the model is expected to gradually rectify the pseudo-labels mispredicted for the unlabeled data in last epoches. With pseudo-rectifying, the model trapped in the learning of extremely noisy pseudo-labels will be rescued due to its ability to correct these labels. & OD V V , Q G H [ Unfortunately, the pseudo-rectifying ability of the SSL model could be severely perturbed in MNAR. Take the setting in Fig. 1 for example. The model's "confidence" in predicting the pseudo-labels into the labeled rare classes is attenuated by over-learning the samples of the labeled popular classes. Thus, the model fails to rectify those pseudo-labels mispredicted as the popular classes to the correct rare classes (even if the class distribution is balanced in unlabeled data). As shown in Fig. 2b , compared with FixMatch (Sohn et al., 2020) trained in the conventional setting (Fig. 2a ), FixMatch trained in MNAR (Fig. 1 ) significantly deteriorates its pseudo-rectifying ability. Even after many iterations, the error rates of the pseudo-labels predicted for labeled rare classes remain high. This phenomenon hints the necessity to provide additional guidance to the rectifying procedure to address MNAR. Meanwhile, as observed in Fig. 2c , we notice that the mispredicted pseudo-labels for each class are often concentrated in a few classes, rather than scattered across all other classes. Intuitively, a class can easily be confused with the classes similar to it. For example, as shown in Fig. 2c , the "automobile" samples are massively mispredicted as the most similar class: "truck". Inspired by this, we argue that it is feasible to guide pseudo-rectifying from the class-level, i.e., pointing out the latent direction of class transition based on its current class prediction only. For instance, given a sample classified as "truck", the model could be given a chance to classify it as "automobile" sometimes, and vice versa. Notably, our approach does not require predefined semantically similar classes. We believe that two classes are conceptually similar only if they are frequently misclassified to each other by the classifier. In this sense, we develop a novel definition of the similarity of two classes, which is directly determined by model's output. Even if there are no semantically similar classes, as long as the model makes incorrect prediction during the training, this still leads to class transitions which has seldom been investigated before. Our intuition could be regarded as perturbations on some confident class predictions to preserve the pseudo-rectifying ability of the model. Such a strategy does not rely on the matched class distributions assumption and therefore is amenable to MNAR. Given the motivations above, we propose class transition tracking based Pseudo-Rectifying Guidance (PRG) to address SSL in MNAR, which is shown in Fig. 3 . Our main idea can be presented as dynamically tracking the class transitions caused by pseudo-rectifying procedures at previous epoch to provide the class-level guidance for pseudo-rectifying at next epoch. We argue that every class transition of each pseudo-label could become the cure for the deterioration of the pseudo-rectifying ability of the traditional SSL methods in MNAR. A graph is first built on the class tracking matrix recording each pseudo-label's class transitions occurring in pseudo-rectifying procedure. Then we propose to model the class transition by the Markov random walk, which brings information about the difference in the propensity to rectify pseudo-labels of one class into various other classes. Specifically, we guide the class transitions of each pseudo-label during the rectifying process according to the transition probability corresponding to the current class prediction. The probability is obtained by the transition matrix of Markov random walk, which has been rescaled at both the class-level and the batch-level. Moreover, the class prediction at the last epoch can also be introduced to guide the pseudo-rectifying process at the current epoch. PRG recalls classes that are easily overlooked but appear in class transition history. They are deemed as similar to the ground-truth, and have more chance to be assigned rather than simply letting the model assign the classes it favors without hesitation. By this, PRG could help improve the quality of pseudo-labels suffered from biased imputation potentially caused by the mismatched distributions in MNAR. Because pseudo-rectifying is a spontaneous behavior of the model, moderately activating class transition will not hinder the learning of the model in the traditional setting. PRG is evaluated on several widely-used SSL classification benchmarks, demonstrating its effectiveness in coping with SSL in MNAR. To help understand our paper, we summarize it with the following questions and answers. • What is the novelty and contribution? Towards addressing SSL in MNAR, we propose transition tracking based Pseudo-Rectifying Guidance (PRG) to mitigate the adverse effects of mismatched distributions via combining information from the class transition history. We propose that the pseudo-rectifying guidance can be carried out from the class-level, by modeling the class transition of the pseudo-label as a Markov random walk on the graph. • Why does our method work for MNAR? In MNAR, being aware of rare class plays a key role, PRG enhances the model to preserve a certain probability to generate class transition to rare classes when assigning pseudo-labels. This form of probability based on class transition history produces effective results, because we do not spare any attempt of the model to identify the rare class by class transition tracking (such attempts would be slowly buried due to overlearning of the popular classes). Thereby, PRG helps the model to still try to identify rare classes with a certain probability. This corresponds to our soft pseudo-label strategy, where we adjust the probability distribution of soft labels so that the model can assign pseudo-labels to rare classes with a clear purpose. • How about the performance improvement? Our solution is computation and memory friendly without introducing additional network components. PRG achieves superior performance in the MNAR scenarios under various protocols, e.g., it outperforms CADR (Hu et al., 2022) , a newlyproposed method for addressing MNAR, by up to 15.11% in accuracy on CIFAR-10. Besides, we show the performance of PRG is also competitive in the traditional SSL setting.

2. RELATED WORK

Semi-supervised learning (SSL) is a promising paradigm to address the problem by effectively utilizing both labeled and unlabeled data. Given an input x (labeled or unlabeled data), our objective in SSL can be described as the learning of a predictor for generating label y for it. In conventional SSL settings (Berthelot et al., 2020; Sohn et al., 2020; Li et al., 2021; Zhang et al., 2021) and imbalanced SSL (Wei et al., 2021) , underlying most of them is the assumption: labeled and unlabeled data are matched and balanced. Some more practical scenarios for SSL are now extensively discussed. Recently, some work has focused on addressing the class-imbalanced issue in SSL. Kim et al. (2020) refines the pseudo-labels softly by formulating a convex optimization. Wei et al. (2021) proposes class-rebalancing self-training combining distribution alignment. However, these existing methods still underestimate the complexity of practical scenarios of SSL, e.g., Wei et al. (2021) works based on strong assumptions: the labeled data and unlabeled data fall in the same distribution (i.e., their distributions match). Further, a novel and realistic setting called label missing not at random is proposed in Hu et al. (2022) , which pops up in various fields such as social analysis, medical sciences and so on (Enders, 2010; Heckman, 1977) . To address the mismatched distributions of labeled and unlabeled data in MNAR, Hu et al. (2022) proposes a class-aware doubly robust (CADR) 6)∼( 7)). Finally, the rescaled pseudo-label px2 is used for the training. estimator combining class-aware propensity and class-aware imputation to remove the bias on label imputation. Differently, our method alleviates the bias from another perspective, that is to guide the pseudo-rectifying direction based on the historical information of class transitions.

3. METHOD

Formally, we denote the input space as X and the label space as Y = {1, ..., k} over k classes.  L , m L ) ∈ D L , i ∈ {1, ..., n L } be the labeled data pairs consisting of the sample with corresponding ground-truth label (i.e., m (i) = 0), and (x (i) U , y (i) U , m (i) U ) ∈ D U , i ∈ {n L + 1, ..., n T } be the unlabeled data missing labels (i.e., m (i) = 1), where n L and n T refer to the number of labeled data and total training data respectively. Hereafter, the SSL dataset can be defined as D = D L ∪ D U . In brief, we can review the conventional SSL as a optimization task for loss L: min θ (x,y,m)∈D L(x, y; θ), where D is a dataset with independent Y and M. In this sense, the model trained on D L can easily impute unbiased pseudo-labels for unlabeled data x U (Hu et al., 2022) . Conversely, the scenario where M is dependent with Y, namely label Missing Not At Random (MNAR), will make the model produce strong bias on label imputation, which causes the ability of pseudo-rectifying suffer greatly. Take the current most popular SSL method FixMatch (Sohn et al., 2020) as an example. In FixMatch, the term L(x, y; θ) in Eq. ( 1) can be decomposed into two loss terms L L and L U with a pre-defined confidence threshold τ (implying max(p) is used as a measure of the model's confidence): L(x, y; θ) = L L (x L , y L ; θ) + λ U 1(max(p) ≥ τ )L U (x U , arg max(p); θ), where λ L is the unlabeled loss weight and 1(•) is the indicator function. Training with MNAR setting in Fig. 1 , FixMatch is gradually seduced by samples predicted to be the labeled popular classes with confidence above τ (even though most of them are wrong), while samples predicted to be the rare class with confidence below τ do not participate into training, resulting in biased propensity on label imputation. In this work, we propose class transition tracking based Pseudo-Rectifying Guidance (PRG) to help model better self-correct pseudo-labels with additional guidance information.

3.1. PSEUDO-RECTIFYING GUIDANCE

Firstly, we formally describe the pseudo-rectifying process in SSL. In this paper, label assignment is considered as a procedure for generating soft labels. We denote the i-th component of vector x as x i . Let p ∈ R k + be the soft label vector assigned to unlabeled data x U , where R + is the set of nonnegative real numbers and k i=1 p i = 1. Denoting x at epoch e as x e , the pseudo-rectifying process can be described as the change on p by the next epoch: p e+1 = g θ (p e ), where g θ (p e ) is a mapping from p e to p e+1 determined by the knowledge learned from the model parametrized by θ at epoch e + 1. In MNAR, take imbalanced D L and balanced D U as an example, as the training progresses, the model's confidence is gradually slashed and unexpectedly grows on the rare and popular classes in D L respectively. To address this issue, it is necessary to provide more guidance to assist the model in pseudo-rectifying. In general, the Pseudo-Rectifying Guidance (PRG) can be described as pe+1 = Normalize(η • g θ (p e )), where • is Hadamard product, scaling weight vector η ∈ R k + and Normalize(x ) i = x i / k j=1 x j . We can review the technical contributions of some popular self-training works as obtaining more effective η for pseudo-rectifying. For example, pseudo-labeling based methods (Lee et al., 2013; Sohn et al., 2020; Li et al., 2021; Xu et al., 2021; Zhang et al., 2021)  set η i = 1/p e+1 i , i ∈ i | i = arg max(p e+1 ) ∧ p e+1 i ≥ τ and η j = 0, j ∈ {j | j ∈ (1, • • • , k) ∧ j ̸ = i} and, i.e., using a confidence threshold to filter low-confidence samples. However, it is difficult to set an apposite η at the sample-level (e.g., for simplicity, Sohn et al. (2020) fixes τ to determine η for all samples and the value of τ is usually set based on experience) to guide pseudo-rectifying, especially in the MNAR settings. In addition, some variants of class-balancing algorithms (Berthelot et al., 2020; Li et al., 2021; Gong et al., 2021 ) can be integrated into pseudo-rectifying framework. These methods utilize distribution alignment to make the class distribution of predictions close to the prior distribution (e.g., the distribution of labeled data). This process can be summarized as dataset-level pseudorectifying guidance by setting η as the ratio of the current class distribution of predictions to the prior distribution, i.e., the fixed η are used for all samples. Performing pseudo-rectifying guidance in this way strongly relies on an ideal assumptions: the labeled data and unlabeled data share the same class distribution, i.e., in D, Y is independent with M. Thus, these approaches fail miserably in the MNAR scenarios, which can be demonstrated in Appendix D.1. As we discussed in Sec. 1, it is also feasible to guide pseudo-rectifying at the class-level. Hence, we define rectifying weight matrix as A ∈ R k×k + , where each row A i is representing the rectifying weight vector corresponding to class i. Denoting the class prediction as p = arg max(p), the class-level pseudo-rectifying guidance can be conducted by plugging A pe+1 into η in Eq. ( 3): pe+1 = Normalize(A pe+1 • g θ (p e )). Next, we will introduce a simple and feasible way to obtain an effective A for PRG to improve the pseudo-labels predicted by SSL models in the MNAR scenarios.

3.2. CLASS TRANSITION TRACKING

Firstly, we consider building a fully connected graph G in class space Y. This graph is constructed by adjacency matrix C ∈ R k×k + (dubbed as class tracking matrix), where each element C ij represents the frequency of class transitions that occur from class i to class j (i.e., an edge directed from vertex i to vertex j on G). C ij is parametrized by the following class transition tracking averaged on last N B batches with unlabeled data batch size B U , i.e., C ij = N B n=1 C (n) ij /N B , where C (n) ij = p(b) | p(b),e = i, p(b),e+1 = j, i ̸ = j, b ∈ {1, ..., B U } , n ∈ {1, ..., N B } , C (n) ii = 0. (5) Hereafter, we define the Markov random walk along the nodes of G, which is characterized by its transition matrix H ∈ R k×k + . Each element H ij represents the transition probability for the class prediction p transits from class i at epoch e to class j at epoch e + 1. In specific, H is computed by conducting row-wise normalization on C. The above designs are desirable for the following reasons. (1) In the self-training process of the model, the historical information of pseudo-rectifying contains the relationship between classes, which is often ignored in previous methods and can be utilized to help the model assign labels at a new epoch. We can record the class transition trend in pseudorectifying by Eq. ( 5), which corresponds to the transition probability represented by H ij , i.e., for a sample x, when its class prediction p is in the state of class i, if a rectifying procedure resulting in a class transition occurs, what probability will it transit to class j. Intuitively, given p with p = i, the model prefers to rectify it to another class similar to class i in one class transition, i.e., the preference of class transitions can also be regarded as the similarity between classes and the more similar two classes are, the more likely they are to be misclassified as each other's classes. The label is more likely to oscillate between the two classes, resulting in more swinging class transitions. As shown in Fig. 4 , in the "dog" class predictions, the predictions transitioning to the "cat" class are significantly more than to other classes, and vice versa in the "cat" labels. We can observe that C behaves like a symmetric matrix, reflecting the symmetric nature of class similarity. Consequently, this similarity between classes can be utilized to provide information for our class-level pseudo-rectifying guidance. D LU S OD Q H D X WR P R E LO H E LU G F D W G H H U G R J IU R J K R UV H V K LS WU X F N ,WHUDWLRQV DLUSODQH DXWRPRELOH ELUG FDW GHHU GRJ IURJ KRUVH VKLS WUXFN D LU S OD Q H D X WR P R E LO H E LU G F D W G H H U G R J IU R J K R UV H V K LS WU X F N ,WHUDWLRQV DLUSODQH DXWRPRELOH ELUG FDW GHHU GRJ IURJ KRUVH VKLS WUXFN D LU S OD Q H D X WR P R E LO H E LU G F D W G H H U G R J IU R J K R UV H V K LS WU X F N ,WHUDWLRQV DLUSODQH DXWRPRELOH ELUG FDW GHHU GRJ IURJ KRUVH VKLS WUXFN D LU S OD Q H D X WR P R E LO H E LU G F D W G H H U G R J IU R J K R UV H V K LS WU X F N ,WHUDWLRQV DLUSODQH DXWRPRELOH ELUG FDW GHHU GRJ IURJ KRUVH VKLS WUXFN D LU S OD Q H D X WR P R E LO H E LU G F D W G H H U G R J IU R J K R UV H V K LS WU X F (2) In the MNAR settings, the tricky problem is that the mismatched distributions lead to biased label imputation for unlabeled data. The feedback loop of self-reinforcing errors is not achieved overnight. Empirically, as the training progresses, the model becomes more and more confident in the popular classes (in labeled or unlabeled data), which leads to misclassify the samples that it initially thought might be the rare classes to the popular classes later. As shown in Fig. 4 , the lower left corner and upper right corner of the heatmap (i.e., the class transitions between the popular classes and rare classes) is getting lighter and always lighter than the upper left corner (i.e., the class transitions among the popular classes), which means the model is increasingly reluctant to transfer the class prediction to the rare classes during the pseudo-rectifying process. If we only focus on what the model has learned at present, the model's past efforts to recognize the rare classes will be buried. The latent relational information between classes is hidden in the pseudo-rectifying process producing class transitions. The history of class transitions can point the way for bias removal on label imputation with an abnormal propensity on different classes caused by mismatched distributions in MNAR. With obtained H, some preparations are done for plugging it into Eq. ( 4) to replace A. We're only modeling the pseudo-rectifying process resulting in class transition (i.e., C ii = 0), which means H ii = 0, i.e., η pe+1 is set to 0 in Eq. ( 3). This will encourage the class prediction to transition to other classes during each pseudo-rectifying process, which is unreasonable for training a robust classifier. Hence, we control the probability that does not transition class by setting H ii = α k-1 , where 1 k-1 is the average of the transition probabilities in each row of H and α is a pre-defined hyper-parameter. In addition, to avoid training instability, we scale each element in H by H ′ ij = k d=1 L d k d=1 k d ′ =1 C dd ′ × k d=1 C id L j × H ij , where L ∈ R k + and L i records the number of class predictions belonging to class i averaged on last N B batches. The first term on the right-hand side of Eq. ( 6) rescales H ij at the batch-level while the second term rescales H ij at the class-level, controlling the intensity of class transition together. For specific, the excessive class transitions in the self-training loop could yield confused supervision information that is not conducive to learning. Hereafter, to simply integrate our method into the above framework of pseudo-rectifying guidance, we plug H ′ into A in Eq. ( 4): Table 1 : Mean accuracy (%) in MNAR under CADR's protocol. The results of baseline methods are derived from CADR (Hu et al., 2022) . The larger γ, the more imbalanced the labeled data. In the format Mean ↑↓Diff. ±Std. , our accuracies are averaged on 3 runs while the standard deviations (±Std.) and the performance difference (↑↓Diff.) compared to FixMatch (Sohn et al., 2020) where H ′ pe+1 can be regarded as the class prediction for one sample randomly walks along the nodes of C at the current epoch, i.e., drive a possible class transition in the pseudo-rectifying for bias removal on label imputation propensity due to MNAR (more discussions can be found in Appendix B). We note that it is also feasible to use the class transition driven by pe to revise p e+1 (what is the class prediction after a class transition from last epoch to the present), i.e., replace H ′ pe+1 in Eq. ( 7) with H ′ pe , which is dubbed as PRG Last . The whole algorithms are presented in Appendix A. pe+1 = Normalize(H ′ pe+1 • g θ (p e )), & OD V V , Q G H [

4. EXPERIMENT

Dataset and Baselines. We evaluate PRG on three widely used benchmarks in SSL, including CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009) and mini-ImageNet (Vinyals et al., 2016 ) (a subset of ImageNet (Deng et al., 2009) composed of 100 classes). Following Hu et al. (2022) , we mainly report the mean accuracy of PRG in both conventional SSL settings and various MNAR scenarios. Multiple baseline methods are compared, including representative conventional SSL algorithms: Π Model (Rasmus et al., 2015) , MixMatch (Berthelot et al., 2019) , ReMixMatch (Berthelot et al., 2020) , and FixMatch (Sohn et al., 2020) . More importantly, we provide fair comparisons with the recent label bias removal methods for imbalanced SSL: DARP (Kim et al., 2020) , Crest (Wei et al., 2021) , and the latest approaches designed for addressing SSL in MNAR: CADR (Hu et al., 2022) . MNAR Settings. Following Hu et al. (2022) , the MNAR scenarios are mimicked by constructing the class-imbalanced subset of the original dataset for either the labeled data or the unlabeled data. Let γ denote the imbalanced ratio, N i and M i respectively refer to the number of the labeled and the unlabeled data in class i from k classes. Three MNAR protocols are used for the evaluations on PRG: (1) CADR's protocol (Hu et al., 2022) . N i = γ k-i k-1 , in which N 1 = γ is the maximum number of labeled data in all classes, and the larger the value of γ, the more imbalanced the class distribution of the labeled data. For example, Fig. 1 shows CIFAR-10 with γ = 20. (2) Our protocol. Because the total number of labeled data n L in the CADR's protocol varies with γ, which violates the principle of controlling variables, n L is fixed by users in our protocol. N 1 is altered for different scales of imbalance, i.e., N i = N 1 × γ -i-1 k-1 while γ is calculated by the constraint k i=1 N i = n L . We further consider the MNAR settings where the unlabeled data is also imbalanced, i.e., M i = M 1 × γ -k-i k-1 u (implying inversely imbalanced distribution compared with the labeled data), where M 1 = 5000 in CIFAR-10. (3) DARP'sprotocol (Kim et al., 2020) : N i = N 1 × γ -i-1 k-1 l , M i = M 1 × γ -i-1 k-1 u , where N 1 = 1500 and M 1 = 3000 in CIFAR-10, where γ l and γ u are varied for labeled and unlabeled data respectively, i.e., the distributions of the labeled and unlabeled data are mismatched and imbalanced. Implementation Details. In this section, PRG is implemented as a plugin to FixMatch (Sohn et al., 2020) (see Appendix D.2 for other SSL learners). Thus, we keep the same training hyper-parameters as FixMatch (e.g., unlabeled data batch size B U = 448), whereas the class invariance coefficient α = 1 and the tracked batch number N B = 128 are set for PRG in all experiments. The complete list of hyper-parameters can be found in Appendix C. Following Sohn et al. (2020) , our models are trained for 2 20 iterations, using the backbone of WideResNet-28-2 (WRN) (Zagoruyko & Komodakis, 2016) for CIFAR-10, WRN-28-8 for CIFAR-100 and ResNet-18 (He et al., 2016) for mini-Imagenet. Main Results. The experimental results under CADR's and our protocol with various levels of imbalance are summarized in Tabs. 1 and 2. PRG consistently achieves higher accuracy than baseline methods across most of the settings, benefiting from the information offered by class transition tracking. As shown in Figs. 5a and 5b , the pseudo-rectifying ability of PRG is significantly improved compared with the original FixMatch, i.e., as the training progresses, the error rates of both the popular classes and the rare classes of the labeled data are greatly reduced, eventually yielding improvements in test accuracy shown in Fig. 5c . Meanwhile, in Tab. 3 we further provide geometric mean scores (GM, a metric often used for imbalanced dataset (Kubat et al., 1997; Kim et al., 2020) ), which is defined by the geometric mean over class-wise sensitivity for evaluate the classification performance of models trained in MNAR. More metrics for evaluation (e.g., precision and recall) and the results of PRG built on other SSL learner can be found in Appendix D.2.

4.1. RESULTS IN MNAR

Our main competitors can be divided into three categories. ( 1) State-of-The-Art (SOTA) SSL methods such as ReMixMatch (Berthelot et al., 2020) and FixMatch (Sohn et al., 2020) . As shown in Tabs. 1 and 2, these methods show poor performance under MNAR. Especially, our backbone FixMatch can't cope with MNAR at all, whereas with our method, the performance is significantly improved by more than 10% in most cases. (2) Imbalanced SSL methods: DARP (Kim et al., 2020) and Crest (Wei et al., 2021) . These two SOTA methods addressing long-tailed distribution in SSL emphasize the bias removal in matched distribution (i.e., the unlabeled data is equally imbalanced as the labeled data), showing very limited capacity in handling MNAR. (3) SSL solutions devised for the MNAR scenarios: CADR (Hu et al., 2022) . Our method outperforms CADR under its proposed protocol across the board, demonstrating PRG is more effective for bias removal on label imputation than it. With extremely few labels, the class-aware propensity estimation in CADR is not reliable whereas our method still works well, yielding a performance gap of up to 14.41%. More MNAR Settings. More MNAR scenarios are considered for evaluation. In our protocol, we alter N 1 and γ u to mimic the case where the distributions of the labeled and unlabeled data are imbalanced and mismatched, i.e., the two distributions are different. Likewise, DARP's protocol produces similar mismatched distributions. As shown in Fig. 6 , PRG achieves promising results in all the comparisons with the baseline methods. Our method boosts the accuracy of the original FixMatch by up to 35.51% and 24.33% in our and DARP's protocols respectively. The activated class transitions make the model less prone to over-learning unexpected classes so that the negative effect of MNAR can be mitigated. Moreover, the results of balanced labeled data with imbalanced unlabeled data and more application scenarios (e.g., tabular data) can be found in Appendix D.2. As shown in Tab. 4, our method still works well on balanced datasets. The class-level guidance offered by our method is also valid in the conventional setting while maintaining the vitality of class transition, even though there is not too much need to remove bias on label imputation. Hereafter, we investigate the effect of the class invariance coefficient α and the tracked batch number N B on PRG, which is shown in Fig. 7 . Choosing an appropriate α to control the degree of class invariance in pseudo-rectifying is important for PRG, which ensures stability of supervision information and training. Meanwhile, we note that too small N B is not sufficient to estimate the underlying distribution of class transitions, where N B = 128 is a sensible choice for both memory overhead and performance.  (n L , N 1 , u ) 7HVW$FFXUDF\ )L[0DWFK 35* &$'5

5. CONCLUSION

This paper can be concluded as proposing a effective SSL framework called class transition based Pseudo-Rectifying Guidance (PRG) to address SSL in the MNAR scenarios. Firstly, we argue that the history of class transition caused by pseudo-rectifying can be utilized to offer informative guidance for future label assignment. Thus, we model the class transition as a Markov random walk along the nodes of the graph constructed on the class tracking matrix. Finally, we propose to utilize the class prediction information at current epoch (an alternative strategy is to combine the class prediction at the last epoch) to guide the class transition for pseudo-rectifying so that the bias of label imputation can be alleviated. Given that our method achieves considerable performance gains in various MNAR settings, we believe PRG can be used for robust semi-supervised learning in broader scenarios. From DU , draw a mini-batch BU = {(x // Construct transition matrix In this section, we give insights into re-weighting scheme of H in Eq. ( 6) based on the following theoretical justification. Overall, we give an explanation from the perspective of gradient. Our re-weighting scheme potentially scale the gradient magnitude on the learning of the unlabeled data to mitigate adverse effects of biased labeled data, and suppresses the gradient magnitude when lass transition is overheated. Letting p be the naive soft label vector, by Eq. ( 6), we re- (b) U ); b ∈ (1, ..., BU )} 4 H = RowWiseNormalize(Average(C)) // Construct transition matrix 5 H ′ ij = k d=1 L d k d=1 k d ′ =1 C dd ′ × k d=1 C id L j × Hij // 5 H ′ ij = k d=1 L d k d=1 k d ′ =1 C dd ′ × k d=1 C id L j × Hij // weight H by H ′ ij = k d=1 L d k d=1 k d ′ =1 C dd ′ × k d=1 C id Lj × H ij and obtain the rescaled pseudo-label vector p = Normalize(H ′ • p). Hence, the cross-entropy between prediction p and p can be formalized as L U = - k c p log p c = - k c    k d=1 L d k d=1 k d ′ =1 C dd ′ × k d=1 C pd Lc × H pc × p c Z    log p c = - k d=1 C pd Z k d=1 k d ′ =1 C dd ′ k c   H pc × p c Z Lc k d=1 L d   log p c , where Z is the normalize factor.  ∂L U ∂o c = -(p c -pc p c - k i̸ =c pi p c ) = k i pi p c -pc = k d=1 C pd Z k d=1 k d ′ =1 C dd ′   1 - H pc Z Lc k d=1 L d   p c . The larger the difference between H pc and , the smaller the gradient ( ∂L U ∂oc = 0 when H pc Z Lc k d=1 L d = 1 ). This means that we intend to provide unbiased guidance (because this is derived from the unlabeled data) for the learning of unlabeled samples from the class level, so as to resist the influence of biased labeled samples. Meanwhile, if there are too many class transitions occur on the whole, k d=1 C pd k d=1 k d ′ =1 C dd ′ will decrease, which results in decreasing of L U , i.e., suppress the trend of class transition overheating. To demonstrate the effectiveness of our re-weighting scheme on H, we conduct ablation experiments on it. As shown in Tab. 5, the re-weighting scheme can effectively boost the performance of PRG in MNAR because it controls the intensity of class transition together. Additionally, for the utilization of H ′ in Eq. ( 7), we consider taking k steps, i.e., multiply by H ′k instead of H ′ to uncover more complex patterns of misclassification than simple pairwise class relations. However, as shown in Tab. 6, we can observe that the performance is inversely proportional to k. The advantage of PRG is that H ′ is updated in each iteration, which means that the value of H ′ is dynamic. As the model learns new knowledge, the past H ′ may not be suitable for the pseudo-rectifying process anymore. If H ′k is used, this means that we are using the same H ′ multiple times for a given sample, which wastes the advantage of dynamic H ′ . H ′k using a suitable k or a dynamic selection of k might yield better performance, but it is difficult for us to determine the value of k. Therefore, the PRG is designed for simplicity and exhibits superior the performance. More Metrics. To comprehensively explore the improvement of PRG in MNAR, we report the difference in class-wise precision and recall with/without PRG. The experimental results are shown in Tab. 9. Compared to original FixMatch, we witness FixMatch with PRG achieves higer precision/recall by and large, especially on rare classes (i.e., class with larger index), which demonstrates that the bias removal capability of PRG effectively mitigates the effect of MNAR on the model. We also observe that both PRG and FixMatch achieve high precision as well as recall on popular classes and high precision but low recall on rare classes (especially FixMatch) in the early training period. The improvement of recall by PRG is due to the activated class transitions, which gives the model a certain probability to assign pseudo-labels to rare classes. In addition, as shown in Fig. 8 , PRG exhibits superior bias removal for confidence of pseudo-labels in MNAR, whereas FixMatch filters out too many labeled rare class samples with confidence lower than τ , e.g., class 8, 9 and 10, resulting in the waste of unlabeled data. More MNAR Scenarios We also provide more experiments on the setting of balanced labeled data with imbalanced unlabeled data, which is summarized in Tab. 10. For specific, we set n L = 40 with balanced distribution and set γ u = 50, 100, 200 for imbalanced unlabeled data, i.e., the class-wise number of unlabeled data M i = M 1 × γ -k-i k-1 u , where M 1 = 5000 in CIFAR-10. As shown in Tab. 10, PRG outperforms all baseline methods by a large margin (it is worth noting that the performance of CADR is even weaker than original FixMatch), proving the robustness of PRG in this MNAR scenario due to the unbiased guidance derived from the class transition history. More SSL Learners. Moreover, to further evaluate PRG's performance, we consider building PRG on the top of more SSL frameworks. Thus, we conduct experiments on CIFAR-10 under CADR's protocol with UPS (Rizve et al., 2021) combining PRG. UPS is a recently-proposed uncertainty-aware pseudo-label selection framework for SSL, which is the SOTA method among pseudo-labeling based methods. We keep all training settings the same as the original UPS. With γ = 20, UPS achieves an accuracy of 30.46% whereas UPS with PRG achieves an accuracy of 32.22%. We note that UPS performs poorly in the MNAR scenarios because it is a more pure pseudo-labeling approach that does not introduce consistency regularization to improve model performance. Also we observe that PRG improves UPS marginally, much less than FixMatch. This is understandable because the negative learning that UPS prides itself on can be potentially negatively affected by the probability distribution of pseudo-label being adjusted by PRG, e.g., uncertainty being altered. More Data Type In order to explore a broader application scenario of PRG, we apply it to tabular data. We conduct further experiments on tabular MNIST (interpreting MNIST as tabular data with 784 features) by plugging PRG into VIME (Yoon et al., 2020) . VIME is a prevailing self-and semisupervised learning frameworks for tabular data with pretext task of estimating mask vectors from corrupted tabular data. We implement PRG above the semi-supervised learning component of VIME. PRG provide pseudo-rectifying guidance to rescale the pseudo-labels for the original unlabeled sample in VIME. Specially, we replace the consistency loss used in VIME (i.e., mean squared error in Eq. ( 9) in Yoon et al. ( 2020)) with standard cross-entropy loss to makes PRG applicable to VIME. We use two protocols to show the performance advantage of PRG, including CADR's protocol and our protocol. As shown in Tab. 11, except for CADR's protocol with γ = 20 and our protocol with n L = 40, N 1 = 10 (it is worth noting that our upper limits of performance greatly exceed that of VIME), PRG outperforms original VIME by a large margin in the most of settings. The reason is that the scheme of class-transition-based pseudo-rectifying guidance is high-level and general (not limited to image data), which ultimately yields robust effect on the MNAR problem with tabular data.



Figure 1: An example of the MNAR scenarios on CIFAR-10 (see Sec. 4 for details). The class distribution of total data is balanced whereas labeled data is unevenly distributed across classes.For better illustration, the y-axis has different scaling for labeled (blue) and unlabeled data (green).

Figure 2: Results of FixMatch (Sohn et al., 2020) in MNAR and the conventional setting. The models are trained on CIFAR-10 with WRN-28-2 backbone (Zagoruyko & Komodakis, 2016). (a) and (b): Class-wise pseudo-label error rate. (c): Confusion matrix of pseudo-labels. In (b) and (c), experiments are conducted with the setting of Fig. 1, whereas in (a) with the conventional setting (i.e., balanced labeled and unlabeled data). The label amount used in (a) is the same as that in (b) and (c).

Figure 3: Overview of PRG. Class tracking matrix C is obtained by tracking the class transitions of pseudo-labels (e.g., p x1 for sample x 1 ) between epoch e and epoch e + 1 caused by pseudo-rectifying procedure (Eq. (5)). The Markov random walk defined by transition matrix H (each row H i represents the transition probability vector corresponding to class i) is modeled on the graph constructed over C. Generally, given a pseudo-label, e.g., p x2 for sample x 2 , class-and batch-rescaled H (i.e., H ′ ) is utilized to provide the class-level pseudo-rectifying guidance for p x2 according to its class prediction p = arg max(p x2 ) (Eqs. (6)∼(7)). Finally, the rescaled pseudo-label px2 is used for the training.

Following Hu et al. (2022), SSL can be reviewed as a label missing problem. The label missing indicator set is defined as M with m ∈ {0, 1}, where m = 1 indicates label is missing and m = 0 is the otherwise. Given the training dataset in SSL, we obtain a set of labeled data: D L ⊆ X × Y × M and a set of unlabeled data: D U ⊆ X × Y × M. Since the ground-truth y U ∈ Y of unlabeled data x U is inaccessible in SSL, prevailing self-training based SSL methods impute y U with pseudo-label p. p = f (x U ; θ) is predicted by the model which is parametrized by θ and trained on the labeled data.

Figure 4: Visualization of class tracking matrix C obtained in training process of FixMatch(Sohn et al., 2020) on CIFAR-10 with the same setting as in Figs.2c and 2b. The darker the color, the more frequent the class transitions. Overall, the number of class transitions decreases as the training progresses. Class transitions occur intensively between the popular classes, and class transitions between the rare classes gradually disappear (e.g., between "ship" and "truck").

Figure 5: Results on CIFAR-10 under CADR's protocol. (a) and (b): Class-wise pseudo-label error rate with γ = 50. (c): Learning curve of PRG. We mark the final results of FixMatch as dash lines.

Figure 6: Results on CIFAR-10 under two protocols. The imbalanced distributions of labeled and unlabeled data are in reverse order of each other in (a) and the case of γ u marked with "R" in (b).

Figure 7: Ablation studies with α and N B on CIFAR-10 under CADR's protocol with γ = 20.

Rescale H at class/batch-level 6 for b = 1 to BU do 7 p (b) = f θ (x ) = arg max(p (b) ) // Compute class prediction 10 if l (idx) ̸ = p(b) = FixMatch BL, BU , {p (b) ; b ∈ (1, ..., BU )} // Run FixMatch 17 θ = SGD(LL + LU , θ) // Update model parameters θ 18 end Algorithm 2: PRG Last : Pseudo-Rectifying Guidance Using Class Predictions of the Last Epoch Input: class tracking matrices C = {C (i) ; i ∈ (1, ..., NB)}, labeled training dataset DL, unlabeled training dataset DU , model θ, label bank {l (i) ; i ∈ (1, ..., nT -nL)} 1 for n = 1 to MaxIteration do 2 From DL, draw a mini-batch BL = {(x b ∈ (1, ..., B)} 3 From DU , draw a mini-batch BU = {(x (b) U ); b ∈ (1, ..., BU )} 4 H = RowWiseNormalize(Average(C))

(LL + LU , θ) // Update model parameters θ 18 end

C dd ′ can be regarded as the ratio of class transitions derived from class p to population transitions. Denoting the logit outputted from the model as o (implying p = Softmax(o)), with no gradient on pseudo-label p, we obtain ∂L U ∂oc = -

Figure 8: Violin plot of confidence scores on the unlabeled data under CADR's protocol with γ = 100. The confidence threshold τ = 0.95 in FixMatch is marked out in red.

are reported.

Mean accuracy (%) in MNAR under our protocol with the varying labeled data sizes n L and imbalanced ratios N 1 . Baseline methods are based on our reimplementation.

Geometric mean scores (GM) on CIFAR-10 under CADR's protocol.

Mean accuracy (%) in the conventional setting with various n L . Results of baselines are reported in CADR(Hu et al., 2022) while results of * are based on our reimplementation.

The ablation studies on re-weighting by Eq. (6). We report the results (accuracy (%) / GM) on CIFAR-10 under CADR's protocol.

The ablation studies on step k. We report accuracy (%) and GM on CIFAR-10 under CADR's protocol with γ = 20.

The comparisons of Class-wise precision and recall on CIFAR-10 in the training process under CADR's protocol with γ = 50.

Accuracy (%) on CIFAR-10 with n L = 40 and various γ u . The labeled data is balanced and the unlabeled data is imbalanced.

Accuracy (%) on tabular MNIST. γ is varied for CADR's protocol whereas n L and N 1 are varied for our protocol. Mean ± Std. are computed over 50 runs.

annex

Reproducibility Statement. For reproducibility, please refer to the method described in Sec. 3 and the algorithmic presentation shown in Sec. A. The implementation details (including backbone, hyperparameters, traning details, etc) can be found in Sec. 4 and Sec. C. Moreover, the checkpoints and evaluation code are available at the anonymous link https://anonymous.4open.science/ r/PRG4SSL-MNAR-8DE2. We promise to release the all source code if the paper is accepted.

APPENDIX A ALGORITHM

Pseudo-code of PRG is presented in Algorithm 1 while that of PRG Last is presented in Algorithm 2. 

C IMPLEMENTATION DETAILS

In this section, we show the complete hyper-parameters in Tab. 7. As mentioned in Sec. 4, our method is implemented as a plugin to FixMatch (Sohn et al., 2020) . Thus, we keep the original hyperparameters in FixMatch and alert additional hyper-parameters in our method. Note that FixMatch sets different values of weight decay w for CIFAR-10 and CIFAR-100, which are 0.0005 and 0.001 respectively. For simplicity, we set w = 0.0005 for all experiments in our work. Additionally, the models in this paper are trained on GeForce RTX 3090/2080 Ti and Tesla V100. We observe that since no additional network components are introduced, the average running time of single iteration hardly increased, which means our method does not introduce excessive computational overhead. 

D.1 USING DISTRIBUTION ALIGNMENT IN MNAR

As discussed in Sec. 3.2, distribution alignment (DA) aims to perform strong regularization on pseudo-labels by aligning the class distribution of predictions on unlabeled data to that of labeled data. DA boosts the performance of SSL models tangibly (Berthelot et al., 2020; Gong et al., 2021; Li et al., 2021; Sohn et al., 2020) . However, DA works on a strong assumption that the distribution of unlabeled data matches that of labeled data. In MNAR, this assumption does not hold obviously. Thus, these methods combining DA fail to address SSL in MNAR, eventually yielding abysmal performance. As shown in Tab. 8, rather than improving performance, integrating DA into SSL models is counterproductive, e.g., original FixMatch outperforms FixMatch with DA by up to 28.68% on CIFAR-10. DA leads to a substantial deterioration of the model performance in the MNAR scenarios due to the large gap between the labeled data utilized and the unlabeled data distribution. Conversely, our method is not restricted by the mismatched distributions and achieves superior performance across the board, because PRG helps the model to better handle MNAR scenarios without any prior information (distribution prior estimated from labeled data is used in DA). 

