SOFTMATCH: ADDRESSING THE QUANTITY-QUALITY TRADE-OFF IN SEMI-SUPERVISED LEARNING

Abstract

The critical challenge of Semi-Supervised Learning (SSL) is how to effectively leverage the limited labeled data and massive unlabeled data to improve the model's generalization performance. In this paper, we first revisit the popular pseudo-labeling methods via a unified sample weighting formulation and demonstrate the inherent quantity-quality trade-off problem of pseudo-labeling with thresholding, which may prohibit learning. To this end, we propose SoftMatch to overcome the trade-off by maintaining both high quantity and high quality of pseudo-labels during training, effectively exploiting the unlabeled data. We derive a truncated Gaussian function to weight samples based on their confidence, which can be viewed as a soft version of the confidence threshold. We further enhance the utilization of weakly-learned classes by proposing a uniform alignment approach. In experiments, SoftMatch shows substantial improvements across a wide variety of benchmarks, including image, text, and imbalanced classification.

1. INTRODUCTION

Semi-Supervised Learning (SSL), concerned with learning from a few labeled data and a large amount of unlabeled data, has shown great potential in practical applications for significantly reduced requirements on laborious annotations (Fan et al., 2021; Xie et al., 2020; Sohn et al., 2020; Pham et al., 2021; Zhang et al., 2021; Xu et al., 2021b; a; Chen et al., 2021; Oliver et al., 2018) . The main challenge of SSL lies in how to effectively exploit the information of unlabeled data to improve the model's generalization performance (Chapelle et al., 2006) . Among the efforts, pseudo-labeling (Lee et al., 2013; Arazo et al., 2020) with confidence thresholding (Xie et al., 2020; Sohn et al., 2020; Xu et al., 2021b; Zhang et al., 2021) is highly-successful and widely-adopted. The core idea of threshold-based pseudo-labeling (Xie et al., 2020; Sohn et al., 2020; Xu et al., 2021b; Zhang et al., 2021) is to train the model with pseudo-label whose prediction confidence is above a hard threshold, with the others being simply ignored. However, such a mechanism inherently exhibits the quantity-quality trade-off, which undermines the learning process. On the one hand, a high confidence threshold as exploited in FixMatch (Sohn et al., 2020) ensures the quality of the pseudo-labels. However, it discards a considerable number of unconfident yet correct pseudolabels. As an example shown in Fig. 1 (a), around 71% correct pseudo-labels are excluded from the training. On the other hand, dynamically growing threshold (Xu et al., 2021b; Berthelot et al., 2021) , or class-wise threshold (Zhang et al., 2021) encourages the utilization of more pseudo-labels but inevitably fully enrolls erroneous pseudo-labels that may mislead training. As an example shown by FlexMatch (Zhang et al., 2021) in Fig. 1 (a), about 16% of the utilized pseudo-labels are incorrect. In summary, the quantity-quality trade-off with a confidence threshold limits the unlabeled data utilization, which may hinder the model's generalization performance. In this work, we formally define the quantity and quality of pseudo-labels in SSL and summarize the inherent trade-off present in previous methods from a perspective of unified sample weighting for- mulation. We first identify the fundamental reason behind the quantity-quality trade-off is the lack of sophisticated assumption imposed by the weighting function on the distribution of pseudo-labels. Especially, confidence thresholding can be regarded as a step function assigning binary weights according to samples' confidence, which assumes pseudo-labels with confidence above the threshold are equally correct while others are wrong. Based on the analysis, we propose SoftMatch to overcome the trade-off by maintaining high quantity and high quality of pseudo-labels during training. A truncated Gaussian function is derived from our assumption on the marginal distribution to fit the confidence distribution, which assigns lower weights to possibly correct pseudo-labels according to the deviation of their confidence from the mean of Gaussian. The parameters of the Gaussian function are estimated using the historical predictions from the model during training. Furthermore, we propose Uniform Alignment to resolve the imbalance issue in pseudo-labels, resulting from different learning difficulties of different classes. It further consolidates the quantity of pseudo-labels while maintaining their quality. On the two-moon example, as shown in Fig. 1 (c) and Fig. 1 (b), Soft-Match achieves a distinctively better accuracy of pseudo-labels while retaining a consistently higher utilization ratio of them during training, therefore, leading to a better-learned decision boundary as shown in Fig. 1(d) . We demonstrate that SoftMatch achieves a new state-of-the-art on a wide range of image and text classification tasks. We further validate the robustness of SoftMatch against long-tailed distribution by evaluating imbalanced classification tasks. Our contributions can be summarized as: • We demonstrate the importance of the unified weighting function by formally defining the quantity and quality of pseudo-labels, and the trade-off between them. We identify that the inherent trade-off in previous methods mainly stems from the lack of careful design on the distribution of pseudo-labels, which is imposed directly by the weighting function. • We propose SoftMatch to effectively leverage the unconfident yet correct pseudo-labels, fitting a truncated Gaussian function the distribution of confidence, which overcomes the trade-off. We further propose Uniform Alignment to resolve the imbalance issue of pseudolabels while maintaining their high quantity and quality. • We demonstrate that SoftMatch outperforms previous methods on various image and text evaluation settings. We also empirically verify the importance of maintaining the high accuracy of pseudo-labels while pursuing better unlabeled data utilization in SSL.

2.1. PROBLEM STATEMENT

We first formulate the framework of SSL in a C-class classification problem. Denote the labeled and unlabeled datasets as D L = x l i , y l i N L i=1 and D U = {x u i } N U i=1 , respectively, where x l i , x u i ∈ R d is the d-dimensional labeled and unlabeled training sample, and y l i is the one-hot ground-truth label for labeled data. We use N L and N U to represent the number of training samples in D L and D U , respectively. Let p(y|x) ∈ R C denote the model's prediction. During training, given a batch of labeled data and unlabeled data, the model is optimized using a joint objective L = L s + L u , where L s is the supervised objective of the cross-entropy loss (H) on the B L -sized labeled batch: L s = 1 B L B L i=1 H(y i , p(y|x l i )). For the unsupervised loss, most existing methods with pseudo-labeling (Lee et al., 2013; Arazo et al., 2020; Xie et al., 2020; Sohn et al., 2020; Xu et al., 2021b; Zhang et al., 2021) exploit a confidence thresholding mechanism to mask out the unconfident and possibly incorrect pseudo-labels from training. In this paper, we take a step further and present a unified formulation of the confidence thresholding scheme (and other schemes) from the sample weighting perspective. Specifically, we formulate the unsupervised loss L u as the weighted cross-entropy between the model's prediction of the strongly-augmented data Ω(x u ) and pseudo-labels from the weakly-augmented data ω(x u ): L u = 1 B U B U i=1 λ(p i )H(p i , p(y|Ω(x u i ))), where p is the abbreviation of p(y|ω(x u )), and p is the one-hot pseudo-label argmax(p); λ(p) is the sample weighting function with range [0, λ max ]; and B U is the batch size for unlabeled data.

2.2. QUANTITY-QUALITY TRADE-OFF FROM SAMPLE WEIGHTING PERSPECTIVE

In this section, we demonstrate the importance of the unified weighting function λ(p), by showing its different instantiations in previous methods and its essential connection with model predictions. We start by formulating the quantity and quality of pseudo-labels. Definition 2.1 (Quantity of pseudo-labels). The quantity f (p) of pseudo-labels enrolled in training is defined as the expectation of the sample weight λ(p) over the unlabeled data: f (p) = E D U [λ(p)] ∈ [0, λ max ]. Definition 2.2 (Quality of pseudo labels). The quality g(p) is the expectation of the weighted 0/1 error of pseudo-labels, assuming the label y u is given for x u for only theoretical analysis purpose: g(p) = N U i 1(p i = y u i ) λ(p i ) N U j λ(p j ) = Eλ (p) [1(p = y u )] ∈ [0, 1], where λ(p) = λ(p)/ λ(p) is the probability mass function (PMF) of p being close to y u . Based on the definitions of quality and quantity, we present the quantity-quality trade-off of SSL. Definition 2.3 (The quantity-quality trade-off). Due to the implicit assumptions of PMF λ(p) on the marginal distribution of model predictions, the lack of sophisticated design on it usually results in a trade-off in quantity and quality -when one of them increases, the other must decrease. Ideally, a well-defined λ(p) should reflect the true distribution and lead to both high quantity and quality. Despite its importance, λ(p) has hardly been defined explicitly or properly in previous methods. In this paper, we first summarize λ(p), λ(p), f (p), and g(p) of relevant methods, as shown in Table 1 , with the detailed derivation present in Appendix A.1. For example, naive pseudo-labeling (Lee et al., 2013) and loss weight ramp-up scheme (Samuli & Timo, 2017; Tarvainen & Valpola, 2017; Berthelot et al., 2019b; a) exploit the fixed sample weight to fully enroll all pseudo-labels into training. It is equivalent to set λ = λ max and λ = 1/N U , regardless of p, which means each pseudolabel is assumed equally correct. We can verify the quantity of pseudo-labels is maximized to λ max .  λ(p) 1/N U 1/ N τ U , if max(p) ≥ τ, 0.0, otherwise.          exp(- (max(p i )-μt ) 2 2 σt 2 ) N U 2 + N U 2 i exp(- (max(p i )-μt ) 2 2 σt 2 ) , max(p) < µ t 1 N U 2 + N U 2 i exp(- (max(p i )-μt ) 2 2 σt 2 ) , max(p) ≥ µ t f (p) λ max λ max N τ U /N U λ max /2 + λ max /N U N U 2 i exp(-(max(pi)-μt) 2 2 σt 2 ) g(p) NU i 1(p = y u )/N U NU i 1(p = y u )/ N τ U N µ t U j 1(p j = y u j )/2 NU + NU -N µ t U i 1(p i = y u i ) exp(-(max(pi)-µt) 2 σ 2 t )/2(N U -N µt U )

Note

High Quantity Low Quality Low Quantity High Quality High Quantity High Quality However, maximizing quantity also fully involves the erroneous pseudo-labels, resulting in deficient quality, especially in early training. This failure trade-off is due to the implicit uniform assumption on PMF λ(p) that is far from the realistic situation. In confidence thresholding (Arazo et al., 2020; Sohn et al., 2020; Xie et al., 2020) , we can view the sample weights as being computed from a step function with confidence max(p) as the input and a pre-defined threshold τ as the breakpoint. It sets λ(p) to λ max when the confidence is above τ and otherwise 0. Denoting N τ U = N U i 1(max(p) ≥ τ ) as the total number of samples whose predicted confidence are above the threshold, λ is set to a uniform PMF with a total mass of N τ U within a fixed range [τ, 1] . This is equal to constrain the unlabeled data as Dτ U = {x u ; max(p(y|x u )) ≥ τ }, with others simply being discarded. We can derive the quantity and the quality as shown in Table 1 . A trade-off exists between the quality and quantity of pseudo-labels in confidence thresholding controlled by τ . On the one hand, while a high threshold ensures quality, it limits the quantity of enrolled samples. On the other hand, a low threshold sacrifices quality by fully involving more but possibly erroneous pseudo-labels in training. The trade-off still results from the over-simplification of the PMF from actual cases. Adaptive confidence thresholding (Zhang et al., 2021; Xu et al., 2021b) adopts the dynamic and class-wise threshold, which alleviates the trade-off by evolving the (class-wise) threshold during learning. They impose a further relaxation on the assumption of distribution, but the uniform nature of the assumed PMF remains unchanged. While some methods indeed consider the definition of λ(p) (Ren et al., 2020; Hu et al., 2021; Kim et al., 2022) , interestingly, they all neglect the assumption induced on the PMF. The lack of sophisticated modeling of λ(p) usually leads to a quantity-quality trade-off in the unsupervised loss of SSL, which motivates us to propose SoftMatch to overcome this challenge.

3.1. GAUSSIAN FUNCTION FOR SAMPLE WEIGHTING

Inherently different from previous methods, we generally assume the underlying PMF λ(p) of marginal distribution follows a dynamic and truncated Gaussian distribution of mean µ t and variance σ t at t-th training iteration. We choose Gaussian for its maximum entropy property and empirically verified better generalization. Note that this is equivalent to treat the deviation of confidence max(p) from the mean µ t of Gaussian as a proxy measure of the correctness of the model's prediction, where samples with higher confidence are less prone to be erroneous than that with lower confidence, consistent to the observation as shown in Fig. 1 (a). To this end, we can derive λ(p) as: λ(p) = λ max exp -(max(p)-µt) 2 2σ 2 t , if max(p) < µ t , λ max , otherwise. (5) which is also a truncated Gaussian function within the range [0, λ max ], on the confidence max(p). However, the underlying true Gaussian parameters µ t and σ t are still unknown. Although we can set the parameters to fixed values as in FixMatch (Sohn et al., 2020) or linearly interpolate them within some pre-defined range as in Ramp-up (Tarvainen & Valpola, 2017) , this might again oversimplify the PMF assumption as discussed before. Recall that the PMF λ(p) is defined over max(p), we can instead fit the truncated Gaussian function directly to the confidence distribution for better generalization. Specifically, we can estimate µ and σ 2 from the historical predictions of the model. At t-th iteration, we compute the empirical mean and the variance as: μb = ÊB U [max(p)] = 1 B U B U i=1 max(p i ), σ2 b = Var B U [max(p)] = 1 B U B U i=1 (max(p i ) -μb ) 2 . ( ) We then aggregate the batch statistics for a more stable estimation, using Exponential Moving Average (EMA) with a momentum m over previous batches: μt = mμ t-1 + (1 -m)μ b , σ2 t = mσ 2 t-1 + (1 -m) B U B U -1 σ2 b , where we use unbiased variance for EMA and initialize μ0 as 1 C and σ2 0 as 1.0. The estimated mean μt and variance σ2 t are plugged back into Eq. ( 5) to compute sample weights. Estimating the Gaussian parameters adaptively from the confidence distribution during training not only improves the generalization but also better resolves the quantity-quality trade-off. We can verify this by computing the quantity and quality of pseudo-labels as shown in Table 1 . The derived quantity f (p) is bounded by [ λmax 2 (1 + exp(- ( 1 C -μt) 2 2 σt 2 )), λ max ], indicating SoftMatch guarantees at least λ max /2 of quantity during training. As the model learns better and becomes more confident, i.e., μt increases and σt decreases, the lower tail of the quantity becomes much tighter. While quantity maintains high, the quality of pseudo-labels also improves. As the tail of the Gaussian exponentially grows tighter during training, the erroneous pseudo-labels where the model is highly unconfident are assigned with lower weights, and those whose confidence are around μt are more efficiently utilized. The truncated Gaussian weighting function generally behaves as a soft and adaptive version of confidence thresholding, thus we term the proposed method as SoftMatch.

3.2. UNIFORM ALIGNMENT FOR FAIR QUANTITY

As different classes exhibit different learning difficulties, generated pseudo-labels can have potentially imbalanced distribution, which may limit the generalization of the PMF assumption (Oliver et al., 2018; Zhang et al., 2021) . To overcome this problem, we propose Uniform Alignment (UA), encouraging more uniform pseudo-labels of different classes. Specifically, we define the distribution in pseudo-labels as the expectation of the model predictions on unlabeled data: E D U [p(y|x u )]. Dur- ing training, it is estimated as ÊB U [p(y|x u ) ] using the EMA of batch predictions on unlabeled data. We use the ratio between a uniform distribution u(C) ∈ R C and ÊB U [p(y|x u )] to normalize the each prediction p on unlabeled data and use the normalized probability to calculate the per-sample loss weight. We formulate the UA operation as: UA(p) = Normalize p • u(C) ÊB U [p] , where the Normalize(•) = (•)/ (•), ensuring the normalized probability sums to 1.0. With UA plugged in, the final sample weighting function in SoftMatch becomes: λ(p) = λ max exp -(max(UA(p))-μt) 2 2σ 2 t , if max(UA(p)) < μt , λ max , otherwise. When computing the sample weights, UA encourages larger weights to be assigned to less-predicted pseudo-labels and smaller weights to more-predicted pseudo-labels, alleviating the imbalance issue. An essential difference between UA and Distribution Alignment (DA) (Berthelot et al., 2019a) proposed earlier lies in the computation of unsupervised loss. The normalization operation makes the predicted probability biased towards the less-predicted classes. In DA, this might not be an issue, as the normalized prediction is used as soft target in the cross-entropy loss. However, with pseudolabeling, more erroneous pseudo-labels are probably created after normalization, which damages the quality. UA avoids this issue by exploiting original predictions to compute pseudo-labels and normalized predictions to compute sample weights, maintaining both the quantity and quality of pseudo-labels in SoftMatch. The complete training algorithm is shown in Appendix A.2.

4. EXPERIMENTS

While most SSL literature performs evaluation on image tasks, we extensively evaluate SoftMatch on various datasets including image and text datasets with classic and long-tailed settings. Moreover, We provide ablation study and qualitative comparison to analyze the effectiveness of SoftMatch.foot_0 

4.1. CLASSIC IMAGE CLASSIFICATION

Setup. For the classic image classification setting, we evaluate on CIFAR-10/100 (Krizhevsky et al., 2009) , SVHN (Netzer et al., 2011) , STL-10 (Coates et al., 2011) and ImageNet (Deng et al., 2009) , with various numbers of labeled data, where class distribution of the labeled data is balanced. We use the WRN-28-2 (Zagoruyko & Komodakis, 2016) for CIFAR-10 and SVHN, WRN-28-8 for CIFAR-100, WRN-37-2 (Zhou et al., 2020) for STL-10, and ResNet-50 (He et al., 2016) for ImageNet. For all experiments, we use SGD optimizer with a momentum of 0.9, where the initial learning rate η 0 is set to 0.03. We adopt the cosine learning rate annealing scheme to adjust the learning rate with a total training step of 2 20 . The labeled batch size B L is set to 64 and the unlabeled batch size B U is set to 7 times of B L for all datasets. We set m to 0.999 and divide the estimated variance σt by 4 for 2σ of the Gaussian function. We record the EMA of model parameters for evaluation with a momentum of 0.999. Each experiment is run with three random seeds on labeled data, where we report the top-1 error rate. More details on the hyper-parameters are shown in Appendix A.3.1. Results. SoftMatch obtains the state-of-the-art results on almost all settings in Table 2 and Table 3 , except CIFAR-100 with 2,500 and 10,000 labels and SVHN with 1,000 labels, where the results of SoftMatch are comparable to previous methods. Notably, FlexMatch exhibits a performance drop compared to FixMatch on SVHN, since it enrolls too many erroneous pseudo-labels at the beginning of the training that prohibits learning afterward. In contrast, SoftMatch surpasses FixMatch by 1.48% on SVHN with 40 labels, demonstrating its superiority for better utilization of the pseudolabels. On more realistic datasets, CIFAR-100 with 400 labels, STL-10 with 40 labels, and ImageNet with 10% labels, SoftMatch exceeds FlexMatch by a margin of 7.73%, 2.84%, and 1.33%, respectively. SoftMatch shows the comparable results to FlexMatch on CIFAR-100 with 2,500 and 10,000 labels, whereas ReMixMatch (Berthelot et al., 2019a) demonstrates the best results due to the Mixup (Zhang et al., 2017) and Rotation loss. Setup. We evaluate SoftMatch on a more realistic and challenging setting of imbalanced SSL (Kim et al., 2020; Wei et al., 2021; Lee et al., 2021; Fan et al., 2022) , where both the labeled and the unlabeled data exhibit long-tailed distributions. Following (Fan et al., 2022) , the imbalance ratio γ ranges from 50 to 150 and 20 to 100 for CIFAR-10-LT and CIFAR-100-LT, respectively. Here, γ is used to exponentially decrease the number of samples from class 0 to class C (Fan et al., 2022). We compare SoftMatch with two strong baselines: FixMatch (Sohn et al., 2020) and FlexMatch (Zhang et al., 2021) . All experiments use the same WRN-28-2 (Zagoruyko & Komodakis, 2016) as the backbone and the same set of common hyper-parameters. Each experiment is repeated five times with different data splits, and we report the average test accuracy and the standard deviation. More details are in Appendix A.3.2. Results. As is shown in Table 4 , SoftMatch achieves the best test error rate across all long-tailed settings. The performance improvement over the previous state-of-the-art is still significant even at large imbalance ratios. For example, SoftMatch outperforms the second-best by 2.4% at γ = 150 on CIFAR-10-LT, which suggests the superior robustness of our method against data imbalance. Discussion. Here we study the design choice of uniform alignment as it plays a key role in Soft-Match's performance on imbalanced SSL. We conduct experiments with different target distributions for alignment. Specifically, the default uniform target distribution u(C) can be replaced by ground-truth class distribution or the empirical class distribution estimated by seen labeled data during training. The results in Fig. 3 (a) show a clear advantage of using uniform distribution. Uniform target distribution enforces the class marginal to become uniform, which has a strong regularization effect of balancing the head and tail classes in imbalanced classification settings.

4.3. TEXT CLASSIFICATION

Setup. In addition to image classification tasks, we further evaluate SoftMatch on text topic classification tasks of AG News and DBpedia, and sentiment tasks of IMDb, Amazon-5, and Yelp-5 (Maas et al., 2011; Zhang et al., 2015) . We split a validation set from the training data to evaluate the algorithms. For Amazon-5 and Yelp-5, we randomly sample 50,000 samples per class from the training data to reduce the training time. We fine-tune the pre-trained BERT-Base (Devlin et al., 2018) model for all datasets using UDA (Xie et al., 2020) , FixMatch (Sohn et al., 2020) , FlexMatch (Zhang et al., 2021) , and SoftMatch. We use AdamW (Kingma & Ba, 2014; Loshchilov & Hutter, 2017) optimizer with an initial learning rate of 1e -5 and the same cosine scheduler as image classification tasks. All algorithms are trained for a total iteration of 2 18 . The fine-tuned model is directly used for evaluation rather than the EMA version. of 12.68% on AG news with only 40 labels and 1.68% on DBpedia with 70 labels, surpassing the second best by a margin of 2.81% and 0.5% respectively. On sentiment tasks, SoftMatch also shows the best results on Amazon-5 and IMDb, and comparable results to its counterpart on Yelp-5.

4.4. QUALITATIVE ANALYSIS

In this section, we provide a qualitative comparison on CIFAR-10 with 250 labels of FixMatch (Sohn et al., 2020) , FlexMatch (Zhang et al., 2021) , and SoftMatch from different aspects, as shown in Fig. 2 . We compute the error rate, the quantity, and the quality of pseudo-labels to analyze the proposed method, using the ground truth of unlabeled data that is unseen during training. SoftMatch utilizes the unlabeled data better. Gaussian Parameter Estimation. SoftMatch estimates the Gaussian parameters µ and σ 2 directly from the confidence generated from all unlabeled data along the training. Here we compare it (All-Class) with two alternatives: (1) Fixed: which uses pre-defined µ and σ 2 of 0.95 and 0.01. (2) Per-Class: where a Gaussian for each class instead of a global Gaussian weighting function. As shown in Fig. 3 (c), the inferior performance of Fixed justifies the importance of adaptive weight adjustment in SoftMatch. Moreover, Per-Class achieves comparable performance with SoftMatch at 250 labels, but significantly higher error rate at 40 labels. This is because an accurate parameter estimation requires many predictions for each class, which is not available for Per-Class. Uniform Alignment on Gaussian. To verify the impact of UA, we compare the performance of SoftMatch with and without UA, denoted as all-class with UA and all-class without UA in Fig. 3(d) . Since the per-class estimation standalone can also be viewed as a way to achieve fair class utilization (Zhang et al., 2021) , we also include it in comparison. Removing UA from SoftMatch has a slight performance drop. Besides, per-class estimation produces significantly inferior results on SVHN. We further include the detailed ablation of sample functions and several additional ablation study in Appendix A.5 due to space limit. These studies demonstrate that SoftMatch stays robust to different EMA momentum, variance range, and UA target distributions on balanced distribution settings.

5. RELATED WORK

Pseudo-labeling (Lee et al., 2013) generates artificial labels for unlabeled data and trains the model in a self-training manner. Consistency regularization (Samuli & Timo, 2017) is proposed to achieve the goal of producing consistent predictions for similar data points. A variety of works focus on improving the pseudo-labeling and consistency regularization from different aspects, such as loss weighting (Samuli & Timo, 2017; Tarvainen & Valpola, 2017; Iscen et al., 2019; Ren et al., 2020) , data augmentation (Grandvalet et al., 2005; Sajjadi et al., 2016; Miyato et al., 2018; Berthelot et al., 2019b; a; Xie et al., 2020; Cubuk et al., 2020; Sajjadi et al., 2016 ), label allocation (Tai et al., 2021) , feature consistency (Li et al., 2021; Zheng et al., 2022; Fan et al., 2021) , and confidence thresholding (Sohn et al., 2020; Zhang et al., 2021; Xu et al., 2021b) . Loss weight ramp-up strategy is proposed to balance the learning on labeled and unlabeled data. (Samuli & Timo, 2017; Tarvainen & Valpola, 2017; Berthelot et al., 2019b; a) . By progressively increasing the loss weight for the unlabeled data, which prevents the model involving too much ambiguous unlabeled data at the early stage of training, the model therefore learns in a curriculum fashion. Per-sample loss weight is utilized to better exploit the unlabeled data (Iscen et al., 2019; Ren et al., 2020) . The previous work "Influence" shares a similar goal with us, which aims to calculate the loss weight for each sample but for the motivation that not all unlabeled data are equal (Ren et al., 2020) . SAW (Lai et al., 2022) utilizes effective weights (Cui et al., 2019) to overcome the class-imbalanced issues in SSL. Modeling of loss weight has also been explored in semi-supervised segmentation (Hu et al., 2021) . De-biased self-training (Chen et al., 2022; Wang et al., 2022a) study the data bias and training bias brought by involving pseudo-labels into training, which is similar exploration of quantity and quality in SoftMatch. Kim et al. (2022) proposed to use a small network to predict the loss weight, which is orthogonal to our work. Confidence thresholding methods (Sohn et al., 2020; Xie et al., 2020; Zhang et al., 2021; Xu et al., 2021b) adopt a threshold to enroll the unlabeled samples with high confidence into training. Fix-Match (Sohn et al., 2020) uses a fixed threshold to select pseudo-labels with high quality, which limits the data utilization ratio and leads to imbalanced pseudo-label distribution. Dash (Xu et al., 2021b) gradually increases the threshold during training to improve the utilization of unlabeled data. FlexMatch (Zhang et al., 2021) designs class-wise thresholds and lowers the thresholds for classes that are more difficult to learn, which alleviates class imbalance.

6. CONCLUSION

In this paper, we revisit the quantity-quality trade-off of pseudo-labeling and identify the core reason behind this trade-off from a unified sample weighting. We propose SoftMatch with truncated Gaussian weighting function and Uniform Alignment that overcomes the trade-off, yielding both high quantity and quality of pseudo-labels during training. Extensive experiments demonstrate the effectiveness of our method on various tasks. We hope more works can be inspired in this direction, such as designing better weighting functions that can discriminate correct pseudo-labels better.

A APPENDIX

A.1 QUANTITY-QUALITY TRADE-OFF In this section, we present the detailed definition and derivation of the quantity and quality formulation. Importantly, we identify that the sampling weighting function λ(p) ∈ [0, λ max ] is directly related to the (implicit) assumption of probability mass function (PMF) over p for p ∈ {p(y|x u ); x u ∈ D U }, i.e., the distribution of p. From the unified sample weighting function perspective, we show the analysis of quantity and quality of the related methods and SoftMatch. A.1.1 QUANTITY AND QUALITY

Derivation Definition 2.1

The definition and derivation of quantity f (p) of pseudo-labels is rather straightforward. We define the quantity as the percentage/ratio of unlabeled data enrolled in the weighted unsupervised loss. In other words, the quantity is the average sample weights on unlabeled data: f (p) = N U i λ(p i ) N U = E D U [λ(p i )], where each unlabeled data is uniformly sampled from D U and f (p) ∈ [0, λ max ].

Derivation Definition 2.2

We define the quality g(p) of pseudo-labels as the percentage/ratio of correct pseudo-labels enrolled in the weighted unsupervised loss, assuming the ground truth label y u of unlabeled data is known. With the 0/1 correct indicator function γ(p) being defined as: γ(p) = 1(p = y u ) ∈ {0, 1}, where p is the one-hot vector of pseudo-label argmax(p). We can formulate quality as: g(p) = N U i γ(p i ) λ(p i ) N U j λ(p j ) = N U i γ(p i ) λ(p i ) = Eλ (p) [γ(p)] = Eλ (p) [1(p = y u )] ∈ [0, 1]. We denote λ(p) as the probability mass function (PMF) of p, with λ(p) ≥ 0 and λ(p) = 1.0. This indicates that, once λ(p) is set to a function, the assumption on the PMF of p is made. In most of the previous methods (Tarvainen & Valpola, 2017; Berthelot et al., 2019b; a; Sohn et al., 2020; Zhang et al., 2021; Xu et al., 2021b) , although they do not explicitly set λ(p), the introduction of loss weight schemes implicitly relates to the PMF of p. While the ground truth label p is actually unknown in practice, we can still use it for theoretical analysis. In the following sections, we explicitly derive the sampling weighting function λ(p), probability mass function λ(p), quantity f (p), and quality g(p) for each relevant method. Published as a conference paper at ICLR 2023

A.1.2 NAIVE PSEUDO-LABELING

In naive pseudo-labeling (Lee et al., 2013) , the pseudo-labels are directly used to the model itself. This is equivalent to set λ(p) to a fixed value λ max , which is a hyper-parameter. We can write: λ(p) = λ max , λ(p) = λ max N U λ max = 1 N U , f (p) = N U i λ max N U = λ max , g(p) = N U i 1(p i = y u i ) N U . ( ) We can observe that the naive self-training maximizes the quantity of the pseudo-labels by fully enrolling them into training. However, full enrollment results in pseudo-labels of low quality. At beginning of training, a large portion of the pseudo-labels would be wrong, i.e., γ(p) = 0, since the model is not well-learned. The wrong pseudo-labels usually leads to confirmation bias (Guo et al., 2017; Arazo et al., 2020) as training progresses, where the model memorizes the wrong pseudolabels and becomes very confident on them. We can also notice that, by setting λ(p) to a fixed value λ max , we implicitly assume the PMF of the model's prediction p is uniform, which is far away from the realistic distribution.

A.1.3 LOSS WEIGHT RAMP UP

In the earlier attempts of semi-supervised learning, a bunch of work (Tarvainen & Valpola, 2017; Berthelot et al., 2019b; a) exploit the loss weight ramp up technique to avoid involving too much erroneous pseudo-labels in the early training and let the model focus on learning from labeled data first. In this case, the sample weighting function is formulated as a function of training iteration t, which is linearly increased during training and reaches its maximum λ max after T warm-up iterations. Thus we have: λ(p) = λ max min( t T , 1), λ(p) = λ max min( t T , 1) N U λ max min( t T , 1) = 1 N U , f (p) = λ max min( t T , 1), g(p) = N U i 1(p i = y u i ) N U , which demonstrates the same uniform assumption of PMF and same quality function as naive selftraining. It also indicates that, as long as same sample weight is used for all unlabeled data, a uniform assumption of PDF over p is made.

A.1.4 FIXED CONFIDENCE THRESHOLDING

Confidence thresholding introduces a filtering mechanism, where the unlabeled data whose prediction confidence max(p) is above the pre-defined threshold τ is fully enrolled during training, and others being ignored (Xie et al., 2020; Sohn et al., 2020) . The confidence thresholding mechanism can be formulated by setting λ(p) as a step function -when the confidence is above threshold, the sample weight is set to λ max , and otherwise 0. We can derive: λ(p) = λ max , if max(p) ≥ τ, 0.0, otherwise. (21) λ(p) = 1(max(p) ≥ τ ) N U i 1(max(p i ) ≥ τ ) = 1 NU , if max(p) ≥ τ, 0.0, otherwise. ( 22) f (p) = N U i 1(max(p i ) ≥ τ )λ max N U = λ max NU N U , g(p) = NU i 1(p i = y u i ) NU , (25) where we set NU = N U i 1(max(p i ) ≥ τ ), i.e. , number of unlabeled samples whose prediction confidence max(p) are above threshold τ . Interestingly, one can find that confidence thresholding directly modeling the PMF over the prediction confidence max(p). Although it still makes the uniform assumption, as shown in Eq. ( 22), it constrains the probability mass to concentrate in the range of [τ, 1] . As the model is more confident about the pseudo-labels, and the unconfident ones are excluded from training, it is more likely that p would be close to y u , thus ensuring the quality of the pseudo-labels to a high value if a high threshold is exploited. However, a higher threshold corresponds to smaller NU , directly reducing the quantity of pseudo-labels. We can clearly observe a trade-off between quantity and quality of using fixed confidence thresholding. In addition, assuming the PMF of max(p) as a uniform within a range [τ, 1] still does not reflect the actually distribution over confidence during training.

A.1.5 SOFTMATCH

In this paper, we propose SoftMatch to overcome the trade-off between quantity and quality of pseudo-labels. Different from previous methods, which implicitly make over-simplified assumptions on the distribution of p, we directly modelling the PMF of max(p), from which we derive the sample weighting function λ(p) used in SoftMatch. We assume the confidence of model predictions max(p) generally follows the Gaussian distribution N (max(p); μt , σt ) when max(p) < µ t and the uniform distribution when max(p) ≥ µ t . Note that µ t and σ t is changing along training as the model learns better. One can see that the uniform part of the PMF is similar to that of confidence thresholding, and it is the Gaussian part makes SoftMatch distinct from previous methods. In SoftMatch, we directly estimate the Gaussian parameters on max(p) using Maximum Likelihood Estimation (MLE), rather than set them to fixed values, which is more consistent to the actual distribution of prediction confidence. Using the definition of PMF λ(p), we can directly write the sampling weighting function λ(p) of SoftMatch as: λ(p) = λ max √ 2πσ t ϕ(max(p; µ t , σ t )), max(p) < µ t λ max , max(p) ≥ µ t , where ϕ(x; µ, σ) = 1 √ 2πσ exp(-(x-µ) 2 2σ 2 ). Without loss of generality, we can assume max (p i ) < µ t for i ∈ [0, N U 2 ], as µ t = 1 N U N U i max(p i ) (shown in Eq. ( 6)) and thus P(max(p) < µ t ) = 0.5. Therefore, λ(p) is computed as follows: N U i λ(p i ) = N U 2 i=1 λ(p i ) + N U j= N U 2 +1 λ(p j ) = N U 2 i λ max √ 2πσ t ϕ(max(p i ); µ t , σ t )) + N U j= N U 2 +1 λ max = λ max   N U 2 + N U 2 i exp(- (max(p i ) -µ t ) 2 2σ 2 t )   Further, f (p) = 1 N U N U i λ(p i ) = 1 N U    N U 2 i=1 λ(p i ) + N U j= N U 2 +1 λ(p j )    = λ max N U   N U 2 + N U 2 j exp(- (max(p j ) -µ t ) 2 2σ 2 t )   = λ max 2 + λ max N U N U 2 j exp(- (max(p j ) -µ t ) 2 2σ 2 t ) (28) Since max(p i ) < µ t for i ∈ [0, N U 2 ], exp(- ( 1 C -µ t ) 2 2σ 2 t ) <= exp(- (max(p i ) -µ t ) 2 2σ 2 t ) < 1 N U 2 exp(- ( 1 C -µ t ) 2 2σ 2 t ) <= N U 2 i exp(- (max(p i ) -µ t ) 2 2σ 2 t ) < N U 2 λ max 2 < λ max 2 (1 + exp(- ( 1 C -µ t ) 2 2σ 2 t )) <= f (p) < λ max Therefore, SoftMatch can guarantee at least half of the possible contribution to the final loss, improving the utilization of unlabeled data. Besides, as σ t is also estimated from max(p), the lower bound of f (p) would become tighter during training with a better and more confident model. With the derived λ(p), We can write the PDF λ(p) in SoftMatch as: λ(p) =      √ 2πσtϕ(max(p);µt,σt) N U 2 + N U 2 i √ 2πσtϕ(max(p);µt,σt) , max(p) < µ t 1 N U 2 + N U 2 i √ 2πσtϕ(max(p);µt,σt) , max(p) ≥ µ t , and further derive the quality of pseudo-labels in SoftMatch as: g(p) = N U i 1(p i = y u ) λ(p) = 1 N U k λ(p k ) N U i γ(p i )λ(p i ) = 1 N U k λ(p k )    N U 2 i γ(p i )λ(p i ) + N U 2 j= N U 2 +1 γ(p j )λ(p j )    = N U 2 i γ(p i ) λ max √ 2πσ t ϕ(max(p i ); µ t , σ t ) N U k λ(p k ) + N U 2 j γ(p j ) λ max N U k λ(p k ) = N U -NU i 1(p i = y u i ) exp(-(max(pi)-µt) 2 σ 2 t ) 2(N U -NU ) + NU j 1(p j = y u j ) 2 NU (30) where NU = N U i 1(max(p i ) ≥ µ t ). From the above equation, we can see that for pseudo-labels whose confidence is above µ t , the quality is as high as in confidence thresholding; for pseudolabels whose confidence is lower, thus more possible to be erroneous, the quality is weighted by the deviation from µ t . At the beginning of training, where the model is unconfident about most of the pseudo-labels, Soft-Match guarantees the quantity for at least λmax . As the model learns better and becomes more confident, i.e., µ t increases and σ t decreases, the lower bound of quantity becomes tighter. The increase in NU leads to better quality with pseudo-labels whose confidence below µ t are further down-weighted. Therefore, SoftMatch overcomes the quantityquality trade-off.

A.2 ALGORITHM

We present the pseudo algorithms of SoftMatch in this section. SoftMatch adopts the truncated Gaussian function with parameters estimated from the EMA of the confidence distribution at each training step, which introduce trivial computations. Algorithm 1 SoftMatch algorithm. 1: Input: Number of classes C, labeled batch {x i , y i } i∈[B L ] , unlabeled batch {u i } i∈[B U ] , and EMA momentum m. 2: Define: p i = p(y|ω(u i )) 3: L s = 1 B L B L i=1 H(y i , p(y|ω(x i ))) ▷ Compute L s on labeled batch 4: μb = 1 B U B U i=1 max(p i ) ▷ Compute the mean of confidence 5: σ2 = 1 B U B U i=1 (max(p i ) -μb ) 2 ▷ Compute the variance of confidence 6: μt = mμ t-1 + (1 -m)μ b ▷ Update EMA of mean 7: σ2 t = mσ 2 t-1 + (1 -m) B U B U -1 σ2 b ▷ Update EMA of variance 8: for i = 1 to B U do 9: λ(p i ) = exp -(max(UA(pi))-μt) 2 2σ 2 t , if max(UA(p i )) < μt , 1.0, otherwise. ▷ Compute loss weight 10: end for 11: L u = 1 B U B U i=1 λ(p i )H(p i , p(y|Ω(u i ))) ▷ Compute L u on unlabeled batch 12: Return: L s + L u The training parameters used are shown in Table 8 . Note that for strong augmentation, we use backtranslation similar to (Xie et al., 2020) . We conduct back-translation offline before training, using EN-DE and EN-RU with models provided in fairseq (Ott et al., 2019) . We use NVIDIA V100 to train all text classification models, the total training time is around 20 hours. 

A.4 EXTEND EXPERIMENT RESULTS

In this section, we provide detailed experiments on the implementation of the sample weighting function in unlabeled loss, as shown in Table 9 . One can observe most fixed functions works surprisingly well on CIFAR-10 with 250 labels, yet Gaussian function demonstrate the best results on CIFAR-10 with 40 labels. On the SVHN with 40 labels, Linear and Quadratic function fails to learn while Laplacian and Gaussian function shows better performance. Estimating the function parameters from the confidence and making the function truncated allow the model learn more flexibly and yields better performance for both Laplacian and Gaussian function. We visualize the functions studied in Fig. 4 , where one can observe the truncated Gaussian function is most reasonable by assigning diverse weights for samples whose confidence is within its standard deviation. EMA momentum. We compare SoftMatch with momentum 0.99, 0.999, and 0.9999 and present the results in Table 10 . A momentum of 0.999 shows the best results. While different momentum does not affect the final performance much, they have larger impact on convergence speed, where a smaller momentum value results in faster convergence yet lower accuracy and a larger momentum slows down the convergence. Variance range. We study the variance range of Gaussian function. In all experiments of the main paper, we use the 2σ range, i.e., divide the estimated variance σt by 4 in practice. The variance range directly affects the degree of softness of the truncated Gaussian function. We show in Table 11 that using σ directly results in a slight performance drop, while 2σ and 3σ produces similar results. UA target distribution. In the main paper, we validate the target distribution of UA on long-tailed setting. We also include the effect of the target distribution of UA on balanced setting. As shown in Table 12 , using uniform distribution u(c) or the ground-truth marginal distribution p L (y) produces the same results, whereas using the estimated pL (y) (Berthelot et al., 2021) has a performance drop.

A.6 EXTEND ANALYSIS ON TRUNCATED GAUSSIAN

In this section, we provide further visualization about the confidence distribution of pseudo-labels, and the weighting function, similar to Fig. 1 (a) but on CIFAR-10. More specifically, we plot the histogram of confidence of pseudo-labels and of wrong pseudo-labels, from epoch 1 to 6. We select the first 5 epochs because the difference is more significant. Along with the histogram, we also plot the current weighting function over confidence, as a visualization how the pseudo-labels over different confidence interval are used in different methods. Fig. 5 summarizes the visualization. Interestingly, although FixMatch adopts quite a high threshold, the quality of pseudo-labels is very low, i.e., there are more wrong pseudo-labels in each confidence interval. This reflects the important of involving more pseudo-labels into training at the beginning, as in SoftMatch, to let the model learn more balanced on each class to improve quality of pseudolabels.

A.7 EXTEND ANALYSIS ON UNIFORM ALIGNMENT

In this section, we provide more explanation regarding the mechanism of Uniform Alignment (UA). UA is proposed to make the model learn more equally on each classes to reduce the pseudo-label imbalance/bias. To do so, we align the expected prediction probability to a uniform distribution



All experiments in Section 4.1, Section 4.2, and Section 4.5 are conducted with TorchSSL(Zhang et al., 2021) and Section 4.3 are conducted with USB(Wang et al., 2022b) since it only supports NLP tasks back then. More recent results of SoftMatch are included in USB along its updates, refer https://github.com/Hhhhhhao/SoftMatch for details.



Figure 1: Illustration on Two-Moon Dataset with only 4 labeled samples (triangle purple/pink points) with others as unlabeled samples in training a 3-layer MLP classifier. Training detail is in Appendix. (a) Confidence distribution, including all predictions and wrong predictions. The red line denotes the correct percentage of samples used by SoftMatch. The part of the line above scatter points denotes the correct percentage for FixMatch (blue) and FlexMatch (green). (b) Quantity of pseudo-labels; (c) Quality of pseudo-labels; (d) Decision boundary. SoftMatch exploits almost all samples during training with lowest error rate and best decision boundary.

Figure 2: Qualitative analysis of FixMatch, FlexMatch, and SoftMatch on CIFAR-10 with 250 labels. (a) Evaluation error; (b) Quantity of Pseudo-Labels; (c) Quality of Pseudo-Labels; (d) Quality of Pseudo-Labels from the best and worst learned class. Quality is computed according to the underlying ground truth labels. SoftMatch achieves significantly better performance.

From Fig.2(b) and Fig.2(c), one can observe that SoftMatch obtains highest quantity and quality of pseudo-labels across the training. Larger error with more fluctuation is present in quality of FixMatch and FlexMatch due to the nature of confidence thresholding, where significantly more wrong pseudo-labels are enrolled into training, leading to larger variance in quality and thus unstable training. While attaining a high quality, SoftMatch also substantially improves the unlabeled data utilization ratio, i.e., the quantity, as shown in Fig.2(b), demonstrating the design of truncated Gaussian function could address the quantityquality trade-off of the pseudo-labels. We also present the quality of the best and worst learned classes, as shown in Fig.2(d), where both retain the highest along training in SoftMatch. The wellsolved quantity-quality trade-off allows SoftMatch achieves better performance on convergence and error rate, especially for the first 50k iterations, as in Fig.2(a).4.5 ABLATION STUDYSample Weighting Functions. We validate different instantiations of λ(p) to verify the effectiveness of the truncated Gaussian assumption on PMF λ(p), as shown in Fig.3(b). Both linear function and Quadratic function fail to generalize and present large performance gap between Gaussian due to the naive assumption on PMF as discussed before. Truncated Laplacian assumption also works well on different settings, but truncated Gaussian demonstrates the most robust performance.

Figure 3: Ablation study of SoftMatch. (a) Target distributions for Uniform Alignment (UA) on long-tailed setting; (b) Error rate of different sample functions; (c) Error rate of different Gaussian parameter estimation, with UA enabled; (d) Ablation on UA with Gaussian parameter estimation;We further include the detailed ablation of sample functions and several additional ablation study in Appendix A.5 due to space limit. These studies demonstrate that SoftMatch stays robust to different EMA momentum, variance range, and UA target distributions on balanced distribution settings.

additional ablation study of other components of SoftMatch, including the EMA momentum parameter m, the variance range of truncated Gaussian function, and the target distribution of Uniform Alignment (UA), on CIFAR-10 with 250 labels.

Figure 4: Sample weighting function visualization

Summary of different sample weighting function λ(p), probability density function λ(p) of p, quantity f (p) and quality g(p) of pseudo-labels used in previous methods and SoftMatch.

Top-1 error rate (%) on CIFAR-10, CIFAR-100, STL-10, and SVHN of 3 different random seeds. Numbers with * are taken from the original papers. The best number is in bold.

Top1 error rate (%) on ImageNet. The best number is in bold.

Top1 error rate (%) on CIFAR-10-LT and CIFAR-100-LT of 5 different random seeds. The best number is in bold.

Top1 error rate (%) on text datasets of 3 different random seeds. Best numbers are in bold.

Hyper-parameters of text classification tasks.

Detailed results of different instantiation of λp on CIFAR-10 with 40 and 250 labels, and SVHN-10 with 40 labels.

Ablation of EMA momentum m on CIFAR-10 with 250 labels.

Ablation of variance range in Gaussian function on CIFAR-10 with 250 labels.

Ablation of target distribution of UA on CIFAR-10 with 250 labels.

A.3.1 CLASSIC IMAGE CLASSIFICATION

We present the detailed hyper-parameters used for the classic image classification setting in Table 6 for reproduction. We use NVIDIA V100 for training of classic image classification. The training time for CIFAR-10 and SVHN on a single GPU is around 3 days, whereas the training time for CIFAR-100 and STL-10 is around 7 days. 

Percentage of Samples

Epoch 1, Acc: 28.1% when computing the sample weights. A difference of UA and DA is that UA is only used in weight computing, and not used in consistency loss. To visualize this, we plot the average class weight according to pseudo-labels of SoftMatch before UA and after UA at the beginning of training, as shown in Fig. 6 . UA facilitates more balanced class-wise sample weight, which would help the model learn more equally on each class.

