FREEMATCH: SELF-ADAPTIVE THRESHOLDING FOR SEMI-SUPERVISED LEARNING

Abstract

Semi-supervised Learning (SSL) has witnessed great success owing to the impressive performances brought by various methods based on pseudo labeling and consistency regularization. However, we argue that existing methods might fail to utilize the unlabeled data more effectively since they either use a pre-defined / fixed threshold or an ad-hoc threshold adjusting scheme, resulting in inferior performance and slow convergence. We first analyze a motivating example to obtain intuitions on the relationship between the desirable threshold and model's learning status. Based on the analysis, we hence propose FreeMatch to adjust the confidence threshold in a self-adaptive manner according to the model's learning status. We further introduce a self-adaptive class fairness regularization penalty to encourage the model for diverse predictions during the early training stage. Extensive experiments indicate the superiority of FreeMatch especially when the labeled data are extremely rare. FreeMatch achieves 5.78%, 13.59%, and 1.28% error rate reduction over the latest state-of-the-art method FlexMatch on CIFAR-10 with 1 label per class, STL-10 with 4 labels per class, and Im-ageNet with 100 labels per class, respectively. Moreover, FreeMatch can also boost the performance of imbalanced SSL. The codes can be found at https: //github.com/

1. INTRODUCTION

The superior performance of deep learning heavily relies on supervised training with sufficient labeled data (He et al., 2016; Vaswani et al., 2017; Dong et al., 2018) . However, it remains laborious and expensive to obtain massive labeled data. To alleviate such reliance, semi-supervised learning (SSL) (Zhu, 2005; Zhu & Goldberg, 2009; Sohn et al., 2020; Rosenberg et al., 2005; Gong et al., 2016; Kervadec et al., 2019; Dai et al., 2017) is developed to improve the model's generalization performance by exploiting a large volume of unlabeled data. Pseudo labeling (Lee et al., 2013; Xie et al., 2020b; McLachlan, 1975; Rizve et al., 2020) and consistency regularization (Bachman et al., 2014; Samuli & Timo, 2017; Sajjadi et al., 2016) are two popular paradigms designed for modern SSL. Recently, their combinations have shown promising results (Xie et al., 2020a; Sohn et al., 2020; Pham et al., 2021; Xu et al., 2021; Zhang et al., 2021) . The key idea is that the model should produce similar predictions or the same pseudo labels for the same unlabeled data under different perturbations following the smoothness and low-density assumptions in SSL (Chapelle et al., 2006) . A potential limitation of these threshold-based methods is that they either need a fixed threshold (Xie et al., 2020a; Sohn et al., 2020; Zhang et al., 2021; Guo & Li, 2022) or an ad-hoc threshold adjusting scheme (Xu et al., 2021) to compute the loss with only confident unlabeled samples. Specifically, UDA (Xie et al., 2020a) and FixMatch (Sohn et al., 2020) retain a fixed high threshold to ensure the quality of pseudo labels. However, a fixed high threshold (0.95) could lead to low data utilization in the early training stages and ignore the different learning difficulties of different classes. Dash (Xu et al., 2021) and AdaMatch (Berthelot et al., 2022) propose to gradually grow the fixed global (dataset-specific) threshold as the training progresses. Although the utilization of unlabeled data is improved, their ad-hoc threshold adjusting scheme is arbitrarily controlled by hyper-parameters and thus disconnected from model's learning process. FlexMatch (Zhang et al., 2021) demonstrates that different classes should have different local (class-specific) thresholds. While the local thresholds take into account the learning difficulties of different classes, they are still mapped from a predefined fixed global threshold. Adsh (Guo & Li, 2022) obtains adaptive thresholds from a pre-defined threshold for imbalanced Semi-supervised Learning by optimizing the the number of pseudo labels for each class. In a nutshell, these methods might be incapable or insufficient in terms of adjusting thresholds according to model's learning progress, thus impeding the training process especially when labeled data is too scarce to provide adequate supervision. For example, as shown in Figure 1 (a), on the "two-moon" dataset with only 1 labeled sample for each class, the decision boundaries obtained by previous methods fail in the low-density assumption. Then, two questions naturally arise: 1) Is it necessary to determine the threshold based on the model learning status? and 2) How to adaptively adjust the threshold for best training efficiency? In this paper, we first leverage a motivating example to demonstrate that different datasets and classes should determine their global (dataset-specific) and local (class-specific) thresholds based on the model's learning status. Intuitively, we need a low global threshold to utilize more unlabeled data and speed up convergence at early training stages. As the prediction confidence increases, a higher global threshold is necessary to filter out wrong pseudo labels to alleviate the confirmation bias (Arazo et al., 2020) . Besides, a local threshold should be defined on each class based on the model's confidence about its predictions. The "two-moon" example in Figure 1 (a) shows that the decision boundary is more reasonable when adjusting the thresholds based on the model's learning status. We then propose FreeMatch to adjust the thresholds in a self-adaptive manner according to learning status of each class (Guo et al., 2017) . Specifically, FreeMatch uses the self-adaptive thresholding (SAT) technique to estimate both the global (dataset-specific) and local thresholds (class-specific) via the exponential moving average (EMA) of the unlabeled data confidence. To handle barely supervised settings (Sohn et al., 2020) more effectively, we further propose a class fairness objective to encourage the model to produce fair (i.e., diverse) predictions among all classes (as shown in Figure 1 (b)). The overall training objective of FreeMatch maximizes the mutual information between model's input and output (John Bridle, 1991) , producing confident and diverse predictions on unlabeled data. Benchmark results validate its effectiveness. To conclude, our contributions are: Using a motivating example, we discuss why thresholds should reflect the model's learning status and provide some intuitions for designing a threshold-adjusting scheme. We propose a novel approach, FreeMatch, which consists of Self-Adaptive Thresholding (SAT) and Self-Adaptive class Fairness regularization (SAF). SAT is a threshold-adjusting scheme that is free of setting thresholds manually and SAF encourages diverse predictions. Extensive results demonstrate the superior performance of FreeMatch on various SSL benchmarks, especially when the number of labels is very limited (e.g, an error reduction of 5.78% on CIFAR-10 with 1 labeled sample per class).

2. A MOTIVATING EXAMPLE

In this section, we introduce a binary classification example to motivate our threshold-adjusting scheme. Despite the simplification of the actual model and training process, the analysis leads to some interesting implications and provides insight into how the thresholds should be set. We aim to demonstrate the necessity of the self-adaptability and increased granularity in confidence thresholding for SSL. Inspired by (Yang & Xu, 2020) , we consider a binary classification problem where the true distribution is an even mixture of two Gaussians (i.e., the label Y is equally likely to be positive (+1) or negative (-1)). The input X has the following conditional distribution: X | Y = -1 ∼ N (µ 1 , σ 2 1 ), X | Y = +1 ∼ N (µ 2 , σ 2 2 ). We assume µ 2 > µ 1 without loss of generality. Suppose that our classifier outputs confidence score s(x) = 1/[1+exp(-β(x-µ1+µ2 ))], where β is a positive parameter that reflects the model learning status and it is expected to gradually grow during training as the model becomes more confident. Note that µ1+µ2 2 is in fact the Bayes' optimal linear decision boundary. We consider the scenario where a fixed threshold τ ∈ ( 1 2 , 1) is used to generate pseudo labels. A sample x is assigned pseudo label +1 if s(x) > τ and -1 if s(x) < 1 -τ . The pseudo label is 0 (masked) if 1 -τ ≤ s(x) ≤ τ . We then derive the following theorem to show the necessity of self-adaptive threshold: Theorem 2.1. For a binary classification problem as mentioned above, the pseudo label Y p has the following probability distribution: P (Y p = 1) = 1 2 Φ( µ2-µ1 2 -1 β log( τ 1-τ ) σ 2 ) + 1 2 Φ( µ1-µ2 2 -1 β log( τ 1-τ ) σ 1 ), P (Y p = -1) = 1 2 Φ( µ2-µ1 2 -1 β log( τ 1-τ ) σ 1 ) + 1 2 Φ( µ1-µ2 2 -1 β log( τ 1-τ ) σ 2 ), P (Y p = 0) = 1 -P (Y p = 1) -P (Y p = -1), where Φ is the cumulative distribution function of a standard normal distribution. Moreover, P (Y p = 0) increases as µ 2 -µ 1 gets smaller. The proof is offered in Appendix B. Theorem 2.1 has the following implications or interpretations: (i) Trivially, unlabeled data utilization (sampling rate) 1-P (Y p = 0) is directly controlled by threshold τ . As the confidence threshold τ gets larger, the unlabeled data utilization gets lower. At early training stages, adopting a high threshold may lead to low sampling rate and slow convergence since β is still small. (ii) More interestingly, P (Y p = 1) ̸ = P (Y p = -1) if σ 1 ̸ = σ 2 . In fact, the larger τ is, the more imbalanced the pseudo labels are. This is potentially undesirable in the sense that we aim to tackle a balanced classification problem. Imbalanced pseudo labels may distort the decision boundary and lead to the so-called pseudo label bias. An easy remedy for this is to use class-specific thresholds τ 2 and 1 -τ 1 to assign pseudo labels. (iii) The sampling rate 1 -P (Y p = 0) decreases as µ 2 -µ 1 gets smaller. In other words, the more similar the two classes are, the more likely an unlabeled sample will be masked. As the two classes get more similar, there would be more samples mixed in feature space where the model is less confident about its predictions, thus a moderate threshold is needed to balance the sampling rate. Otherwise we may not have enough samples to train the model to classify the already difficult-to-classify classes. The intuitions provided by Theorem 2.1 is that at the early training stages, τ should be low to encourage diverse pseudo labels, improve unlabeled data utilization and fasten convergence. However, as training continues and β grows larger, a consistently low threshold will lead to unacceptable confirmation bias. Ideally, the threshold τ should increase along with β to maintain a stable sampling rate throughout. Since different classes have different levels of intra-class diversity (different σ) and some classes are harder to classify than others (µ 2 -µ 1 being small), a fine-grained class-specific threshold is desirable to encourage fair assignment of pseudo labels to different classes. The challenge is how to design a threshold adjusting scheme that takes all implications into account, which is the main contribution of this paper. We demonstrate our algorithm by plotting the average threshold trend and marginal pseudo label probability (i.e. sampling rate) during training in Figure 1(c ) and 1(d). To sum up, we should determine global (dataset-specific) and local (class-specific) thresholds by estimating the learning status via predictions from the model. Then, we detail FreeMatch.

3. PRELIMINARIES

In SSL, the training data consists of labeled and unlabeled data. Let D L = {(x b , y b ) : b ∈ [N L ]} and D U = {u b : b ∈ [N U ] }foot_2 be the labeled and unlabeled data, where N L and N U is their number of samples, respectively. The supervised loss for labeled data is: L s = 1 B B b=1 H(y b , p m (y|ω(x b ))), where B is the batch size, H(•, •) refers to cross-entropy loss, ω(•) means the stochastic data augmentation function, and p m (•) is the output probability from the model. For unlabeled data, we focus on pseudo labeling using cross-entropy loss with confidence threshold for entropy minimization. We also adopt the "Weak and Strong Augmentation" strategy introduced by UDA (Xie et al., 2020a) . Formally, the unsupervised training objective for unlabeled data is: L u = 1 µB µB b=1 1(max(q b ) > τ ) • H(q b , Q b ). We use q b and Q b to denote abbreviation of p m (y|ω(u b )) and p m (y|Ω(u b )), respectively. qb is the hard "one-hot" label converted from q b , µ is the ratio of unlabeled data batch size to labeled data batch size, and 1(• > τ ) is the indicator function for confidence-based thresholding with τ being the threshold. The weak augmentation (i.e., random crop and flip) and strong augmentation (i.e., RandAugment Cubuk et al. ( 2020)) is represented by ω(•) and Ω(•) respectively. Besides, a fairness objective L f is usually introduced to encourage the model to predict each class at the same frequency, which usually has the form of L f = U log E µB [q b ] (Andreas Krause, 2010), where U is a uniform prior distribution. One may notice that using a uniform prior not only prevents the generalization to non-uniform data distribution but also ignores the fact that the underlying pseudo label distribution for a mini-batch may be imbalanced due to the sampling mechanism. The uniformity across a batch is essential for fair utilization of samples with per-class threshold, especially for early-training stages.

4.1. SELF-ADAPTIVE THRESHOLDING

We advocate that the key to determining thresholds for SSL is that thresholds should reflect the learning status. The learning effect can be estimated by the prediction confidence of a well-calibrated model (Guo et al., 2017) . Hence, we propose self-adaptive thresholding (SAT) that automatically defines and adaptively adjusts the confidence threshold for each class by leveraging the model pre- data, reflecting the overall learning status. Moreover, the global threshold should stably increase during training to ensure incorrect pseudo labels are discarded. We set the global threshold τ t as average confidence from the model on unlabeled data, where t represents the t-th time step (iteration). However, it would be time-consuming to compute the confidence for all unlabeled data at every time step or even every training epoch due to its large volume. Instead, we estimate the global confidence as the exponential moving average (EMA) of the confidence at each training time step. We initialize τ t as 1 C where C indicates the number of classes. The global threshold τ t is defined and adjusted as: τ t = 1 C , if t = 0, λτ t-1 + (1 -λ) 1 µB µB b=1 max(q b ), otherwise, where λ ∈ (0, 1) is the momentum decay of EMA.

Self-adaptive Local Threshold

The local threshold aims to modulate the global threshold in a class-specific fashion to account for the intra-class diversity and the possible class adjacency. We compute the expectation of the model's predictions on each class c to estimate the class-specific learning status: pt (c) = 1 C , if t = 0, λp t-1 (c) + (1 -λ) 1 µB µB b=1 q b (c), otherwise, where pt = [p t (1), pt (2), . . . , pt (C)] is the list containing all pt (c). Integrating the global and local thresholds, we obtain the final self-adaptive threshold τ t (c) as: τ t (c) = MaxNorm(p t (c)) • τ t = pt (c) max{p t (c) : c ∈ [C]} • τ t , ( ) where MaxNorm is the Maximum Normalization (i.e., x ′ = x max(x) ). Finally, the unsupervised training objective L u at the t-th iteration is: L u = 1 µB µB b=1 1(max(q b ) > τ t (arg max (q b )) • H(q b , Q b ). (8)

4.2. SELF-ADAPTIVE FAIRNESS

We include the class fairness objective as mentioned in Section 3 into FreeMatch to encourage the model to make diverse predictions for each class and thus produce a meaningful self-adaptive threshold, especially under the settings where labeled data are rare. Instead of using a uniform prior as in (Arazo et al., 2020) , we use the EMA of model predictions pt from Eq. 6 as an estimate of the expectation of prediction distribution over unlabeled data. We optimize the cross-entropy of pt and p = E µB [p m (y|Ω(u b ))] over mini-batch as an estimate of H(E u [p m (y|u)]). Considering that the underlying pseudo label distribution may not be uniform, we propose to modulate the fairness objective in a self-adaptive way, i.e., normalizing the expectation of probability by the histogram distribution of pseudo labels to counter the negative effect of imbalance as: p = 1 µB µB b=1 1 (max (q b ) ≥ τ t (arg max (q b )) Q b , h = Hist µB 1 (max (q b ) ≥ τ t (arg max (q b )) Qb . Similar to pt , we compute ht as: ht = λ ht-1 + (1 -λ) Hist µB (q b ) . ( ) The self-adaptive fairness (SAF) L f at the t-th iteration is formulated as: L f = -H SumNorm pt ht , SumNorm p h , where SumNorm = (•)/ (•). SAF encourages the expectation of the output probability for each mini-batch to be close to a marginal class distribution of the model, after normalized by histogram distribution. It helps the model produce diverse predictions especially for barely supervised settings (Sohn et al., 2020) , thus converges faster and generalizes better. This is also showed in Figure 1(b) . The overall objective for FreeMatch at t-th iteration is: L = L s + w u L u + w f L f , where w u and w f represents the loss weight for L u and L f respectively. With L u and L f , FreeMatch maximizes the mutual information between its outputs and inputs. We present the procedure of FreeMatch in Algorithm 1 of Appendix.

5.1. SETUP

We evaluate FreeMatch on common benchmarks: CIFAR-10/100 (Krizhevsky et al., 2009) , SVHN (Netzer et al., 2011 ), STL-10 (Coates et al., 2011) and ImageNet (Deng et al., 2009) . Following previous work (Sohn et al., 2020; Xu et al., 2021; Zhang et al., 2021; Oliver et al., 2018) , we conduct experiments with varying amounts of labeled data. In addition to the commonly-chosen labeled amounts, following (Sohn et al., 2020) , we further include the most challenging case of CIFAR-10: each class has only one labeled sample. For fair comparison, we train and evaluate all methods using the unified codebase TorchSSL (Zhang et al., 2021) with the same backbones and hyperparameters. Concretely, we use Wide ResNet-28-2 (Zagoruyko & Komodakis, 2016) for CIFAR-10, Wide ResNet-28-8 for CIFAR-100, Wide ResNet-37-2 (Zhou et al., 2020) for STL-10, and ResNet-50 (He et al., 2016) for ImageNet. We use SGD with a momentum of 0.9 as optimizer. The initial learning rate is 0.03 with a cosine learning rate decay schedule as η = η 0 cos( 7πk 16K ), where η 0 is the initial learning rate, k(K) is the current (total) training step and we set K = 2 20 for all datasets. At the testing phase, we use an exponential moving average with the momentum of 0.999 of the training model to conduct inference for all algorithms. The batch size of labeled data is 64 except for ImageNet where we set 128. We use the same weight decay value, pre-defined threshold τ , unlabeled batch ratio µ and loss weights introduced for Pseudo-Label (Lee et al., 2013) , Π model (Rasmus et al., 2015) , Mean Teacher (Tarvainen & Valpola, 2017) , VAT (Miyato et al., 2018) , MixMatch (Berthelot et al., 2019b) , ReMixMatch (Berthelot et al., 2019a) , UDA (Xie et al., 2020a) , FixMatch (Sohn et al., 2020) , and FlexMatch (Zhang et al., 2021) . We implement MPL based on UDA as in (Pham et al., 2021) , where we set temperature as 0.8 and w u as 10. We do not fine-tune MPL on labeled data as in (Pham et al., 2021) since we find fine-tuning will make the model overfit the labeled data especially with very few of them. For Dash, we use the same parameters as in (Xu et al., 2021) except we warm-up on labeled data for (2) restrict the SAT within the range [0.9, 0.95]. The detailed hyperparameters are introduced in Appendix D. We train each algorithm 3 times using different random seeds and report the best error rates of all checkpoints (Zhang et al., 2021) .

5.2. QUANTITATIVE RESULTS

The Top-1 classification error rates of CIFAR-10/100, SVHN, and STL-10 are reported in Table 1 . The results on ImageNet with 100 labels per class are in Table 2 . We also provide detailed results on precision, recall, F1 score, and confusion matrix in Appendix E.3. These quantitative results demonstrate that FreeMatch achieves the best performance on CIFAR-10, STL-10, and ImageNet datasets, and it produces very close results on SVHN to the best competitor. On CIFAR-100, FreeMatch is better than ReMixMatch when there are 400 labels. The good performances of ReMixMatch on CIFAR-100 (2500) and CIFAR-100 (10000) are probably brought by the mix up (Zhang et al., 2017) technique and the self-supervised learning part. On ImageNet with 100k labels, FreeMatch significantly outperforms the latest counterpart FlexMatch by 1.28%foot_3 . We also notice that FreeMatch exhibits fast computation in ImageNet from Table 2 . Note that FlexMatch is much slower than Fix-Match and FreeMatch because it needs to maintain a list that records whether each sample is clean, which needs heavy indexing computation budget on large datasets. Noteworthy is that, FreeMatch consistently outperforms other methods by a large margin on settings with extremely limited labeled data: 5.78% on CIFAR-10 with 10 labels, 1.96% on CIFAR-100 with 400 labels, and surprisingly 13.59% on STL-10 with 40 labels. STL-10 is a more realistic and challenging dataset compared to others, which consists of a large unlabeled set of 100k images. The significant improvements demonstrate the capability and potential of FreeMatch to be deployed in real-world applications.

5.3. QUALITATIVE ANALYSIS

We present some qualitative analysis: Why and how does FreeMatch work? What other benefits does it bring? We evaluate the class average threshold and average sampling rate on STL-10 (40) (i.e., 40 labeled samples on STL-10) of FreeMatch to demonstrate how it works aligning with our theoretical analysis. We record the threshold and compute the sampling rate for each batch during training. The sampling rate is calculated on unlabeled data as µB b 1(max(q b )>τt(arg max(q b ))

µB

. We To further demonstrate the effectiveness of the class-specific threshold in FreeMatch, we present the t-SNE (Van der Maaten & Hinton, 2008) visualization of features of FlexMatch and FreeMatch on STL-10 (40) in Figure 5 of Appendix E.8. We exhibit the corresponding local threshold for each class. Interestingly, FlexMatch has a high threshold, i.e., pre-defined 0.95, for class 0 and class 6, yet their feature variances are very large and are confused with other classes. This means the classwise thresholds in FlexMatch cannot accurately reflect the learning status. In contrast, FreeMatch clusters most classes better. Besides, for the similar classes 1, 3, 5, 7 that are confused with each other, FreeMatch retains a higher average threshold 0.87 than 0.84 of FlexMatch, enabling to mask more wrong pseudo labels. We also study the pseudo label accuracy in Appendix E.9 and shows FreeMatch can reduce noise during training.

5.4. ABLATION STUDY

Self-adaptive Threshold We conduct experiments on the components of SAT in FreeMatch and compare to the components in FlexMatch (Zhang et al., 2021) , FixMatch (Sohn et al., 2020) , Class-Balanced Self-Training (CBST) (Zou et al., 2018) , and Relative Threshold (RT) in AdaMatch (Berthelot et al., 2022) . The ablation is conducted on CIFAR-10 (40 labels). As shown in Table 3 , SAT achieves the best performance among all the threshold schemes. Self-adaptive global threshold τ t and local threshold MaxNorm(p t (c)) themselves also achieve comparable results, compared to the fixed threshold τ , demonstrating both local and global threshold proposed are good learning effect estimators. When using CPL M(β(c)) to adjust τ t , the result is worse than the fixed threshold and exhibits larger variance, indicating potential instability of CPL. AdaMatch (Berthelot et al., 2022) uses the RT, which can be viewed as a global threshold at t-th iteration computed on the predictions of labeled data without EMA, whereas FreeMatch conducts computation of τ t with EMA on unlabeled data that can better reflect the overall data distribution. For class-wise threshold, CBST (Zou et al., 2018) maintains a pre-defined sampling rate, which could be the reason for its bad performance since the sampling rate should be changed during training as we analyzed in Sec. 2. Note that we did not include L f in this ablation for a fair comparison. Ablation study in Appendix E.4 and E.5 on FixMatch and FlexMatch with different thresholds shows SAT serves to reduce hyperparameter-tuning computation or overall training time in the event of similar performance for an optimally selected threshold. Self-adaptive Fairness As illustrated in Table 4 , we also empirically study the effect of SAF on CIFAR-10 (10 labels). We study the original version of fairness objective as in (Arazo et al., 2020) . Based on that, we study the operation of normalization probability by histograms and show that countering the effect of imbalanced underlying distribution indeed helps the model to learn and diverse better. One may notice that adding original fairness regularization alone already helps improve the performance. Whereas adding normalization operation in the log operation hurts the performance, suggesting the underlying batch data are indeed not uniformly distributed. We also evaluate Distribution Alignment (DA) for class fairness in ReMixMatch (Berthelot et al., 2019a) and AdaMatch (Berthelot et al., 2022) , showing inferior results than SAF. A possible reason for the worse performance of DA (AdaMatch) is that it only uses labeled batch prediction as the target distribution which cannot reflect the true data distribution especially when labeled data is scarce and changing the target distribution to the ground truth uniform, i.e., DA (ReMixMatch), is better for the case with extremely limited labels. We also proved SAF can be easily plugged into FlexMatch and bring improvements in Appendix E.6. The EMA decay ablation and performances of imbalanced settings are in Appendix E.5 and Appendix E.7.

6. RELATED WORK

To reduce confirmation bias (Arazo et al., 2020) in pseudo labeling, confidence-based thresholding techniques are proposed to ensure the quality of pseudo labels (Xie et al., 2020a; Sohn et al., 2020; Zhang et al., 2021; Xu et al., 2021) , where only the unlabeled data whose confidences are higher than the threshold are retained. UDA (Xie et al., 2020a) and FixMatch (Sohn et al., 2020) keep the fixed pre-defined threshold during training. FlexMatch (Zhang et al., 2021) adjusts the pre-defined threshold in a class-specific fashion according to the per-class learning status estimated by the number of confident unlabeled data. A co-current work Adsh (Guo & Li, 2022) explicitly optimizes the number of pseudo labels for each class in the SSL objective to obtain adaptive thresholds for imbalanced Semi-supervised Learning. However, it still needs a user-predefined threshold. Dash (Xu et al., 2021) defines a threshold according to the loss on labeled data and adjusts the threshold according to a fixed mechanism. A more recent work, AdaMatch (Berthelot et al., 2022) , aims to unify SSL and domain adaptation using a pre-defined threshold multiplying the average confidence of the labeled data batch to mask noisy pseudo labels. It needs a pre-defined threshold and ignores the unlabeled data distribution especially when labeled data is too rare to reflect the unlabeled data distribution. Besides, distribution alignment (Berthelot et al., 2019a; 2022) is also utilized in Adamatch to encourage fair predictions on unlabeled data. Previous methods might fail to choose meaningful thresholds due to ignorance of the relationship between the model learning status and thresholds. Chen et al. (2020) ; Kumar et al. (2020) try to understand self-training / thresholding from the theoretical perspective. We use a motivating example to derive some implications and further adjust meaningful thresholds according to the learning status satisfying the derived implications. Except consistency regularization, entropy-based regularization is also used in SSL. Entropy minimization (Grandvalet et al., 2005) encourages the model to make confident predictions for all samples disregarding the actual class predicted. Maximization of expectation of entropy (Andreas Krause, 2010; Arazo et al., 2020 ) over all samples is also proposed to induce fairness to the model, enforcing the model to predict each class at the same frequency. But previous ones assume a uniform prior for underlying data distribution and also ignore the batch data distribution. Distribution alignment (Berthelot et al., 2019a) adjusts the pseudo labels according to labeled data distribution and the EMA of model predictions.

7. CONCLUSION

We proposed FreeMatch that utilizes self-adaptive thresholding and class-fairness regularization for SSL. FreeMatch outperforms strong competitors across a variety of SSL benchmarks, especially in the barely-supervised setting. We believe that confidence thresholding has more potential in SSL. A potential limitation is that the adaptiveness still originates from the heuristics of the model prediction, and we hope the efficacy of FreeMatch inspires more research for optimal thresholding. A EXPERIMENTAL DETAILS OF THE "TWO-MOON" DATASET. We generate only two labeled data points (one label per each class, denoted by black dot and round circle) and 1,000 unlabeled data points (in gray) in 2-D space. We train a 3-layer MLP with 64 neurons in each layer and ReLU activation for 2,000 iterations. The red samples indicate the different samples whose confidence values are above the threshold of FreeMatch but below that of FixMatch. The sampling rate is computed on unlabeled data as B PROOF OF THEOREM 2.1 Theorem 2.1 For a binary classification problem as mentioned above, the pseudo label Y p has the following probability distribution: P (Y p = 1) = 1 2 Φ( µ2-µ1 2 -1 β log( τ 1-τ ) σ 2 ) + 1 2 Φ( µ1-µ2 2 -1 β log( τ 1-τ ) σ 1 ), P (Y p = -1) = 1 2 Φ( µ2-µ1 2 -1 β log( τ 1-τ ) σ 1 ) + 1 2 Φ( µ1-µ2 2 -1 β log( τ 1-τ ) σ 2 ), P (Y p = 0) = 1 -P (Y p = 1) -P (Y p = -1), ( ) where Φ is the cumulative distribution function of a standard normal distribution. Moreover, P (Y p = 0) = 0 increases as µ 2 -µ 1 gets smaller. Proof. A sample x will be assigned pseudo label 1 if 1 1 + exp (-β(x -µ1+µ2 2 )) > τ, which is equivalent to x > µ 1 + µ 2 2 + 1 β log( τ 1 -τ ). Likewise, x will be assigned pseudo label -1 if 1 1 + exp (-β(x -µ1+µ2 2 )) < 1 -τ, which is equivalent to x < µ 1 + µ 2 2 - 1 β log( τ 1 -τ ). If we integrate over x, we arrive at the following conditional probabilities: P (Y p = 1|Y = 1) = Φ( µ2-µ1 2 -1 β log( τ 1-τ ) σ 2 ), P (Y p = 1|Y = -1) = Φ( µ1-µ2 2 -1 β log( τ 1-τ ) σ 1 ), P (Y p = -1|Y = -1) = Φ( µ2-µ1 2 -1 β log( τ 1-τ ) σ 1 ), P (Y p = -1|Y = 1) = Φ( µ1-µ2 2 -1 β log( τ 1-τ ) σ 2 ). Recall that P (Y = 1) = P (Y = -1) = 0.5, therefore P (Y p = 1) = 1 2 Φ( µ2-µ1 2 -1 β log( τ 1-τ ) σ 2 ) + 1 2 Φ( µ1-µ2 2 -1 β log( τ 1-τ ) σ 1 ), P (Y p = -1) = 1 2 Φ( µ2-µ1 2 -1 β log( τ 1-τ ) σ 1 ) + 1 2 Φ( µ1-µ2 2 -1 β log( τ 1-τ ) σ 2 ). Now, let's use z to denote µ 2 -µ 1 , to show that P (Y p = 0) increases as µ 2 -µ 1 gets smaller, we only need to show P (Y p = -1) + P (Y p = 1) gets bigger. We write P (Y p = -1) + P (Y p = 1) as P (Y p = 1) + P (Y p = 1) = 1 2 Φ(a 1 z -b 1 ) + 1 2 Φ(-a 1 z -b 1 ) + 1 2 Φ(a 2 z -b 2 ) + 1 2 Φ(-a 2 z -b 2 ), where a 1 = 1 2σ1 , a 2 = 1 2σ2 , b 1 = log( τ 1-τ ) βσ1 , b 2 = log( τ 1-τ ) βσ2 are positive constants. We futher only need to show that f (z) = 1 2 Φ(a 1 z -b 1 ) + 1 2 Φ(-a 1 z -b 1 ) is monotone increasing on (0, ∞). Take the derivative of z, we have f ′ (z) = 1 2 a 1 (ϕ(a 1 z -b 1 ) -ϕ(-a 1 z -b 1 )), where ϕ is the probability density function of a standard normal distribution. Since |a 1 z -b 1 | < | -a 1 z -b 1 |, we have f ′ (z) > 0, and the proof is complete.

C ALGORITHM

We present the pseudo algorithm of FreeMatch. Compared to FixMatch, each training step involves updating the global threshold and local threshold from the unlabeled data batch, and computing corresponding histograms. FreeMatchs introduce a very trivial computation budget compared to FixMatch, demonstrated also in our main paper. Algorithm 1 FreeMatch algorithm at t-th iteration.  τ t (c) = MaxNorm(p t (c)) • τ t {Calculate SAT} 8: end for 9: Compute L u on unlabeled data L u = 1 µB µB b=1 1 (max (q b ) ≥ τ t (arg max (q b ))) •H(q b , Q b ) 10: Compute expectation of probability on unlabeled data p = 1 µB µB b=1 1 (max (q b ) ≥ τ t (arg max (q b )) Q b {Q b is an abbr. of p m (y|Ω(u b )), shape of p: [C]} 11: Compute histogram for p h = Hist µB 1 (max (q b ) ≥ τ t (arg max (q b )) Qb {Shape of h: [C]} 12: Compute L f on unlabeled data L f = -H SumNorm( pt ht ), SumNorm( p h ) 13: Return: L s + w u • L u + w f • L f D HYPERPARAMETER SETTING For reproduction, we show the detailed hyperparameter setting for FreeMatch in Table 5 and 6 , for algorithm-dependent and algorithm-independent hyperparameters, respectively. As shown in Table 12 , the performance of FixMatch and FlexMatch is quite sensitive to the changes of the pre-defined threshold τ . E.5 ABLATION ON EMA DECAY ON We provide the ablation study on EMA decay parameter λ in Equation ( 5) and Equation ( 6). One can observe that different decay λ produces the close results on CIFAR-10 with 40 labels, indicating that FreeMatch is not sensitive to this hyper-parameter. A large λ is not encouraged since it could impede the update of global / local thresholds.

E.6 ABLATION OF SAF ON FLEXMATCH AND FREEMATCH

In Table 13 , we present the comparison of different class fairness objectives on CIFAR-10 with 10 labels. FreeMatch is better than FlexMatch in both settings. In addition, SAF is also proved effective when combined with FlexMatch.

E.7 ABLATION OF IMBALANCED SSL

To further prove the effectiveness of FreeMatch, We evaluate FreeMatch on the imbalanced SSL setting Kim et al. (2020) ; Wei et al. (2021) ; Lee et al. (2021); Fan et al. (2021) , where the labeled and the unlabeled data are both imbalanced. We conduct experiments on CIFAR-10-LT and CIFAR-100-LT with different imbalance ratios. The imbalance ratio used on CIFAR datasets is defined as γ = N max /N min where N max is the number of samples on the head (frequent) class and N min the 



Note the results of this paper are obtained using TorchSSL(Zhang et al., 2021). We also provide codes and logs in USB(Wang et al., 2022). [N ] := {1, 2, . . . , N }. Following(Zhang et al., 2021), we train ImageNet for 2 20 iterations like other datasets for a fair comparison. We use Tesla V100 GPUs on ImageNet.



Figure 1: Demonstration of how FreeMatch works on the "two-moon" dataset. (a) Decision boundary of FreeMatch and other SSL methods. (b) Decision boundary improvement of self-adaptive fairness (SAF) on two labeled samples per class. (c) Class-average confidence threshold. (d) Classaverage sampling rate of FreeMatch during training. The experimental details are in Appendix A.scheme(Xu et al., 2021) to compute the loss with only confident unlabeled samples. Specifically, UDA(Xie et al., 2020a)  and FixMatch(Sohn et al., 2020) retain a fixed high threshold to ensure the quality of pseudo labels. However, a fixed high threshold (0.95) could lead to low data utilization in the early training stages and ignore the different learning difficulties of different classes. Dash(Xu et al., 2021) and AdaMatch(Berthelot et al., 2022) propose to gradually grow the fixed global (dataset-specific) threshold as the training progresses. Although the utilization of unlabeled data is improved, their ad-hoc threshold adjusting scheme is arbitrarily controlled by hyper-parameters and thus disconnected from model's learning process. FlexMatch(Zhang et al., 2021) demonstrates that different classes should have different local (class-specific) thresholds. While the local thresholds take into account the learning difficulties of different classes, they are still mapped from a predefined fixed global threshold. Adsh(Guo & Li, 2022) obtains adaptive thresholds from a pre-defined threshold for imbalanced Semi-supervised Learning by optimizing the the number of pseudo labels for each class. In a nutshell, these methods might be incapable or insufficient in terms of adjusting thresholds according to model's learning progress, thus impeding the training process especially when labeled data is too scarce to provide adequate supervision.

Figure 2: Illustration of Self-Adaptive Thresholding (SAT). FreeMatch adopts both global and local self-adaptive thresholds computed from the EMA of prediction statistics from unlabeled samples. Filtered (masked) samples are marked with red X.

Figure 3: How FreeMatch works in STL-10 with 40 labels, compared to others. (a) Class-average confidence threshold; (b) class-average sampling rate; (c) convergence speed in terms of accuracy; (d) confusion matrix, where fading colors of diagonal elements refer to the disparity of accuracy.also plot the convergence speed in terms of accuracy and the confusion matrix to show the proposed component in FreeMatch helps improve performance. From Figure3(a) and Figure3(b), one can observe that the threshold and sampling rate change of FreeMatch is mostly consistent with our theoretical analysis. That is, at the early stage of training, the threshold of FreeMatch is relatively lower, compared to FlexMatch and FixMatch, resulting in higher unlabeled data utilization (sampling rate), which fastens the convergence. As the model learns better and becomes more confident, the threshold of FreeMatch increases to a high value to alleviate the confirmation bias, leading to stably high sampling rate. Correspondingly, the accuracy of FreeMatch increases vastly (as shown in Figure3(c)) and resulting better class-wise accuracy (as shown in Figure3(d)). Note that Dash fails to learn properly due to the employment of the high sampling rate until 100k iterations.

(q b ) > τ )/N U . Results are averaged 5 times.

Input: Number of classes C, labeled batch X = {(x b , y b ) : b ∈ (1, 2, . . . , B)}, unlabeled batch U = {u b : b ∈ (1, 2, . . . , µB)}, unsupervised loss weight w u , fairness loss weight w f , and EMA decay λ. 2: Compute L s for labeled data L s = 1 B B b=1 H(y b , p m (y|ω(x b ))) 3: Update the global threshold τ t = λτ t-1 + (1 -λ) 1 µB µB b=1 max(q b ) {q b is an abbreviation of p m (y|ω(u b )), shape of τ t : [1] } 4: Update the local threshold pt = λp t-1 + (1 -λ) 1 µB µB b=1 q b {Shape of pt : [C]} 5: Update histogram for pt ht = λ ht-1 + (1 -λ) Hist µB (q b ) {Shape of ht : [C]} 6: for c = 1 to C do 7:

Figure 4: CIFAR-10 (10) labeled samples visualization, sorted from the most prototypical dataset (first row) to least prototypical dataset (last row).

Figure 5: T-SNE visualization of FlexMatch and FreeMatch features on STL-10 (40). Unlabeled data is indicated by gray color. Local threshold τ t (c) for each class is shown on the legend.

Figure 6: CIFAR-10 (10) Pseudo Label accuracy visualization.

(a) The most prototypical labeled samples (b) The second-most prototypical labeled samples (c) The least prototypical labeled samples

Figure 7: Confusion matrix on the test set of CIFAR-10 (10). Rows correspond to the rows in Figure 4. Columns correspond to different SSL methods.

Error rates on CIFAR-10/100, SVHN, and STL-10 datasets. The fully-supervised results of STL-10 are unavailable since we do not have label information for its unlabeled data. Bold indicates the best result and underline indicates the second-best result. The significant tests and average error rates for each dataset can be found in Appendix E.1. set w u = 1 for all experiments. Besides, we set w f = 0.01 for CIFAR-10 with 10 labels, CIFAR-100 with 400 labels, STL-10 with 40 labels, ImageNet with 100k labels, and all experiments for SVHN. For other settings, we use w f = 0.05. For SVHN, we find that using a low threshold at early training stage impedes the model to cluster the unlabeled data, thus we adopt two training techniques for SVHN: (1) warm-up the model on only labeled data for 2 epochs as Dash; and

Error rates and runtime on ImageNet with 100 labels per class.

Comparison of different thresholding schemes.

Comparison of different class fairness items.

The average error rates for each dataset.

Error rates of different thresholding EMA decay.

FixMatch and FlexMatch with different thresholds on CIFAR-10 (40).



Error rates (%) of imbalanced SSL using 3 different random seeds. 0±0.22 22.3±1.08 46.6±0.69 58.3±0.41 FlexMatch w/ ABC 14.2±0.34 23.1±0.70 46.2±0.47 58.9±0.51 FreeMatch w/ ABC 13.9±0.03 22.3±0.26 45.6±0.76 58.9±0.55

annex

Published as a conference paper at ICLR 2023 Table 5 : Algorithm dependent hyperparameters.

Algorithm FreeMatch

Unlabeled Data to Labeled Data Ratio (CIFAR-10/100, STL-10, SVHN) 7Unlabeled Data to Labeled Data Ratio (ImageNet) 1Loss weight w u for all experiments 1Loss weight w f for CIFAR-10 (10), CIFAR-100 (400), , ImageNet (100k), SVHN 0.01Loss weight w f for others 0.05Thresholding EMA decay for all experiments 0.999 

E EXTENSIVE EXPERIMENT DETAILS AND RESULTS

We present extensive experiment details and results as complementary to the experiments in the main paper.

E.1 SIGNIFICANT TESTS

We did significance test using the Friedman test. We choose the top 7 algorithms on 4 datasets (i.e., N = 4, k = 7). Then, we compute the F value as τ F = 3.56, which is clearly larger than the thresholds 2.661(α = 0.05) and 2.130(α = 0.1). This test indicates that there are significant differences between all algorithms.To further show our significance, we report the average error rates for each dataset in Table 7 . We can see FreeMatch outperforms most SSL algorithms significantly.E.2 CIFAR-10 (10) LABELED DATA Following (Sohn et al., 2020) , we investigate the limitations of SSL algorithms by giving only one labeled training sample per class. The selected 3 labeled training sets are visualized in Figure 4 , which are obtained by (Sohn et al., 2020) using ordering mechanism (Carlini et al., 2019) .

E.3 DETAILED RESULTS

To comprehensively evaluate the performance of all methods in a classification setting, we further report the precision, recall, f1 score, and AUC (area under curve) results of CIFAR-10 with the same 10 labels, CIFAR-100 with 400 labels, SVHN with 40 labels, and STL-10 with 40 labels. As shown in Table 8 and 9 , FreeMatch also has the best performance on precision, recall, F1 score, and AUC in addition to the top1 error rates reported in the main paper.Published as a conference paper at ICLR 2023 , where C is the number of classes. Following (Lee et al., 2021; Fan et al., 2021) , we set N max = 1500 for CIFAR-10 and N max = 150 for CIFAR-100, and the number of unlabeled data is twice as many for each class. We use a WRN-28-2 (Zagoruyko & Komodakis, 2016) as the backbone. We use Adam (Kingma & Ba, 2014) as the optimizer. The initial learning rate is 0.002 with a cosine learning rate decay schedule as η = η 0 cos( 7πk 16K ), where η 0 is the initial learning rate, k(K) is the current (total) training step and we set K = 2.5 × 10 5 for all datasets. The batch size of labeled and unlabeled data is 64 and 128, respectively. Weight decay is set as 4e-5. Each experiment is run on three different data splits, and we report the average of the best error rates.The results are summarized in Table 14 . Compared with other standard SSL methods, FreeMach achieves the best performance across all settings. Especially on CIFAR-10 at imbalance ratio 150, FreeMatch outperforms the second best by 2.4%. Moreover, when plugged in the other imbalanced SSL method (Lee et al., 2021) , FreeMatch still attains the best performance in most of the settings. We plot the T-SNE visualization of the features on STL-10 with 40 labels from FlexMatch (Zhang et al., 2021) and FreeMatch. FreeMatch shows better feature space than FlexMatch with less confusing clusters. E.9 PSEUDO LABEL ACCURACY ON We average the pseudo label accuracy with three random seeds and report them in Figure 6 . This indicates that mapping thresholds from a high fixed threshold like FlexMatch did can prevent unlabeled samples from being involved in training. In this case, the model can overfit on labeled data and a small amount of unlabeled data. Thus the predictions on unlabeled data will incorporate 

