IMBALANCED SEMI-SUPERVISED LEARNING WITH BIAS ADAPTIVE CLASSIFIER

Abstract

Pseudo-labeling has proven to be a promising semi-supervised learning (SSL) paradigm. Existing pseudo-labeling methods commonly assume that the class distributions of training data are balanced. However, such an assumption is far from realistic scenarios and thus severely limits the performance of current pseudolabeling methods under the context of class-imbalance. To alleviate this problem, we design a bias adaptive classifier that targets the imbalanced SSL setups. The core idea is to automatically assimilate the training bias caused by class imbalance via the bias adaptive classifier, which is composed of a novel bias attractor and the original linear classifier. The bias attractor is designed as a light-weight residual network and optimized through a bi-level learning framework. Such a learning strategy enables the bias adaptive classifier to fit imbalanced training data, while the linear classifier can provide unbiased label prediction for each class. We conduct extensive experiments under various imbalanced semi-supervised setups, and the results demonstrate that our method can be applied to different pseudo-labeling models and is superior to current state-of-the-art methods.

1. INTRODUCTION

Semi-supervised learning (SSL) (Chapelle et al., 2009) has proven to be promising for exploiting unlabeled data to reduce the demand for labeled data. Among existing SSL methods, pseudo-labeling (Lee et al., 2013) , using the model's class prediction as labels to train against, has attracted increasing attention in recent years. Despite the great success, pseudo-labeling methods are commonly based on a basic assumption that the distribution of labeled and/or unlabeled data are class-balanced. Such an assumption is too rigid to be satisfied for many practical applications, as realistic phenomena always follows skewed distributions. Recent works (Hyun et al., 2020; Kim et al., 2020a) have found that class-imbalance significantly degrades the performance of pseudo-labeling methods. The main reason is that pseudo-labeling usually involves pseudo-label prediction for unlabeled data, and an initial model trained on imbalanced data easily mislabels the minority class samples as the majority ones. This implies that the subsequent training with such biased pseudo-labels will aggravate the imbalance of training data and further bias the model training. To address the aforementioned issues, recent literature attempts to introduce pseudo-label rebalancing strategies into existing pseudo-labeling methods. Such a re-balancing strategy requires the class distribution of unlabeled data as prior knowledge (Wei et al., 2021; Lee et al., 2021) or needs to estimate the class distribution of the unlabeled data during training (Kim et al., 2020a; Lai et al., 2022) . However, most of the data in imbalanced SSL are unlabeled and the pseudo-labels estimated by SSL algorithms are unreliable, which makes these methods sub-optimal in practice, especially when there are great class distribution mismatch between labeled and unlabeled data. In this paper, we investigate pseudo-labeling SSL methods in the context of class-imbalance, in which class distributions of labeled and unlabeled data may differ greatly. In such a general scenario, the current state-of-the-art FixMatch (Sohn et al., 2020) may suffer from performance degradation. To illustrate this, we design an experiment where the entire training data (labeled data + unlabeled 0 1 2 3 4 5 6 7 8 To address this problem, we propose a learning to adapt classifier (L2AC) framework to protect the linear classifier of deep classification network from the training bias. Specifically, we propose a bias adaptive classifier which equips the linear classifier with a bias attractor (parameterized by a residual transformation). The linear classifier aims to provide an unbiased label prediction and the bias attractor attempts to assimilate the training bias arising from class imbalance. To this end, we learn the L2AC with a bi-level learning framework: the lower-level optimization problem updates the modified network with bias adaptive classifier over both labeled and unlabeled data for better representation learning; the upper-level problem tunes the bias attractor over an online class-balanced set (re-sampled from the labeled training data) for making the linear classifier predict unbiased labels. As a result, the bias adaptive classifier can not only fit the biased training data but also make the linear classifier generalize well towards each class (i.e., tend to equal preference to each class). In Fig. 1 (c), we show that the linear classifier learned by our L2AC can well approximate the predicted class distribution of the upper bound model, indicating that L2AC obtains an unbiased classifier. In summary, our contributions are mainly three-fold: (1) We propose to learn a bias adaptive classifier to assimilate online training bias arising from class imbalance and pseudo-labels. The proposed L2AC framework is model-agnostic and can be applied to various pseudo-labeling SSL methods; (2) We develop a bi-level learning paradigm to optimize the parameters involved in our method. This allows the online training bias to be decoupled from the linear classifier such that the resulting network can generalize well towards each class. (3) We conduct extensive experiments on various imbalanced SSL setups, and the results demonstrate the superiority of the proposed method. The source code is made publicly available at https://github.com/renzhenwang/bias-adaptive-classifier.

2. RELATED WORK

Class-imbalanced learning attempts to learn the models that generalize well to each classes from imbalanced data. Recent studies can be divided into three categories: Re-sampling (He & Garcia, 2009; Chawla et al., 2002; Buda et al., 2018; Byrd & Lipton, 2019) that samples the data to rearrange the class distribution of training data; Re-weighting (Khan et al., 2017; Cui et al., 2019; Cao et al., 2019; Lin et al., 2017; Ren et al., 2018; Shu et al., 2019; Tan et al., 2020; Jamal et al., 2020) that assigns weights for each class or even each sample to balance the training data; Transfer learning (Wang et al., 2017; Liu et al., 2019; Yin et al., 2019; Kim et al., 2020b; Chu et al., 2020; Liu et al., 2020; Wang et al., 2020) that transfers knowledge from head classes to tail classes. Besides, most recent works tend to decouple the learning of representation and classifier (Kang et al., 2020; Zhou et al., 2020; Tang et al., 2020; Zhang et al., 2021b) . However, it is difficult to directly extend these techniques to imbalanced SSL, as the distribution of unlabeled data is unknown and may be greatly different from that of labeled data. Semi-supervised learning targets to learn from both labeled and unlabeled data, which includes two main lines of researches, namely pseudo-labeling and consistency regularization. Pseudo-labeling (Lee et al., 2013; Xie et al., 2020a; b; Sohn et al., 2020; Zhang et al., 2021a) is evolved from entropy minimization (Grandvalet & Bengio, 2004) and commonly trains the model using labeled data together with unlabeled data whose labels are generated by the model itself. Consistency regularization (Sajjadi et al., 2016; Tarvainen & Valpola, 2017; Berthelot et al., 2019b; Miyato et al., 2018; Berthelot et al., 2019a) aims to impose classification invariance loss on unlabeled data upon perturbations. Despite their success, most of these methods are based on the assumption that the labeled and unlabeled data follow uniform label distribution. When used for class-imbalance, these methods suffer from significant performance degradation due to the imbalance bias and pseudo-label bias. Imbalanced semi-supervised learning has been drawing extensive attention recently. Yang & Xu (2020) pointed out that SSL can benefit class-imbalanced learning. Hyun et al. (2020) proposed a suppressed consistency loss to suppress the loss on minority classes. Kim et al. (2020a) introduced a convex optimization method to refine raw pseudo-labels. Similarly, Lai et al. (2022) estimated the mitigating vector to refine the pseudo-labels, and Oh et al. (2022) proposed to blend the pseudolabels from the linear classifier with those from a similarity-based classifier. Guo & Li (2022) found a fixed threhold for pseudo-labeled sample selection biased towards head classes and in turn proposed to optimize an adaptive threhold for each class. Assumed that labeled and unlabeled data share the same distribution, Wei et al. (2021) proposed a re-sampling method to iteratively refine the model, and Lee et al. (2021) proposed an auxiliary classifier combined with re-sampling technique to mitigate class imbalance. Most recently, Wang et al. (2022) proposed to combine counterfactual reasoning and adaptive margins to remove the bias from the pseudo-labels. Different from these methods, this paper aims to learn an explicit bias attractor that could protect the linear classifier from the training bias and make it generalize well towards each class.

3.1. PROBLEM SETUP AND BASELINES

Imbalanced SSL involves a labeled dataset D l = {(x n , y n )} N n=1 and an unlabeled dataset D u = {x m } M m=1 , where x n is a training example and y n ∈ {0, 1} K is its corresponding label. We denote the number of training examples of class k within D l and D u as N k and M k , respectively. In a classimbalanced scenario, the class distribution of the training data is skewed, namely, the imbalance ratio γ l := max k N k min k N k ≫ 1 or γ u := max k M k min k M k ≫ 1 always holds. Note that the class distribution of D u , i.e., {M k } K k=1 is usually unknown in practice. Given D l and D u , our goal is to learn a classification model that is able to correctly predict the labels of test data. We denote a deep classification model Ψ = f cls ϕ • f ext θ with the feature extractor f ext θ and the linear classifier f cls ϕ , where θ and ϕ are the parameters of f ext θ and f cls ϕ , respectively, and • is function composition operator. With pseudo-labeling techniques, current state-of-the-art SSL methods (Xie et al., 2020b; Sohn et al., 2020; Zhang et al., 2021a) generate pseudo-labels for unlabeled data to augment the training dataset. For unlabeled sample x m , its pseudo-label ŷm can be a 'hard' one-hot label (Lee et al., 2013; Sohn et al., 2020; Zhang et al., 2021a) or a sharpened 'soft' label (Xie et al., 2020a; Wang et al., 2021) . The model is then trained on both labeled and pseudo-labeled samples. Such a learning scheme is typically formulated as an optimization problem with objective function L = L l + λ u L u , where λ u is a hyper-parameter for balancing labeled data loss L l and pseudo-labeled data loss L u . To be more specific, L l = 1 | Dl | xn∈ Dl H (Ψ(x n ), y n ) , where Dl denotes a batch of labeled data sampled from D l , H is cross-entropy loss; L u = 1 | Du| xm∈ Du 1(max(p m ) ≥ τ )H (Ψ(x m ), ŷm ) , where p m = softmax(Ψ(x m )) represents the output probability, and τ is a predefined threshold for masking out inaccurate pseudo-labeled data. For simplicity, we reformulate L as L = 1 | Dl | x i ∈ Dl H(Ψ(xi), yi) + 1 | Du| x i ∈ Du λiH(Ψ(xi), ŷi), where during training, such that the generated pseudo-labels can be even more biased and severely degrades the performance of minority classes. Moreover, due to the confirmation bias issue (Tarvainen & Valpola, 2017; Arazo et al., 2020) , the model itself is hard to rectify such a training bias. λ i = λ u 1(max(p i ) ≥ τ ).

3.2. LEARNING TO ADAPT CLASSIFIER

Our goal is to enhance the existing pseudo-labeling SSL methods by making full use of both labeled and unlabeled data, while protecting the linear classifier from the training bias (imbalance bias and pseudo-label bias). To this end, we design to learn a bias adaptive classifier that equips the linear classifier with a bias attractor in order to assimilate complicated training bias. The proposed bias adaptive classifier: As shown in Fig. 2 . The bias adaptive classifier (denoted as F ) consists of two modules: the linear classifier f cls ϕ and a nonlinear network ∆f w (dubbed bias attractor). The bias attractor is implemented by imposing a residual transformation on the output of the linear classifier, i.e., plugging ∆f w after f cls ϕ and then bridging their outputs with a shortcut connection. Mathematically, the bias adaptive classifier F can be formulated as F ω,ϕ (z) = (I + ∆f ω ) • f cls ϕ (z), where z = f ext θ (x) ∈ R d , I denotes identity mapping, and ∆f w with parameters ω is a multi-layer perceptron (MLP) with one hidden layer in this paper. We design such a bias adaptive classifier for the following two considerations. On the one hand, the bias attractor ∆f w adopts a nonlinear network which can assimilate complicated training bias in theory due to the universal approximation propertiesfoot_0 . Since the whole bias adaptive classifier (i.e., classifier with the bias attractor) is required to fit the imbalanced training data, we hope the bias attractor could indeed help the linear classifier to learn the unbiased class conditional distribution (i.e., let the classifier less effected by the biases.). By contrast, in the original classification network, a single linear classifier is required to fit biased training data such that it is easily misled by class imbalance and biased pseudo-labels. On the other hand, the residual connection conveniently makes the bias attractor a plug-in module, i.e., assimilate the training bias during training and be removed in the test stage, and it also has been proven successful in easing the training of deep networks (He et al., 2016; Long et al., 2016) . Learning bias adaptive classifier: With the proposed bias adaptive classifier F ω,ϕ , the modified classification network can be formulated as Ψ = F ω,ϕ • f ext θ . To make full use of the whole training data (D l ∪ D u ) for better representation learning, we can minimize the following loss function: L = 1 | Dl | x i ∈ Dl H( Ψ(xi), yi) + 1 | Du| x i ∈ Du λiH( Ψ(xi), ŷi). This involves an optimization problem with respect to three parts of parameters {θ, ϕ, ω}, which can be jointly optimized via the stochastic gradient decent (SGD) in an end-to-end manner. However, such a training strategy poses a critical challenge: there is no prior knowledge on f cls ϕ to predict unbiased label prediction and on ∆f w to assimilate the training bias. In other words, we cannot guarantee the training bias to be exactly decoupled from the linear classifier f cls ϕ . To address this problem, we take ω as hyper-parameters associated with ϕ and design a bi-level learning algorithm to jointly optimize the network parameters {θ, ϕ} and hyper-parameters ω. We illustrate the process in Fig. 2 and Algorithm 1. In each training iteration t, we update the network parameters {θ, ϕ} by gradient descent as (θ t+1 , ϕ t+1 (ω)) = (θ t , ϕ t ) -α∇ θ,ϕ L, (4) where α is the learning rate. Note that we herein assume that ω is only directly related to the linear classifier f cls ϕ via ϕ t+1 (ω), which implies that the subsequent optimization of ω will not affect the feature extractor f ext θ . We then tune the hyper-parameters ω to make the linear classifier ϕ t+1 (ω) generalize well towards each class, and we thus minimize the following loss function of the network Ψ = f cls ϕ t+1 (ω) •f ext θ t+1 over a class-balanced set (dynamically sampled from the labeled training set): L bal = 1 |B| xi∈B H(f cls ϕ t+1 (ω) • f ext θ t+1 (x i ), y i ), where B ⊂ D l is a batch of class-balanced labeled samples, which can be implemented by classaware sampling (Shen et al., 2016) . This loss function reflects the effect of the hyper-parameter ω on making the linear classifier generalize well towards each class, we thus optimize ω by ω t+1 = ω t -η∇ ω L bal , where η is the learning rate on ω. Note that in Eq. ( 6) we need to compute a second-order gradient ∇ ω L bal with respect to ω, which can be easily implemented through popular deep learning frameworks such as Pytorch (Paszke et al., 2019) in practice. In summary, the proposed bi-level learning framework ensures that 1) the linear classifier f cls ω can fit unbiased class-conditional distribution by minimizing the empirical risk over balanced data via Eq. ( 5) and 2) the bias attractor ∆f ω can handle the implicit training bias by training the bias adaptive classifier F ω,ϕ over the imbalanced training data by Eq. ( 3). As such, our proposed ∆f w can protect f cls w from the training bias. Additionally, our L2AC is more efficient than most existing bi-level learning methods (Finn et al., 2017; Ren et al., 2018; Shu et al., 2019) . To illustrate this, we herein give a brief complexity analysis of our algorithm. Since our L2AC introduces a bi-level optimization problem, it requires one extra forward passes in Eq. ( 5) and backward pass in Eq. ( 6) compared to regular single-level optimization problem. However, in the backward pass, the second-order gradient of ω in Eq. ( 6) only requires to unroll the gradient graph of the linear classifier f cls ϕ . As a result, the backward-on-backward automatic differentiation in Eq. ( 6) demands a lightweight of overhead, i.e., approximately #Params(f cls ϕ ) #Params(Ψ) × training time of one full backward pass.

3.3. THEORETICAL ANALYSIS

Note that the update of the parameter ω aims to minimize the problem Eq. ( 5), we herein give a brief debiasing analysis of our proposed L2AC by showing how the value of bias attractor ∆f (•) change with the update of ω. We use notation ∂f ∂ω | ω t to denote the gradient operation of f at ω t and superscript T to denote the vector/matrix transpose. We have the following proposition. Proposition 3.1 Let p i denote the predicted probability of x i , then Eq. (6) can be rewritten as ω t+1 = ω t + ηα 1 n n i=1 G i ∂∆f i ∂ω | ω t , where G i = ∂(pi-yi) ∂ϕ | T ϕ t ( 1 m m j=1 ∂L bal j (θ,ϕ) ∂ϕ | ϕ t+1 ), which represents the similarity between the gradient of the sample x i and the average gradient of the whole balanced set B. This shows that the update of the parameter ω affects the value of the bias attractor, i.e, the value of ∆f i is adjusted according to the interaction G i between x i and B. If G i > 0 then ∆f i,k is increased, otherwise ∆f i,k is decreased, indicating L2AC adaptively assimilate the training bias. Table 1 : Comparison results on CIFAR-10 under two typical imbalanced SSL settings, i.e., γ = γ l = γ u and γ l ̸ = γ u (γ l = 100). The performance (bACC / GM) is reported in the form of mean ±std across three random runs. 

CIFAR-10 (γ

l = γ u ) CIFAR-10 (γ l ̸ = γ u ) γ = 100 γ = 150 γ u = 1 (uniform) γ u =

4. EXPERIMENTS

We evaluate our approach on four benchmark datasets: CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009) , STL-10 (Coates et al., 2011) and SUN397 (Xiao et al., 2010) , which are broadly used in imbalanced learning and SSL tasks. We adopt balanced accuracy (bACC) (Huang et al., 2016; Wang et al., 2017) and geometric mean scores (GM) (Kubat et al., 1997; Branco et al., 2016) foot_1 as the evaluation metrics. We evaluate our L2AC under two different settings: 1) both labeled and unlabeled data follow the same class distribution, i.e., γ := γ l = γ u ; 2) labeled and unlabeled data have different class distributions, i.e., γ l ̸ = γ u , where γ u is commonly unknown.

4.1. RESULTS ON CIFAR-10

Dataset. We follow the same experiment protocols as Kim et al. (2020a) . In detail, a labeled set and an unlabeled set are randomly sampled from the original training data, keeping the number of images for each class to be the same. Then both the two sets are tailored to be imbalanced by randomly discarding training images according to the predefined imbalance ratios γ l and γ u . We denote the number of the most majority class within labeled and unlabeled data as N 1 and M 1 , respectively, and we then have N k = N 1 • γ ϵ k l and M k = M 1 • γ ϵ k u , where ϵ k = k-1 K-1 . We initially set N 1 = 1500 and M 1 = 3000 following Kim et al. (2020a) , and further ablate the proposed L2AC under various labeled ratios in Section 4.4. The test set remains unchanged and class-balanced. Setups. The experimental setups are consistent with Kim et al. (2020a) . Concretely, we employ Wide ResNet-28-2 (Oliver et al., 2018) as our backbone network and adopt Adam optimizer (Kingma & Ba, 2015) for 500 training epochs, each of which has 500 iterations. To evaluate the model, we use its exponential moving average (EMA) version, and report the average test accuracy of the last 20 epochs following Berthelot et al. (2019b) . See Appendix D.1 for more details. Results under γ l = γ u . We evaluate the proposed L2AC based on two widely-used SSL methods: MixMatch (Berthelot et al., 2019b) and FixMatch (Sohn et al., 2020) , and compare it with the following methods: 1) The Vanilla model merely trained with labeled data; 2) Recent re-balancing methods that are trained with labeled data by considering class imbalance, including: Re-sampling (Japkowicz, 2000) , LDAM-DRW (Cao et al., 2019) and cRT (Kang et al., 2020) ; 3) Recent imbalanced SSL methods, including: DARP (Kim et al., 2020a) , CReST+ (Wei et al., 2021) , ABC (Lee et al., 2021) , SaR (Lai et al., 2022) and DASO (Oh et al., 2022) . Please refer to Appendix C for more details about these methods. The main results are shown in Table 1 . It can be observed that L2AC significantly improves MixMatch and FixMatch at least 9% absolute gain on bACC and at least 14% on GM for all settings. This implies that our L2AC benefits the two baselines by learning an unbiased linear classifier. Moreover, our L2AC consistently surpasses all the comparison methods over both evaluation metrics. Take the extremely imbalanced case of γ = 150 for example, compared with the second best comparison method, our L2AC achieves up to 3.6% bACC gain and 4.9% GM gain upon MixMatch, and around 3.0% bACC gain and 3.6% GM gain upon FixMatch. Results under γ l ̸ = γ u . The class distribution of labeled and unlabeled data can be arguably different in practice. We herein simulate two typical scenarios following Oh et al. (2022) , i.e., the unlabeled set follows an uniform class distribution (γ u = 1) and a reversed long-tailed class distribution against the labeled data (γ u = 100 (reversed)). Note that CReST+ (Wei et al., 2021) fails in this case as they require the class distribution of unlabeled data as prior knowledge for training. As shown in Table 1 , L2AC can consistently improve both MixMatch and FixMatch by a large margin. An interesting observation is that the baselines MixMach and FixMatch under γ u = 1 perform much worse than that under γ u = 100, even if more unlabeled data are added for γ u = 1. This is mainly because the models under imbalanced SSL setting have a strong bias to generate incorrect labels for the tail classes, which will impair the entire learning process. As a result, more unlabeled tail class samples under γ u = 1 lead to more severe performance degradation. On the contrary, our L2AC can eliminate the influence of the training bias and predict high-quality pseudolabels for unlabeled data, such that it achieves significant performance gain in both settings.

4.2. RESULTS ON CIFAR-100 AND STL-10

Dataset: To make a more comprehensive comparison, we further evaluate L2AC on CIFAR-100 (Krizhevsky et al., 2009) and STL-10 (Coates et al., 2011) . For CIFAR-100, we create labeled and unlabeled sets in the same way as described in Sec. 4.1 and set N 1 = 150 and M 1 = 300. STL-10 is a more realistic SSL task that has no distribution information for unlabeled data. In our experiments, we set N 1 = 450 to construct the imbalanced labeled set and adopt the whole unknown unlabeled set (i.e., M = 100k). It is worth noting that the unlabeled set of STL-10 is noisy as it contains samples that do not belong to any of the classes in the labeled set. Results: As shown in Table 2 , L2AC achieves the best performance over GM and competitive performance over bACC compared to current state-of-the-art ABC (Lee et al., 2021) and DASO (Oh et al., 2022) . This implies that our approach obtains a relatively balanced classification performance towards all the classes. While for SLT-10, a more realistic noisy dataset without distribution information for unlabeled data, L2AC significantly outperforms ABC and DASO on both bACC and GM, which demonstrates that it has greater potential to be applied in the practical SSL scenarios.

4.3. RESULTS ON LARGE-SCALE SUN-397

Dataset: SUN397 (Xiao et al., 2010 ) is an imbalanced real-world scene classification dataset, which originally consists of 108,754 RGB images with 397 classes. Following the experimental setups in (Kang et al., 2020) , FixMatch (Sohn et al., 2020) , DARP (Kim et al., 2020a) and ABC (Lee et al., 2021) . More training details are presented in Appendix D.2. Results: The experimental results are summarized in Table 3 . Compared to the baseline FixMatch (Sohn et al., 2020) , our proposed L2AC results in about 4% performance gain over all evaluation metrics, and outperforms all the SOTA methods. This further verifies the efficacy of our proposed method toward the real-world imbalanced SSL applications.

4.4. DISCUSSION

What about the performance under various label ratios? To answer this question, we vary the ratios of labeled data (denoted as β) on CIFAR-10 and STL-10 to evaluate the proposed method. For CIFAR-10, we define β = N 1 /(N 1 + M 1 ) and set the imbalance ratio γ = 100. For STL-10, since it does not provide annotations for unlabeled data, we re-sample the labeled set from labeled data by β = N 1 /500 and set the imbalance ratio γ l = 10. As shown in Table 4 , our L2AC consistently improves the baseline across different amounts of labeled data on both CIFAR-10 and STL-10. For example, STL-10 with β = 5% contains very scarce labeled data, where only 25 and 3 labeled samples belong to the most majority and minority classes, respectively. In such an extremely biased scenario, our L2AC significantly improves FixMatch by around 16% over bACC and 37% over GM. How does L2AC perform on the majority/minority classes? To explain the source of performance improvements, we further visualize the confusion matrices on the test set of CIFAR-10 with γ = 100. Noting that the diagonal vector of a confusion matrix represents per-class recall. As shown in Fig. 3 , our L2AC provides a relatively balanced per-class recall compared with the baseline Fix-Match (Sohn et al., 2020) and other imbalanced SSL methods, e.g., DARP (Kim et al., 2020a) and ABC (Lee et al., 2021) . It can also be observed that FixMatch easily tends to misclassify the samples of the minority classes into the majority classes, while our L2AC largely alleviates this bias. These results reveal that our L2AC provides an unbiased linear classifier for the test stage. Could L2AC improve the quality of pseudo-labels? Qualitative and quantitative experiment results have shown that the proposed L2AC can improve the performance of pseudo-labeling SSL methods under different settings. We attribute this to the fact that L2AC can generate unbiased pseudo-labels during training. To validate this, we show per-class recall of pseudo-labels for CIFAR-10 with γ l = γ u = 100 and γ l = 100, γ u = 100 (reversed) in True label 0.98 0.00 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.04 0.00 0.91 0.02 0.01 0.01 0.01 0.00 0.00 0.00 0.03 0.00 0.06 0.82 0.02 0.04 0.01 0.00 0.00 0.00 0.02 0.00 0.06 0.05 0.87 0.00 0.00 0.01 0.00 0.00 0.01 0.00 0.07 0.23 0.04 0.64 0.01 0.01 0.00 0.00 0.02 0.00 0.12 0.09 0.01 0.01 0.75 0.00 0.00 0.00 0.07 0.00 0.05 0.12 0.11 0.06 0.00 0.59 0.00 0.00 0.39 0.10 0.02 0.01 0.00 0.00 0.00 0.00 0.47 0.00 0.10 0.66 0.01 0.02 0.00 0.00 0.00 0.00 0.00 0.20 True label 0.98 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.05 0.00 0.91 0.01 0.01 0.01 0.01 0.00 0.00 0.00 0.04 0.00 0.05 0.81 0.03 0.05 0.01 0.00 0.00 0.00 0.01 0.00 0.05 0.04 0.89 0.01 0.00 0.00 0.00 0.00 0.01 0.00 0.07 0.23 0.04 0.63 0.00 0.01 0.00 0.00 0.02 0.00 0.12 0.09 0.03 0.01 0.74 0.00 0.00 0.00 0.06 0.00 0.07 0.10 0.12 0.05 0.00 0.60 0.00 0.00 0.33 0.09 0.01 0.01 0.00 0.00 0.00 0.00 0.56 0.00 0.10 0.47 0.00 0.01 0.00 0.00 0.00 0.00 0.01 0.40 True label 0.97 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.04 0.00 0.90 0.02 0.02 0.01 0.01 0.00 0.00 0.00 0.02 0.00 0.06 0.75 0.03 0.10 0.02 0.01 0.01 0.00 0.01 0.00 0.05 0.03 0.89 0.01 0.01 0.01 0.00 0.00 0.01 0.00 0.05 0.16 0.03 0.72 0.01 0.01 0.00 0.00 0.01 0.00 0.08 0.05 0.01 0.01 0.82 0.00 0.00 0.00 0.02 0.00 0.03 0.08 0.10 0.05 0.01 0.71 0.01 0.00 0.18 0.06 0.02 0.00 0.00 0.00 0.00 0.00 0.71 0.03 0.04 0.29 0.01 0.01 0.00 0.00 0.00 0.00 0. True label 0.94 0.00 0.02 0.01 0.00 0.00 0.00 0.00 0.01 0.00 0.01 0.98 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.04 0.00 0.87 0.03 0.02 0.02 0.01 0.00 0.00 0.00 0.02 0.00 0.03 0.83 0.02 0.06 0.02 0.00 0.00 0.00 0.01 0.00 0.03 0.04 0.90 0.01 0.01 0.01 0.00 0.00 0.01 0.00 0.04 0.20 0.03 0.70 0.01 0.01 0.00 0.00 0.01 0.00 0.07 0.07 0.02 0.01 0.83 0.00 0.00 0.00 0.02 0.00 0.04 0.12 0.08 0.06 0.00 0.68 0.00 0.00 1. As shown in Fig. 5 , the features of the tail classes from FixMatch are scattered to the majority classes. However, L2AC can help the model to effectively discriminate the tail classes (e.g., Class 8, 9) from majority classes (e.g., Class 0, 1). Such a high-quality representation learning also benefits from an unbiased classifier during training. In Appendix E.1, we further present the unbiasedness of our L2AC by evaluating it on various imbalanced test sets. Ablation analysis. We conduct an ablation study to explore the contribution of each critical component in L2AC. We experiment with FixMatch on CIFAR-10 under γ l = γ u = 100 and γ l = 100, γ u = 100 (reversed). 1) We first verify whether the bias attractor is helpful. To this end, we apply the proposed bias attractor to FixMatch. It can be seen from Table 5 that the bias attractor helps improve the performance to a certain extent, indicating that the bias attractor is effective yet not significant. As analyzed in Section 3.2, there is no prior knowledge for the bias attractor to assimilate the training bias in this plain training manner. 2) Next, we study how important the role bi-level optimization plays in our method. Instead of using a bi-level learning framework, we disengage the hierarchy structure of our upper-level loss and lower-level loss, and reformulate a single level optimization problem as L + λL bal . The results in Table 5 show that such a degraded version of L2AC provides substantial performance gain over FixMatch w/ bias attractor while still inferior to our L2AC. This demonstrates the effect of the proposed bias adaptive classifier and bi-level learning framework on protecting the linear classifier from the training bias.

5. CONCLUSION

In this work, we propose a bias adaptive classifier to deal with the training bias problem in imbalanced SSL tasks. The bias adaptive classifier is consist of a linear classifier to predict unbiased labels and a bias attractor to assimilate the complicated training bias. It is learned with a bi-level optimization framework. With such a tailored classifier, the unlabeled data can be fully used by online pseudo-labeling to improve the performance of pseudo-labeling SSL methods. Extensive experiments show that our proposed method achieves consistent improvements over the baselines and current state-of-the-arts. We believe that our bias adaptive classifier can also be used for more complex data bias other than class imbalance.

A ALGORITHM

We give the training algorithm of the proposed L2AC method in Algorithm 1. Algorithm 1 learning to adapt classifier during training Input: labeled / unlabeled training data D l / D u , labeled / unlabeled batch size n / m, max iterations T Output: classification network parameters {θ, ϕ} 1: Initialize {θ 0 , ϕ 0 } ← {θ, ϕ} and ω 0 ← ω. 2: for t = 0 to T do 3: Dl = {x i , y i } n i=1 ← SampleMiniBatch(D l , n). 4: Du = {x i } m i=1 ← SampleMiniBatch(D u , m). 5: B = {x i , y i } n i=1 ← SampleMiniBatch(D l , n). 6: Estimate pseudo-label ŷi for x i ∈ Du .

7:

Compute lower-level loss L by Eq. ( 3). 8: Update network parameters {θ t+1 , ϕ t+1 } by Eq. ( 4).

9:

Compute upper-level loss L bal by Eq. ( 5). 10: Update bias attractor parameters ω t+1 by Eq. ( 6). 11: end for B PROOF OF PROPOSITION 3.1 Proof B.1 The update of ϕ is as: ϕ t+1 = ϕ t -α 1 n n i=1 ∂Li(θ, ϕ, w) ∂ϕ | ϕ t . In consequence, we have ω t+1 = ω t -η∇L bal (θ, ϕ t+1 )| ω t = ω t + ηα 1 n n i=1 ∂ 2 Li(θ, ϕ) ∂ϕ∂ω T | ϕ t ,ω t ∂L bal (θ, ϕ) ∂ϕ | ϕ t+1 = ω t + ηα 1 n n i=1 ∂∆fi ∂ω | ω t ∂ 2 Li(θ, ϕ) ∂ϕ∂∆fi T | ϕ t ∂L bal (θ, ϕ) ∂ϕ | ϕ t+1 , Denote by Ξ i = ∂Li(θ,ϕ) ∂∆fi , then Eq. ( 9) becomes ω t+1 = ω t + ηα 1 n n i=1 Gi ∂∆fi ∂ω | ω t , where G i = ∂Ξ i ∂ϕ | ϕ t , ∂L bal (θ, ϕ) ∂ϕ | ϕ t+1 . As the training loss is Li(θ, ϕ) = log d k=1 e z i,k +∆f i,k -zi,c i -∆fi,c i , ( ) where c i is the class label of the i-th sample. Therefore Ξ i,k =          e z i,k +∆ i,k d s=1 e z i,s +∆ i,s , k ̸ = ci e z i,k +∆ i,k d s=1 e z i,s +∆ i,s -1, k = ci = pi -yi (12) Meanwhile, the upper level loss is defined as L bal (θ, ϕ) = - 1 m m j=1 L bal i (θ, ϕ), ( ) this finishes the proof.

C COMPARISON METHODS

To comprehensively evaluate the proposed method, we compare it with the Vanilla model that merely trained with labeled data and three other lines of methods: 1) re-balancing methods where only the class-imbalanced labeled data are used for training, including: Re-sampling (Japkowicz, 2000) , LDAM-DRW (Cao et al., 2019) and cRT (Kang et al., 2020) . 2) pseudo-labeling based SSL methods where both labeled and unlabeled data is used (without considering class-imbalance), including: Pseudo-labels (Lee et al., 2013) , MixMatch (Berthelot et al., 2019b) and FixMatch (Sohn et al., 2020) . 3) imbalanced semi-supervised learning methods that consider class-imbalance and unlabeled data simultaneously, including: DARP (Kim et al., 2020a) , CReST+ (Wei et al., 2021) , ABC (Lee et al., 2021) , SaR (Lai et al., 2022) and DASO (Oh et al., 2022) . We herein give a brief introduction for all the comparison methods. • Vanilla, a plain classification network, e.g., Wide ResNet-28-2 (Oliver et al., 2018) , trained with imbalanced labeled data by cross-entropy loss. • Re-sampling, a re-balancing method that uses re-sampling strategy to balance the distribution of training data. • LDAM-DRW, i.e., Label-distribution-aware margin, a re-weighting method where the classifier encourage to maintain large margin for tail classes. • cRT, i.e., Classifier re-training, a two-stage training method that first pretrains the entire network with all the imbalanced training data and re-train the classifier with a balanced objective. • MixMatch, a SSL method which combines pseudo-labeling and consistency regularization techniques via Mixup augmentation (Zhang et al., 2018) . • FixMatch, a pseudo-labelling based SSL method of which the strongly augmented unlabeled samples (whose pseudo labels are generated from their weakly augmented versions) are used to train the network. • DARP, a recent state-of-the-art imbalanced SSL method that refines raw pseudo-labels via a convex optimization for alleviating distribution bias arisen by imbalanced and unlabeled training data. • CReST, a pseudo-labeling based imbalanced SSL method that combines re-balancing and distribution alignment techniques to alleviate the training bias. The method assumes that labeled and unlabeled data have roughly the same distribution. • ABC, which equips with two parallel linear classifiers with one fitting the imbalanced data and the other fitting the re-balanced data, and adds the consistency regularization to further improve the performance. • SaR, i.e., self-adaptive refinement, which proposes the concept of mitigating vector that refines the soft labels of unlabeled data before generating the one-hot pseudo labels to alleviate the confirmation bias brought about by unlabeled samples. • DASO, for an unlabeled sample, which combines its pseudo-label from the linear classifier with that from a similarity-based classifier to leverage their complementary properties in terms of bias. Moreover, a semantic alignment loss is proposed to balance the biased feature representation. For fair comparison, we use the same code basefoot_2 as DARP (Kim et al., 2020a) . As the training or evaluation protocols of CReST (Wei et al., 2021) , ABC (Lee et al., 2021) and DASO (Oh et al., 2022) are different from that of DARP, we reproduce their results according to the official codes (i.e., CReSTfoot_3 , ABCfoot_4 and DASOfoot_5 ) released by the authors. Note that the results on CIFAR-100 in DARP (Kim et al., 2020a) are achieved under N 1 = 300, M 1 = 150, while this paper keeps N 1 = 150, M 1 = 300 for satisfying the common assumption that the amount of unlabeled data are larger than that of labeled data.

D EXPERIMENTAL SETUPS D.1 IMPLEMENTATION DETAILS ON CIFAR AND STL-10

All our experiments are implemented with the Pytorch platform (Paszke et al., 2019) and follows the experimental settings in Kim et al. (2020a) . We use Wide ResNet-28-2 (Oliver et al., 2018) as our backbone network. During training, the model is trained with Adam optimizer (Kingma & Ba, 2015) under the default parameter setting, i.e., β 1 = 0.9, β 2 = 0.999, and ϵ = 10 -8 . The learning rate is set as 2 × 10 -3 and the batch size is set as 64. The total number of training iterations are 2.5 × 10 5 as in Kim et al. (2020a) . To evaluate the model, we follow the setting in Berthelot et al. (2019b) and use an exponential moving average (EMA) of its parameters with a decay rate of 0.999 at each iteration. We also follow the standard evaluation protocols in Berthelot et al. (2019b) that evaluates the performance at every 500 iterations and reports the average test accuracy of the last 20 evaluations. Bias attractor: As aforementioned in Section 3.2, the bias attractor is a light-weight network, i.e., a multi-layer perceptron with one hidden layer in this paper. The input of the bias attractor is the prediction scores output by the linear classifier, so the input dimension is the same as the number of classes. We normalize the input through its L 2 norm or a softmax activation. Concretely, We use softmax operator on CIFAR-10 and STL-10, and L 2 norm on CIFAR-100 and SUN-397 due to its better and more stable performance than softmax operator. Note that the gradients of the input of the bias attractor are stopped in the training stage. The hidden layer dimension of the bias attractor is fixed as 256, which keeps stable and sound results through all our experiments. The parameters of bias attractor are updated by Eq. ( 6), where the learning rate η is set as 1 × 10 -4 . Baselines: We evaluate our L2AC based on two recent popular SSL methods, i.e., MixMatch (Berthelot et al., 2019b) and FixMatch (Sohn et al., 2020) . Both the two baselines are the cornerstone of current state-of-the-art imbalanced SSL methods, such as DARP (Kim et al., 2020a) , CReST (Wei et al., 2021) and DASO (Oh et al., 2022) . Note that the two baselines can be uniformly formulated as Eq. (1). For MixMatch, the pseudo-label of one unlabeled example is produced by temperature sharpening to the the average prediction of its different augmented versions, and the objective function of unlabeled data is adopted as mean-squared loss (MSE) function. The threshold τ is kept as 0, and λ u is dynamically updated by a linear ramp-up strategy, i.e., λ u linearly increases from 0 to 75 during training. For FixMatch, unlabeled data are augmented by weak and strong augmentations via RandAugment (Cubuk et al., 2020) . In particular, the weekly augmented data are used to generate pseudo-labels for the strongly augmented data, and these strongly augmented data are then used to compute the unlabeled loss in Eq. ( 1). We set τ as 0.95, and λ u as 1 without applying linear ramp-up strategy.

D.2 IMPLEMENTATION DETAILS ON SUN397

SUN397 (Xiao et al., 2010 ) is an imbalanced real-world scene classification dataset, which originally consists of 108,754 RGB images labeled with 397 classes. Following the experimental setups in Kim et al. (2020a) , we hold-out 50 samples per each class for testing because no official data split is provided. We then artificially construct the labeled and unlabeled dataset using the remaining dataset according to M k N k = 2. The comparison methods includes: Vanilla, classifier retraining (cRT) (Kang et al., 2020) , FixMatch (Sohn et al., 2020) , DARP (Kim et al., 2020a) and ABC (Lee et al., 2021) . Training details. For pre-processing, we randomly crop and rescale to 224 × 224 size all labeled and unlabeled training images before applying augmentation. We use standard ResNet-34 (He et al., 2016) as our backbone network. During training, the model is trained with Adam optimizer (Kingma & Ba, 2015) with a batch-size of 128 labeled samples and 256 unlabeled samples, and a initial learning rate of 0.002 for 300 training epochs. For fair comparison, all the experiments are based on FixMatch (Sohn et al., 2020) . We set unlabeled loss weight λ u as 1.0 and confidence threshold τ as 0.6, and utilize exponential moving average technique with decay rate 0.99. Following Kim et al. (2020a) , we adopt RandAugment with random magnitude (Cubuk et al., 2020) for strong augmentation and random horizontal flip for weak augmentation.

E ADDITIONAL EXPERIMENTS E.1 EVALUATION ON IMBALANCED TEST SETS

Real-world test data is not always following an uniform distribution, and we herein simulate several imbalanced test sets to further study the generalization of the proposed method. As shown in Fig. 6 , we construct three imbalanced test sets: (a) Test-1, which follows a long-tailed class distribution with an imbalance ratio of 10; (b) Test-2, which follows a reversed long-tailed class distribution; (c) Test-3, which follows a random class distribution. We evaluate the proposed L2AC upon FixMatch (Sohn et al., 2020) and compare it with the following methods: DARP (Kim et al., 2020a) , CReST+ (Wei et al., 2021) , ABC (Lee et al., 2021) . All these models are trained on CIFAR-10 with imbalanced ratio γ = γ l = γ u = 100. As the evaluation metrics bACC and GM are insensitive to class imbalance of test sets, we add ACC to measure the recognition accuracy for all samples. The results are summarized in Tab. 6. It can be observed that the network learned with our L2AC achieves the best or the second best performance across all the test settings, which further indicates that the proposed bi-level learning framework provides a relatively unbiased classifier compared to other comparison methods. In the main text, we have evaluated the proposed L2AC under long-tailed imbalanced settings with various imbalance ratios. Here, we further validate its generalization in the case of step imbalance on CIFAR-10 under two typical setups, namely γ l = γ u = 100 and γ = 100, γ u = 1. As shown in Fig. 7 , step imbalance assumes a more severely imbalanced class distribution than the long-tailed imbalance setups, as there are very scare data for half of the classes. The experimental results are summarized in Tab. 7. We can see that the proposed L2AC achieved the best performance compared with all the comparison methods for all settings. Especially in the case of distribution mismatch between labeled and unlabeled data, L2AC brings significant performance gain. Case of γ l = γ u . Fig. 8 visualizes the confusion matrices of pseudo-labels of Fixmatch (Sohn et al., 2020) and our L2AC under the imbalance ratio γ l = γ u = 100. Note that this requires to use true labels that are hidden in the training stage. We can observe that the original pseudolabels are highly biased toward majority classes of the labeled dataset. In contrast, our L2AC tends to a relatively equal per-class recall, especially on minority classes the performance is significantly improved compared to FixMatch (Sohn et al., 2020) . This suggests that the proposed method readily improves the quality of pseudo-labels. True label 0.98 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.04 0.00 0.89 0.02 0.04 0.01 0.01 0.00 0.00 0.00 0.01 0.00 0.02 0.87 0.04 0.04 0.01 0.00 0.00 0.00 0.01 0.00 0.01 0.02 0.94 0.01 0.00 0.01 0.00 0.00 0.01 0.00 0.02 0.12 0.04 0.79 0.01 0.01 0.00 0.00 0.01 0.00 0.06 0.04 0.07 0.00 0.81 0.00 0.00 0.00 0.05 0.00 0.03 0.11 0.07 0.04 0.00 0.69 0.00 0.00 0.24 0.09 0.00 0.00 0.01 0.00 0.00 0.00 0.65 0.00 0.04 0.38 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.58 True label 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.03 0.00 0.96 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.02 0.93 0.01 0.02 0.01 0.00 0.00 0.00 0.04 0.00 0.03 0.03 0.89 0.00 0.00 0.00 0.00 0.00 0.04 0.01 0.04 0.23 0.04 0.62 0.02 0.01 0.00 0.00 0.05 0.00 0.04 0.04 0.01 0.01 0.85 0.00 0.00 0.00 0.16 0.00 0.04 0.05 0.05 0.00 0.00 0.69 0.00 0.00 0.96 0.02 0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.08 0.91 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 True label 0.96 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.92 0.01 0.03 0.02 0.00 0.00 0.00 0.00 0.00 0.03 0.04 0.92 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.05 0.11 0.05 0.76 0.01 0.01 0.00 0.00 0.00 0.00 0.12 0.05 0.02 0.01 0.79 0.00 0.00 0.00 0.00 0.00 0.04 0.16 0.10 0.02 0.00 0.68 0.00 0.00 

E.4 TRAINING CONVERGENCE VERIFICATION

To verify the convergence of our proposed L2AC approach, Fig. 10 visualizes the training curves of the lower-level loss Eq. ( 3) and the upper-level loss Eq. ( 5) with training iteration increasing from 0 to 2.5 × 10 5 . We can see that both lower-level loss and upper-level loss convergence fast within the first 100 epochs (5 × 10 4 iterations), and the test accuracy curve increases fast at the same time. When the test accuracy reached the peak value, our L2AC roughly remains the same test accuracy until termination, which verifies the robustness of the proposed method. Since the balanced set B is dynamically sampled from the labeled set D l , a nature question is whether the upper-level loss (over B) has the same convergence rate with the lower-level loss (over D l ) during training. To investigate this, we further visualize the curves of these two losses in Fig. 11 , and we can observe that the two losses decrease differently and converge to values of different magnitudes at different iterations.

E.5 RUNNING COST ANALYSIS

In Section 3.2, we conduct a complexity analysis of the training algorithm of our L2AC, which shows that our method is very efficient in theory. To verify this, we herein measure floating point operations per second (FLOPS) using NVIDIA GeForce RTX 3090 to quantify the training cost. We compare our proposed algorithm with the baseline model FixMatch (Sohn et al., 2020) and two stateof-the-art methods (Kim et al., 2020a) and (Lee et al., 2021) , as L2AC uses the same code base with these two methods. Besides, we also provide the training cost of L2AC (traditional), the algorithm that unrolls the gradient of the whole classification network to compute the second-order gradient of the bias attractor, just like most gradient-based bi-level optimization algorithm (Finn et al., 2017; Ren et al., 2018) . It can be seen that: (1) our L2AC is much faster than L2AC (traditional); (2) The training cost of our L2AC is comparable to the current state-of-the-art method ABC (Lee et al., 2021) . (Kim et al., 2020a) 1.47 M 18.2 iter/sec 7.5 iter/sec w/ ABC (Lee et al., 2021) 1.47 M 15.1 iter/sec 14.9 iter/sec w/ L2AC (traditional) 1.48 M 9.9 iter/sec 9.7 iter/sec w/ L2AC (ours) 1.48 M 14.2 iter/sec 13.9 iter/sec It can be observed that the computation cost increment of our proposed L2AC is nearly negligible compared with the baseline models especially considering the significant improvement performance of L2AC. And it is worth noting that our proposed L2AC is more efficient than traditional secondorder optimization, we think it is owing to two reasons: (1) the bias attractor only adds a very small number of parameters (about 0.68% of the total number of parameters); (2) To calculate the secondorder gradient of these parameters, we only need to unroll the gradient of the linear classifier as shown in Eq. ( 4). Note that in the test stage our L2AC requires no extra overhead compared with the baseline model FixMatch.

E.6 FEATURE VISUALIZATION UNDER OTHER SETUPS

In Fig. 5 , we visualize the representations of training data through t-SNE (Van der Maaten & Hinton, 2008) on CIFAR-10 with γ l = 100, γ u = 1. Under such a setting, each class roughly has the same number of samples, which ensures a good visual visualization for each class. To verify that the proposed L2AC can generally obtain a high-quality representation, we further visualize the t-SNE of training data under other experimental setups, including γ l = γ u = 100 and γ l = 100, γ u = 100 (reversed). As shown in Fig. 12 and Fig. 13 , compared with FixMatch (Sohn et al., 2020) , our L2AC certainly improves the separability of the tail classes from the head classes. This verifies that the result in Fig. 5 is not owing to the specific choice of the experimental setups. 

F CONVERGENCE ANALYSIS OF ALGORITHM 1

We prove that our algorithm to minimize L bal converges at rate of Õ( 1 √ T ), which meets the convergence results of similar work such as Ren et al. (2018) . Theorem F.1 Assume that L bal in Eq. ( 5) is L-smooth with ρ-bounded gradients on the samples of balanced set B, and the bias attractor function ∆f w (•) is differentiable with a δ-bounded gradient and twice differentiable with bounded Hessian. Let α t in Eq. (4) satisfies α t = c1 t ≤ 2 L , and η t in Eq. ( 6) is set as η t = c2 σ √ t for c > 0 such that c σ √ t < 1 L . Then the loss function L bal converges to critical point at the rate Õ( 1 √ T ) as min 0≤t≤T E[∥∇L bal (ϕ t (ω t ))∥ 2 2 ] ≤ Õ( 1 √ T ). ( ) Proof F.1 The parameter ω is updated by stochastic gradient descent as ω t+1 = ω t -ηt∇L bal (ϕ t (ω t ))|B t , where B t is a mini-batch of class-balanced data. We can write Eq. ( 15) as ω t+1 = ω t -ηt[∇L bal (ϕ t (ω t )) + ξ t ], where ξ t is the random gradient noise with zero mean and finite variance σ. Observe that L bal (ϕ t+1 (ω t+1 )) -L bal (ϕ t (ω t )) ={L bal (ϕ t+1 (ω t+1 )) -L bal (ϕ t (ω t+1 ))} + {L bal (ϕ t (ω t+1 )) -L bal (ϕ t (ω t ))}. ( ) According to L-smoothness of L bal with respect to ϕ, we have L bal (ϕ t+1 (ω t+1 )) -L bal (ϕ t (ω t+1 )) ≤⟨∇ ϕ L bal (ϕ t (ω t+1 )), ϕ t+1 (ω t+1 ) -ϕ t (ω t+1 )⟩ + L 2 ∥ϕ t+1 (ω t+1 ) -ϕ t (ω t+1 )∥ 2 2 , (18) meanwhile ϕ t+1 (ω t+1 ) -ϕ t (ω t+1 ) = -αt∇ ϕ L(ϕ(ω t+1 ), θ)| ϕ t (ω t+1 ) , we have L bal (ϕ t+1 (ω t+1 )) -L bal (ϕ t (ω t+1 )) ≤∥∇ ϕ L bal (ϕ t (ω t+1 ))∥∥ -αt∇ ϕ L(ϕ(ω t+1 ), θ)| ϕ t (ω t+1 ) ∥+ L 2 ∥ -αt∇ ϕ L(ϕ(ω t+1 ), θ)| ϕ t (ω t+1 ) ∥ 2 2 , ≤αtρ 2 + L 2 α 2 t ρ 2 , the second inequality holds due to the assumption ∥∇ ϕ L bal (ϕ t (ω t+1 ))∥ ≤ ρ and ∥α t ∇ ϕ L(ϕ(ω t+1 ), θ)| ϕ t (ω t+1 ) ∥ ≤ ρ. According to the L-smoothness of L bal with respect to ω, we have Taking expectations with respect to ξ t on Eq. ( 24), we can obtain (25) The first inequality holds due to E[∥ξ t ∥ 2 2 ] = σ 2 and the second equality holds due to E ξ t [ξ t ] = 0. Further more, we have (26) The second inequality hods due to 



In theory, an MLP can approximate almost any continuous function(Hornik et al., 1989). bACC and GM are defined as the arithmetic and geometric mean over class-wise sensitivity, respectively. https://github.com/bbuing9/DARP https://github.com/google-research/crest https://github.com/LeeHyuck/ABC https://github.com/ytaek-oh/daso



Figure 1: Experiments on CIFAR-10-LT. (a) Labeled set is class-imbalanced with imbalance ratio γ = 100, while the whole training data remains balanced. Analysis on (b) per-class recall and (c) predicted class distribution for Upper bound (trained with the whole training data with ground truth labels), Lower bound (trained with the labeled data only), FixMatch and Ours on the balanced test set. Note that predicted class distribution is averaged by the predicted scores for all samples.

It is clear that L2AC significantly raises the final recall of the minority classes. Especially for the situation where the distribution of labeled and unlabeled data are severely mismatched, as shown in Fig.4(b), our L2AC considerably improves the recall of the most minority class by around 60% upon FixMatch. Such a high-quality pseudo-label estimation probably benefits from a more unbiased classifier with a basically equal preference for each class.How about the learnt linear classifier and feature extractor? We revisit the toy experiment in Section 1 where the whole class-balanced training set is used to train an unbiased upper bound model. Instead, the baseline FixMatch and our L2AC are trained under the standard imbalanced SSL setups, and we compare the predicted class distributions on the test set with that of the upper bound model. As shown in Fig.1(c), the classifier learned by L2AC approximates the predicted class distribution of the upper bound model and much better than FixMatch, indicating that L2AC results in a relatively unbiased classifier. For the feature extractor, we further visualize the representations of training data through t-SNE (Van derMaaten & Hinton, 2008) on CIFAR-10 with γ l = 100, γ u =



Figure 3: Confusion matrices of FixMatch(Sohn et al., 2020), DARP(Kim et al., 2020a), ABC(Lee et al., 2021), and ours on CIFAR-10 under the imbalance ratio γ = 100.

Figure 5: t-SNE visualization of training data for (a) FixMatch and (b) L2AC. L2AC helps to discriminate tail classes from majority ones.

Figure 6: Class distributions of three typical imbalanced test sets.

Figure 7: Class distributions of CIFAR-10 under step imbalance setups with (a) γ l = γ u = 100; (b) γ l = 100, γ u = 1.

Figure 8: Pseudo-label confusion matrices of (a) FixMatch and (b) Ours on CIFAR-10 under γ l = γ u = 100.



Figure 9: Pseudo-label confusion matrices of (a) FixMatch and (b) Ours on CIFAR-10 with γ l = 100, γ u = 100 (reversed).

Figure 10: Curves of Left: lower-level and upper-level losses and Right: test accuracy durning training of our approach on long-tailed CIFAR-10 under γ l = γ u = 100.

Figure 12: t-SNE visualization of unlabeled data for (a) FixMatch and (b) L2AC on CIFAR-10 with γ l = γ u = 100.

= log(T ) thus the last equality holds, this finish our proof.



±0.14 / 61.2 ±0.15 65.8 ±0.52 / 56.5 ±2.08 86.7 ±0.80 / 86.2 ±0.82 72.9 ±0.24 / 71.0 ±0.32 w/ SaR 66.8 ±0.92 / 59.9 ±1.32 64.4 ±2.21 / 57.3 ±1.95 68.4 ±3.20 / 62.0 ±2.17 65.5 ±1.01 / 64.2 ±0.95 w/ DASO 69.8 ±1.10 / 69.3 ±1.07 66.5 ±1.99 / 65.4 ±2.25 75.5 ±0.48 / 74.6 ±0.67 65.7 ±1.01 / 62.0 ±1.23 w/ ABC 75.7 ±0.76 / 74.7 ±0.47 68.5 ±0.40 / 56.4 ±1.50 72.1 ±0.53 / 41.2 ±4.40 62.9 ±0.36 / 59.9 ±0.60 w/ L2AC (ours) 76.6 ±0.73 / 75.7 ±1.08 72.1 ±0.62 / 70.3 ±0.93 87.2 ±0.09 / 86.7 ±0.08 74.0 ±0.82 / 72.9 ±1.01 FixMatch 71.5 ±0.72 / 66.8 ±1.51 68.4 ±0.15 / 59.9 ±0.43 68.9 ±1.95 / 42.8 ±8.11 65.5 ±0.05 / 26.0 ±0.44 w/ DARP 75.5 ±0.05 / 73.0 ±0.09 70.4 ±0.25 / 64.9 ±0.17 85.4 ±0.55 / 85.0 ±0.65 74.9 ±0.51 / 72.3 ±1.13 w/ CReST+ 77.5 ±0.15 / 76.1 ±0.15 72.1 ±0.74 / 68.9 ±1.29 ±0.42 / 75.9 ±0.76 71.5 ±0.23 / 66.9 ±0.25 85.9 ±0.68 / 85.3 ±0.53 78.3 ±0.34 / 76.1 ±0.21 w/ DASO 78.3 ±0.55 / 76.5 ±0.57 74.6 ±0.74 / 71.7 ±0.52 87.9 ±0.41 / 87.7 ±0.43 79.5 ±0.91 / 78.9 ±0.96 w/ ABC 80.2 ±0.42 / 78.9 ±1.29 74.7 ±1.04 / 72.2 ±1.45 81.3 ±0.34 / 80.2 ±0.36 70.3 ±0.50 / 67.9 ±0.70 w/ L2AC (ours) 82.1 ±0.57 / 81.5 ±0.64 77.6 ±0.53 / 75.8 ±0.71 89.5 ±0.18 / 89.2 ±0.19 82.2 ±1.23 / 81.7 ±1.36

Comparison results on CIFAR-100 and STL-10 under two different imbalance ratios. The performance (bACC / GM) is reported in the form of mean ±std across three random runs. ±0.19 / 52.1 ±0.31 52.6 ±0.13 / 43.0 ±0.45 79.9 ±0.52 / 79.1 ±0.49 77.0 ±0.65 / 75.8 ±0.68

Comparison results on large-scale SUN397. The performance (bACC / GM) is reported in the form of mean±std across three random runs.

Performance (bACC / GM) on CIFAR10-LT and STL-10 under various label ratio β.Kim et al. (2020a), we hold-out 50 samples per each class for testing because no official data split is provided. We then construct the labeled and unlabeled dataset according to M k /N k = 2. The comparison methods includes: Vanilla, cRT

Ablation study.

Imbalanced test set results. ACC: accuracy for all samples.

Performance (bACC / GM) on CIFAR-10 under step imbalance setups.

Training cost analysis on CIFAR-10 and CIFAR-100.

L bal (ϕ t (ω t+1 )) -L bal (ϕ t (ω t ))≤⟨∇ωL bal (ϕ t (ω t )), ω t+1 -ω t ⟩ + L 2 ∥ω t+1 -ω t ∥ 2 2 =⟨∇ωL bal (ϕ t (ω t )), -ηt[∇ωL bal (ϕ t (ω t )) + ξ t ]⟩ + Lη 2 t -ηt)⟨∇ωL bal (ϕ t (ω t )), ξ t ]⟩. ≤L bal (ϕ t (ω t )) -L bal (ϕ t+1 (ω t+1 )) + αtρ 2 + Lη 2 t -ηt)⟨∇ωL bal (ϕ t (ω t )), ξ t ]⟩. ≤L bal (ϕ 1 (ω 1 )) -L bal (ϕ T +1 (ω T +1 )) + -ηt)⟨∇ωL bal (ϕ t (ω t )), ξ t ]⟩.≤L bal (ϕ 1 (ω 1 )) + -ηt)⟨∇ωL bal (ϕ t (ω t )), ξ t ]⟩.

ACKNOWLEDGMENTS

We thank the anonymous reviewers for their constructive suggestions on improving this paper. This research was supported by National Key R&D Program of China (2020YFA0713900), the Macao Science and Technology Development Fund under Grant 0612020A2, The Major Key Project of PCL (PCL2021A12), the China NSFC projects under contract 61721002 and 61906144.

