LONG-TAILED PARTIAL LABEL LEARNING VIA DYNAMIC REBALANCING

Abstract

Real-world data usually couples the label ambiguity and heavy imbalance, challenging the algorithmic robustness of partial label learning (PLL) and long-tailed learning (LT). The straightforward combination of LT and PLL, i.e., LT-PLL, suffers from a fundamental dilemma: LT methods build upon a given class distribution that is unavailable in PLL, and the performance of PLL is severely influenced in long-tailed context. We show that even with the auxiliary of an oracle class prior, the state-of-the-art methods underperform due to an adverse fact that the constant rebalancing in LT is harsh to the label disambiguation in PLL. To overcome this challenge, we thus propose a dynamic rebalancing method, termed as RECORDS, without assuming any prior knowledge about the class distribution. Based on a parametric decomposition of the biased output, our method constructs a dynamic adjustment that is benign to the label disambiguation process and theoretically converges to the oracle class prior. Extensive experiments on three benchmark datasets demonstrate the significant gain of RECORDS compared with a range of baselines. The code is publicly available.

1. INTRODUCTION

Partial label learning (PLL) origins from the real-world scenarios, where the annotation for each sample is an ambiguous set containing the groundtruth and other confusing labels. This is common when we gather annotations of samples from news websites with several tags (Luo & Orabona, 2010) , videos with several characters of interest (Chen et al., 2018) , or labels from multiple annotators (Gong et al., 2018) . The ideal assumption behind PLL is that the collected data is approximately uniformly distributed regarding classes. However, a natural distribution assumption in above realworld applications should be imbalance, especially follows the long-tailed law, which should be considered if we deploy the PLL methods into online systems. This thereby poses a new challenge about the robustness of algorithms to both category imbalance and label ambiguity in PLL studies. Existing efforts, partial label learning and long-tailed learning, independently study the partial aspect of this problem in the past decades. The standard PLL requires the label disambiguation from candidate sets along with the training of an ordinary classifier (Feng et al., 2020) . The mainstream to solve this problem is estimating label-wise confidence to implicitly or explicitly re-weight the classification loss, e.g., PRODEN (Lv et al., 2020) , LW (Wen et al., 2021) , CAVL (Fei et al., 2022) and CORR (Wu et al., 2022) , which have achieved the state-of-the-art performance in PLL. When it comes to the long-tailed learning, the core difficulty lies on diminishing the inherent bias induced by the heavy class imbalance (Chawla et al., 2002; Menon et al., 2013) . The simple but fairly effective method is the logit adjustment (Menon et al., 2021; Ren et al., 2020) , which has been demonstrated very powerful in a range of recent studies (Cui et al., 2021; Narasimhan & Menon, 2021) . Nevertheless, considering a more practical long-tailed partial label learning (LT-PLL) problem, several dilemma remains based on the above two paradigms. One straightforward concern is that the skewed long-tailed distribution exacerbates the bias to the head classes in the label disambiguation, Figure 1 : Average classifier prediction (on the CIFAR-100 test set) of different methods during training (on LT-PLL training set CIFAR-100-LT with imbalance ratio ρ = 100 and ambiguity q = 0.05). "PRODEN" (Lv et al., 2020 ) is a popular PLL method. "PRODEN + Oracle-LA" denotes PRO-DEN with the state-of-the-art logit adjustment (Menon et al., 2021; Hong et al., 2021) in LT under the oracle prior. "PRODEN + RECORDS" is PRODEN with our proposed calibration. "Uniform" characterizes the expected average confidence on different classes. easily resulting in the trivial solution that are excessively confident to the head classes. More importantly, most state-of-the-art long-tailed learning methods cannot be directly used in LT-PLL, since they require the class distribution available that is agnostic in PLL due to the label ambiguities. In addition, we discover that even after applying an oracle class distribution prior in the training, existing techniques underperform in LT-PLL and even fail in some cases. In Figure 1 , we trace the average prediction of a PLL model PRODEN (Lv et al., 2020) , on a uniform test set. Normally, the backbone PLL method PRODEN exhibits the biased prediction towards head classes shown by the blue curve, and ideally, we expect with the intervention of the state-of-the-art logit adjustment in LT, the prediction for all classes will be equally confident, namely, the purple curve. However, as can be seen, PRODEN calibrated by the oracle prior actually performs worse and is prone to over-adjusting towards the tail classes as shown in the orange curve. This is because logit adjustment in LT leverages a constant class distribution prior to rebalance the training and does not consider the dynamic of label disambiguation. Specially, at the early stage where the true label is very ambiguous from the candidate set, over-adjusting the logit only leads to the strong confusion of the classifier, which is negative to the overall training. Thus, we can see in Figure 1 , the average prediction on the tail classes becomes too high along with the training. Based on the above analysis, compared with the previous constant rebalancing methods in PLL, a dynamic rebalancing mechanism friendly to the training dynamic will be more preferred. To this intuition, we propose a novel method, termed as REbalanCing fOR Dynamic biaS (RECORDS) for LT-PLL. Specifically, we perform a parametric decomposition of the biased model output and implement a dynamic adjustment by maintaining a prototype feature with momentum updates during training. The empirical and theoretical analysis demonstrate that our dynamic parametric class distribution is asymmetrically approaching to the statistical prior but benign to the overall training. A quick glance at the performance of RECORDS is the red curve in Figure 1 , which approximately fits the expected purple curve in the whole training progress. The contribution can be summarized as follows, 1. We delve into a more practical but under-explored LT-PLL scenario, and identify its several challenges in this task that cannot be addressed and even lead to failure by the straightforward combination of the current long-tailed learning and partial label learning. 2. We propose a novel RECORDS for LT-PLL that conducts the dynamic adjustment to rebalance the training without requiring any prior about the class distribution. The theoretical and empirical analysis show that the dynamic parametric class distribution is asymmetrically approaching to the oracle class distribution but more friendly to label disambiguation. 3. Our method is orthogonal to existing PLL methods and can be easily plugged into the current PLL methods in an end-to-end manner. Extensive experiments on three benchmark datasets under the long-tailed setting and a range of PLL methods demonstrate the effectiveness of the proposed RECORDS. Specially, we show a 32.03% improvement in classification performance compared to the best CORR on the Pascal VOC dataset. Partial Label Learning (PLL). In PLL, each training sample is associated with a candidate label set containing the ground-truth. Early explorations is mainly average-based (Hüllermeier & Beringer, 2006; Zhang & Yu, 2015; Cour et al., 2011b) , which treat all candidate labels equally during model training. Their drawback is that the model training is easily misled by the false positive labels that co-occur with the ground truth. Some other identification-based methods consider the truth label as a latent variable and optimize the objective under some criterions (Jin & Ghahramani, 2002; Liu & Dietterich, 2014; Nguyen & Caruana, 2008; Yu & Zhang, 2015) . Recently, self-training methods (Feng et al., 2020; Wen et al., 2021; Lv et al., 2020; Fei et al., 2022) that gradually disambiguate the candidate label sets during training have achieved better results. PiCO (Wang et al., 2022b) introduces a contrastive learning branch that improves PLL performance by enhancing representation. Long-Tailed Learning (LT). In LT, several methods have been proposed to consider the extremely skewed distribution during training (Cui et al., 2019; Zhou et al., 2022; Chawla et al., 2002) . Resampling (Kubat & Matwin, 1997; Wallace et al., 2011; Han et al., 2005) is one of the most widely used paradigm by down-sampling samples of head classes or up-sampling samples of tail classes. Re-weighting (Morik et al., 1999; Menon et al., 2013) adjusts the sample weights in the loss function during training. Transfer learning (Chu et al., 2020; Wang et al., 2021; Kim et al., 2020) seeks to transfer knowledge from head classes to tail classes to obtain a more balanced performance. Recently, logit adjustment techniques (Menon et al., 2021; Ren et al., 2020; Tian et al., 2020) that modify the output logits of the model by an offset term log P train (y) have been the state-of-the-art. Long-Tailed Partial Label Learning (LT-PLL). A few works approximately explore the LT-PLL problem. Liu et al. (2021) implicitly alleviates data imbalance in the non-deep PLL by constraining the parameter space, which is hard to apply in deep learning context due to the complex optimization. Concurrently, SoLar (Wang et al., 2022a) improves the label disambiguation process in LT-PLL through the optimal transport technique. However, it requires an extra outer-loop to refine the label prediction via sinkhorn-knopp iteration, which increases the algorithm complexity. Different from these works, this paper tries to solve the LT-PLL problem from the perspective of rebalancing.

3.1. PROBLEM FORMULATION

Let X be the input space and Y = {1, 2, ..., C} be the class space. The candidate label set space S is the powerset of Y without the empty set: S = 2 Y -∅. A long-tailed partial label training set can be denoted as D train = {(x i , y i , S i )} N i=1 ∈ (X , Y, S) N , where any x i is associated with a candidate label set S i ⊂ Y and its ground truth y i ∈ S i is invisible. The sample number N c of each class c ∈ Y in the descending order exhibits a long-tailed distribution. Let f (x; θ) denote a deep model parameterized by θ, transforming x into an embedding vector. Then, the final output logits of x are given by z(x) = g(f (x; θ); W ) = W ⊤ f (x; θ) where g is a linear classifier with the parameter matrix W . We leverage Θ = [θ, W ] to denote all parameters of the deep network. For evaluation, a class-balanced test set D uni = {(x i , y i )} is used (Brodersen et al., 2010) . In a nutshell, the goal of LT-PLL is to learn Θ on D train that minimizes the following balanced error rate (BER) on D uni : min Θ BER(Θ) Puni(y)= 1 C = P (x,y)∈Duni (y ̸ = argmax y ′ ∈Y z y ′ (x)). (1)

3.2. SELF-TRAINING PLL METHODS

Self-training PLL methods maintain class-wise confidence weights w for each sample during training and formulate the loss function as a weighted summation of the classification loss by w. w is updated in each training round based on the model output and gradually converges to the one-hot labels, converting PLL to ordinary classification. Specially, PRODEN (Lv et al., 2020) assigns higher confidence weights to labels with larger output logit; LW (Wen et al., 2021) additionally considers the loss of non-partial labels and assigns the corresponding confidence weights to non-partial labels as well; CAVL (Fei et al., 2022) designs a identification strategy based on the idea of CAM and assigns hard confidence weights to labels; and CORR (Wu et al., 2022) applies consistency regularization in the disambiguation strategy. Please refer to Appendix B.1 for more details.

3.3. LOGIT ADJUSTMENT FOR LONG-TAILED LEARNING

We briefly review logit adjustment (LA) (Menon et al., 2021; Hong et al., 2021) , a powerful technique for supervised long-tailed learning. First, according to the Bayes rule, we have the underlying class-probability P(y|x) ∝ P(x|y)•P(y). Directly minimizing the cross-entropy loss to reach the optimal has softmax(z y (x)) = P train (y|x) (Yu et al., 2018) . However, for a BER-optimal output logit z uni , it is about the balanced data, which has softmax(z y uni (x)) = P uni (y|x) ∝ P(x|y) • P uni (y) and P uni (y) = 1 C (Menon et al., 2013) . Then, we have the following relations P uni (y|x) ∝ P(x|y) • P train (y) / P train (y) ∝ P train (y|x) / P train (y) ∝ softmax(z y (x)log P train (y)). (2) That is to say, if we have the logit z y (x) by training on standard cross-entropy loss, a BER-optimal logit z y uni (x) = z y (x)log P train (y) can be obtained by subtracting an offset term log P train (y). Using z y uni (x) as the test logit preserves the balanced part P(x|y) of the output and removes the imbalanced part P train (y) in a statistical sense. LA has demonstrated its effectiveness on a range of recent long-tailed learning methods (Zhu et al., 2022; Cui et al., 2021) .

4. METHOD

In LT-PLL, the model output is biased to head classes, and using biased output to update confidence weights will result in biased re-weighting, namely, a tendency to identify a more frequent label in the candidate set as the ground truth. In turn, biased re-weighting on the loss function will further lead to a more severe imbalance on the model. In this section, we seek to propose a universal rebalancing algorithm that obtains unbiased output z uni from biased output z in self-training PLL methods.

4.1. MOTIVATION

Previous long-tailed learning techniques assume no ambiguity in supervision and focus only on the model bias caused by the training set distribution, which we term as constant rebalancing. As shown in Figure 1 , the constant rebalancing (Oracle-LA) fails to rebalance model in LT-PLL. This is because in LT-PLL, not only the skewed training set distribution, but also the ambiguous supervision can affect the training, since the inferred label changes continuously along with the label disambiguation process. Specially, at the early stage, the prediction imbalance of PRODEN is not significant. Using the more skewed training set distribution instead leads to a model that is heavily biased towards the tail classes, inducing the difficulty in label disambiguation. Therefore, a dynamic rebalancing method that considers the label disambiguation process can be intuitively more effective, e.g., RECORDS in Figure 1 . In the following, we will concretely present our method.

4.2. RECORDS: REBALANCING FOR DYNAMIC BIAS

The above analysis empirically suggests that the constant calibration by LA cannot match the dynamic of label disambiguation in LT-PLL. Formally, Equation 2 fails because the label disambiguation dynamically affects model optimization during training, which induces the mismatch between the ground truth P train (y) and the training dynamics of the label disambiguation. To mediate the conflict, we propose a parametric decomposition on the original rebalancing paradigm: P uni (y|x; Θ) ∝ P(x|y; Θ) • P train (y|Θ) / P train (y|Θ) ∝ P train (y|x; Θ) / P train (y|Θ) ∝ softmax(z y (x)log P train (y|Θ)), (3) where P uni (y|x; Θ) is the parametric class-probabilities under a uniform class prior. Here, we start a perspective of dynamic rebalancing: a dynamic P train (y|Θ) adapted to the training process is required instead of the constant prior P train (y) to rebalance the model in LT-PLL. Existing PLL methods work to achieve better disambiguation results or improve the quality of the representation, i.e., learning P(x|y; Θ) that is closer to P(x|y). Thus, our study is orthogonal to existing PLL methods and can be combined together to improve the performance (see Section 5.2). Also, the rebalanced output can boost the accuracy of label disambiguation and a better P(x|y; Θ) can be learned. Here, our effective and lightweight design is the estimation of the dynamic class distribution P train (y|Θ) by the model via the training set, i.e., P train (y|Θ) = E xi∈Dtrain P train (y|x; Θ). First, Figure 2 : Illustration for RECORDS. The class-wise confidence weights w are updating by the "disambiguation" module for each sample and used as soft/hard pseudo labels in PLL Loss. The main differences between the PLL baselines are the "PLL Loss" and the "disambiguation" module (see Table 6 ). The "debias" module dynamically rebalances P train (y|x; Θ) to obtain P uni (y|x; Θ). A balanced P uni (y|x; Θ) helps tail samples to disambiguate labels more accurately and avoid being overwhelmed by head classes. A momentum-updated prototype feature is used to estimate P train (y|Θ), which is benign to label disambiguation and asymmetrically approaching the oracle prior P train (y). In comparison, constant rebalancing does not consider the dynamic of the label disambiguation. we use the Normalized Weighted Geometric Mean (NWGM) approximation (Baldi & Sadowski, 2013) to put the expectation operation inside the softmax as follows: P train (y|Θ) = E xi∈Dtrain softmax(z y (x i )) N W GM ≈ softmax(E xi∈Dtrain z y (x i )) = softmax(g y (E xi∈Dtrain f (x i ; θ); W )). (4) Note that, NWGM approximation is widely used in dropout understanding (Baldi & Sadowski, 2014) , caption generation (Xu et al., 2015) , and causal inference (Wang et al., 2020) . The intuition behind Equation 4 is to capture more stable feature statistics by means of NWGM and combine with the latest updated linear classifier to estimate P train (y|Θ). Nevertheless, with Equation 4, we do not reach the final form, since directly estimating P train (y|Θ) requires to consider the whole dataset via an EM alternation. To improve the efficiency, we design a momentum mechanism to accumulatively compute the expectation of features along with the training. Concretely, we maintain a prototype feature F for the entire training set, using each batch's feature expectation for momentum updates: F ← mF + (1 -m)E xi∈Batch f (x i ; θ), where m ∈ [0, 1) is a momentum coefficient. Then, replacing E xi∈Dtrain f (x i ; θ) in Equation 4 by F yields the final implementation of our method: z y uni (x) = z y (x) -log P train (y|Θ) = z y (x) -log softmax(g y (F ; W )). ( ) Our RECORDS is lightweight and can be easily plugged into existing PLL methods in an end-to-end manner. As illustrated in Figure 2 , we insert a "debias" module before label disambiguation of each training iteration, converting P train (y|x; Θ) to P uni (y|x; Θ) via Equation 6.

4.3. RELATION BETWEEN DYNAMIC REBALANCING AND CONSTANT REBALANCING

In previous sections, we design a parametric class distribution to calibrate training in LT-PLL. However, it is not clear about the relation of P train (y|Θ) and P train (y). In this section, we theoretically point out their connection. First, let P train (y j ) = E (x,y)∼(X ,Y) 1(y = y j ), y j ∈ Y denote the oracle prior, where 1(•) denotes the indicator function. Considering a hypothesis space H where each h Θ ∈ H : X → Y is a multiclass classifier parameterized by Θ (h = g •f in Figure 2 ), we define the parametric class distribution for h Θ as P train (y j |Θ) = E (x,y)∼(X ,Y) 1(h Θ (x) = y j ), y j ∈ Y. Assume the disambiguated label for (x, y, S) in LT-PLL is ỹ(x, S) ∈ S during training. Then, the empirical risk on basis of the disambiguated label under If the small ambiguity degree condition (Cour et al., 2011a; Liu & Dietterich, 2014 )) satisfies, namely, η ∈ [0, 1), then for ∀δ > 0, the L 2 distance between P train (y) and P train (y| Θ) given h is bounded as D train is R Dtrain (Θ) = 1 N N i=1 1(h Θ (x i ) ̸ = L 2 h < 4 (ln 2 -ln(1 + η))N (d H (ln 2N + 2 ln C) -ln δ + ln 2) with probability at least 1δ, where N is the sample number and C is the category number. Proposition 1 yields an important implication that alongside the label disambiguation, the dynamic estimation can progressively approach to the oracle class distribution under small ambiguity degree. We kindly refer the readers to Appendix D for the complete proof. To further verify this, we trace the L 2 distance between the estimation of P train (y|Θ 3 indicates that our dynamic rebalancing is not only benign to the label disambiguation at the early stages, but also finally approach to the constant rebalancing. )

5.1. TOY STUDY

We simulate a four-class LT-PLL task where each class is distributed in different regions and their samples distribute uniformly in the corresponding regions. The label set of each sample consists of a true label and negative labels having a 0.6 probability flipped from other classes. In the first column of Figure 4 , we visualize the training set and the test set: the training set is highly imbalanced among classes, with the number of samples from each class being 30 (red), 100 (yellow), 500 (blue) and 1000 (green), while the test set is balanced. In the right three columns of Figure 4 , we show the results of identifying true labels from the candidate sets on the training set and the prediction results on the test set for different methods respectively. As can be seen, PRODEN results in the minority category (red) being completely overwhelmed by the majority categories (blue and green), and the yellow category being dominated mostly. After calibration with the oracle class prior, PRODEN + Oracle-LA instead predicts all data in the test set as the minority class (red). This suggests that constant rebalancing is prone to being over-adjusted towards tail classes coupling with the label disambiguation dynamic. In comparison, when using RECORDS as a dynamic rebalancing mechanism in LT-PLL, we achieve the desired results on both the training set and the test set.

5.2. BENCHMARK RESULTS

Long-tailed partial label datasets. We evaluate RECORDS on three datasets: CIFAR-10-LT (Liu et al., 2019) , CIFAR-100-LT (Liu et al., 2019) and PASCAL VOC. For CIFAR-10-LT and CIFAR-100-LT, we build the long-tailed version of of CIFAR-10/100 with imbalanced ratio ρ ∈ {50, 100}. Following Lv et al. (2020) ; Wen et al. (2021) , we adopt the uniform setting in CIFAR-10-LT and CIFAR-100-LT, i.e., P(y ∈ S|y ̸ = y) = q to generate candidate label set of each sample. PASCAL VOC is a real-world LT-PLL dataset constructed from PASCAL VOC 2007 (Everingham et al.) . Specifically, we crop objects in images as instances and all objects appearing in the same original image are regarded as the labels of a candidate set and empirically we can observe the significant class imbalance. Note that, here we are the first to conduct experiments based on deep models on real-world datasets, while previous PLL real-world datasets are tabular and only suitable for linear models or shallow MLPs (Fei et al., 2022) . Please refer to Appendix E.1 for more details. Baselines. We consider the state-of-the-art PLL algorithms including PRODEN (Lv et al., 2020) , LW (Wen et al., 2021) , CAVL (Fei et al., 2022) , and CORR (Wu et al., 2022) and their combination with the state-of-the-art rebalancing method logit adjustment (LA) (Menon et al., 2021; Hong et al., 2021) . We use two rebalancing variants: (1) Oracle-LA: rebalance a PLL model during training with the oracle class prior by LA; (2) Oracle-LA post-hoc: rebalance a pre-trained PLL model with the oracle prior using LA in a post-hoc way. Note that, for our methods, we directly apply RECORDS into PLL baselines but without requiring the oracle class prior. For comparisons with concurrent LT-PLL work SoLar (Wang et al., 2022a) , please refer to Appendix E.7. Implementation details. We use 18-layer ResNet as the backbone. The standard data augmentations are applied as in Cubuk et al. (2020) . The mini-batch size is set to 256 and all the methods are trained using SGD with momentum of 0.9 and weight decay of 0.001 as the optimizer. The hyper-parameter m in Equation 5 is set to 0.9 constantly. The initial learning rate is set to 0.01. We train the model for 800 epochs with the cosine learning rate scheduling. Table 3 : Fine-grained analysis on CIFAR-100-LT-NU with ρ = 100 and q ∈ {0.03, 0.05, 0.07}. q = 0.03 q = 0.05 q = 0. Overall performance. In Table 1 , we summarize the Top-1 accuracy on three benchmark LT-PLL datasets. Our method clearly exhibits superior performance on all datasets with different experimental settings under CORR, PRODEN, LW and CAVL. Specially, compared to the best PLL baseline CORR, RECORDS significantly improve the accuracy by 6.45%-25.68% on CIFAR-10-LT, 3.86%-7.60% on CIFAR-100-LT, and 32.03% on PASCAL VOC. When Oracle-LA is applied into PLL baselines, it induces severe performance degradation on CIFAR but improves significantly on PAS-CAL VOC. In comparison, Oracle-LA post-hoc can better alleviates the class imbalance on CIFAR. However, both of them cannot address the adverse effect of constant rebalancing on label disambiguation. In comparison, our RECORDS that solves this problem through dynamic rebalancing during training, achieves the consistent and the best improvements among all methods. Fine-grained analysis. In Figure 5 , we visualize the per-class accuracy on CIFAR-10-LT with ρ = 100 under the best PLL baseline CORR. As expected, dominant classes generally exhibits a higher accuracy in CORR. Oracle-LA post-hoc can improve CORR performance on medium classes, but accuracy remains poor on tail classes, especially when label ambiguity is high. Oracle-LA performs very poorly on head and medium classes due to over-adjustment towards tail classes. Our RECORDS systematically improves performance over CORR, particularly on rare classes. In Table 2, we show the Many-Medium-Fewfoot_0 accuracies of different methods on CIFAR-100-LT with ρ = 100. Our RECORDS shows the significant and consistent gains on Medium and Few classes when combined with CORR. As demonstrated in many long-tailed learning literatures (Kang et al., 2020; Menon et al., 2021) , there might be a head-to-tail accuracy tradeoff in LT-PLL, and our method achieve the best overall accuracy in this tradeoff. More results are summarized in Appendix E.4.

5.3. FURTHER ANALYSIS

Non-uniform candidates generation. Real-world annotation ambiguity often occurs among semantically close labels. To evaluate the performance of RECORDS in practical scenarios, we conduct experiments on a more challenging dataset, namely CIFAR-100-LT-NU. To build CIFAR-100-LT-NU, we generate candidates from the ground truth of CIFAR-100-LT using a Non-uniform setting. Specifically, labels in the same superclass of the ground truth have a higher probability to be selected into the candidate set, i.e., P(y ∈ S|y ̸ = y, D(y) ̸ = D(y)) = q, P(y ∈ S|y ̸ = y, D(y) = D(y)) = 8q, where D(y) denotes the superclass to which y belongs. In Table 3 , we show the finegrained analysis on CIFAR-100-LT-NU with ρ = 100 and q ∈ {0.03, 0.05, 0.07}. Combined with our RECORDS, the best baseline CORR achieves significant gains, demonstrating the robustness of RECORDS in different scenarios. More results are summarized in Appendix E.3. Linear probing performance. Following the literatures of self-supervised learning (Chen et al., 2020; He et al., 2020) , we conduct linear probing on CIFAR-100-LT with ρ = 100 and q = 0.05 to quantitatively evaluate the representation quantity of different methods. To eliminate the class imbalance effect, the linear classifier is trained on a balanced dataset. From Figure 6 (a), our RECORDS consistently outperforms baselines in this evaluation under different shots. It indicates that our RECORDS that considers the label disambiguation dynamic can help extract better representation. Other dynamic strategies. To verify the effectiveness of RECORD, we setup two other straightforward dynamic strategies: (1) Temp Oracle-LA: temperature scaling P train (y) from 0 to 1 during training; (2) Epoch RECORDS: use the prediction of the latest epoch to estimate the class distribution and rebalance. In Table 5 , we can find that although simper solutions might be effective, RECORDS outperforms them significantly, confirming the effectiveness of our design. More baselines for imbalanced learning. We additionally compare our RECORDS with two contrastive-based SOTA long-tailed learning baselines, BCL (Zhu et al., 2022) and PaCO (Cui et al., 2021) , and two regularization methods for mitigating imbalance in PLL, LDD (Liu et al., 2021) and SFN (Liu et al., 2021) . Similar to Oracle-LA, we use the oracle class distributions for BCL and PaCO, denoted as Oracle-BCL and Oracle-PaCO. In Table 4 , RECORDS still significantly outperforms these methods, showing the effectiveness of dynamic rebalancing. Effect of the momentum coefficient m. In Figure 6 (b), we explore the effect of the momentum update factor m on the performance of RECORDS. As can be seen, the best result is achieved at m = 0.9 and the performance decreases when m takes smaller values or a larger value. Specially, when m = 0, it still maintains a competitive result, showing the robustness of RECORDS. At the other extreme value, i.e., m = 1.0, CORR + RECORDS actually degenerates to CORR.

6. CONCLUSION

In this paper, we focus on a more practical LT-PLL scenario and identify several critical challenges in this task based on previous independent research paradigms LT and PLL. To avoid the drawback of their combination, we propose a novel method for LT-PLL, RECORDS, without requiring the prior of the oracle class distribution. Both empirical and theoretical analysis show that our proposed parametric class distribution is asymmetrically approach the static oracle class prior during training and is more friendly to label disambiguation. Our method is orthogonal to existing PLL methods and can be easily plugged into current PLL methods in an end-to-end manner. Extensive experiments demonstrate the effectiveness of our proposed RECORDS. In the future, the extension of RECORDS to other weakly supervised learning scenarios can be explored in a more general scope.

ETHICS STATEMENT

This paper does not raise any ethics concerns. This study does not involve any human subjects, practices to data set releases, potentially harmful insights, methodologies and applications, potential conflicts of interest and sponsorship, discrimination/bias/fairness concerns, privacy and security issues, legal compliance, and research integrity issues.

REPRODUCIBILITY STATEMENT

To ensure the reproducibility of experimental results, our code is available at https://github. com/MediaBrain-SJTU/RECORDS-LTPLL. We provide experimental setups and implementation details in Section 5 and Appendix E. The proof of Proposition 1 is given in Appendix D. For example, a movie clip may contain several characters talking to each other, with some of them appearing in a screenshot. Although we can obtain scripts and dialogues that indicate the names of the characters, we cannot directly confirm the real name of each face in the screenshot (see Figure 7 (a)). A similar scenario arises for recognizing faces from news images, where we can obtain the names of the people from the news descriptions but cannot establish a one-to-one correspondence with the face images (see Figure 7 (b)). Partial label learning problem also appears in crowdsourcing, where each instance may be given multiple labels by different annotators. However, some labels may be incorrect or biased due to differences in expertise or cultural background of different annotators, so it is necessary to find the most appropriate label for each instance from candidate labels (see Figure 7(c) ). Meanwhile, in the real world, the data naturally exhibit a long-tailed distribution. Corresponding to the three examples in Figure 7 , the main characters in movies tend to take up most of the time, while the supporting roles appear much less frequently; in sports news, superstars get most of the exposure, while many role players only get few appearances; and the number of different species in nature also shows a clear long-tailed distribution. However, such common long-tailed distribution of real data has been ignored by most existing PLL methods. Further, the invisibility of ground truth in PLL makes it difficult to even manually balance the dataset. Therefore, we focus on the more difficult but common LT-PLL scenario where label ambiguity and category imbalance co-occur. When facing LT-PLL, directly combining the paradigms of partial label learning and long-tailed learning poses some dilemmas. One immediate problem is that the skewed long-tailed distribution exacerbates the bias toward head classes in label disambiguation and tends to lead to trivial solutions with overconfidence in head classes. More importantly, most state-of-the-art long-tailed learning methods cannot be directly used in LT-PLL because they require available class distributions, which are agnostic in PLL due to label ambiguity. Moreover, we find that existing techniques perform poorly in LT-PLL and even fail in some cases even after applying an oracle class distribution prior in the training.

B PRELIMINARIES B.1 SELF-TRAINING PLL METHODS

The self-training PLL methods (Feng et al., 2020; Wen et al., 2021; Lv et al., 2020; Fei et al., 2022) remove annotation ambiguities from model outputs and gradually identify true labels during training, achieving state-of-the-art results. These methods maintain class-wise confidence weights w for each sample during training and formulate the loss function as a weighted summation of the classification loss by w. They update the confidence weights w from the output, then use w as soft/hard pseudolabels to learn classifiers from the data. Model training and weight updates are performed iteratively in each round. In the late training period, w converges to stable one-hot labels, and PLL is converted to an ordinary classification problem that learns from the disambiguated labels. The differences among these self-training methods are the form of loss function and how the weights are updated. In Tabel 6, we summarize four state-of-the-art self-training methods, PRODEN (Lv et al., 2020) , LW (Wen et al., 2021) , CAVL (Fei et al., 2022) , and CORR (Wu et al., 2022) . Specially, PRODEN assigns higher confidence weights to labels with larger output logit; LW additionally considers the loss of non-partial labels and assigns the corresponding confidence weights to non-partial labels as well; CAVL designs a identification strategy based on the idea of CAM and assigns hard confidence weights to labels; and CORR applies consistency regularization in the disambiguation strategy.  PRODEN j∈Y w i,j ℓ(z j (x i )) w i,j = exp(z j (xi)) k∈S i exp(z k (xi)) , if j ∈ S i w i,j = 0, if j / ∈ S i LW j∈Si w i,j ℓ(z j (x i ))+ w i,j = exp(z j (xi)) k∈S i exp(z k (xi)) , if j ∈ S i β j / ∈Si w i,j ℓ(-z j (x i )) w i,j = exp(z j (xi)) k / ∈S i exp(z k (xi)) , if j / ∈ S i CAVL j∈Y w i,j ℓ(z j (x i )) w i,j = 1, if j = argmax j∈Si z j (x i ) -1 z j (x i ) w i,j = 0, else CORR λD KL (z(x i )∥w i )+ w i,j = ( x ′ ∈A(x i ) exp(z j (x ′ ))) 1 |A(xi)| k∈Y ( x ′ ∈A(x i ) exp(z k (x ′ ))) 1 |A(xi)| , if j ∈ S i j / ∈Si ℓ(-z j (x i )) w i,j = 0, if j / ∈ S i

B.2 LOGIT ADJUSTMENT FOR LONG-TAILED LEARNING

Here we give a review on the successful logit adjustment (LA) (Menon et al., 2021; Hong et al., 2021) in long-tailed learning. LA uses the Bayes Rule to remove the adverse effect of class imbalance by performing a offset term on the output logit. In the long-tailed setting with a highly skewed training category distribution P train (y), a trivial classifier that classifies all instances as majority labels can achieve high training accuracy. To cope with it, a class-balanced test set D uni = {(x i , y i )} is used (Brodersen et al., 2010) . We donate P uni (y|x) ∝ P(x|y) as the underlying class probability under uniform distribution P uni (y) = 1 C . The of long-tailed learning is to learn a model on D train that minimizes the following balanced error rate (BER) on D uni : min Θ BER(Θ) Puni(y)= 1 C = P (x,y)∈Duni (y ̸ = argmax y ′ ∈Y z y ′ (x)). For an optimal model trained by minimizing the traditional softmax cross-entropy loss, the output logit z satisfies softmax(z y (x)) = P train (y|x) ∝ P(x|y) • P train (y) (Yu et al., 2018) . Recall that the goal is to minimize BER. For a BER-optimal model, also known as Bayes-optimal model, its output logit z uni satisfies softmax(z y uni (x)) = P uni (y|x) (Menon et al., 2013) . LA attempts to convert the output z obtained from traditional cross-entropy training into a Bayes-optimal output as follows: softmax(z y (x)) = P train (y|x) ∝ P(x|y) • P train (y) P uni (y|x) ∝ P(x|y) ∝ softmax(z y (x)log P train (y)). That is to say, if we have the logit z y (x) by training on standard cross-entropy loss, a Bayes-optimal logit z y uni (x) = z y (x)log P train (y) can be obtained by subtracting an offset term log P train (y). Using z y uni (x) as the test logit preserves the balanced part P(x|y) of the output and removes the imbalanced part P train (y) in a statistical sense. A number of recent studies (Zhu et al., 2022; Cui et al., 2021) have demonstrated the powerful effects of logit adjustment in long-tailed learning.

C PSEUDO-CODE OF RECORDS

We summarize the complete procedure of our RECORDS in Algorithm 1. 

D PROOF OF PROPOSITION 1

Lemma D.1. Let L 2 (h) for h ∈ H be L 2 distance between P train (y) and P train (y|Θ), where h is parameterized by Θ. We have L 2 (h) ≤ 2E (x,y)∼(X ,Y) 1(h(x) ̸ = y)). Published as a conference paper at ICLR 2023 Proof. L 2 (h) = C yj =1 E (x,y)∼(X ,Y) (1(y = y j ) -1(h(x) = y j )) 2 ≤ C yj =1 E (x,y)∼(X ,Y) |1(y = y j ) -1(h(x) = y j )| (7) For any instance (x, y), if h(x) = y, i.e., h correctly classifies x, then for ∀y j ∈ {1, 2, . . . , C}, we have |1(y = y j ) -1(h(x) = y j )| = 0. Otherwise if h(x) ̸ = y, we have |1(y = y j ) -1(h(x) = y j )| = 1, y j = y or y j = h(x) 0, else Thus, we can bound L 2 (h) by L 2 (h) ≤ 2E (x,y)∼(X ,Y) 1(h(x) ̸ = y)) (9) Lemma D.2. Define R as set of all D train for which there exists an ϵ-L 2 Distance h with zero empirical risk: R = {D train ∈ (X , Y, S) N : ∃h, L 2 (h) ≥ ϵ, R Dtrain (h) = 0}. Introduce another training set D ′ train ∈ (X , Y, S) N . Define M to be set of all pairs (D train , D ′ train ) for which there exists a h with zero empirical risk on D train that makes ϵ-L 2 Distance and make at least ϵN 4 classification errors on D ′ train , i.e., M = {(D train , D ′ train ) ∈ (X , Y, S) 2N : ∃h, L 2 (h) ≥ ϵ, R Dtrain (h) = 0, N i=1 1(h(x ′ ) ̸ = y ′ ) ≥ ϵN 4 }. If N > 16 ln 2 ϵ , then we have P(D train ∈ R) < 2P((D train , D ′ train ) ∈ M ). Proof. This lemma is used in many learnability proofs. Apply the chain rule of probability: P((D train , D ′ train ) ∈ M ) =P((D train , D ′ train ) ∈ M |D train ∈ R) • P(D train ∈ R) =P(∃h, L 2 (h) ≥ ϵ, R Dtrain (h) = 0, N i=1 1(h(x ′ ) ̸ = y ′ ) ≥ ϵN 4 |D train ∈ R) • P(D train ∈ R) ≥P( N i=1 1(h(x ′ ) ̸ = y ′ ) ≥ ϵN 4 |L 2 (h) ≥ ϵ, R Dtrain (h) = 0) • P(D train ∈ R) (10) Given L 2 (h) ≥ ϵ, we can get E (x,y)∼(X ,Y) 1(h(x) ̸ = y) ≥ ϵ 2 (Lemma D.1 ). Using the Chernoff bound, when N > 16 ln 2 ϵ , we can get P( N i=1 1(h(x ′ ) ̸ = y ′ ) ≥ ϵN 4 ) =1 -P( N i=1 1(h(x ′ ) ̸ = y ′ ) < ϵN 4 ) ≥1 -P( N i=1 1(h(x ′ ) ̸ = y ′ ) < (1 - 1 2 )E (x ′ ,y ′ )∼(X ,Y) N i=1 1(h(x ′ ) ̸ = y ′ )) ≥1 -e - N E (x,y)∼(X ,Y) 1(h(x)̸ =y) 8 ≥1 -e -N ϵ 16 > 1 2 Thus we can bound P(D train ∈ R) by P(D train ∈ R) < 2P((D train , D ′ train ) ∈ M ) Lemma D.3. (Liu & Dietterich, 2014) If the hypothesis space H has Natarajandimension d H and η < 1, then P((D train , D ′ train ) ∈ M ) < (2N ) d H C 2d H ( 1 + η 2 ) ϵN 4 . Proof. Expand D train and D ′ train as D train = (x N , y N , S N ) ∈ (X , Y, S) N , D ′ train = (x ′N , y ′N , S ′N ) ∈ (X , Y, S) N . We can get P((D train , D ′ train ) ∈ M ) = E P((D train , D ′ train ) ∈ M |x N , y N , x ′N , y ′N ) Where the expectation E is taken with respect to (x N , y N , x ′N , y ′N ) and the probability P comes from (S N , S ′N ) given (x N , y N , x ′N , y ′N ). Let H|(x N , x ′N ) be the set of hypothesis making different classifications for D train and D ′ train , we can get P((D train , D ′ train ) ∈ M |x N , y N , x ′N , y ′N ) ≤ h∈H|(x N ,x ′N ) P(R Dtrain (h) = 0, N i=1 1(h(x ′ ) ̸ = y ′ ) ≥ ϵN 4 |x N , y N , x ′N , y ′N ) P((D train , D ′ train ) ∈ M |x N , y N , x ′N , y ′N ) ≤ h∈H|(x N ,x ′N ) P(R Dtrain (h) = 0, N i=1 1(h(x ′ ) ̸ = y ′ ) ≥ ϵN 4 |x N , y N , x ′N , y ′N ) Randomly match the instances in D train and D ′ train into n pairs. Following Liu & Dietterich (2014) , we define a group G of swaps. A swap σ ∈ G swaps the instances in pairs indexed by J σ ⊆ {1, 2, . . . , n}. |G| = 2 N , σ(D train , D ′ train ) = (D σ train , D ′σ train ). Let a 1 , a 2 be the number for which h classifies D train incorrectly, and D ′ train incorrectly. Let b 1 , b 2 be the number for which h classifies both incorrectly, and only one incorrectly. P(R Dtrain (h) = 0, N i=1 1(h(x ′ ) ̸ = y ′ ) ≥ ϵN 4 |x N , y N , x ′N , y ′N ) =1( N i=1 1(h(x ′ ) ̸ = y ′ ) ≥ ϵN 4 ) • N i=1 P(h(x i ) = ỹ(x i , S i )|x N , y N , x ′N , y ′N ) =1(a 2 ≥ ϵN 4 ) • N i=1 P(h(x i ) = ỹ(x i , S i )|x N , y N , x ′N , y ′N ) We then give the bound for N i=1 P(h(x i ) = ỹ(x i , S i )|x N , y N , x ′N , y ′N ). N i=1 P(h(x i ) = ỹ(x i , S i )|x N , y N , x ′N , y ′N ) ≤ N i=1 P(h(x i ) ∈ S i |x N , y N , x ′N , y ′N ) ≤ η a1 (17) Combining Equation 13-17, we can get P((D train , D ′ train ) ∈ M ) =E P((D train , D ′ train ) ∈ M |x N , y N , x ′N , y ′N ) =E 1 2 N σ∈G P(σ(D train , D ′ train ) ∈ M |x N , y N , x ′N , y ′N ) ≤E 1 2 N σ∈G h∈H|σ(x,x ′ ) P(R Dtrain (h) = 0, N i=1 1(h(x ′ ) ̸ = y ′ ) ≥ ϵN 4 |x N , y N , x ′N , y ′N ) ≤E (2N ) d H C 2d H 1 2 N σ∈G 1(a σ 2 ≥ ϵN 4 )η a σ 1 ≤E (2N ) d H C 2d H 1 2 N b1+b2 a σ 1 =b1 1(b 1 + b 2 ≥ ϵN 4 )C a σ 1 -b1 b2 2 n-b2 η a σ 1 =E (2N ) d H C 2d H 1(b 1 + b 2 ≥ ϵN 4 )η b1 ( 1 + η 2 ) b2 where the third line uses the inequality Natarajan, 1989) . When b 1 = 0, b 2 = ϵN 4 , the right side reaches the maximum of (2N ) H|(x N , x ′N ) ≤ (2N ) d H C 2d H ( d H C 2d H ( 1+η 2 ) ϵN 4 . For ease of reading, here we restate Proposition 1 in the main text. Proposition D.1. Let η = sup (x,y)∈X ×Y,yj ∈Y,yj ̸ =y P S|(x,y) (y j ∈ S) denote the ambiguity degree, d H be the Natarajan dimension of the hypothesis space H, h = h Θ be the optimal classifier on the basis of the label disambiguation, where Θ = arg min Θ R Dtrain (Θ). If the small ambiguity degree condition (Cour et al., 2011a; Liu & Dietterich, 2014 )) satisfies, namely, η ∈ [0, 1), then for ∀δ > 0, the L 2 distance between P train (y) and P train (y| Θ) for h is bounded as L 2 ( h) < 4 (ln 2 -ln(1 + η))N (d H (ln 2N + 2 ln C) -ln δ + ln 2) with probability at least 1δ, where N is the sample number and C is the category number. Proof. Recall the definition of R: R = {D train ∈ (X , Y, S) N : ∃h, L 2 (h) ≥ ϵ, R Dtrain (h) = 0} Here we need to prove the sufficient condition for P(D train ∈ R) ≤ δ. Lemma D.2 bounds P(D train ∈ R) by P(D train ∈ R) < 2P((D train , D ′ train ) ∈ M ). Lemma D.3 bounds P((D train , D ′ train ) ∈ M ) by (2N ) d H C 2d H ( 1+η 2 ) ϵN 4 . Taking it into Lemma D.2, we get P(D train ∈ R) < 2P((D train , D ′ train ) ∈ M ) ≤ 2(2N ) d H C 2d H ( 1 + η 2 ) ϵN 4 Then we can prove the sufficient condition for P(D train ∈ R) ≤ δ: P(D train ∈ R) ≤ δ ⇐= ln 2 + d H (ln 2 + ln N ) + 2d H ln C - ϵN 4 ln( 1 + η 2 ) ≤ ln δ ⇐⇒ ϵ ≥ 4 (ln 2 -ln(1 + η))N (d H (ln 2N + 2 ln C) -ln δ + ln 2) That is, when ϵ = Wen et al. (2021) , we adopt the uniform setting in CIFAR-10-LT and CIFAR-100-LT, i.e., P(y ∈ S|y ̸ = y) = q to generate candidate label set of each sample. That is, all the C -1 negative labels have the same probability of q to be selected into the candidate set along with the ground truth. Besides, for CIFAR-100-LT, we also use a more challenging non-uniform setting where other labels in the same superclass of the ground truth have a higher probability to be selected into the candidate set, i.e., P(y ∈ S|y ̸ = y, D(y) ̸ = D(y)) = q, P(y ∈ S|y ̸ = y, D(y) = D(y)) = 8q, where D(y) denotes the superclass to which y belongs. The non-uniform setting is more practical and challenging for the LT-PLL algorithms because the semantics of the categories in the same superclass are closer. We refer to CIFAR-100-LT with candidates generated by the non-uniform setting as CIFAR-100-LT-NU. Dataset partitioning. To demonstrate the performance of the algorithm on categories with different frequencies, we partition the dataset according to the sample size. Following Kang et al. (2020) , We split the dataset into three partitions: Many-shot (classes with more than 100 images), Medium-shot (classes with 20-100 images), and Few-shot (classes with less than 20 images). PASCAL VOC. We conduct the real-world LT-PLL dataset PASCAL VOC from PASCAL VOC 2007 (Everingham et al.) . Specifically, we crop objects in images as instances and all objects appearing in the same original image are regarded as the labels of a candidate set. Note that, here we are the first to conduct experiments based on deep models on real-world datasets, while previous PLL real-world datasets are tabular and only suitable for linear models or shallow MLPs (Fei et al., 2022) . Empirically we can observe the significant class imbalance as shown in Table 8 . We manually balanced the test set for fairness of the evaluation. In Table 7 , we summarize the basic characteristics of PASCAL VOC.

E.2 ADDITIONAL IMPLEMENTATION DETAILS

Toy study. (Figure 4 ) We simulated a four-class LT-PLL task with each class distributed in a different region and their samples uniformly distributed in the corresponding region. The four uniform distributions are U([-1, -1], [0, 0]), U([0, -1], [1, 0]), U([-1, 0], [0, 1]), and U([0, 0], [1, 1]), respectively. The candidate label set for each sample consists of the ground truth and negative labels with 0.6 probability of being flipped from the other classes e.g., P(y ∈ S|y ̸ = y) = 0.6. The sample size of each class in the training set is 30 (red), 100 (yellow), 500 (blue) and 1000 (green), respectively. The test set is balanced, with 100 samples per class. We use a two-layer MLP with 10 hidden neurons as the model for the toy study. The batch size is set to 512. The toy models are trained using SGD with momentum of 0.9. We train the models for 50 epochs with initial learning rate 2.0. Linear Probing. (Figure 6 (a)) We train a linear classifier on a frozen pre-trained backbone and measure the quality of the representation through the test accuracy. To eliminate the effect of the longtailed distribution in the fine-tuning phase, the classifier is trained on a balanced dataset. Specifically, the performance of the classifier is reported on the basis of pre-trained representations for different amounts of data, including full-shot, 100-shot, and 50-shot. In the fine-tuning phase, we train the linear classifier for 500 epochs with SGD of momentum 0.7 and weight decay 0.0005. The batch size is set to 1000. The learning rate decays exponentially from 10 -2 to 10 -6 . The loss function is set to the ordinary cross-entropy loss.

E.3 ADDITIONAL RESULTS FOR NON-UNIFORM CANDIDATES GENERATION.

In Table 9 , we show the complete experimental results on CIFAR-100-LT-NU. Under a more practical and challenging candidate generation setting, our RECORDS achieves a consistent and signifi- cant improvement over the PLL baselines for varying imbalance ratio ρ ∈ {50, 100} and ambiguity q ∈ {0.03, 0.05, 0.07}. It demonstrates the stable enhancement by our RECORDS under different potential candidate generation settings.

E.4 FINE-GRAINED ANALYSIS ON MORE PLL BASELINES

In Table 10 and 11, we show the additional results of fine-grained analysis on CIFAR-100-LT (ρ = 100, q ∈ {0.03, .0.05, 0.07}) and CIFAR-100-LT-NU (ρ = 100, q ∈ {0.03, .0.05, 0.07}). Our RECORDS shows consistent gains on Medium and Few classes when conbined with all four PLL baselines. As a result of head-to-tail tradeoff, the accuracies of Many classes improve less or decrease slightly, which can be observed in many long-tailed learning literatures (Kang et al., 2020; Menon et al., 2021) . Our RECORDS consistently improves the overall accuracies in this tradeoff.

E.5 FEATURE VISUALIZATION

In Figure 8 , We visualize the image representation produced by the feature encoder f using t-SNE (van der Maaten & Hinton, 2008) on CIFAR-10-LT with ρ = 100 and q = 0.5. We can observe that there is some overlap in the representations of different classes of PRODEN due to the heavy class imbalance and the label ambiguity. PRODEN + Oracle-LA basically fails to learn discriminative representations due to the tough constant rebalancing. In contrast, our dynamic rebalancing method RECORDS produces more distinguishable representations. (Wang et al., 2022a ) is a concurrent LT-PLL work published in NeuIPS 2022. At the time of our submission, we did not have access to this work and make comparisons with it. It improves the label disambiguation process in LT-PLL through the optimal transport technique. However, it requires an extra outer-loop to refine the label prediction via sinkhorn-knopp iteration, which increases the computational complexity. Different from SoLar, this paper tries to solve the LT-PLL problem from the perspective of rebalancing in a lightweight and effective manner. Note that compared to other PLL baselines and previous experiments in this paper, SoLar additionally applies Mixup (Zhang et al., 2018) in the experiments. To align the experimental setup, we show the results with the addition of Mixup in Table 13 . All other settings remain the same as previous sections. Compared to SoLar, which designed for the LT-PLL problem, our method still offers consistent improvements. The code implementation of comparisons with Solar can be found here.



Many/Medium/Few corresponds to classes with >100,[20, 100], and <20 images(Kang et al., 2020). (ln 2-ln(1+η))N (d H (ln 2N + 2 ln C)ln δ + ln 2), we have L 2 h < ϵ with probability at least 1δ.



P train (y|Θ)

Figure 4: Visualization of the toy study. The first column illustrates the groud truth distribution of four classes on the training set and on the test set, marked by color. The right three columns exhibit the label identification from the candidate set for training samples in the first row and the predictions on the test set in the second row w.r.t. three methods.ỹ(x i , S i )). Motivated by the theoretical study on the learnability of PLL(Liu & Dietterich, 2014), we propose Proposition 1 to discuss the relation between dynamic and constant rebalancing. Proposition 1. Let η = sup (x,y)∈X ×Y,yj ∈Y,yj ̸ =y P S|(x,y) (y j ∈ S) denote the ambiguity degree, d H be the Natarajan dimension of the hypothesis space H, h = h Θ be the optimal classifier on the basis of the label disambiguation, where Θ = arg min Θ R Dtrain (Θ). If the small ambiguity degree condition(Cour et al., 2011a;Liu & Dietterich, 2014)) satisfies, namely, η ∈ [0, 1), then for ∀δ > 0, the L 2 distance between P train (y) and P train (y| Θ) given h is bounded as

Figure 3: (a) L 2 distance between the estimated class distribution and the oracle class prior during training. (b) The final estimated class distribution. Experiment is conducted by CORR + RECORDS on CIFAR-100-LT with imbalance ratio ρ = 100 and ambiguity q = 0.03. are gradually minimized, i.e.,, our parametric class distribution gradually converges to the statistical oracle class prior. Besides, as shown in Figure 3(b), the final estimated class distribution is very close to the oracle class prior. In total, Figure3indicates that our dynamic rebalancing is not only benign to the label disambiguation at the early stages, but also finally approach to the constant rebalancing.

Figure5: The per-class accuracy on CIFAR-10-LT with ρ = 100 and q ∈ {0.3, 0.5, 0.7}.

Figure 6: (a) Top-1 accuracy (left) and top-5 accuracy (right) under different shots of linear probing for different methods pretrained on CIFAR-100-LT (ρ = 100, q = 0.05). CORR + Oracle-LA posthoc is same to PRODEN in terms of features and thus is not plotted. (b) Performance of CORR + RECORDS with varying m on CIFAR-100-LT (ρ = 100, q = 0.05).

Figure 7: Some practical applications of PLL. (a) A screenshot of the movie "The Amazing Spider-Man" and the corresponding dialogue script. (b) An image from NBA news with the caption "The History of LeBron James and Stephen Curry's Rivalrous Friendship". (c) an image of "Russian Blue", which is very similar to "Korat".

Our proposed RECORDS. Input: Training dataset D train , deep model f , classifier g, a self-training PLL algorithm A, number of epochs T . Output: Parameter θ for f, parameter W for g. 1 Initialize uniform weights w; 2 Initialize F with 0; 3 for t = 1, 2, . . . , T do 4 Shuffle training set D train into B mini-batches; 5 for k = 1, 2, . . . , B do 6 Compute output z for mini-batch D k ; 7 Update F according to Equation 5; 8 Compute loss L A according to algorithm A; 9 Compute debiased output z uni according to Equation 6; 10 Update w using z uni according to algorithm A; 11 Update θ and W by minimizing L A ;

Figure 8: T-SNE of image representation on CIFAR-10-LT with ρ = 100 and q = 0.5. Note that, PRODEN + Oracle-LA post-hoc does not alter feature and shares the same figure as PRODEN.

Top-1 accuracy on three benchmark datasets. Bold indicates the superior results.

Fine-grained analysis on CIFAR-100-LT with ρ = 100 and q ∈ {0.03, 0.05, 0.07}. Many/Medium/Few corresponds to three partitions on the long-tailed data.

Comparision with more baselines for imbalanced learning.

Comparison with other dynamic strategies on CIFAR-10-LT and CIFAR-100-LT.

The loss functions and confidence weights updating strategies for self-training PLL methods.

Characteristics of PASCAL VOC #Classes #Train #Test Imbalance ratio Avg #Candidate Labels

Sample size for each class of PASCAL VOC The original versions of CIFAR-10 and CIFAR-100 contain 50,000 training images and 10,000 validation images in 32 × 32 size with 10 and 100 categories, respectively. The 100 categories in CIFAR-100 form 20 superclasses, each with 5 classes. Following Liu et al. (2019), we conduct the long-tailed version of CIFAR-10 and CIFAR-100 by an exponential decay across sample sizes different classes, namely CIFAR-10-LT and CIFAR-100-LT. The imbalance rate ρ denotes the ratio of the sample sizes of the most frequent and least frequent classes, i.e., ρ = N C N1 , where the sample number N c of each class c ∈ Y is in the descending order. Candidate label sets generation. Following Lv et al. (2020);

Top-1 accuracy on CIFAR-100-LT-NU with imbalance ratio ρ ∈ {50, 100} and ambiguity q ∈ {0.03, 0.05, 0.07}. Bold indicates superior results.

Fine-grained analysis on CIFAR-100-LT with ρ = 100 and q ∈ {0.03, 0.05, 0.07}. Many/Medium/Few corresponds to three partitions on the long-tailed data. Bold indicates superior results.

Fine-grained analysis on CIFAR-100-LT-NU with ρ = 100 and q ∈ {0.03, 0.05, 0.07}. Many/Medium/Few corresponds to three partitions on the long-tailed data. Bold indicates superior results.

Top-1 accuracy on three benchmark datasets. Bold indicates the superior results.

Results with the addition of Mixup on benchmark datasets. The best and second best results are respectively in bold and underline.

ACKNOWLEDGMENTS

This work is supported by STCSM (No. 22511106101, No. 18DZ2270700, No. 21DZ1100100), 111 plan (No. BP0719010), and State Key Laboratory of UHD Video and Audio Production and Presentation.

A LONG-TAILED PARTIAL LABEL LEARNING: BACKGROUND

The remarkable success of deep learning is built on a large amount of labeled data. Data annotation in real-world scenarios often suffers from annotation ambiguity. To address annotation ambiguity, partial label learning allows multiple candidate labels to be annotated for each training instance, which can be widely used in web mining (Luo & Orabona, 2010) , automatic image annotations (Zeng et al., 2013; Chen et al., 2018) , ecoinformatics (Liu & Dietterich, 2012) , and crowdsourcing (Gong et al., 2018) .

