PROGRESSIVE PURIFICATION FOR INSTANCE-DEPENDENT PARTIAL LABEL LEARNING

Abstract

Partial label learning (PLL) aims to train multi-class classifiers from instances with partial labels (PLs)-a PL for an instance is a set of candidate labels where a fixed but unknown candidate is the true label. In the last few years, the instanceindependent generation process of PLs has been extensively studied, on the basis of which many practical and theoretical advances have been made in PLL, whereas relatively less attention has been paid to the practical setting of instancedependent PLs, namely, the PL depends not only on the true label but the instance itself. In this paper, we propose a theoretically grounded and practically effective approach called PrOgressive Purification (POP) for instance-dependent PLL: in each epoch, POP updates the learning model while purifying each PL for the next epoch of the model training by progressively moving out false candidate labels. Theoretically, we prove that POP enlarges the region appropriately fast where the model is reliable, and eventually approximates the Bayes optimal classifier with mild assumptions; technically, POP is flexible with arbitrary losses and compatible with deep networks, so that the previous advanced PLL losses can be embedded in it and the performance is often significantly improved.

1. INTRODUCTION

Over-parameterized deep neural networks owe their popularity much to their ability to (nearly) perfectly memorize large numbers of training examples, and the memorization is known to decrease the generalization error Feldman (2020) . On the other hand, scaling the acquisition of examples for training neural networks inevitably introduces non-fully supervised data annotation, a typical example among which is partial label Nguyen & Caruana (2008) ; Cour et al. (2011) ; Zhang et al. (2016; 2017b) ; Feng & An (2018) ; Xu et al. (2019) ; Yao et al. (2020b) ; Lv et al. (2020) ; Feng et al. (2020b) ; Wen et al. (2021) -a partial label for an instance is a set of candidate labels where a fixed but unknown candidate is the true label. Partial label learning (PLL) trains multi-class classifiers from instances that are associated with partial labels. It is therefore apparent that some techniques should be applied to prevent memorizing the false candidate labels when PLL resorts to deep learning, and unfortunately, empirical evidence has shown general-purpose regularization cannot achieve that goal Lv et al. (2021) . A large number of deep PLL algorithms have recently emerged that aimed to design regularizers Yao et al. (2020a; b) ; Lyu et al. (2022) or network architectures Wang et al. (2022a) for PLL data. Further, there are some PLL works that provided theoretical guarantees while making their methods compatible with deep networks Lv et al. (2020) ; Feng et al. (2020b) ; Wen et al. (2021) ; Wu & Sugiyama (2021) . We observe that these existing theoretical works have focused on the instance-independent setting where the generation process of partial labels is homogeneous across training examples. With an explicit formulation of the generation process, the asymptotical consistency Mohri et al. (2018) of the methods, namely, whether the classifier learned from partial labels approximates the Bayes optimal classifier, can be analyzed. However, the instance-independent process cannot model the real world well since data labeling is prone to different levels of error in tasks of varying difficulty. Intuitively, instance-dependent (ID) partial labels should be quite realistic as some poor-quality or ambiguous instances are more difficult to be labeled with an exact true label. Although the instance-independent setting has been extensively studied, on the basis of which many practical and theoretical advances have been made in PLL, relatively less attention has been paid to the practically relevant setting of ID partial labels. Very recently, one solution has been proposed Xu et al. (2021) which learned directly from ID partial labels, nevertheless, it is still unclear in theory whether the learned classifier is good. Motivated by the above observations, we set out to investigate ID PLL with the aim of proposing a learning approach that is model-independent and theoretically explain when and why the proposed method works. In this paper, we propose PrOgressive Purification (POP), a theoretically grounded PLL framework for ID partial labels. Specifically, we use the observed partial labels to pretrain a randomly initialized classifier (deep network) for several epochs, and then we update both partial labels and the classifier for the remaining epochs. In each epoch, we purify each partial label by moving out the candidate labels for which the current classifier has high confidence of being incorrect, and subsequently we train the classifier with the purified partial labels in the next epoch. As a consequence, the false candidate labels are gradually sifted out and the classification performance of the classifier is improved. We justify POP and outline the main contributions below: • We propose a novel approach named POP for the ID PLL problem, which purifies the partial labels and refines the classifier iteratively. Extensive experiments validate the effectiveness of POP. • We prove that POP can be guaranteed to enlarge the region where the model is reliable by a promising rate, and eventually approximates the Bayes optimal classifier with mild assumptions. This proof process does not rely on the assumption of the instance-independent setting. To the best of our knowledge, this is the first theoretically guaranteed approach for the general ID PLL problem. • POP is flexible with respect to losses, so that the losses designed for the instance-independent PLL problems can be embedded directly. We empirically show that such embedding allows advanced PLL losses can be applied to the ID problem and achieve state-of-the-art learning performance.

2. RELATED WORK

In this section, we briefly go through the seminal works in PLL, focusing on the theoretical works and discussing the underlying assumptions behind them. Non-deep PLL There have been substantial non-deep PLL algorithms from the pioneering work Jin & Ghahramani (2003) . From a practical standpoint, they have been studied along two different research routes: the identification-based strategy and the average-based strategy. The identificationbased strategy purifies each partial label and extracts the true label heuristically in the training phase, so as to identify the true labels Chen et al. (2014) ; Zhang et al. (2016) ; Tang & Zhang (2017) ; Feng & An (2019) ; Xu et al. (2019) . On the contrary, the average-based strategy treats all candidates equally Hüllermeier & Beringer (2006) ; Cour et al. (2011) ; Zhang & Yu (2015) . On the theoretical side, Liu and Dietterich Liu & Dietterich (2012) analyzed the learnability of PLL by making a small ambiguity degree condition assumption, which ensures classification errors on any instance have a probability of being detected. And Cour et al. Cour et al. (2011) proposed a consistent approach under the small ambiguity degree condition and a dominance assumption on data distribution (Proposition 5 in Cour et al. (2011) ). Liu and Dietterich Liu & Dietterich (2012) proposed a Logistic Stick-Breaking Conditional Multinomial Model to portray the mapping between instances and true labels while assuming the generation of the partial label is independent of the instance itself. It should be noted that the vast majority of non-deep PLL works have only empirically verified the performance of algorithms on small data sets, without formalizing the statistical model for the PLL problem, and therefore even less so for theoretical analysis of when and why the algorithms work. Deep PLL In recent years, deep learning has been applied to PLL and has greatly advanced the practical application of PLL. Feng et al. (2020a) has been proposed that learns from instances equipped with a complementary label. A complementary label specifies the classes to which the instance does not belong, so it can be considered to be an inverted PLL problem. However, all of them made the instance-independent assumption for analyzing the statistic consistency. Wu and Sugiyama Wu & Sugiyama (2021) proposed a framework that unifies the formalization of multiple generation processes under the instance-independent assumption. Wang et al. Wang et al. (2022a) proposed a data-augmentation-based framework to disambiguate partial labels with contrastive learning. Zhang et al. Zhang et al. (2021a) exploited the class activation value to identify the true label in candidate label sets. Very recently, some researchers are beginning to notice a more general setting-ID PLL. Learning with the ID partial labels is challenging, and all instance-independent approaches cannot handle the ID PLL problem directly. Specifically, the theoretical approaches mentioned above utilize mainly the loss correction technique, which corrects the prediction or the loss of the classifier using a prior or estimated knowledge of data generation processes, i.e., a set of parameters controlling the probability of generating incorrect candidate labels, or it is often called transition matrix Patrini et al. (2017) . The transition matrix can be characterized fixedly in the instance-independent setting since it does not need to include instance-level information, a condition that does not hold in ID PLL. Furthermore, it is ill-posed to estimate the transition matrix by only exploiting partially labeled data, i.e., the transition matrix is unidentifiable Xia et al. (2020) . Therefore, some new methods should be proposed to tackle this issue. Xu et al. Xu et al. (2021) introduced a solution that infers the latent label posterior via variational inference methods Blei et al. (2017) , nevertheless, its effectiveness would be hardly guaranteed. In this paper, we propose POP for the ID PLL problem and theoretically prove that the learned classifier approximates well to the Bayes optimal. 3 PROPOSED METHOD

3.1. PRELIMINARIES

First of all, we briefly introduce some necessary notations. Consider a multi-class classification problem of c classes. Let X = R q be the q-dimensional instance space and Y = {1, 2, . . . , c} be the label space with c class labels. In supervised learning, let p(x, y) be the underlying "clean" distribution generating (x, y x ) ∈ X × Y from which n i.i.d. samples {(x i , y xi )} n i=1 are drawn. In PLL, there is a partial label space S := {S|S ⊆ Y, S ̸ = ∅} and the PLL training set D = {(x i , S i )|1 ≤ i ≤ n} is sampled independently and identically from a "corrupted" density p(x, S) over X × S. It is generally assumed that p(x, y) and p(x, S) have the same marginal distribution of instances p(x). Then the generation process of partial labels can thus be formalized as p(S|x) = y p(S|x, y)p(y|x). We define the probability that, given the instance x and its class label y x , j-label being included in its partial label as the flipping probability: ξ j (x) = p(j ∈ S|x, y x ), ∀j ∈ Y, The key definition in PLL is that the latent true label of an instance is always one of its candidate label, i.e., ξ y x (x) = 1. We consider use deep models by the aid of an inverse link function Reidand & Williamson (2010) ϕ : R c → ∆ c-1 where ∆ c-1 denotes the c-dimensional simplex, for example, the softmax, as learning model in this paper. Then the goal of supervised multi-class classification and PLL is the same: a scoring function f : X → ∆ c-1 that can make correct predictions on unseen inputs. Typically, the classifier takes the form: h(x) = arg max j∈Y f j (x). The Bayes optimal classifier h ⋆ (learned using supervised data) is the one that minimizes the risk w.r.t the 0-1 loss (or some classification-calibrated loss Bartlett et al. (2006) ), i.e., h ⋆ = arg min h R 01 = arg min h E (X,Y )∼p(x,y) 1 {h(X)̸ =Y } . For strictly proper losses Gneiting & Raftery (2007) , the scoring function f * recovers the classposterior probabilities, i.e., f ⋆ (x) = p(y|x), ∀x ∈ X . When the supervision information available is partial label, the PLL risk under p(x, S) w.r.t. a suitable PLL loss L : R k × S → R + is defined as R = E (X,S)∼ p(x,S) L(h(X), S) . Minimizing R induces the classifier and it is desirable that the minimizer approach h ⋆ . In addition, let o = arg max j̸ =y x p(y = j|x) be the class label with the second highest posterior possibility among all labels.

3.2. OVERVIEW

In the latter part of this section, we will introduce a concept pure level set as the region where the model is reliable. We prove that given a tiny reliable region, one could progressively enlarge this region and improves the model with a sufficient rate by disambiguating the partial labels. Motivated by the theoretical results, we propose an approach POP that works by progressively purifying the partial labels to move out the false candidate labels, and eventually the learned classifier could approximate the Bayes optimal classifier. POP employs the observed partial labels to pre-train a randomly initialized classifier for several epochs, and then updates both partial labels and the classifier for the remaining epochs. We start with a warm-up period, in which we train the predictive model with a well-defined PLL loss Lv et al. (2020) . This allows us to attain a reasonable predictive model before it starts fitting incorrect labels Zhang et al. (2017a) . After the warm-up period, we iteratively purify each partial label by moving out the candidate labels for which the current classifier has high confidence of being incorrect, and subsequently we train the classifier with the purified partial labels in the next epoch. After the model has been fully trained, the predictive model can perform prediction for unseen instances.

3.3. THE POP METHOD

We assume that the hypothesis class H is sufficiently complex (and deep networks could meet this condition), such that the approximation error equals zero, i.e., arg min h R = arg min h∈H R and we have enough training data i.e., n → ∞. The classifier is able to at least approximate the Bayes optimal classifier h ⋆ and the gap between the learned f (x) and the the scoring function f ⋆ (x) corresponding to h ⋆ is determined by the inconsistency between incorrect candidate labels and output of the Bayes optimal classifier. For two instance x and z that satisfy p(y z |z) -p(o|z) ≥ p(y x |x) -p(o|x), i.e., the margin between the posterior of ground-truth label p(y z |z) and the second highest posterior possibility p(o|z) is larger than that in point x, the indicator function 1 {j̸ =h ⋆ (z)} p(y z |z) -p(o|z) ≥ p(y x |x) -p(o|x), j ∈ S z equals 1 if the candidate label j of z is inconsistent with the output of the optimal Bayes classifier h ⋆ (z). Then, the gap between f j (x) and f ⋆ j (x) , i.e., the approximation error of the classifier, could be controlled by the inconsistency between the incorrect candidate labels and the output of the Bayes optimal classifier h ⋆ for all the instances z. Therefore, we assume that there exist constants α, ϵ < 1, such that for f (x), |f j (x)-f ⋆ j (x)| ≤ αE (z,S)∼ p(z,S) 1 {j̸ =h ⋆ (z)} p(y z |z) -p(o|z) ≥ p(y x |x) -p(o|x), j ∈ S z + ϵ 6 (1) where the scoring function f * corresponding to h * on strictly proper losses Gneiting & Raftery (2007) recovers the class-posterior probabilities, i.e., f ⋆ j (x) = p(y = j|x). In addition, for the probability density function d(u) of cumulative distribution function D(u) = P x∼p(x,y) (u(x) ≤ u) where 0 ≤ u ≤ 1 and the margin u(x) = p(y x |x) -p(o|x). we assume that there exist constants c ⋆ , c ⋆ > 0 such that c ⋆ < d(u) < c ⋆ . Then, the worst-case density-imbalance ratio is denoted by l = c ⋆ c⋆ . As the flipping probability of the incorrect label in the instance-dependent generation process is related to its posterior probability, we assume that there exists a constant t > 0 such that: ξ j (x) ≤ p(y = j|x)t. (2) Motivated by the pure level set in binary classification Zhang et al. (2021b) , we define the pure level set in instance-dependent PLL, i.e., the region where the model is reliable: Definition 1 (Pure (e, f )-level set). A set L(e) := {x∥p(y x |x) -p(o|x) |≥ e} is pure for f if y x = arg max j f j (x) for all x ∈ L(e). Assume that there exists a set L(e) for all x ∈ L(e) which satisfies y x = arg max j f j (x), we have E (z,S)∼ p(z,S) 1 {j̸ =h ⋆ (z)} p(y z |z) -p(o|z) ≥ p(y x |x) -p(o|x), j ∈ S z = 0 (3) which means that there is a tiny region L(e) := {x∥p(y x |x) -p(o|x) |≥ e} where the model f is reliable. Let e new be the new boundary and ϵ 6lα (p(y x |x) -e) ≤ e -e new ≤ ϵ 3lα (p(y x |x) -e). As the probability density function d(u) of the margin u(x) = p(y x |x) -p(o|x) is bounded by c ⋆ < d(u) < c ⋆ , we have the following result for x that satisfies e > p(y x |x) -p(o|x) ≥ e newfoot_0 : E (z,S)∼ p(z,S) 1 {j̸ =h ⋆ (z)} p(y z |z) -p(o|z) ≥ p(y x |x) -p(o|x), j ∈ S z ≤ ϵ 3α . Combining Eq. ( 1) and Eq. ( 4), there is |f j (x) -f ⋆ j (x)| ≤ ϵ 2 . ( ) Denote by m = arg max j f j (x) the label with the highest posterior probability for the current prediction. If f m (x) -f j̸ =m (x) ≥ e + ϵ, we have 2 p(y x |x) ≥ p(y = j|x) + e which means that the label j is incorrect label. Therefore, we could move the label j out from the candidate label set to disambiguate the partial label, and then refine the learning model with the partial label with less ambiguity. In this way, we would move one step forward by trusting the model with the tiny reliable region with following theorem. We start with a warm-up period, as the classifier is able to attain reasonable outputs before fitting label noise Zhang et al. (2017a) . Note that the warm-up training is employed to find a tiny reliable region and the ablation experiments show that the performance of POP does not rely on the warm-up strategy. The predictive model θ could be trained on partially labeled examples by minimizing any PLL loss function. Here we adopt PRODEN loss Lv et al. (2020) to to find a tiny reliable region: L P LL = n i=1 c j=1 w ij ℓ(f j (x i ), S i ). Here, ℓ is the cross-entropy loss and the weight w ij is initialized with with uniform weights and then could be tackled simply using the current predictions for slightly putting more weights on more possible labels Lv et al. (2020) : w ij = f j (x i ) / j∈Si f j (x i ) if j ∈ S i 0 otherwise (8) Theorem 1 Assume that we have enough training data(n → ∞) and there is a pure (e, f )-level set where x ∈ L(e) can be correctly classified by f . For each x and ∀j ∈ S and j ̸ = m, if f m (x) -f j (x) ≥ e + ϵ, we move out label j from the candidate label set and then update the candidate label set as S new . Then the new classifier f new (x) is trained on the updated data with the new distribution p(x, S new ). Let e new be the minimum boundary that L(e new ) is pure for f new . Then, we have p(y x |x) -e new ≥ (1 + ϵ 6αl )(p(y x |x) -e). The detailed proof can be found in Appendix A.1. Theorem 1 shows that the purified region γ = p(y x |x) -e would be enlarged by at least a constant factor with the given purification strategy. After the warm-up period, the classifier could be employed for purification. According to Theorem 1, we could progressively move out the incorrect candidate label with the continuously strict bound, and subsequently train an effective classifier with the purified labels with the PLL loss Lv et al. (2020) since the PLL loss Lv et al. ( 2020) is model-independent and could operates in a minibatched training manner to update the model with the labeling-confidence weight. Specifically, we set a high threshold e 0 and calculate the difference f m (x i ) -f j (x i ) for each candidate label. If there is a label j for x i satisfies f m (x i ) -f j (x i ) ≥ e 0 , we move out it from the candidate label set and update the candidate label set. We depart from the theory by reusing the same fixed dataset over and over, but the empirics are reasonable. If there is no purification for all partial labels, we begin to decrease the threshold e and continue the purification for improving the training of the model. In this way, the incorrect candidate labels are progressively removed from the partial label round by round, and the performance of the classifier is continuously improved. The algorithmic description of POP is shown in Algorithm 1. Then we prove that if there exists a pure level set for an initialized model, our proposed approach can purify incorrect labels and the classifier f will finally match the Bayes optimal classifier h after sufficient rounds R under the instance-dependence PLL setting . Theorem 2 For any flipping probability of each incorrect label ξ j (x), define e 0 = (1+t)α+ ϵ 6 1+α . And for a given function f 0 there exists a level set L(e 0 ) which is pure for f 0 . If one runs purification in Theorem 1 with enough traing data (n → ∞) starting with f 0 and the initialization: (1) e 0 ≥ (1+t)α+ ϵ 6 1+α , (2) R ≥ 6l ϵ log( 1-ϵ 1 c -e0 ), 3)e end ≥ ϵ, then we have: P x∼D [y f f inal (x) = h ⋆ (x)] ≥ 1 -c ⋆ ϵ The proof of Theorem 2 is provided in Appendix A.3. According to Theorem 2, the learned classifier under the instance-dependent PLL setting will be consistent with the Bayes optimal classifier eventually. Theorem 2 shows that the classifier can be guaranteed to eventually approximate the Bayes optimal classifier. (2009), CIFAR-100 Krizhevsky & Hinton (2009) . These datasets are manually corrupted into ID partially labeled versions. Specifically, we set the flipping probability of each incorrect label corresponding to an instance x by using the confidence prediction of a neural network trained using supervised data parameterized by θ Xu et al. (2021) . The flipping probability ξ j (x) = fj (x; θ) max j∈ Ȳ fj (x; θ) , where Ȳi is the set of all incorrect labels except for the true label of x i . The average number of candidate labels (avg. #CLs) for each benchmark dataset corrupted by the ID generation process is recorded in Appendix A.4.

4. EXPERIMENTS

In addition, five real-world PLL datasets which are collected from different application domains are used, including Lost Cour et al. (2011 ), Soccer Player Zeng et al. (2013) , Yahoo!News Guillaumin et al. (2010 ), MSRCv2 Liu & Dietterich (2012 ), and BirdSong Briggs et al. (2012) . The average number of candidate labels (avg. #CLs) for each real-world PLL dataset is also recorded in Appendix A.4.

4.2. BASELINES

The performance of POP is compared against five deep PLL approaches: • PRODEN Lv et al. (2020) : A progressive identification approach which approximately minimizes a risk estimator and identifies the true labels in a seamless manner; • RC Feng et al. (2020b) : A risk-consistent approach which employs the loss correction strategy to establish the true risk by only using the partially labeled data; • CC Feng et al. (2020b) : A classifier-consistent approach which also uses the loss correction strategy to learn the classifier that approaches the optimal one; • VALEN Yao et al. (2020a) : An ID PLL approach which recovers the latent label distribution via variational inference methods; • LW Wen et al. (2021) : A risk-consistent approach which proposes a leveraged weighted loss to trade off the losses on candidate labels and non-candidate ones. • CAVL Zhang et al. (2021a) : A progressive identification approach which exploits the class activation value to identify the true label in candidate label sets. • CLPL Cour et al. (2011) : A avearging-based disambiguation approach based on a convex learning formulation. • PICO Wang et al. (2022b) : a data-augmentation-based method which identifies the true label via contrastive-learning with learned prototypes for image datasets. • RCR Wu et al. (2022) : a data-augmentation-based method which identifies the true label via consistency regularization with random augmented instances for image datasets. For the benchmark datasets, we use the same data augmentation strategy for the data-augmentationfree methods (VALEN, PRODEN, RC, CC, LW and CAVL) to make fair comparisons with the dataaugmentation-based methods (PICO and RCR). However, data augmentation cannot be employed on the realworld datasets that contain extracted feature from audio and video data, we just compared our methods with the data-augmentation-free methods on realworld datasets. For all the deep approaches, We used the same training/validation setting, models, and optimizer for fair comparisons. Specifically, a 5-layer LeNet is trained on MNIST, Kuzushiji-MNIST and Fashion-MNIST, the Wide-ResNet-28-2 Zagoruyko & Komodakis ( 2016) is trained on CIFAR-10 and CIFAR-100, and the linear model is trained on real-world PLL datasets, respectively. The hyperparameters are selected so as to maximize the accuracy on a validation set (10% of the training set). We run 5 trials on the benchmark datasets and the real-world PLL datasets. The mean accuracy as well as standard deviation are recorded for all comparing approaches. All the comparing methods are implemented with PyTorch.

4.3. EXPERIMENTAL RESULTS

Table 1 and Table 2 report the classification accuracy of each approach on benchmark datasets corrupted by the ID generation process and the real-world PLL datasets, respectively. Due to the inability of data augmentation to be employed on extracted feature , we didn't compare our methods with PICO and RCR on realworld datasets. The best results are highlighted in bold. We can observe that POP achieves the best performance against other approaches in most cases and the performance advantage of POP over comparing approaches is stable under varying the number of candidate labels. In addition, to analysis the purified region in Theorem 1, we employ the confidence predictions of f (x, θ) (the network in Section 4.1) as the posterior and plot the curve of the estimated purified region in every epoch on Lost in Figure 1 . We can see that although the estimated purified region would be not accurate enough, the curve could show that the trend of continuous increase for the purified region.

4.4. FURTHER ANALYSIS

As the framework of POP is flexible for the loss function, we integrate the proposed method with the previous methods for instance-independent PLL including PRODEN, RC, CC, LW, CAVL and CLPL. In this subsection, we empirically prove that the previous methods for instance-independent PLL could be promoted to achieve better performance after integrating with POP. Table 3 and Table 4 report the classification accuracy of each method for instance-independent PLL and its variant integrated with POP on benchmark datasets corrupted by the ID generating procedure and the real-world datasets, respectively. We didn't use any data augmentation on benchmark datasets in this part of experiments. As shown in Table 3 and Table 4 , the approaches integrated with POP including PRODEN+POP, RC+POP, CC+POP , LW+POP, CAVL+POP and CLPL+POP achieve superior performance against original method, which clearly validates the usefulness of POP framework for improving performance for ID PLL. Figure 3 illustrates the variant integrated with POP performs under different hyper-parameter configurations on CIFAR-10 while similar observations are also made on other data sets. The hyperparameter sensitivity on other datasets could be founded in Appendix A.4. As shown in Figure 3 , it is obvious that the performance of the variant integrated with POP is relatively stable across a broad range of each hyper-parameter. This property is quite desirable as POP framework could achieve robust classification performance.

5. CONCLUSION

In this paper, the problem of partial label learning is studied where a novel approach POP is proposed. we consider ID partial label learning and propose a theoretically-guaranteed approach, which could train the classifier with progressive purification of the candidate labels and is theoretically guaranteed to eventually approximates the Bayes optimal classifier for ID PLL. Experiments on benchmark and real-world datasets validate the effectiveness of the proposed method. If PLL methods become very effective, the need for exactly annotated data would be significantly reduced. As a result, the employment of data annotators might be decreased which could lead to a negative societal impact. d(u) < c ⋆ , we have the following result for x that satisfies p(y x |x) -p(o|x) ≥ e new 3 E (z,S)∼ p(z,Snew) 1 {j̸ =h ⋆ (z)} j ∈ Sz,p(y z |z) -p(o|z) ≥ p(y x |x) -p(o|x) ≤E (z,S)∼ p(z,Snew) 1 {j̸ =h ⋆ (z)} p(y z |z) -p(o|z) ≥ p(y x |x) -p(o|x) =Pz j ̸ = h ⋆ (z) p(y z |z) -p(o|z) ≥ p(y x |x) -p(o|x) = Pz [j ̸ = h ⋆ (z), p(y z |z) -p(o|z) ≥ p(y x |x) -p(o|x)] Pz [p(y z |z) -p(o|z) ≥ p(y x |x) -p(o|x)] ≤ Pz [j ̸ = h ⋆ (z), p(y z |z) -p(o|z) ≥ e] Pz [p(y z |z) -p(o|z) ≥ p(y x |x) -p(o|x)] + Pz [j ̸ = h ⋆ (z), enew ≤ p(y z |z) -p(o|z) < e] Pz [p(y z |z) -p(o|z) ≥ p(y x |x) -p(o|x)] = Pz [j ̸ = h ⋆ (z), p(y z |z) -p(o|z) ≥ e] Pz [p(y z |z) -p(o|z) ≥ e] Pz [p(y z |z) -p(o|z) ≥ e] Pz [p(y z |z) -p(o|z) ≥ p(y x |x) -p(o|x)] + Pz [j ̸ = h ⋆ (z), enew ≤ p(y z |z) -p(o|z) < e] Pz [p(y z |z) -p(o|z) ≥ p(y x |x) -p(o|x)] = E (z,S)∼ p(z,S) 1 {h(z)̸ =y z } p(y z |z) -p(o|z) ≥ e =0(According to Eq. ( 9)) Pz [p(y z |z) -p(o|z) ≥ e] Pz [p(y z |z) -p(o|z) ≥ p(y x |x) -p(o|x)] + Pz [j ̸ = y z , enew ≤ p(y z |z) -p(o|z) < e] Pz [p(y z |z) -p(o|z) ≥ p(y x |x) -p(o|x)] = Pz [enew ≤ p(y z |z) -p(o|z) < e] Pz [p(y z |z) -p(o|z) ≥ p(y x |x) -p(o|x)] ≤ c ⋆ (e -enew) c⋆ (p(y x |x) -e) .  Then, we can find that the assumption that the gap between f j (x) and f ⋆ j (x) should be controlled by the risk at point z implies: f j (x) -f ⋆ j (z) ≤ αE (z,S)∼ p(z,Snew) 1 {h(z)̸ =y z } p(y z |z) -p(o|z) ≥ p(y x |x) -p(o|x) + ϵ 6 ≤ α ϵ 3α + ϵ 6 ≤ ϵ 2 . ( ) Hence, for x s.t. p(y x |x) -p(o|x) ≥ e new , according to Eq. ( 12) we have which means that j (x) will be the same label as h ⋆ and thus the level set L(e new ) is pure for f . Meanwhile, the choice of e new ensures that f y x (x) -f j̸ =y x (x) ≥ (p(y = y x |x) - ϵ 2 ) -(p(y = j|x) + ϵ 2 ) = p(y = y x |x) -p(y = j|x) -ϵ ≥ p(y = y x |x) -p(o|x) -ϵ ≥ e new -ϵ ≥ 0, p(y x |x) -e new ≥ p(y x |x) -(e - ϵ 6lα (p(y x |x) -e)) = p(y x |x) -e + ϵ 6lα (p(y x |x) -e) = (1 + ϵ 6lα )(p(y x |x) -e). Here, the proof of Theorem 1 has been completed. A.2 DETAILS OF EQ. (5) If f m (x) -f j̸ =m ≥ e + ϵ, according to Eq. ( 12) we have: p(y x |x) ≥ p(y = m|x) = p(y = j|x) + p(y = m|x) -p(y = j|x) ≥ p(y = j|x) + p(y = m|x) -p(y = j|x) ≥ p(y = j|x) + (f m (x) - ϵ 2 ) -(f j (x) + ϵ 2 ) = p(y = j|x) + (f m (x) -f j (x)) -ϵ ≥ p(y = j|x) + (e + ϵ) -ϵ = p(y = j|x) + e.

A.3 PROOFS OF THEOREM 2

To begin with, we prove that there exists at least a level set L(e 0 ) pure to f 0 . Considering x satisfies p(y x |x) -p(o|x) ≥ e 0 , we have P z j ̸ = h ⋆ (z) j ∈ S z , p(y z |z) -p(o|z) ≥ e 0 ≤ p(y z |z) -e 0 + ξ j (z). Due to the assumption |f j (x) -f ⋆ j (x)| ≤ αE (z,S)∼ p(z,S) 1 {j̸ =h ⋆ (z)} j ∈ S z , p(y z |z) -p(o|z) ≥ p(y x |x) -p(o|x) + ϵ 6 , it suffices to satisfy α(p(y x |x) -e 0 + ξ) + ϵ 6 ≤ e 0 to ensure that f j (x) has the same prediction with h ⋆ when p(y x |x) -p(o|x) ≥ e 0 . Since we have ξ j (x) ≤ p(y = j|x)t ≤ p(y x |x)t, by choosing e 0 ≥ Then in the rest of the iterations we ensure the level set p(y z |z) -p(o|z) ≥ e is pure. We decrease e by a reasonable factor to avoid incurring too many corrupted labels while ensuring enough progress in label purification, i.e. ϵ 6lα (p(y x |x) -e) ≤ e -e new ≤ ϵ 3lα (p(y x |x) -e), such that in the level set p(y x |x) -p(o|x) ≥ e new we have |f j (x) -f ⋆ j (x)| ≤ ϵ 2 . This condition ensures the correctness of flipping when e ≥ ϵ. The the purified region cannot be improved once e < ϵ since there is no guarantee that f j (x) has consistent label with h ⋆ when p(y x |x) -p(o|x) < ϵ and The rest of the proof is the total round R ≥ 6αl ϵ log( 1-ϵ 1 c -e0 ), which follows from the fact that each round of label flipping improves the the purified region by a factor of (1 + ϵ 6lα ): 1 + ϵ 6lα 



More details could be found in Appendix A.1. More details could be found in Appendix A.2. Details of Eq. (3) in the paper submission



DATASETS We adopt five widely used benchmark datasets including MNIST LeCun et al. (1998), Kuzushiji-MNIST Clanuwat et al. (2018), Fashion-MNIST Xiao et al. (2017), CIFAR-10 Krizhevsky & Hinton

Figure 1: Estimated purified region on Lost.

ϵ 6lα (p(y x |x) -e) ≤ e -e new ≤ ϵ 3lα (p(y x |x) -e) holds, we can further relax Eq. (10) as follows: E (z,S)∼ p(z,Snew) 1 {j̸ =h ⋆ (z)} j ∈ S z , p(y z |z) -p(o|z) ≥ p(y x |x) -p(o|x) ≤ c * (e -e new ) c * (p(y x |x) -e) ≤ c * c * (p(y x |x) -

Figure 3: Hyper-parameter sensitivity on Lost.

αp(y x |x)+ ϵ 6 1+α one can ensure that initial f 0 has a pure L(e 0 )-level set.

|f j (x) -f ⋆ j (x)| ≤ ϵ 2 .To get the largest purified region, we can set e end = ϵ. Since the probability density function d(u) of the margin u(x) = p(y x |x) -p(o|x) is bounded by c ⋆ ≤ d(u) ≤ c ⋆ , we have: P x∼D [y f f inal (x) ̸ = h ⋆ ] ≤ P[p(y x |x) -p(o|x) < e end ] = P x∼D [p(y x |x) -p(o|x) < ϵ] ≤ c ⋆ ϵ. (16) Then P x∼D [y f f inal (x) = h ⋆ ] = 1 -P x∼D [y f f inal (x) ̸ = h ⋆ ] ≥ 1 -c ⋆ ϵ.

(p(y x |x) -e 0 ) ≥ p(y x |x) -ϵ We collect four widely used benchmark datasets including MNIST LeCun et al. (1998), Kuzushiji-MNIST Clanuwat et al. (2018), Fashion-MNIST Xiao et al. (2017), CIFAR-10 Krizhevsky & Hinton (2009), CIFAR-100 Krizhevsky & Hinton (2009). In addition, five real-world PLL datasets are adopted, which are collected from several application domains including Lost Cour et al. (2011), Soccer Player Zeng et al. (2013) and Yahoo!News Guillaumin et al. (2010) for automatic face naming from images or videos, MSRCv2 Liu & Dietterich (2012) for object classification, and

Yao et al. Yao et al. (2020a;b)  andLv et al. Lv et al. (2020) proposed learning objectives that are compatible with stochastic optimization and thus can be implemented by deep networks. SoonFeng et al. Feng et al. (2020b)  formalized the first generation process for PLL. They assumed that given the latent true label, the probability of all incorrect labels being added into the candidate label set is uniform and independent of the instance. Thanks to the uniform generation process, they proposed two provably consistent algorithms.Wen et al. Wen et al. (2021) extended the uniform one to the class-dependent case, but still keep the instance-independent as-

Algorithm 1 POP AlgorithmInput: The PLL training set D = {(x1, S1), ..., (xn, Sn)}, initial threshold e0, end threshold eend, total round R, step-size es;1: Initialize the predictive model θ by warm-up training with the PLL loss Eq. 7, and threshold e = e0;

Classification accuracy (mean±std) of each comparing approach on benchmark datasets corrupted by the ID generation process.

Classification accuracy (mean±std) of each comparing approach on the real-world datasets.

Classification accuracy (mean±std) of each comparing approach on benchmark datasets corrupted by the ID generation process.

Classification accuracy (mean±std) of each comparing approach on the real-world datasets.

Characteristic of the benchmark datasets corrupted by the ID generation process.

Characteristic of the real-world PLL datasets.

A APPENDIX

A.1 PROOFS OF THEOREM 1 Assume that there exists a set L(e) for all x ∈ L(e) which satisfies y x = arg max j f j (x) and p(y x |x) -p(o|x) ≥ e, we have The average number of candidate labels (avg. #CLs) for each benchmark dataset corrupted by the ID generation process is recorded in Table-5 and the average number of candidate labels (avg. #CLs) for each real-world PLL dataset is recorded in 

