PROGRESSIVE PURIFICATION FOR INSTANCE-DEPENDENT PARTIAL LABEL LEARNING

Abstract

Partial label learning (PLL) aims to train multi-class classifiers from instances with partial labels (PLs)-a PL for an instance is a set of candidate labels where a fixed but unknown candidate is the true label. In the last few years, the instanceindependent generation process of PLs has been extensively studied, on the basis of which many practical and theoretical advances have been made in PLL, whereas relatively less attention has been paid to the practical setting of instancedependent PLs, namely, the PL depends not only on the true label but the instance itself. In this paper, we propose a theoretically grounded and practically effective approach called PrOgressive Purification (POP) for instance-dependent PLL: in each epoch, POP updates the learning model while purifying each PL for the next epoch of the model training by progressively moving out false candidate labels. Theoretically, we prove that POP enlarges the region appropriately fast where the model is reliable, and eventually approximates the Bayes optimal classifier with mild assumptions; technically, POP is flexible with arbitrary losses and compatible with deep networks, so that the previous advanced PLL losses can be embedded in it and the performance is often significantly improved.



. We observe that these existing theoretical works have focused on the instance-independent setting where the generation process of partial labels is homogeneous across training examples. With an explicit formulation of the generation process, the asymptotical consistency Mohri et al. (2018) of the methods, namely, whether the classifier learned from partial labels approximates the Bayes optimal classifier, can be analyzed. However, the instance-independent process cannot model the real world well since data labeling is prone to different levels of error in tasks of varying difficulty. Intuitively, instance-dependent (ID) partial labels should be quite realistic as some poor-quality or ambiguous instances are more difficult to be labeled with an exact true label. Although the instance-independent setting has been extensively studied, on the basis of which many practical and theoretical advances have been made in PLL, relatively less attention has been paid to the practically relevant setting of ID partial labels. Very recently, one solution has been proposed Xu et al. ( 2021) which learned directly from ID partial labels, nevertheless, it is still unclear in theory whether the learned classifier is good. Motivated by the above observations, we set out to investigate ID PLL with the aim of proposing a learning approach that is model-independent and theoretically explain when and why the proposed method works. In this paper, we propose PrOgressive Purification (POP), a theoretically grounded PLL framework for ID partial labels. Specifically, we use the observed partial labels to pretrain a randomly initialized classifier (deep network) for several epochs, and then we update both partial labels and the classifier for the remaining epochs. In each epoch, we purify each partial label by moving out the candidate labels for which the current classifier has high confidence of being incorrect, and subsequently we train the classifier with the purified partial labels in the next epoch. As a consequence, the false candidate labels are gradually sifted out and the classification performance of the classifier is improved. We justify POP and outline the main contributions below: • We propose a novel approach named POP for the ID PLL problem, which purifies the partial labels and refines the classifier iteratively. Extensive experiments validate the effectiveness of POP. • We prove that POP can be guaranteed to enlarge the region where the model is reliable by a promising rate, and eventually approximates the Bayes optimal classifier with mild assumptions. This proof process does not rely on the assumption of the instance-independent setting. To the best of our knowledge, this is the first theoretically guaranteed approach for the general ID PLL problem. • POP is flexible with respect to losses, so that the losses designed for the instance-independent PLL problems can be embedded directly. We empirically show that such embedding allows advanced PLL losses can be applied to the ID problem and achieve state-of-the-art learning performance.

2. RELATED WORK

In this section, we briefly go through the seminal works in PLL, focusing on the theoretical works and discussing the underlying assumptions behind them. 



deep neural networks owe their popularity much to their ability to (nearly) perfectly memorize large numbers of training examples, and the memorization is known to decrease the generalization error Feldman (2020). On the other hand, scaling the acquisition of examples for training neural networks inevitably introduces non-fully supervised data annotation, a typical example among which is partial label Nguyen & Caruana (2008); Cour et al. (2011); Zhang et al. (2016; 2017b); Feng & An (2018); Xu et al. (2019); Yao et al. (2020b); Lv et al. (2020); Feng et al. (2020b); Wen et al. (2021)-a partial label for an instance is a set of candidate labels where a fixed but unknown candidate is the true label. Partial label learning (PLL) trains multi-class classifiers from instances that are associated with partial labels. It is therefore apparent that some techniques should be applied to prevent memorizing the false candidate labels when PLL resorts to deep learning, and unfortunately, empirical evidence has shown general-purpose regularization cannot achieve that goal Lv et al. (2021). A large number of deep PLL algorithms have recently emerged that aimed to design regularizers Yao et al. (2020a;b); Lyu et al. (2022) or network architectures Wang et al. (2022a) for PLL data. Further, there are some PLL works that provided theoretical guarantees while making their methods compatible with deep networks Lv et al. (2020); Feng et al. (2020b); Wen et al. (2021); Wu & Sugiyama

There have been substantial non-deep PLL algorithms from the pioneering work Jin & Ghahramani (2003). From a practical standpoint, they have been studied along two different research routes: the identification-based strategy and the average-based strategy. The identificationbased strategy purifies each partial label and extracts the true label heuristically in the training phase, so as to identify the true labels Chen et al. (2014); Zhang et al. (2016); Tang & Zhang (2017); Feng & An (2019); Xu et al. (2019). On the contrary, the average-based strategy treats all candidates equally Hüllermeier & Beringer (2006); Cour et al. (2011); Zhang & Yu (2015). On the theoretical side, Liu and Dietterich Liu & Dietterich (2012) analyzed the learnability of PLL by making a small ambiguity degree condition assumption, which ensures classification errors on any instance have a probability of being detected. And Cour et al. Cour et al. (2011) proposed a consistent approach under the small ambiguity degree condition and a dominance assumption on data distribution (Proposition 5 in Cour et al. (2011)). Liu and Dietterich Liu & Dietterich (2012) proposed a Logistic Stick-Breaking Conditional Multinomial Model to portray the mapping between instances and true labels while assuming the generation of the partial label is independent of the instance itself. It should be noted that the vast majority of non-deep PLL works have only empirically verified the performance of algorithms on small data sets, without formalizing the statistical model for the PLL problem, and therefore even less so for theoretical analysis of when and why the algorithms work. Deep PLL In recent years, deep learning has been applied to PLL and has greatly advanced the practical application of PLL. Yao et al. Yao et al. (2020a;b) and Lv et al. Lv et al. (2020) proposed learning objectives that are compatible with stochastic optimization and thus can be implemented by deep networks. Soon Feng et al. Feng et al. (2020b) formalized the first generation process for PLL. They assumed that given the latent true label, the probability of all incorrect labels being added into the candidate label set is uniform and independent of the instance. Thanks to the uniform generation process, they proposed two provably consistent algorithms.Wen et al. Wen et al. (2021)   extended the uniform one to the class-dependent case, but still keep the instance-independent as-

