MULTI-LEVEL GENERATIVE MODELS FOR PARTIAL LABEL LEARNING WITH NON-RANDOM LABEL NOISE Anonymous

Abstract

Partial label (PL) learning tackles the problem where each training instance is associated with a set of candidate labels that include both the true label and irrelevant noise labels. In this paper, we propose a novel multi-level generative model for partial label learning (MGPLL), which tackles the PL problem by learning both a label level adversarial generator and a feature level adversarial generator under a bi-directional mapping framework between the label vectors and the data samples. MGPLL uses a conditional noise label generation network to model the nonrandom noise labels and perform label denoising, and uses a multi-class predictor to map the training instances to the denoised label vectors, while a conditional data feature generator is used to form an inverse mapping from the denoised label vectors to data samples. Both the noise label generator and the data feature generator are learned in an adversarial manner to match the observed candidate labels and data features respectively. We conduct extensive experiments on both synthesized and real-world partial label datasets. The proposed approach demonstrates the state-of-the-art performance for partial label learning.

1. INTRODUCTION

Partial label (PL) learning is a weakly supervised learning problem with ambiguous labels (Hüllermeier & Beringer, 2006; Zeng et al., 2013) , where each training instance is assigned a set of candidate labels, among which only one is the true label. Since it is typically difficult and costly to annotate instances precisely, the task of partial label learning naturally arises in many real-world learning scenarios, including automatic face naming (Hüllermeier & Beringer, 2006; Zeng et al., 2013) , and web mining (Luo & Orabona, 2010) . As the true label information is hidden in the candidate label set, the main challenge of PL lies in identifying the ground truth labels from the candidate noise labels, aiming to learn a good prediction model. Some previous works have made effort on adjusting the existing effective learning techniques to directly handle the candidate label sets and perform label disambiguation implicitly (Gong et al., 2018; Nguyen & Caruana, 2008; Wu & Zhang, 2018) . These methods are good at exploiting the strengths of the standard classification techniques and have produced promising results on PL learning. Another set of works pursue explicit label disambiguation by trying to identify the true labels from the noise labels in the candidate label sets. For example, the work in (Feng & An, 2018) tries to estimate the latent label distribution with iterative label propagations and then induce a prediction model by fitting the learned latent label distribution. Another work in (Lei & An, 2019) exploits a self-training strategy to induce label confidence values and learn classifiers in an alternative manner by minimizing the squared loss between the model predictions and the learned label confidence matrix. However, these methods suffer from the cumulative errors induced in either the separate label distribution estimation steps or the error-prone label confidence estimation process. Moreover, all these methods have a common drawback: they automatically assumed random noise in the label space -that is, they assume the noise labels are randomly distributed in the label space for each instance. However, in real world problems the appearance of noise labels is usually dependent on the target true label. For example, when the object contained in an image is a "computer", a noise label "TV" could be added due to a recognition mistake or image ambiguity, but it is less likely to annotate the object as "lamp" or "curtain", while the probability of getting noise labels such as "tree" or "bike" is even smaller. In this paper, we propose a novel multi-level adversarial generative model, MGPLL, for partial label learning. The MGPLL model comprises of conditional data generators at both the label level and feature level. The noise label generator directly models non-random appearances of noise labels conditioning on the true label by adversarially matching the candidate label observations, while the data feature generator models the data samples conditioning on the corresponding true labels by adversarially matching the observed data sample distribution. Moreover, a prediction network is incorporated to predict the denoised true label of each instance from its input features, which forms inverse mappings between labels and features, together with the data feature generator. The learning of the overall model corresponds to a minimax adversarial game, which simultaneously identifies true labels of the training instances from both the observed data features and the observed candidate labels, while inducing accurate prediction networks that map input feature vectors to (denoised) true label vectors. To the best of our knowledge, this is the first work that exploits multi-level generative models to model non-random noise labels for partial label learning. We conduct extensive experiments on real-world and synthesized PL datasets. The empirical results show the proposed MGPLL achieves the state-of-the-art PL performance.

2. RELATED WORK

Partial label (PL) learning is a popular weakly supervised learning framework (Zhou, 2018) in many real-world domains, where the true label of each training instance is hidden within a given candidate label set. The challenge of PL learning lies in disambiguating the true labels from the candidate label sets to induce good prediction models. One strategy towards PL learning is to adjust the standard learning techniques and implicitly disambiguate the noise candidate labels through the statistical prediction pattern of the data. For example, with the maximum likelihood techniques, the likelihood of each PL training sample can be defined over its candidate label set instead of its implicit ground-truth label (Jin & Ghahramani, 2003; Liu & Dietterich, 2012) . For the k-nearest neighbor technique, the candidate labels from neighbor instances can be aggregated to induce the final prediction on a test instance (Hüllermeier & Beringer, 2006; Gong et al., 2018; Zhang & Yu, 2015) . For the maximum margin technique, the classification margin can be defined over the predictive difference between the candidate labels and the non-candidate labels for each PL training sample (Nguyen & Caruana, 2008; Yu & Zhang, 2016) . For the boosting technique, the weight of each PL training instance and the confidence value of each candidate label being ground-truth label can be refined via each boosting round (Tang & Zhang, 2017) . For the error-correcting output codes (ECOC) technique, multiple binary classifier corresponding to the ECOC coding matrix are built based on the transformed binary training sets (Zhang et al., 2017) . For the binary decomposition techniques, a one-vs-one decomposition strategy has been adopted to address PL learning by considering the relevance of each label pair (Wu & Zhang, 2018) . Recently, there have been increasing attentions in designing explicit feature-aware disambiguation strategies (Feng & An, 2018; Xu et al., 2019a; Feng & An, 2019; Wang et al., 2019a) . The authors of (Feng & An, 2018) attempt to refine the latent label distribution using iterative label propagations and then induce a predictive model based on the learned latent label distribution. However, the latent label distribution estimation in this approach can be impaired by the cumulative error induced in the propagation process, which can consequently degrade the PL learning performance, especially when the noisy labels dominate. Another work in (Lei & An, 2019) tries to refine the label confidence values with a self-training strategy and induce the prediction model over the refined label confidence scores via alternative optimization. Its estimation error on confidence values however can negatively impact the coupled partial label classifier due to the nature of alternative optimization. A recent work in (Yao et al., 2020) proposes to address the PL learning problem by enhancing the representation ability via deep features and improving the discrimination ability through margin maximization between the candidate labels and the non-candidate labels. Another recent work in (Yan & Guo, 2020) proposes to dynamically correct label confidence values with a batch-wise label correction strategy and induce a robust predictive model based on the MixUp enhanced data. Although these works demonstrate good empirical performance, they are subject to one common drawback of assuming random distributions of noise labels by default, which does not hold in many real-world learning scenarios. This paper presents the first work that explicitly model non-random noise labels for partial label learning.

