DECOMPOSITIONAL GENERATION PROCESS FOR INSTANCE-DEPENDENT PARTIAL LABEL LEARNING

Abstract

Partial label learning (PLL) is a typical weakly supervised learning problem, where each training example is associated with a set of candidate labels among which only one is true. Most existing PLL approaches assume that the incorrect labels in each training example are randomly picked as the candidate labels and model the generation process of the candidate labels in a simple way. However, these approaches usually do not perform as well as expected due to the fact that the generation process of the candidate labels is always instance-dependent. Therefore, it deserves to be modeled in a refined way. In this paper, we consider instancedependent PLL and assume that the generation process of the candidate labels could decompose into two sequential parts, where the correct label emerges first in the mind of the annotator but then the incorrect labels related to the feature are also selected with the correct label as candidate labels due to uncertainty of labeling. Motivated by this consideration, we propose a novel PLL method that performs Maximum A Posterior (MAP) based on an explicitly modeled generation process of candidate labels via decomposed probability distribution models. Extensive experiments on manually corrupted benchmark datasets and real-world datasets validate the effectiveness of the proposed method.

1. INTRODUCTION

Partial label learning (PLL) aims to deal with the problem where each instance is provided with a set of candidate labels, only one of which is the correct label. The problem of learning from partial label examples naturally arises in a number of real-world scenarios such as web data mining Luo & Orabona (2010) , multimedia content analysis Zeng et al. (2013) ; Chen et al. (2017) , and ecoinformatics Liu & Dietterich (2012) ; Tang & Zhang (2017) . A number of methods have been proposed to improve the practical performance of PLL. Identificationbased PLL approaches Jin & Ghahramani (2002) ; Nguyen & Caruana (2008) ; Liu & Dietterich (2012) ; Chen et al. (2014) ; Yu & Zhang (2016) regard the correct label as a latent variable and try to identify it. Average-based approaches Hüllermeier & Beringer (2006) ; Cour et al. (2011) ; Zhang & Yu (2015) treat all the candidate labels equally and average the modeling outputs as the prediction. In addition, risk-consistent methods Feng et al. (2020) ; Wen et al. (2021) and classifier-consistent methods Lv et al. (2020) ; Feng et al. (2020) are proposed for deep models. Furthermore, aimed at deep models, Wang et al. (2022) investigate contrastive representation learning, Zhang et al. (2021) adapt the class activation map, and Wu et al. (2022) revisit consistency regularization in PLL. It is challenging to avoid overfitting on candidate labels, especially when the candidate labels depend on instances. Therefore, the previous methods assume that the candidate labels are instanceindependent. Unfortunately, this often tends to be the case that the incorrect labels related to the feature are more likely to be picked as candidate label set for each instance. Recent work Xu et al. (2021) has also shown that the presence of instance-dependent PLL imposes additional challenges but is more realistic in practice than the instance-independent case. In this paper, we focus on the instance-dependent PLL via considering the essential generating process of candidate labels in PLL. To begin with, let us rethink meticulously how candidate labels arise in most manual annotation scenarios. When one annotates an instance, though the correct label has already emerged in the mind of the annotator first, the incorrect labels which are related to the feature of the instance confuse the annotator, then leading to the result that the correct label and some incorrect labels are packed together as the candidate labels. Therefore, the generating process of the candidate labels in instance-dependent PLL could be decomposed into two stages, i.e., the generation of the correct label of the instance and the generation of the incorrect labels related to the instance, which could be described by Categorical distribution and Bernoulli distribution, respectively. Motivated by the above consideration, we propose a novel PLL method named IDGP, i.e., Instancedependent partial label learning via Decompositional Generation Process. Before performing IDGP, the distributions of the correct label and the incorrect label given the training example should be modeled explicitly by decoupled probability distributions Categorical distribution and Bernoulli distribution. Then we perform Maximum A Posterior (MAP) estimation on the PLL training dataset to deduce a risk minimizer. To optimize the risk minimizer, Dirichlet distribution and Beta distribution are leveraged to model the condition prior inside and estimate the parameters of Categorical distribution and Bernoulli distribution due to the conjugacy. Finally, we refine prior information by updating the parameters of the corresponding conjugate distributions iteratively to improve the performance of the predictive model in each epoch. Our contributions can be summarized as follows: • We for the first time explicitly model the generation process of candidate labels in instancedependent PLL. The entire generating process is decomposed into the generation of the correct label of the instance and the generation of the incorrect labels, which could be described by Categorical distribution and Bernoulli distribution, respectively. • We optimize the models of Categorical distribution and Bernoulli distribution via the MAP technique, where the corresponding conjugate distributions, i.e., Dirichlet distribution and Beta distribution are induced. • We derive an estimation error bound of our approach, which demonstrates that the empirical risk minimizer would approximately converge to the optimal risk minimizer as the number of training data grows to infinity.

2. RELATED WORK

In this section, we briefly review the literature for PLL from two aspects, i.e., traditional PLL and deep PLL. The former absorbs many classical machine learning techniques and usually utilizes linear models while the latter embraces deep learning and builds upon deep neural networks. We focus on the underlying assumptions on the generation of candidate labels behind part of them. Liu & Dietterich (2012) to depict the generation process, but the candidate labels are assumed to be instance-independent. Deep PLL has recently been studied and advanced the practical application of PLL, where the PLL approaches are not restricted to linear models and low-efficiency optimization. Yao et al. (2020a) pioneer the use of deep convolutional neural networks and employ a regularization term of uncertainty and a temporal-ensembling term to train the deep model. Lv et al. (2020) propose a progressive identification method that allows PLL to be compatible with arbitrary models and optimizers while also performing impressively on image classification benchmarks for the first time. Yao et al. (2020b) introduce a network-cooperation mechanism in deep PLL which trains two networks simultaneously to reduce their respective disambiguation errors. Feng et al. (2020) employ the importance reweighting strategy to derive a risk-consistent estimator and the transition matrix for PLL to derive a classifier-consistent estimator. Wen et al. (2021) introduce a leverage parameter that weights losses on partial labels and non-partial labels. Xu et al. (2021) apply variational label enhancement Xu et al. (2020; 2019a) 

3. PROPOSED METHOD

First of all, we introduce some necessary notations for our approach. In PLL, labeling information of every instance in the training dataset is corrupted from a correct label into a candidate label set, which contains the correct label and incorrect candidate labels but is not a complete label set. Let X ⊆ R q be the q-dimensional feature space of the instances and Y = {1, 2, . . . , c} be the label space with c class labels. Then a PLL training data set can be formulated as D = {(x i , S i )|1 ≤ i ≤ n}, where x i ∈ X denotes the i-th q-dimensional feature vector and S i ∈ C = 2 Y \ {∅, Y} is the candidate label set annotated for x i . S i consists of the correct label y i ∈ Y and the incorrect candidate label set S yi i = S i \ {y i }. We aim to find a multi-class classifier f : X → Y according to D.

3.1. OVERVIEW

To deal with instance-dependent PLL, we introduce the explicit generation model of instancedependent candidate labels, which decouples the distribution of the correct and incorrect candidate labels. Then Category distribution and Bernoulli distribution are leveraged to depict them respectively with Dirichlet distribution and Beta distribution as their prior distributions. Based on the probabilistic generation model, we simply deduce the log-likelihood loss function for optimization and then move forward to the MAP optimization problem in consideration of the prior distributions Dirichlet and Beta. Finally, we further propose the algorithm IDGP, which keeps and refines the prior information.

3.2. GENERATION MODEL OF CANDIDATE LABELS

Focusing on instance-dependent partial label learning, we propose a novel generation model of candidate labels, which explicitly models the generation process and decouples the distribution of the candidate labels into the distribution of the correct label and incorrect candidate labels respectively. To form a candidate label set, we suppose the correct label is first selected according to a posterior distribution dependent on the instance. And then the incorrect candidate labels related to the feature of instance emerge to disturb the annotator and are sampled from another posterior distribution dependent on the instance. The concrete generation model is demonstrated as follows. Given an instance x i , its candidate label set S i , consisting of the correct label y i and the incorrect candidate labels S yi i , is drawn from a probability distribution with the following density: p(S i |x i ) = j∈Si p(y i = j|x i )p(S j i |x i ). Here, p(S i |x i ) suggests our generation model is entirely dependent on the instance. Our generation model builds upon the PLL assumption that the correct label is always in the candidate set, which allows for directly splitting S i into y i and S yi i . For the given candidate label set S i , each label j ∈ S i have the possibility p(y i = j|x i ) sampled by the annotator as the correct label in the first stage of generation, and then the incorrect candidate labels S j i are sampled with the possibility p(S j i |x i ) in the second stage to form the entire candidate label set S i . The generation of the correct label and incorrect labels are conditioned on the instance x i . In this way, we decompose the generation process of the instance-dependent candidate labels. After the decomposition of generation, the corresponding probability distributions are required to depict p(y i |x i ) and p(S yi i |x i ). We assume that the correct label y i of each instance x i is drawn from a Categorical distribution with the parameters θ i , where θ i = [θ 1 i , θ 2 i , . . . , θ c i ] ∈ [0, 1] c is a c-dimension vector with the constraint c j=1 θ j i = 1, i.e., p(l i |x i , θ i ) = Cat(l i |θ i ) = c j=1 (θ j i ) l j i , where the vector l i = [l 1 i , l 2 i . . . , l c i ] ∈ {0, 1} c is utilized to represent whether the j-th label is the correct label y i , i.e.,l j i = 1 if j = y i , otherwise l j i = 0. Besides, for the incorrect candidate label set S yi i , it can also be denoted by a logical label vector s yi i = [s 1 i , s 2 i . . . , s c i ] ∈ {0, 1} c to represent whether the j-th label is the incorrect candidate label, i.e., s j i = 1 if j ∈ S yi i , s j i = 0 if j / ∈ S yi i . In order to describe p(S yi i |x i ) with a probabilistic model, we further decouple the incorrect candidate labels by assuming that the distribution of each variable s j i is independent from each other, i.e, p(s yi i |x i ) = j∈S y i i p(s j i |x i ) j / ∈S y i i (1 -p(s j i |x i )) and the incorrect candidate label set S yi i is drawn from a multivariate Bernoulli distribution with the parameters z i , where z i = [z 1 i , z 2 i , . . . , z c i ] ∈ [0, 1] c is a c-dimension vector, i.e., p(s yi i |x i , z i ) = c j=1 Ber(s j i |z j i ) = c j=1 (z j i ) s j i (1 -z j i ) 1-s j i . For the latter estimation of θ i in MAP, we introduce Dirichlet distribution, the conjugate distribution of the Categorical distribution, as its conditional prior, i.e., p(θ i |x i ) = Dir(θ i |λ i ), where λ i = [λ 1 i , λ 2 i , . . . , λ c i ] ⊤ (λ j i > 0 ) is a c-dimensional vector as the output of our main branch inference model f parameterized by Θ for an instance x i , i.e, λ i = a • exp (f (x i ; Θ)/γ) + b, where a, b and γ(a ≥ 1, b ≥ 0, γ > 0) are used to resolve the scale ambiguity. Note that the main branch inference model f is leveraged as the final predictive model to accomplish the PLL target X → Y by using y i = arg max j θ j i , where θ j i will be estimated by λ i with the conjugacy later. Likewise, to estimate z i latter in MAP, we introduce Beta distribution, the conjugate distribution of Bernoulli distribution, as the conditional prior parameterized by α i = [α 1 i , α 2 i , . . . , α c i ] ⊤ and β i = [β 1 i , β 2 i , . . . , β c i ] ⊤ (α j i > 0, β j i > 0), i.e., p(z i |x i ) = c j=1 Beta(z j i |α j i , β j i ). We employ an auxiliary branch model g parameterized by Ω to output Λ ⊤ i = [α ⊤ i , β ⊤ i ],i.e, Λ i = a•exp (g(x i ; Ω)/γ)+b. The same model and constants to scale Λ i are implemented to simplify our approaches. It should be noted that the predictive model f (x i ; Θ) or auxiliary model g(x i ; Ω) can be any deep neural network once its output satisfies the corresponding constraint.

3.3. OPTIMIZATION USING MAXIMUM A POSTERIOR

Based on the generation model of candidate labels denoted by Eq.(1), Eq.( 2) and Eq.(3), by using the technique of Maximum Likelihood (ML) estimation we can immediately induce the log-likelihood loss function for the PLL training data set to be optimized: L ML = - n i=1 log p(S i |x i , θ i , z i ) = - n i=1 log j∈Si θ j i k∈S j i z k i k / ∈S j i (1 -z k i ). The Eq.( 4) demonstrates how the distribution of the correct label interacts with that of the incorrect candidate labels after decoupling the generation. In the backpropagation of the training process, they provide weight coefficients for each other. In order to bring in prior information for PLL, we further introduce IDGP, which performs MAP in the training data set. In IDGP, we are more concerned about maximizing the joint distribution p(θ, z|x, S). This will lead to the following optimization problem L MAP = L ML + L reg , where L reg = - n i=1 log p(θ i |x i ) + log p(z i |x i ). Compared to ML, our IDGP framework provides a natural way of leveraging prior information via optimizing the extra condition prior illustrated by the Eq.( 5) in the training process, which is significant for PLL due to the implicit supervision information. Combined with the prior distribution Dirichlet and Beta, L reg can be analytically calculated as follows: L reg = - n i=1 c j=1 (λ j i -1) log θ j i + (α j i -1) log z j i + (β j i -1) log (1 -z j i ). The mathematical derivations of Eq.( 4) and Eq.( 6) are provided in Appendix A.1. By combining the Eq.( 4) and the Eq.( 6), the MAP optimization problem L can be calculated as follows: L = - n i=1 log j∈Si θ j i k∈S j i z k i k / ∈S j i (1 -z k i ) - n i=1 c j=1 (λ j i -1) log θ j i + (α j i -1) log z j i + (β j i -1) log (1 -z j i ). L can also accommodate the uniform case, which can be seen in Appendix A.2. Then, due to the conjugacy of Dirichlet and Categorical distribution Minka (2000) , we can estimate θ j i by using θ j i = E θ j i |x i , λ i = o j i + λ j i c k=1 λ k i + o k i , where o j i denotes the number of occurrences on the label j for x i , i.e., o j i = 1 if j ∈ S i , otherwise o j i = 0. Similarly, we can leverage the conjugacy of Beta and Bernoulli distribution to estimate z j i ,i.e, z j i = E z j i |x i , α j i , β j i = o j i + α j i α j i + β j i + o j i . The mathematical derivations of Eq.( 8) and Eq.( 9) are provided in Appendix A.3. As it is shown in the Eq.( 6), IDGP provides prior information that can be included in the parameters λ, α and β for the predictive model f (x; Θ) and the auxiliary model g(x; Ω) by transforming them into the weights exerted on the corresponding label. The prior information comes from the memorization effects Han et al. (2020) of the neural network. It makes the neural network always likely to recognize and remember the correct label in priority, leading to a kind of initial disambiguation at the beginning epochs. Hence, to keep fine prior information, we replace λ i with λ i in the following way: λ j(t) i = mλ j(r) i +(1 -m)λ j(t) i , if j ∈ S i 1 + ϵ, otherwise, where (•) (t) denotes the vector or scalar (•) at the t epoch, r(≤ t) denotes the beginning epoch we set to reserve the fine prior information for f (x; Θ), m ∈ (0, 1) is a positive constant used to replenish the present information λ j(t) i with the prior information λ j(r) i , and ϵ is a minor value, which means that the weight excerted on each incorrect label is negligible. In the similar way, we replace α i will be calculated by α j(t) i = dα j(q) i + (1 -d)α j(t) i β j(t) i = dβ j(q) i + (1 -d)β j(t) i , (11) Algorithm 1 IDGP Algorithm Input: PLL training dataset D = {(x i , S i |1 ≤ i ≤ n}, Epoch T , Iteration K; Output: The predictive model f (x; Θ) 1: Initialize the parameters of the predictive model f (x; Θ) and the auxiliary model g(x; Ω); 2: Initialize λ (1) i , α i , β (1) i for each instance x i ; 3: for t = 1, 2, . . . , T do 4: Randomly shuffle the training dataset D and divide it into K mini-batches; 5: for k = 1, 2, . . . , K do 6: Calculate the parameters of Categorical distribution and Bernoulli distribution θ i and z i for the instance x i by the Eq.( 8) and (9); 7: Calculate the parameters λ (t) i , α (t) i , β (t) i for the instance x i by the Eq.( 10) and ( 11); 8: Fix Θ, update Ω by the Eq.( 7); 9: Fix Ω, update Θ by the Eq.( 7); 10: end for 11: end for where q denotes the beginning epoch we set to reserve the fine prior information for the auxiliary model g(x; Ω), and d ∈ (0, 1) is also a positive constant used to provide the present model information α j(t) i , β j(t) i with the prior information α j(q) i , β j(q) i . Before the epoch r, λ j(t) i = λ j(t) i if j ∈ S i , otherwise 1 + ϵ. Before the epoch q, α j(t) i = α j(t) i and β j(t) i = β j(t) i . In the above way, we refine the prior information epoch by epoch. The optimization of the two model f (x; Θ) and g(x; Ω) at the epoch t for an instance x is as follows. First, f (x; Θ) outputs λ while g(x; Ω) outputs α and β, based on which, according to Eq.( 8) and Eq.( 9), we replace θ and z with the estimation θ and z in Eq.( 7). Next, to introduce prior information, we use λ, α and β in Eq.( 10) and Eq.( 11) to replace λ, α and β in Eq.( 7). Note that θ and z are variables, of which we calculate the gradient to perform backpropagation, while λ, α and β are constants. Finally, we update the predictive model f (x; Θ) and the auxiliary model g(x; Ω) by fixing one and updating the other. The whole algorithmic description of the IDGP is shown in Algorithm 1. After implementing IDGP in the PLL training dataset, we can use the output λ of the main branch inference model f (x; Θ) to calculate the predict results θ as the label confidence in the test dataset.

4. THEORETICAL ANALYSIS

In this section, we pay attention to the estimation error bound of the predictive model f (x; Θ). According to Eq.( 7), the empirical risk estimator for the predictive model f (x; Θ) is denoted by R(f ) = 1 n n i=1 L MAP (f (x i ), S i ) and for further analysis, we give an upper bound of L MAP (f (x i ), S i ) as follows: L MAP (f (x i ), S i ) ≤ -   K i + c j=1 w j i ℓ(f (x i ), e j )   , where e j denotes the standard canonical vector in R c , ℓ denotes the cross-entropy function, w j i = λ j i - 1 + 1 |Si| if j ∈ S i , otherwise w j i = λ j i -1, and K i = log |S i | + 1 |Si| j∈Si log k∈S j i z k i k / ∈S j i (1 - z k i ) + (α j i -1) log z j i + (β j i -1) log (1 -z j i ). We denote K = max{K 1 , K 2 , ..., K n } and scale w j i to [0, ρ] during the training process. The detailed induction of Eq.( 12) can be seen in Appendix A.4. Then to formulate the estimation error bound of f , we give the following definition and lemmas. Definition 1. Let x 1 , x 2 , . . . , x n ∈ X be n i.i.d random variables drawn from a probability distribution, σ 1 , σ 2 , . . . , σ n ∈ {-1, +1} be Rademacher variables with even probabilities, and H = {h : X → R} be a class of measurable functions. Then the expected Rademacher complexity of H is defined as  R n (H) = E x,σ sup h∈H 1 n n i=1 σ i h(x i ) . R n (G) = E x,S,σ sup g∈G 1 n n i=1 σ i g(x i , S i ) . ( ) We pre-limit both the main model f and the auxiliary model g by clamping their output to [-A, A], and the loss L MAP can be bounded, though it would not have extended to infinity in practice. Lemma 1. Suppose the loss function L MAP (f (x), S) is bounded by M ,i.e.,M = sup x∈X ,S∈C,f ∈F L MAP (f (x), S), then for any ξ > 0, with probability at least 1 -ξ, sup f ∈F R(f ) -R(f ) ≤ 2R n (G) + M 2 log 2 ξ 2n . Lemma 2. Assume the loss function ℓ (f (x), e ι ) is L-Lipschitz with respect to f (x) (0 < L < ∞) for all ι ∈ Y. Let H ι = {h : x → f ι (x)|f ∈ F } and R n (H ι ) = E x,σ sup h∈Hι 1 n n i=1 h(x i ) , then the following inequality holds: R n (G) ≤ √ 2ρcL ι∈Y R n (H ι ) + K. The proof of Lemma 1 and 2 is provided in Appendix A.5 and A.6. Based on Lemma 1 and 2, we induce an estimation error bound for our IDGP method. Let f = arg min f ∈F R(f ) be the empirical risk minimizer and f ⋆ = arg min f ∈F R(f ) be the true minimizer. The function space H ι for the label ι ∈ Y is defined as {h : x → f y (x)|f ∈ F}, and R n (H ι ) is defined as the expected Rademacher complexity of H ι with sample size n. Then we have the following theorem. Theorem 1. Assume the loss function ℓ (f (x), e ι ) is L-Lipschitz with respect to f (x) (0 < L < ∞) for all ι ∈ Y and L MAP (f (x), S) is bounded by M , i.e.,M = sup x∈X ,S∈C,f ∈F L MAP (f (x), S),. Then, for any ξ > 0, with probability at least 1 -ξ, R( f ) -R (f ⋆ ) ≤ 4 √ 2ρcL ι∈Y R n (H ι ) + M log 2 ξ 2n + 4K. The proof can be found in Appendix A.7. Theorem 1 means that as n → ∞ the empirical risk minimizer f will converge to the optimal risk minimizer f ⋆ for models with a bounded norm.

5. EXPERIMENTS

In this section, we validate the effectiveness of our proposed IDGP by performing it on augmented benchmark and real-world datasets and comparing its results against DNN-based PLL algorithms. Furthermore, the ablation study and sensitive analysis of parameters are conducted to explore IDGP. For benchmark datasets, we split 10% samples from the training datasets for validating. For each real-world dataset, we run the methods with 80%/10%/10% train/validation/test split. Then we run five trials on each datasets with different random seeds and report the mean accuracy and standard deviation of all comparing algorithms. 2021), a discriminative approach which identifies correct labels from candidate labels by class activation value. 5) LWS Wen et al. (2021) , an identification-based method which introduces a leverage parameter to consider the trade-off between losses on candidate and non-candidate labels. 6) RC Feng et al. (2020) , a risk-consistent PLL method which is induced by an importance reweighting strategy. 7) CC Feng et al. (2020) , a classifier-consistent PLL method which leverages the transition matrix describing the probability of the candidate label set given a correct label. 8) PRODEN Lv et al. (2020) , a self-training style algorithm which provides a framework to equip arbitrary stochastic optimizers and models in PLL. Note that PLCR and PICO will not be compared on the real-world datasets due to the requirement of data augmentation. To ensure the fairness, we employ the same network backbone, optimizer and data augmentation strategy for all the comparing methods. For MNIST, Kuzushiji-MNIST and Fashion-MNIST, we take LeNet-5 as their backbone. For CIFAR-10 and CIFAR-100, the network backbone is changed to ResNet-32 He et al. (2016) . For all the real-world datasets, we simply adopt the linear model. The optimizer is stochastic gradient descent (SGD) Robbins & Monro (1951) with momentum 0.9 and batch size 256. The details of data augmentation strategy are shown in Appendix A.8. Besides, the learning rate is selected from {10 -4 , 10 -3 , 10 -2 }, and the weight decay is selected from {10 -5 , 10 -4 , 10 -3 , 10 -2 } according to the performance on the validation.

5.3. EXPERIMENTAL RESULTS

The performance of each DNN-based method on each corrupted benchmark dataset is summarized in Table 1 , where the best results are highlighted in bold and •/• indicates whether IDGP statistically wins/loses to the comparing method on each dataset additionally (pairwise t-test at 0.05 significance level). We can overall see that IDGP significantly outperforms all comparing approaches on all benchmark datasets (except on Kuzushiji-MNIST where VALEN performs comparably against IDGP), and the improvements are particularly noticeable on Fashion-MNIST and CIFAR-10. Table 2 demonstrates the ability of IDGP to solve the PLL problem in real-world datasets. PLCR and PICO are not compared on the real-world PLL datasets due to the inability of data augmentation to be employed on the extracted features from various domains. We can find that our method has stronger competence than others in all datasets except Soccer Player where IDGP loses to RC but still ranks second. As for BirdSong, MSRCv2 and Yahoo!News, the performance of IDGP is significantly better than all other comparing algorithms. And when it comes to Lost, our method is comparable to VALEN, PRODEN and RC, while obviously better than the rest.

5.4. FURTHER ANALYSIS

To demonstrate the effectiveness of the iteratively refined prior information introduced by IDGP, we remove the loss function L reg to reverse IDGP to IDGP-ML, which only uses the log-likelihood function for optimization. The performance of IDGP-ML against IDGP is also measured by the classification accuracy (with pairwise t-test at 0.05 significance level). As is illustrated in Table 3 , with the assistance of the prior information which IDGP provides and improves epoch by epoch, IDGP achieves superior performance on all real-world datasets compared to IDGP-ML. Furthermore, we conduct parameter sensitivity analysis to study the influence of the two hyperparameters a, γ on our algorithm, which decides the scale of Dirichlet and Beta distribution parameters. Figure 2 illustrates the sensitivity of IDGP in the real-world datasets including Lost, BirdSong, Yahoo!New when a varies from 0.001 to 1000 and γ increases from 0.1 to 3. We can easily find that as for the small-scale real-world datasets like Lost, a and γ are suggested around 0.1 and 0.5, respectively. For the more large-scale real-world datasets like BirdSong, Yahoo!News, IDGP seems insensitive to a, and γ are suggested around 1.

6. CONCLUSION

In this paper, we consider a more realistic scenario, instance-dependent PLL, and explicitly decompose and model the generation process of instance-dependent candidate labels. Min-Ling Zhang and Fei Yu. Solving the partial label learning problem: An instance-based approach. In Twenty-fourth international joint conference on artificial intelligence, 2015. Min-Ling Zhang, Bin-Bin Zhou, and Xu-Ying Liu. Partial label learning via feature-aware disambiguation. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1335-1344, 2016.

A APPENDIX

A.1 DERIVATIONS OF EQ.(4) AND EQ.( 6) The derivation of Eq.( 4) is as follows: L ML = - n i=1 log p(S i |x i , θ i , z i ) = - n i=1 log j∈Si p(y i = j|x i )p( Sj i |x i ) = - n i=1 log j∈Si (Cat(l i |x i , θ i ) c k=1 Ber( sk i |z k i )|y i = j) = - n i=1 log j∈Si ( c k=1 (θ k i ) l k i c k=1 (z k i ) sk i (1 -z k i ) 1-sk i |y i = j) = - n i=1 log j∈Si θ j i k∈ Sj i z k i k / ∈ Sj i (1 -z k i ). The derivation of Eq.( 6) is as follows: L reg = - n i=1 log p(θ i |x i ) + log p(z i |x i ) = - n i=1 log Γ( c j=1 λ j i ) c j=1 Γ(λ j i ) c j=1 (θ j i ) λ j i -1 + log c j=1 Γ(α j i + β j i ) Γ(α j i )Γ(β j i ) (z j i ) α j i -1 (1 -z j i ) β j i -1 = - n i=1 log c j=1 (θ j i ) λ j i -1 + log c j=1 (z j i ) α j i -1 (1 -z j i ) β j i -1 + C Γ = - n i=1 c j=1 (λ j i -1) log(θ j i ) + (α j i -1) log(z j i ) + (β j i -1) log(1 -z j i ) + C Γ . (15) C Γ = - n i=1 log Γ( c j=1 λ j i ) c j=1 Γ(λ j i ) + log c j=1 Γ(α j i +β j i ) Γ(α j i )Γ(β j i ) . Due to that we later use λ, α and β in Eq.( 10) and Eq.( 11) to replace λ, α and β in Eq.( 7), which are fixed to be constants, C Γ will be a constant and ignored. Here, the derivation of Eq.( 6) has been finished.

A.2 DEGENERATION OF EQ.(7)

The proposed process can accommodate the uniform generation process of candidate labels by settng the parameters of Bernoulli distribution to a constant p, which also means the flipping probability. In this case, we can degenerate L ML , L reg and L to the following form. L ML-D = - n i=1 log j∈Si θ j i k∈ Sj i z k i k / ∈ Sj i (1 -z k i ) = - n i=1 log j∈Si θ j i (1 -p) c+1-|Si| p |S|-1 = - n i=1 log(1 -p) c+1-|Si| p |S|-1 j∈Si θ j i = - n i=1 log j∈Si θ j i - n i=1 log(1 -p) c+1-|Si| p |S|-1 . ( ) Due to that the second term is a constant, the final L ML is formulated as L ML-D = - n i=1 log j∈Si θ j i . Due to the parameters of Bernoulli Distribution are constant, we do not need Beta Distribution. Hence, L reg-D = - n i=1 c j=1 (λ j i -1)logθ i j . Finally, L D = L ML-D + L reg-D = - n i=1 log j∈Si θ j i - n i=1 c j=1 (λ j i -1)logθ i j . A.3 DERIVATIONS OF EQ.(8) AND EQ.( 9) θ i ∼ p(θ i |x i , λ i ) ∝ Dir(θ i |λ i ) • Cat(l i |θ i ) = Γ( c j=1 λ j i ) c j=1 Γ(λ j i ) • c j=1 (θ j i ) λ j i -1 • c j=1 (θ j i ) l j i = Γ( c j=1 λ j i ) c j=1 Γ(λ j i ) • c j=1 (θ j i ) λ j i +l j i -1 = Dir(θ i |λ i + l i ). The above has proved the conjugacy of Dirichlet and Categorical distribution. For Dirichlet distribution , its expectation can be calculated as E[θ j i |x i , λ i ] = λ j i + l j i c k=1 λ k i + l k i = λ j i + o j i c k=1 λ k i + o k i . The mathematical derivations of Eq.( 8) are completed. Eq.( 9) can be proved in a similar way due to Dirichlet distribution and Categorical distribution are the generalization forms of Beta distribution and Bernoulli distribution respectively. Published as a conference paper at ICLR 2023 A.4 CALCULATION DETAILS OF EQ. ( 12) According to multivariate basic inequality, we can obtain: -log j∈Si θ j i k∈S j i z k i k / ∈S j i (1 -z k i ) ≤ -log |S i | - 1 |S i | j∈Si   log θ j i + log k∈S j i z k i k / ∈S j i (1 -z k i )    . (22) Then we can calculate Eq. ( 12) as L MAP (f (x i ), S i ) = -log j∈Si θ j i k∈S j i z k i k / ∈S j i (1 -z k i ) - c j=1 (λ j i -1) log θ j i + (α j i -1) log z j i + (β j i -1) log (1 -z j i ) ≤ -log |S i | - 1 |S i | j∈Si   log θ j i + log k∈S j i z k i k / ∈S j i (1 -z k i )    - c j=1 (λ j i -1) log θ j i + (α j i -1) log z j i + (β j i -1) log (1 -z j i ) = -   log |S i | + 1 |S i | j∈Si log k∈S j i z k i k / ∈S j i (1 -z k i ) + c j=1 (α j i -1) log z j i + (β j i -1) log (1 -z j i )    - j∈Si λ j i -1 + 1 |S i | log θ j i - j∈Si λ j i -1 log θ j i = -   K i + c j=1 w j i ℓ(f (x i ), e j )   ) where e j denotes the standard canonical vector in R c , and ℓ denotes the cross-entropy function, w j i = λ j i -1 + 1 |Si| if j ∈ S i , otherwise w j i = λ j i -1, and K i = log |S i | + 1 |Si| j∈Si log k∈S j i z k i k / ∈S j i (1 -z k i ) + (α j i -1) log z j i + (β j i -1) log (1 -z j i ). A.5 PROOF OF LEMMA 1 In order to prove this lemma, we first show L MAP can be bounded. Due to the ouput of f, g is limited in [-A, A], the following inequations hold: When M takes the value larger than -log|S i |B(1 -F ) c+1-|Si| E |Si|-1 -clog[BE(1 -F )] a exp(A/γ)+b , the loss L MAP will be bounded. Note that the limitation to the output of the model excludes extreme (conditional) probabilities, the effect of which could be ignored which A is large enough. For all benchmark datasets, we apply Random Horizontal Flipping, Random Cropping, and Cutout. θ j i ≥ For CIFAR-10 and CIFAR-100, AutoAugment is additionally applied. A.9 EXTENDING EXPERIMENTS Our Generation Model. We also synthesize the sampled correct label from Categorical Distribution p(y = j|x) and incorrect candidate labels from Bernoulli Distribution p(s j = 1|x). For the former, correct labels has been contained in these datasets. For the latter, we use the confidence prediction of a clean neural network g(x; Ω) (trained only with correct labels) to model the Bernoulli Distribution, i.e., p(s j = 1|x) = sigmoid(g j (x; Ω)). Table 4 illustrates the performance of IDGP and comparing approaches on on benchmark datasets, of which instance-dependent partial labels generated by our generation model. From the table, we can overall see that IDGP consistently outperforms all comparing approaches on all benchmark datasets, and the improvements are particularly noticeable on Kuzushiji-MNIST and CIFAR-10. Ablation Study. As is illustrated in Table 5 , IDGP also achieves superior performance on all benchmark datasets, of which instance-dependent partial labels are generated by our generation model. Sensitive Analysis. For the benchmark dataset Fashion-MINIST, the performance of IDGP is stable and effective in the case that a is around 10 and γ is around 1.5. 



i , i.e, the vectors α (t) i and β (t)

BASELINESWe compare IDGP with eight DNN-based methods:1) PLCRWu et al. (2022), a regularized training framework which is based on data augmentation and utilizes the manifold consistency regularization term to preserve the manifold structure both in feature space and label space. 2) PICOWang et al. (2022), a contrastive learning framework which is based on data augmentation and performs label disambiguation based on the contrastive prototypes. 3)VALEN Xu et al. (2021), an instance-dependent PLL framework which guides the training process via the recovered latent label distributions. 4) CAVL Zhang et al. (

Figure 1: Parameter sensitivity analysis for IDGP on Lost, BirdSong and Yahoo!New

exp(-A/γ) + b ac exp(A/γ) + bc + c MAP ≤ -log |S i |B(1 -F ) c+1-|Si| E |Si|-1 -c log[BE(1 -F )] a exp(A/γ)+b

Figure 2: Parameter sensitivity analysis for IDGP on Fashion-MINIST.

Classification accuracy (mean±std) of each comparing approach on benchmark datasets for instance-

Classification accuracy (mean±std) of comparing algorithms on the real-world datasets. We implement IDGP with compared DNN-based algorithms on five widely used benchmark datasets in deep learning, including MNIST LeCun et al. (1998), Kuzushiji-MNIST Clanuwat et al. (2018), Fashion-MINIST Xiao et al. (2017), CIFAR-10 and CIFAR-100 Krizhevsky et al. (2009). Instance-dependent partial labels for these datasets are generated through the same strategy as Xu et al. (2021), which for the first time consider instance-dependent PLL. Besides, part of the comparing algorithms are also performed on five frequently used real-world datasets, which come from different practical application domains, including Lost Cour et al. (2011), BirdSong Briggs et al. (2012), MSRCv2 Liu & Dietterich (2012), Soccer Player Zeng et al. (2013) and Yahoo!News Guillaumin et al. (2010).

Classification accuracy (mean±std) for comparison against IDGP-ML.

Then based on the decompositional generation process, a novel instance-dependent PLL approach IDGP is proposed by us to further introduce and refine the prior information in every training epoch via MAP. The experimental comparisons with other DNN-based algorithms on both instance-dependent corrupted benchmark datasets and real-world datasets demonstrate the effectiveness of our proposed method. Zinan Zeng, Shijie Xiao, Kui Jia, Tsung-Han Chan, Shenghua Gao, Dong Xu, and Yi Ma. Learning by associating ambiguously labeled images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 708-715, 2013. Fei Zhang, Lei Feng, Bo Han, Tongliang Liu, Gang Niu, Tao Qin, and Masashi Sugiyama. Exploiting class activation value for partial-label learning. In International Conference on Learning Representations, 2021.

Classification accuracy (mean±std) of each comparing approach on benchmark datasets, of which instance-dependent partial labels are generated by our generation model. ± 0.05% 91.20 ± 0.21% 97.28 ± 0.19% 90.18 ± 0.32% 64.67 ± 0.23%

Classification accuracy (mean±std) for comparison against IDGP-ML on benchmark datasets, of which instance-dependent partial labels are generated by our generation model.

ACKNOWLEDGMENTS

This research was supported by the National Key Research & Development Plan of China (No. 2021ZD0114202), the National Science Foundation of China (62206050, 62125602, and 62076063), China Postdoctoral Science Foundation (2021M700023), Jiangsu Province Science Foundation for Youths (BK20210220), Young Elite Scientists Sponsorship Program of Jiangsu Association for Science and Technology (TJ-2022-078).

availability

Source code is available at https://github.com/

annex

Published as a conference paper at ICLR 2023 Then, we show that the one direction sup f ∈F R(f ) -R(f ) is bounded with probability at least 1 -ξ/2, and the other direction can be similarly shown. Suppose an example (x i , S i ) is replaced by another arbitrary example (x ′ i , S ′ i ), then the change of sup f ∈F R(f ) -R(f ) is no greater than M/(2n), the loss function L are bounded by M . By applying McDiarmid's inequality, for any ξ > 0, with probability at least 1 -ξ/2,By sysmetrization, we can obtainBy further taking into account the other side sup f ∈F R(f ) -R(f ), we have for any ξ > 0, with probability at least 1 -ξ,A.6 PROOF OF LEMMA 2The upper bound loss function ofCorrespondingly, the function space for L can be defined as:Then the expected Rademacher complexity of Ḡ can be defined as follows:For each example (x i , S i ), sinceand the loss function ℓ (f (x, e ι )) is L-Lipschitz for all ι ∈ Y, by the Rademacher vector contraction inequality, we haveThen the proof is completed.

A.7 PROOF OF THEOREM 1

Theorem is proven through 32) 

