ADVERSARY-AWARE PARTIAL LABEL LEARNING WITH LABEL DISTILLATION

Abstract

To ensure that the data collected from human subjects is entrusted with a secret, rival labels are introduced to conceal the information provided by the participants on purpose. The corresponding learning task can be formulated as a noisy partiallabel learning problem. However, conventional partial-label learning (PLL) methods are still vulnerable to the high ratio of noisy partial labels, especially in a large labelling space. To learn a more robust model, we present Adversary-Aware Partial Label Learning and introduce the rival, a set of noisy labels, to the collection of candidate labels for each instance. By introducing the rival label, the predictive distribution of PLL is factorised such that a handy predictive label is achieved with less uncertainty coming from the transition matrix, assuming the rival generation process is known. Nonetheless, the predictive accuracy is still insufficient to produce an sufficiently accurate positive sample set to leverage the clustering effect of the contrastive loss function. Moreover, the inclusion of rivals also brings an inconsistency issue for the classifier and risk function due to the intractability of the transition matrix. Consequently, an adversarial teacher within momentum (ATM) disambiguation algorithm is proposed to cope with the situation, allowing us to obtain a provably consistent classifier and risk function. In addition, our method has shown high resiliency to the choice of the label noise transition matrix. Extensive experiments demonstrate that our method achieves promising results on the CIFAR10, CIFAR100 and CUB200 datasets.

1. INTRODUCTION

Deep learning algorithms depend heavily on a large-scale, true annotated training dataset. Nonetheless, the costs of accurately annotating a large volume of true labels to the instances are exorbitant, not to mention the time invested in the labelling procedures. As a result, weakly supervised labels such as partial labels that substitute true labels for learning have proliferated and gained massive popularity in recent years. Partial-label learning (PLL) is a special weakly-supervised learning problem associated with a set of candidate labels ⃗ Y for each instance, in which only one true latent label y is in existence. Nonetheless, without an appropriately designed learning algorithm, the limitations of the partial label are evident since deep neural networks are still vulnerable to the ambiguous issue rooted in the partial label problem because of noisy labels Zhou (2018) ; Patrini et al. (2017) ; Han et al. (2018) . As a result, there have had many partial label learning works (PLL) Cour et al. (2011) ; Hüllermeier & Beringer (2006) ; Feng & An (2019) ; Feng et al. (2020) successfully solved the ambiguity problem where there is a set of candidate labels for each instance, and only a true label exists. Apart from the general partial label, we have also seen a variety of partial label generations evolved, simulating different real-life scenarios. The independently and uniformly drawing is the one have seen the most Lv et al. (2020) ; Feng & An (2019) . The other problem settings include the instance dependent partial label learning, where each partial label set is generated depending on the instance as well as the true label Xu et al. (2021) . Furthermore, Lv et al. (2020) has introduced label specific partial label learning, where the uniform flipping probability of similar instances differs from dissimilar group instances. Overall, the learning objective of the previous works is all about disambiguation. More specifically, the goal is to design a classifier training with partial labels, aiming to correctly label the testing dataset, hoping the classification performance will be as close as the full supervised learning. On the contrary, there is a lack of discussion on previous works that shed light on the data privacyenhancing techniques in general partial label learning. The privacy risk is inescapable; thus, privacypreserving techniques need to be urgently addressed. Recently, we have seen surging data breach cases worldwide. These potential risks posed by the attacker are often overlooked and pose a detrimental threat to society. For instance, it is most likely for the adversary to learn from stolen or leaked partially labelled data for illegal conduct using the previous proposed partial-label learning methods. Subsequently, it has become an inherent privacy concerns in conventional partial label learning. In this paper, the Adversary-Aware partial label learning is proposed to address and mitigate the ramification of the data breach. In a nutshell, we propose an affordable and practical approach to manually corrupt the collected dataset to prevent the adversary from obtaining high-quality, confidential information meanwhile ensure the trustee has full access to the useful information. However, we have observed that adversary-aware partial label learning possesses some intrinsic learnability issues. Firstly, the intractability is raised from the transition matrix. Secondly, the classifier and risk inconsistency problem has been raised. Hence, we propose an the Adversarial teacher within momentum (ATM)(In section 2.1), adversary-aware loss function equation 19, and a new ambiguity condition equation 1 to counter the issues. Under the adversary-aware partial label problem setting, the rival is added to a candidate set of labels. To achieve that, we extend the original partial label generation equation 2 by factorisation to add the rival Y ′ . Subsequently, we have the adversary-aware partial label generation established as equation 3. Then, we decompose the second equation of equation 3 into the rival embedded intractable transition matrix term Q * and class instance-dependent transition matrix T y,y ′ , which is P(Y ′ = y ′ | Y = y, X = x). In our problem setting, Ty,y ′ , the class instance-independent transition matrix is utilised, which is defined as P(Y ′ = y ′ | Y = y), with the assumption the rival is generated depending only on Y but instance X. Under the assumption, the class instanceindependent transition matrix is simplified and mathematically identifiable. Since all the instances share the same class instance-independent transition matrix in practice, such encryption is more affordable to implement. The rival variable serves as controllable randomness to enhance privacy against the potential adversary and information leakage. In contrast, the previous methods can not guarantee the privacy protection property. However, a fundamental problem has been raised, inclusion of the rival implies an inconsistent classifier according to the adversary-aware label generation equation equation 3. Learning a consistent partial label classifier is vital, but in our problem setting, the consistency classifier may not be obtained due to the intractability of Q * (details are described in section 1.2). As a consequence, the Adversarial teacher within momentum (ATM) is proposed, which is designed to identify the term P( ⃗ Y | Y, Y ′ , X) which is denoted as Q * . The Moco-style dictionary technique He et al. (2020) and Wang et al. (2022) have inspired us to explore exploiting the the soft label from instance embedding, leveraging Ty,y ′ to identify or reduce the uncertainty of the Q * due to the property of informational preservation and tractability. Therefore, a consistent partial label learner is obtained if the uncertainty raised from the transition matrix is reduced greatly. Specifically, we transform the inference of label generation in Adversary-Aware PLL as an approximation for the transition matrix Q * . Ultimately, a tractable solution to the unbiased estimate of P( ⃗ Y | Y, Y ′ , X) can be derived. Lastly, we have rigorously proven that a consistent Adversary-Aware PLL classifier can be obtained if P( ⃗ Y | Y, Y ′ , X ) and P(Y ′ | Y ) are approximated accurately according to equation 3. In this work, we are mainly focusing on identifying the transition matrix term P( ⃗ Y | Y, Y ′ , X). The rival is generated manually for privacy enhancement. Thus the P(Y ′ | Y ) is given by design. Overall, our proposed method has not only solved the ambiguity problem in Adversary-Aware PLL but also addressed the potential risks from the data breach by using a rival as the encryption. Our proposed label generation bears some resemblance to local differential privacy Kairouz et al. (2014); Warner (1965) , which aims to randomise the responses. The potential application is to randomise survey responses, a survey technique for improving the reliability of responses to confidential interviews or private questions. Depending on the sophistication of the adversary, our method offers a dynamic mechanism for privacy encryption that is more resilient and flexible to face the potential adversary or privacy risk. By learning from the previous attacks, we can design different levels of protection by adjusting the T term. The main contributions of the work are summarized: • We propose a novel problem setting named adversary-aware partial label learning. • We propose a novel Adversary-Aware loss function and the Adversarial teacher within momentum (ATM) disambiguation algorithm. Our proposed paradigm and loss function can be applied universally to other related partial label learning methods to enhance the privacy protection. • A new ambiguity condition (equation 1) for Adversary-Aware Partial Label Learning is derived. Theoretically, we proven that the method is a Classifier-Consistent Risk Estimator.  Y := {⃗ y | ⃗ y ⊂ Y}=2 [c] , in which there is total 2 [c] selection of subsets in [c]. The objective is to learn a classifier with the adversary-aware partially labelled sample n, which was i.i.d drawn from the ⃗ D = {(X 1 , ⃗ Y 1 ), . . . , (X n , ⃗ Y n )} , aiming that it is able to assign the true labels for the testing dataset. Given instance and the adversary-aware partial label ⃗ Y the adversary-aware partial label dataset distribution ⃗ D is defined as (X, ⃗ Y ) ∈ X × ⃗ Y. The class instance-independent transition matrix P (Y ′ | Y ) is denoted as T ∈ R c×c . Ty,y ′ = P (Y ′ = y ′ | Y = y) where Ty,y = 0, ∀y ′ , y ∈ [c]. The adversary-aware means the designed paradigm can prevent the adversary from efficiently and reliably inferring certain information from the database without the T , even if the data was leaked. The rival is the controllable randomness added to the partial label set to enhance privacy.

1.2.1. ASSERTION CONDITIONS IN LABEL GENERATION SET

The following conditions describe the learning condition for adversary-aware partial label. According to Cour et al. (2011) there needs to be certain degrees of ambiguity for the partial label learning. Lemma 1 is the new ERM learnability condition which is proposed as follows P y ′ ,ȳ := P(y ′ , ȳ ∈ ⃗ Y | Y ′ = y ′ , Ȳ = ȳ, X = x). (1) The y ′ is the rival, and ȳ is the false positive label that exists in the partial label set. It has to be met to ensure the Adversary-Aware PLL problem is learnable with y ′ ̸ = y and ȳ ̸ = y, these conditions ensure the ERM learnability Liu & Dietterich (2014) of the adversary-aware PLL problem if there is small ambiguity degree condition. In our case which is that, P y ′ ,ȳ < 1. The y is the true label corresponding to each instance x. And P y := P(y ∈ ⃗ Y | Y = y, X = x), where P y = 1 to ensure that the ground truth label is in the partial label set with respect to each instance.

1.2.2. LABEL GENERATION

In the previous works of partial label generation procedure, only a candidate of the partial label was generated as such. The Standard Partial Label Generation: y∈Y P( ⃗ Y = ⃗ y, Y = y | X = x) = y∈Y P( ⃗ Y = ⃗ y | Y = y, X = x)P(Y = y | X = x). = y∈Y P( ⃗ Y = ⃗ y | Y = y)P(Y = y | X = x), where P( ⃗ Y = ⃗ y | Y = y, X = x) is the label generation for the class instance-dependent partial label and P( ⃗ Y = ⃗ y | Y = y) is the standard partial label learning framework. Then we present the difference between the general partial labels and the adversary-aware partial label. The Adversary-Aware Partial Label Generation: y∈Y P( ⃗ Y = ⃗ y | X = x) = y∈Y y ′ ∈Y ′ P( ⃗ Y = ⃗ y, Y = y, Y ′ = y ′ | X = x) = y∈Y y ′ ∈Y ′ P( ⃗ Y = ⃗ y | Y = y, Y ′ = y ′ , X = x) Adversary-Aware transition matrix Ty,y ′ P(Y = y | X = x). (3) In the adversary-aware partial label problem setting, the transition matrix of the adversary-aware partial label is defined as P( ⃗ Y | Y, Y ′ , X) and denoted as Q * ∈ R c×(2 c -2) . The partial label transition matrix P( ⃗ Y | Y ) is denotes as Q ∈ R c×(2 c -2) . Theoretically, if the true label Y of the vector ⃗ Y is unknown given an instance X, where ⃗ y ∈ ⃗ Y and there are 2 c -2 candidate label sets.The ϵ x is the instance-dependent rival label noise for each instance where ϵ x ∈ R 1×c . The entries of the adversary-aware transition matrix for each instance is defined as follows 2 c -2 j=1 Q * [:, j] = 2 c -2 j=1 ([ Q[:, j] T + ϵ x ] T ) T = 2 c -2 j=1 (A[:, j] T T ) T , where A[:, j] T = Q[:, j] T + ϵ x and the conditional distribution of the adversary-aware partial label set ⃗ Y based on Wen et al. (2021) is derived as belows P( ⃗ Y = ⃗ y | Y = y, Y ′ = y ′ , X = x) = b ′ ∈⃗ y,b ′ ̸ =y p b ′ • t ′ / ∈⃗ y (1 -p t ′ ) , (5) where p t ′ and p b ′ are defined as p t ′ := P(t ∈ ⃗ Y | Y = y, Y ′ = y ′ , X = x) < 1, p b ′ := P(b ∈ ⃗ Y | Y = y, Y ′ = y ′ , X = x) < 1. (6) We summarize the equation 3 as a matrix form in equation 7. The inverse problem is to identify a sparse approximation matrix A to use equation 8 to estimate the true posterior probability. P ( ⃗ Y | X = x) Adversary-aware PLL = Q * P (Y | X = x) True posterior probability , Q * -1 P ( ⃗ Y | X = x) Adversary-aware PLL = P (Y | X = x) True posterior probability , T -1 A -1 P ( ⃗ Y | X = x) Adversary-aware PLL ≈ P (Y | X = x) True posterior probability . (8) In reality, due to the computational complexity of the transition matrix, it would be a huge burden to estimate Q * accurately for each instance. The 2 c -2 is an extremely large figure and increases exponentially as the label space increase. Therefore, we are no longer required to estimate the true transition matrix P( ⃗ Y | Y, Y ′ , X). Instead, we resort to using instance embedding in the form of a soft label to identify the adversary-aware partial label transition matrix Q * . Specifically, we proposed to use a soft pseudo label from the instance embedding (Prototype) to approximate the adversary-aware transition matrix for each instance. The reason is that we can not achieve the true transition matrix Q * directly due to the nature of the practical partial label problem. Therefore, we have used the self-attention prototype learning to approximate the true transition matrix. The detail is described in section 2.1. Since the Adversary-aware partial label is influenced by the rival label noise, it is challenging to accurately estimate both the class instance-independent transition matrix T and the sparse matrix A simultaneously to estimate the true posterior. Considering that the T is private and given, it is easier for us just to approximate A to estimate the posterior probability than the adversary. The equation 8 is implemented as the loss function in equation 17.

1.3. POSITIVE SAMPLE SET

The construction of a positive sample is used for contrastive learning to identify the transition matrix P ( ⃗ Y | Y ′ , Y, X) via the label disambiguation. Nonetheless, the performance of the contrastive learning erodes drastically due to the introduced rival, which is manifest in the poorly constructed positive sample set, resulting in the degenerated classification performance (See Figure 2 ). Subsequently, the adversary-aware loss function is proposed in conjunction the contrastive learning to prevent classification performance degeneration. To start with, we define L 2 norm embedding of u and k as the query and key latent feature from the feature extraction network f Θ and key neural network f ′ Θ respectively. Correspondingly, we have the output u ∈ R 1×d where u i = f Θ (Aug q (x)) and z ∈ R 1×d where z i =f ′ Θ (Aug k (x i )). The construction of a positive sample set is shown as follows. In each mini-batch, we have ⃗ D b where ⃗ D b ∈ ⃗ D. The f (x i ) is the function of a neural network with a projection head of 128 feature dimensionality. The outputs of D q and D k are defined as follows, D q = {u i = f Aug q (x i ) | x i ∈ ⃗ D b }, D k = {z i = f ′ (Aug k (x i )) | x i ∈ ⃗ D b }, where S(x) is the sample set excluding the query set q and is defined as S(x) = C\{q}, in which C = D q ∪ D k ∪ queue . The D q and D k are vectorial embedding with respect to the query and key views given the current mini-batch. The queue size is determined accordingly depending on the input. The instances from the current mini-batch with the prediction label ȳ′ equal to (ŷ i = c) from the S(x). is chosen to be the positive sample set. Ultimately, the N (x) is acquired, and it is denoted as N + (x i ) = z ′ | z ′ ∈ S (x i ) , ȳ′ = (ŷ i = c) . The N + (x) is the positive sample set. The construction of sufficiently accurate positive sample set N + (x) is vital as it underpins the clustering effect of the latent embedding in the contrastive learning procedure. The quality of the clustering effect relies on the precision of prototype v j corresponding to j ∈ {1, ..., C}. Our method helps maintain the precision of prototypes using the T to render better label disambiguation module performance for contrastive learning when introduced the rival. where the query embedding u multiplies the key embedding z and then divides with the remaining pool C. Overall, the S + (x) is used to facilitate the representation learning of the contrastive learning and the self-attention prototype learning to do the label disambiguation or a more accurate pseudolabelling procedure. Our proposed loss ensures the prototype and contrastive learning are working systematically and benefit mutually when the rival is introduced. The pseudo label generation is according to equation 16. We have followed Wang et al. (2022) for the positive sample selection.

2. METHODOLOGY

The main task of partial label learning is label disambiguation, which targets identifying the true label among candidate label sets. Thus, we present an adversarial teacher within momentum (ATM). The equation 17 is developed to do the debiasing from the prediction of f (x) given the adversaryaware partial label via the class instance dependent transition matrix T + I. The unbiased prediction induces the identification of a more accurate positive sample set which allows Equation 18to leverage the high-quality presentation power of a positive sample set to improve the classification performance.

2.1. PSEUDO LABEL LEARNERS VIA ADVERSARIAL TEACHER WITHIN MOMENTUM (ATM)

Unlike Wang et al. (2022) , we present an adversarial teacher strategy with momentum update (ATM) to guide the learning of pseudo labels using Equation 17. Just like a tough teacher who teaches the subject using harsh contents to test students' understanding of the subject. In our case, the rival is like the subject which is purposely generated by us, at the same time Equation 17is introduced to check the understanding of the student (classifier) given the scope of testing content which is the T . Specifically, the spherically margin between prototype vector v i ∈ S d-1 and prototype vector v j ∈ S d-1 is defined as m ij = exp (-v ⊤ i v j ). ( ) For prototype v i , we define the normalized margin between v i and v j as mij = exp (-v ⊤ i v j ) j̸ =i exp (-v ⊤ i v j ) . ( ) For each v i , i ∈ {1, • • • , K}, we perform momentum updating with the normalized margin between v j and v i for all j ̸ = i as an regularization. The resulted new update rule is given as v t+1 i = 1 -α 2 v t i + α g ∥g∥ 2 , ( ) where the gradient g is given as g = u -β j̸ =i mt ij v t j , ( ) where u is the query embedding whose prediction is class i, mt ij is the normalized margin between prototype vectors at step t (i.e., v t j , j ̸ = i). The v c is the prototype corresponding to each class. q = ϕq + (1 -ϕ)v, v c = 1 if c = arg max j∈Y u ⊤ v 0 otherwise , . ( ) where q is the target prediction and subsequently used in the equation 17. It was initialised as the uniform probability q = 1 |c| 1 and updated accordingly to the equation 16. The ϕ is the hyperparameter controlling for the updating of q.

2.2. ADVERSARY AWARE LOSS FUNCTION.

The goal is to build a risk consistent loss function, hoping it can achieve the same generalization error as the supervised classification risk R(f ) with the same classifier f . To train the classifier, we minimize the following modified loss function estimator by leveraging the updated pseudo label from the Adversarial teacher within momentum (ATM) distillation method and transition plus identity matrix, I i,j ∈ [0, 1] c×c , I i,i = 1, for ∀ i=j ∈ [c], I i,j = 0, for ∀ i̸ =j ∈ [c]: where f (X) ∈ R |c| , ⃗ L(f (X), ⃗ Y ) = - c i=1 ( qi ) log (( T + I)f (X)) i . (17) The proof for the modified loss function is shown in the appendix lemma 4. In our case, given sufficiently accurate positive sample set of the contrastive learning is utilised to incorporate with equation 17 to identify the transition matrix of the adversary-aware partial label. The contrastive loss is defined as follows L(f (x), τ, C) = 1 |D q | u∈Dq {- 1 N + (x) z + ∈N+(x) log exp(u ⊤ z/τ ) z ′ ∈ C(x) exp(u ⊤ z/τ ) }. Finally, we have the Adversary-Aware Loss expressed as Adversary-Aware Loss = λL(f (x i ), τ, C) + ⃗ L(f (X), ⃗ Y ). There are two terms of the proposed loss function (equation 19), which are the equation 17 and equation 18 correspondingly. equation 17 is developed to lessen prediction errors from f (x) given the adversary-aware partial label. The debiasing is achieved via the class instance dependent transition matrix T + I by down-weighting the false prediction. The unbiased prediction induces the identification of a more accurate positive sample set. equation 18 is the contrastive loss. It leverages the high-quality representation power of positive sample set to improve the classification performance further.

3. THEORETICAL ANALYSIS

The section introduces the concepts of classifier consistency and risk consistency Xia et al. ( 2019) Zhang (2004) , which are crucial in weakly supervised learning. Risk consistency is achieved if the risk function of weak supervised learning is the same as the risk of fully supervised learning with the same hypothesis. The risk consistency implies classifier consistency, meaning classifier trained with partial labels is consistent as the optimal classifier of the fully supervised learning. Classifier-Consistent Risk Estimator Learning with True labels. Lets denote f (X) = (g 1 (x), . . . , g K (x)) as classifier, in which g c (x) is the classifier for label c ∈ [K]. The prediction of the classifier f c (x) is P (Y = c | x). We want to obtain a classifier f (X) =arg max i∈[K] g i (x). The loss function is to measure the loss given classifier f (X). To this end, the true risk can be denoted as R(f ) = E (X,Y ) [L (f (X) , Y )]. (20) The ultimate goal is to learn the optimal classifier f * =arg min f ∈F R(f ) for all loss functions, for instance to enable the empirical risk Rpn (f ) to be converged to true risk R(h). To obtain the optimal classifier, we need to prove that the modified loss function is risk consistent as if it can converge to the true loss function. Learning with adversary-aware Partial Label. An input X ∈ X has a candidate set of ⃗ Y ∈ ⃗ Y but a only true label Y ∈ ⃗ Y. Given the adversary-aware partial label ⃗ Y ∈ ⃗ Y and instance X ∈ X that the objective of the loss function is denoted as R(f ) = E (X, ⃗ Y ) ⃗ L f (X) , ⃗ Y . ( ) Since the true adversary-aware partial label distribution D is unknown, our goal is approximate the optimal classifier with sample distribution Dpn by minimising the empirical risk function, namely Rpn (f ) = 1 n n i=1 ⃗ L (f (x i ) , ⃗ y i ) . ( ) Assumption 1. According to Yu et al. (2018) that the minimization of the expected risk R(f ) given clean true population implies that the optimal classifier is able to do the mapping of f * i (X) = P (Y = i | X), ∀i ∈ [c]. Under the assumption 1, we are able to draw conclusion that f * = f * applying the theorem 2 in the following. Theorem 1. Assume that the Adversary-Aware matrix T y,y ′ is fully ranked and the Assumption 1 is met, the the minimizer of f * of R(f ) will be converged to f * of R(f ), meaning f * = f * . Remark. If the Q * and T y,y ′ is estimated correctly the empirical risk of the designed algorithm trained with adversary-aware partial label will converge to the expected risk of the optimal classifier trained with the true label. If the number of sample is reaching infinitely large that given the adversary-aware partial labels, fn is going to converged to f * theoretically. Subsequently, fn will converge to the optimal classifier f * as claimed in the theorem 1. With the new generation procedure, the loss function risk consistency theorems are introduced. Theorem 2. The adversary-aware loss function proposed is risk consistent estimator if it can asymptotically converge to the expected risk given sufficiently good approximate of Q and the adversaryaware matrix.The proof is in appendix lemma 4. L(y, f (x)) = ⃗ y∈ ⃗ Yy C y=1 y ′ ∈Y ′ (P(Y = y | X = x) = b ′ ∈⃗ y p b ′ • t ′ / ∈⃗ y (1 -p t ′ ) Tyy ′ ⃗ L(⃗ y, f (x))) = ⃗ L(⃗ y, f (x)).

3.0.1. GENERALISATION ERROR

Define R and Rpn as the true risk the empirical risk respectively given the adversary-aware partial label dataset. The empirical loss classifier is obtained as fpn = arg min f ∈F Rpn (f ). Suppose a set of real hypothesis  F ⃗ y k with f i (X) ∈ F, ∀i ∈ [c] . Also, assume it's loss function ⃗ L(f (X), ⃗ Y ) is L-Lipschitz continuous with respect to f (X) for all ⃗ y k ∈ ⃗ Y -δ, R fpn -R f ⋆ ≤ 4 √ 2L c k=1 ℜ n (F ⃗ y k ) + M log 2 δ 2n . ( ) As the number of samples reaches to infinity n → ∞, ℜ n (F ⃗ y k ) → 0 with a bounded norm. Subsequently, R( f ) → R f ⋆ as the number of training data reach to infinitely large. The proof is given in Appendix Theorem 3.

4. EXPERIMENTS

Datasets We evaluate the proposed method on three benchmarks-CIFAR10, CIFAR100 ) . The method has shown consistently superior results in all learning scenarios where q = {0.3, 0.5} for the adversary-aware partial label learning. More specifically, the proposed method achieves 8.17% superior classification performance at a 0.5 partial rate than the previous state of art work Wang et al. (2022) . Moreover, our proposed method has achieved comparable results at 0.1 and 0.3 partial rates. The experiments for CIFAR-10 have been repeated four times with four random seeds. Main Empirical Results for CUB200 and CIFAR100. The proposed method has shown superior results for the Adversary-Aware Partial Label, especially in more challenging learning tasks like the 0.1 partial rate of the dataset cub200 and CIFAR100, respectively. On the cub200 dataset, we have shown 5.95% improvement at partial rates 0.1 and 1.281% and 0.37% where the partial rate is at 0.05 and 0.03. On the CIFAR100 dataset, the method has shown 6.06% and 0.4181%, 0.5414% higher classification margin at partial rate 0.1, 0.05 and 0.03.The experiments have been repeated five times with five random seeds. 

5. CONCLUSION AND FUTURE WORKS

This paper introduces a novel Adversary-Aware partial label learning problem. The new problem setting has taken local data privacy protection into account. Specifically, we have added the rival to the partial label candidate set as encryption for the dataset. Nonetheless, the generation process has made the intractable transition matrix even more complicated, leading to an inconsistency issue. Therefore, the novel adversary-aware loss function and the self-attention prototype are proposed. The method is proven to be a provable classifier and has shown superior performance. Future work will use variational inference methods to approximate the intractable transition matrix.



RELATED WORK Partial Label Learning (PLL) trains an instance associated with a candidate set of labels in which the true label is included. Many frameworks are designed and proposed to solve the label ambiguity issue in partial label learning. The probabilistic graphical model-based methodsZhang et al. (2016); Wang & Isola (2020); Xu et al. (2019); Lyu et al. (2019) as well as the clustering-based or unsupervised approaches Liu & Dietterich (2012) are proposed by leveraging the graph structure and prior information of feature space to do the label disambiguation. The average-based perspective methods Hüllermeier & Beringer (2006); Cour et al. (2011); Zhang et al. (2016) are designed based on the assumption of uniform treatment of all candidates; however, it is vulnerable to the false positive label, leading to misled prediction. Identification perspective-based methodsJin & Ghahramani (2002) tackle disambiguation by treating the true label as a latent variable. The representative perspective approach uses the maximum margin methodNguyen & Caruana (2008);Wang et al. (2020;2022) to do the label disambiguation. Most recently, self-training perspective methodsFeng & An (2019); Wen et al. (2021); Feng et al. (2020) have emerged and shown promising performance. In Contrastive Learning He et al. (2020); Oord et al. (2018), the augmented input is applied to learns from feature of the unlabeled sample data. The learning objective is to differentiate the similar and dissimilar parts of the input, in turn, maximise the learning of the high-quality representations. CL has been studied in unsupervised representation fashionChen et al. (2020);He et al. (2020), which treats the same classes as the positive set to boost the performance. The weakly supervised learning has also borrowed the concepts of CL to tackle the partial label problemWang et al. (2022). The CL has also been applied to semi-supervised learningLi et al. (2020). 1.2 ADVERSARY-AWARE PARTIAL LABEL PROBLEM SETTING Given the input space X ∈ R d and label space is defined as Y = [c] ∈ {1 • • • c} with the number of c > 2 classes. Under adversary-aware partial labels, each instance X ∈ X has a candidate set of adversary-aware partial labels ⃗ Y ∈ ⃗ Y. The adversary-aware partial label set has space of ⃗

Figure 1: An overview of the proposed method. General partial label can be disclosed to adversary. The initial training is about positive sample selection. Moreover, we have assumed T is given.

and upper-bounded by M , i.e., M = sup x∈X ,f ∈F ,y k ∈ ⃗ Y ⃗ L (f (x), ⃗ y k ). The expected Rademacher complexity of F k is denoted as ℜ n (F ⃗ y k ) Bartlett & Mendelson (2002) Theorem 3. For any δ > 0, with probability at least 1

Figure2shows the experimental result comparisons for CUB200 between the adversary-aware loss function and previous loss function before and after the momentum updating. Given equation 17, the uncertainty of the transition matrix Q is reduced, leading to a good initialisation for the positive set selection, which is a warm start and plays a vital role in improving the performance of contrastive learning. After we have a good set of positive samples, the prototype's accuracy is enhanced. Subsequently, leveraging the clustering effect and the high-quality representation power of the positive sample set of contrastive loss function to improve the classification performance.

Figure 2: The Top1 and Prototype Accuracy of the Proposed Method and the Method in Wang et al. (2022) on CUB200 Adversary-Aware Loss Comparison.

Krizhevsky et al. (2009), and fine-grained CUB200Wah et al. (2011) with general partial label and adversaryaware partial label datasets. Main Empirical Results for CIFAR10. All the classification accuracy is shown in Table1. We have compared classification results on CIFAR-10 with previous worksWang et al. (2022);Lv et al. (2020);Wen et al. (2021) using the Adversarial teacher within momentum (ATM

Benchmark datasets for accuracy comparisons. Superior results are indicated in bold. Our proposed methods have shown comparable results to fully supervised learning and outperform previous methods in a more challenging learning scenario, such as the partial rate at 0.5(CIFAR10) and 0.1(CIFAR100, CUB200). The hyper-parameter α is set to 0.1 for our method. (The symbol * indicates Adversary-Aware partial label dataset). = 0.03±0.02 q * = 0.05 ±0.02 q * =0.1±0.02 = 0.1 ± 0.02 q * = 0.3 ± 0.02 q * = 0.5 ± 0.02

