SOLVING PARTIAL LABEL LEARNING PROBLEM WITH MULTI-AGENT REINFORCEMENT LEARNING

Abstract

Partial label learning (PLL) deals with classifications when a set of candidate labels instead of the true one is given for each training instance. As a weakly supervised learning problem, the main target of PLL is to discover latent relationships within training samples, and utilize such information to disambiguate noisy labels. Many existing methods choose nearest neighbors of each partially-labeled instance in an unsupervised way such that the obtained instance similarities can be empirically non-optimal and unrelated to the downstream classification task. To address this issue, we propose a novel multi-agent reinforcement learning (MARL) framework which models the connection between each pair of training samples as a reinforcement learning (RL) agent. We use attention-based graph neural network (GNN) to learn the instance similarity, and adaptively refine it using a deterministic policy gradient approach until some pre-defined scoring function is optimized. Different from those two-stage and alternative optimization algorithms whose training procedures are not end-to-end, our RL-based approach directly optimizes the objective function and estimates the instance similarities more precisely. The experimental results show that our method outperforms state-of-the-art competitors with a higher classification accuracy in both synthetic and real examples.

1. INTRODUCTION

PLL, also known as superset label learning (Liu & Dietterich, 2014; 2012; Gong et al., 2017) , has been extensively studied in the past few decades. As a typical weakly-supervised learning problem, PLL assumes that most of training instances are partially labeled and their ground truth labels are unknown. To be more specific, each instance is associated with a small set of candidate labels including the ground truth. PLL has been widely considered in diverse fields, including web mining (Luo & Orabona, 2010) , facial age estimation (Zhang et al., 2016) , photograph captioning (Duygulu et al., 2002; Barnard et al., 2003; Berg et al., 2004; Gallagher & Chen, 2007 ) and image annotation (Cour et al., 2011; Zeng et al., 2013) . It is usually much easier to get blurry labels than acquiring exact ground truths, and accurately labeling each instance is costly and labor-intensive. For example, the natural photographs collected in the real world may contain multiple human faces and are often tagged ambiguously with several potential names in the captions. The goal is to precisely match the persons in each images with the names, and learn a robust classification model which can be generalized to unseen instances. PLL has been widely examined in the past few years. How to distinguish fuzzy labels in training sets and recover their true labels plays an important role in developing efficient and robust PLL methods. One main class of PLL methods, such as LSB-CMM (Liu & Dietterich, 2012) , M3PL (Yu & Zhang, 2016) and PL-SVM (Nguyen & Caruana, 2008) , directly fits the classifier with traditional machine learning models. These methods ignore the relationships between training instances which leads to unfaithful labelling disambiguation results. Some more advanced methods, such as PL-KNN (Hüllermeier & Beringer, 2006) , IPAL (Zhang & Yu, 2015) and PL-LEAF (Zhang et al., 2016) , utilize the coorelations between training instances provided by some underlying similarity graph learned in an unsupervised manner. These methods have achieved desirable empirical performance, but still suffer from some common issues. For example, the similarity graph generated via some unsupervised approaches may be non-optimal due to its independence of the main classification task. Another weakness of the graph-based methods is that the label prediction of each test example is voted by its neighbors in the training set which makes the prediction results highly depend on the selection of neighboring instances. Unfortunately, there is usually no guidance of choosing a proper size of the neighboring set in practice. The prediction results can be significantly biased when some unrelated or wrongly-labelled training instances are included as voters. Although some more recent studies such as AGGD (Wang et al., 2019) employ supervised methods to model the instance similarities, the neighbor selection issue has not been fully addressed as they initialize the neighboring set in an unsupervised way and only update the similarity measurements at some specific locations in the subsequent optimization step which has great limitations. On the other hand, these methods need to limit the total number of neighbors and thus some training instances that are useful in improving the predicting accuracy are ignored. Furthermore, the complicated optimization problem proposed by these graph-based methods are difficult to solve in practice. In this work, we propose a novel end-to-end Partial Label learning method using Multi-Agent Reinforcement Learning, called PLRL. Under the MARL setting, each pair of training instances is treated as an individual agent and its action is defined as the similarity measurement between these two instances. All the n training instances build the node set of an underlying similarity graph, and the learned optimal policy precisely quantifies the closeness between any two nodes, which is the edge weight. Unlike traditional two-stage or alternative optimization approaches, our PLRL method directly optimizes a pre-defined score function and adaptively updates the instance similarities in a supervised way which makes the learned similarity graph more related to the main classification task. Specifically, we use GNNs to learn the similarity graph which maximizes a total reward shared by all the n 2 agents. With the similarity graph, an estimated probability distribution over the candidate labels for each test example is obtained through some label propagation strategies. Then we incorporate the estimated probabilities with a kernel ridge regression model to carry out the label disambiguation. The MARL framework together with the classification model is end-to-end trained using policy gradients. The main contributions of this paper are summarized as follows. • We introduce a novel MARL framework to quantify the similarities between training instances and impose no limits on the number of nearest neighbors. • Different from traditional two-stage or alternative optimization methods, we employ an end-to-end approach to jointly implement predictions and labelling disambiguation, which makes the learned instance similarities more related to the main classification task. • We use Policy Gradient to efficiently solve the complicated optimization problem which cannot be easily handled by previous studies. • Experimental results on both synthetic and real datasets show that our method outperforms existing PLL methods with higher classification accuracies in most scenarios, especially in those cases with imbalanced samples.

2. PROBLEM STATEMENT

In PLL, a partially labelled training set D = {(x i , C i ) | 1 ≤ i ≤ n} is given, where x i = (x i1 , ..., x id ) ⊤ ∈ X is a d-dimensional instance and C i ⊆ Y is the candidate label set among which only one label is assumed to be valid. Y = {y 1 , y 2 , ..., y q } here denotes a label space with q classes. Let X = [x 1 , ..., x n ] ⊤ ∈ R n×d be the normalized input data matrix. The target of PLL is to induce a multi-class classifier g(•) : X → Y using such fuzzy label information D to precisely classify partially-labelled and unseen instances. In this work, we consider an undirected weighted graph G = (V, W) among instances. V = {x i |1 ≤ i ≤ n} is the set of vertices. W = [w ij ] n×n refers to the non-negative weight adjacency matrix, where w ij ∈ [0, 1] measures how close the two instances x i and x j are to each other with larger value meaning higher correlation. In this work, a novel RL based approach is used to estimate the weighted adjacency matrix W. For each training example (x i , C i ), we aim to generate a normalized real-valued vector f i ∈ R q where each f ij represents the label confidence of the j-th label being the true label of x i . The label confidence vector f i satisfies the following constraints: (i) q j=1 f ij = 1 for any 1 ≤ i ≤ n, (ii) f ij ≥ 0 for any y ij = 1, and (iii) f ij = 0 for any y ij = 0. The second and third constraints indicate the potential ground-truth label resides in the candidate label set. When the label confidence of training set X is generated as F = [f 1 , ..., f n ] ⊤ ∈ R n×q , a classifier g(•) : X → Y can be induced based on this disambiguation results.

3. A MULTI-AGENT RL FRAMEWORK

In this paper, we model the PLL problems via a MARL framework. Each agent learns the similarity w ij between each pair of training instances by fully exploiting the input features and the labelling uncertainty. To be specific, we consider a multi-agent extension of partially observable Markov games, which is defined by a set of states S describing the possible configurations of all agents, a set of actions A 1 , ..., A N and a set of observations O 1 , ..., O N for each agent. Each agent i chooses the action following a policy π i : O i → A i . We use neural networks to parameterize the policy of each agent, which outputs the similarity between two corresponding instances. The private observation of each agent i is correlated to the overall state S, such that o i : S → O i . Since there are n instances, the total number of agents is N = n 2 . All the n 2 agents are fully cooperative to maximize the total reward r : S × A → R. In addition, we define the action-value function Q π (s, a 1 , ..., a N ) = E[R(s, a)|s 0 = s, a 0 = a], where a = [a 1 , ..., a N ] and π = (π 1 , . . . , π N ) ⊤ . In the context of PLL, we highlight the following specifics: State. The entire state space is constructed by the contexts of n training instances, i.e. S = {x 1 , ..., x n }. Each agent only observes the contexts of the two related instances, i.e. O ij = {x i , x j }. Action. For each agent, its action a ij ∈ [0, 1] measures the similarity between instance x i and x j , which represents the edge weight in the similarity graph determined by the policy π ij (x i , x j ). The larger value of a ij implies the higher similarity between x i and x j . Specifically, it is much likely that two instances should be categorized into two different classes of Y if a ij = 0. For each agent, we employ a neural network to output the action a ij ∈ [0, 1] with input being the observation O ij = {x i , x j }, whose parameters are shared by all the N agents. To ensure that the similarity measurement is undirected, i.e. a ij = a ji , we borrow the idea of attention mechanism (Vaswani et al., 2017) and the action can be obtained as follows, a ij (θ) = π ij (x i , x j ) = Sigmoid (W Q ϕ(x i )) ⊤ W K ϕ(x j ) + (W K ϕ(x i )) ⊤ W Q ϕ(x j ) where ϕ : R p → R p ′ encodes each x i to a p ′ -dimensional embedding ϕ(x i ) (here we use fully connected layer with ReLU activation function), and W Q , W K ∈ R q×p ′ denote the query and key parameter matrices, respectively. θ includes all the parameters to be learned. It is obvious that the network architecture in Eq.( 1) is permutation invariant, and the output is independent of the order of the two inputs x i and x j . With all the N actions taken by the N agents at each training epoch, we can build a normalized weighted adjacency matrix W = [w ij ] n×n with w ij = aij k a ik . All weight values in W are updated during the optimization, except that the diagonals are set to zero to avoid self-loops. Different from previous graph-based methods, we allow all the training instances to involve in the prediction of each test example and their importance is optimized via the MARL design. Reward. Since all the N agents are fully cooperative, we define a single reward function to evaluate the learned similarities W between the n instances, which is defined as follows, -R = n j=1 f j - n i=1 wijf i 2 (i): LSE of label confidence +µ n j=1 g(xj) -f j 2 (ii): LSE of classifier +η n j=1 xj - n i=1 wijxi 2 (iii): LSE of feature -β n j=1 log (p(ŷj|xj)) (iv): log-likelihood (2) The first three terms in (2) are similar to those given in Wang et al. (2019) while the last term is newly raised. Term (iii) measures how each x i is represented by all the n training instance based on W. Term (i) is exploiting the smoothness assumption that the graph structure in the feature space should be preserved in the label space. F here is the label confidence of the n training instances which is a function of W and can be obtained through disambiguation. For term (ii), we expect to minimize the error of the classifier g(•). As shown by Section 4.2, g(•) is determined by F and thus can also be treated as a function of W. In particular, we add a log-likelihood term (iv) to facilitate the convergence. ŷj denotes the label prediction of training instance x j i.e. p(ŷ j |x j ) = max l f jl where f jl is the (j, l)-th element of F. Since the four terms in (2) all depend on the weighted adjacency matrix W, the negative reward can serve as an objective function to optimize W.

4. METHODOLOGY

In this section, we describe the implementation details of the PLRL algorithm. At each training epoch, we apply a disambiguation method to obtain the label confidence F which is then used to learn the classifier g(•). With the obtained F and g(•), we can compute the reward function in (2), using policy gradient to train all the N agents and dynamically updating the weighted adjacency matrix W until the optimal reward R * is achieved. More details can be found in Sections 4.1, 4.2 and 4.3. With the optimal adjacency matrix W * , we can finally predict the class label of each unseen instance using the kernel ridge regression (KRR) with optimal model parameters, which is described in Sections 4.4. The flow chart of PLRL is summarized in Figure 1 and the detailed algorithm is provided in the Supplement A.  p i,j = 1/|C i |, if y j ∈ C i 0, otherwise ∀1 ≤ i ≤ n. This initialization step equally distributes the label confidence of x i over all its candidate classes. Then at the t-th iteration, F is updated by propagating label information along with W, such that F(t) = α • W ⊤ F (t-1) + (1 -α) • P, where the hyperparameter α ∈ (0, 1) balances the contribution of the inherited label information W ⊤ F (t-1) and the initial label confidence P. Then, F(t) is normalized to get F (t) , i.e. f (t) i,j =    f (t) i,j y l ∈C i f (t) i,l , if y j ∈ C i 0, otherwise ∀1 ≤ i ≤ n. After T iterations, we take F (T ) as the final label confidence and adopt the class mass normalization (CMN) (Zhu & Goldberg, 2009) to adjust the disambiguation output towards class prior distribution, which handles unbalanced samples well. We let F = M • F (T ) where • denotes the Hadamard product and M = [m, . . . , m] ⊤ n×q is composed of the vector defined as m = m1 m1 , . . . , mq mq ⊤ q×1 , where m j = n i=1 p i,j , mj = n i=1 f (T ) i,j , j = 1, ..., q. With this disambiguation approach, we get the label confidence matrix F which is then used to compute terms (i) and (iv) in the objective function (2) to update W.

4.2. OBTAINING g(•): KERNEL RIDGE REGRESSION MODEL

Now, we introduce the classifier which makes predictions for unlabelled instances by using the label confidence F obtained in Section 4.1. We consider a reproducing kernel Hilbert space (RKHS) and let ψ(•) : R d → R h denote the (implicit) nonlinear feature mapping that projects the original feature space to a higher dimensional RKHS space H K via a Gaussian kernel function κ(x i , x j ) = exp(-∥x i -x j ∥ 2 2 /(2σ 2 )) where σ is the averaged pairwise distances of training instances. We model the classifier via a kernel regression model using a generalized ridge penalty. As pointed out by Wahba (1990) , a general solution to this problem is finite-dimensional, and can be defined as g(x) = n i=1 u i κ(x, x i ) + b, where b and u i , i = 1, . . . , n are some parameter vectors. Let K = ψ(X)ψ(X) ⊤ be a kernel matrix with each element generated by a kernel function  k ij = κ(x i , x j ) = ψ(x i ) ⊤ ψ(x j ) and U = [u 1 , . . . , u n ] ⊤ ∈ R n×q contains KU + 1b ⊤ -F 2 F + λ tr U ⊤ KU , where 1 denotes a vector of 1's, ∥ • ∥ F represents the Frobenius norm, and tr(•) is the trace. Let the gradients of ( 8) with respect to U and b be zeros, we can obtain the solution of U and b as follows, U = K + λI - 11 ⊤ K n -1 F - 11 ⊤ F n , b = 1 n F ⊤ 1 -U ⊤ K ⊤ 1 , which are then used to compute term (ii) in the objective function (2).

4.3. UPDATING W: POLICY GRADIENT

At each training epoch, the agent i, j observes its own observation O ij and outputs the similarity w ij between x i and x j . With the N obtained w ij 's, we carry out the disambiguation step in Section 4.1 and calculate the prediction model given in Section 4.2. Then the reward R can be obtained. Let the expected reward J(π θ ) = E s∈S,a∼π θ [R] be the objective function of PLRL, where the detailed formulation of R is given by (2) and the computations of F and g(•) are described in Sections 4.1 and 4.2. All the four components in (2) work together and jointly affect the estimation of W, which makes the learned similarity graph related to the classification task, and thus the neighboring information can be well utilized to improve the prediction accuracy. The gradient ∇ θ J (π θ ) is obtained by the deterministic policy gradient (DPG) algorithm (Silver et al., 2014) as ∇ θ J (π θ ) = ∇ θ π θ (s)∇ a Q π (s, a 1 , . . . , a N )| aj =πj (oj ),j=1,...,N , where s = X, o il = {x i , x l }, 1 ≤ i, l ≤ n. In practice, we can pre-train the model and properly initialize the adjacency matrix W, which helps shorten the training process of the PLRL algorithm. A good starting point of the matrix W can be obtained by using some existing unsupervised methods such as k-NN (Zhang & Yu, 2015) . Then the proposed PLRL algorithm can further improve the prediction accuracy by refining the matrix W in a supervised way.

4.4. PREDICTION

As the DPG algorithm converges, we can get the optimal similarity graph W * and the optimal label confidence matrix F * . The finally prediction of each unseen instance x * is given by y * = arg max j n i=1 u * ij κ(x * , x i ) + b * j . ( ) where u * ij is in the i-th row, j-th column of U * and b * j is the j-th element in b * . Both U * and b * here can be obtained by (9) using the optimal W * and F * .

5. EXPERIMENTS

In this section, we conduct both synthetic and real-world experiments to demonstrate the advantages of the proposed PLRL algorithm in solving PLL problems. PLRL is compared with seven SOTA PLL algorithms, whose parameters are fine-tuned as suggested by the literature. The configurations of these methods are summarized in Table 1 and the details are provided in the Supplement B. For PLRL, the optimal choices of the regularization coefficients are µ = 1, η = 0.5 and β = 0.05 according to the cross-validation results. More analysis about the parameter sensitivity are in the Supplement E.1. For each method, we perform five-fold cross-validation, and report the mean accuracy together with their standard deviations. In addition, we apply the t-test at 0.05 significance level to assess the performance of the proposed method. Yes k = 10, α = 0.95, T = 100 PL-KNN (Hüllermeier & Beringer, 2006) Yes k = 10 SURE (Feng & An, 2019a) No λ = 0.5, β = 0.05 LSB-CMM (Liu & Dietterich, 2012) No σ 2 = 1, K = 80, α = 0.05 CLPL (Cour et al., 2011) No SVM with the squared hinge loss PL-SVM (Nguyen & Caruana, 2008) No λ = 0.01 Our implementation is based on PyTorch (Paszke et al., 2019) and all the experiments were carried out with NVIDIA Tesla P100 GPUs. When handling large-scale datasets, we employ a batch training strategy to reduce the influence of the limited memory. To be specific, we randomly split the whole dataset into several parts and iteratively feed them into the GPUs when training the GNN model. Then we use the trained model to make predictions for all training instances. In practice, our PLRL method takes nearly the same computation time as other recently proposed parametric methods including SURE and AGGD. For the small datasets with sample size less than 2000, the whole training procedure takes about 1 to 10 minuets, and for some large datasets with more than 10000 samples, it takes about 1 to 3 hours. Considering the performance gain of PLRL in practice, the training cost is acceptable.

5.1. SIMULATION STUDY

To show that the MARL design can better capture the instance similarities and thus improve the prediction accuracy, we design a simulation experiment and compare our PLRL algorithm with three SOTA graph-based methods, AGGD, IPAL and PL-KNN. We let the sample size n be 110, the dimension of features d be 6, the number of classes q be 5, and the number of wrong labels in each candidate set be r ∈ {1, 2, 3}. Each of the six instance features is generated from a Gaussian distribution, an uniform distribution or a binomial distribution. We randomly divided the n samples into q classes, and the features of instances belonging to different classes follow different distributions. The detailed distribution settings are summarized in Table 7 of the Supplements. The data generating procedure is given as follows, 1. Randomly assign one of the q class labels to each of the n instances as the true label y. 2. For each instance, generate data features X (y) according to Table 7 of the Supplements. 3. We consider an underlying connective graph and simulate the adjacency matrix Ā = [ā ij ] n×n by āij = 1 {yi=yj } 1 {κij >0.75} for 1 ≤ i, j ≤ n, which measures the similarities between the n training instances. κ ij here is the Gaussian kernel κ(x i , x j ) = exp(-∥x i - x j ∥ 2 2 /(2σ 2 )) where σ is the averaged pairwise distances of all instances. 4. For each instance, simulate the r partial labels by randomly selecting r labels from the q -1 candidate labels that are not equal to y. We repeat this data generating procedure 40 times and conduct five-fold cross-validation each time to report the mean prediction accuracy. Due to the small sample size of the generated dataset, we set the total number of training epochs to be 200 for each method. shows, PLRL performs consistently well while the classification accuracy of the the other three varies a lot with the k selections across the 40 replicates. In some cases a small neighbor set is preferred while in other cases a denser similarity graph may be helpful. Since the optimal number of k is usually unknown in the real world, the performance of the three graph-based methods can be highly affected with a improperly selected k. Our method addresses this issue by allowing all the n 2 edges to have non-zero weights which are optimized by the end-to-end MARL design. We also show how each method recovers the true underlying graph Ā. To be specific, we transform each obtained weighted adjacency matrix W into an unweighted one Â = [â ij ] n×n where each âij = 1 {wij >τ } and τ ∈ (0, 1) here represents a truncation threshold. In Figure 2 (b), we draw five ROC curves based on the learned weighted graph obtained by the five methods. It should be noted that both IPAL and PL-KNN estimate the smilarity graph using K Nearest Neighbor and thus we use "KNN" to represent them. As Figure 2 (b) shows, PLRL can better estimates the true graph. For both AGGD and KNN, the estimation performance does not improve when a certain TPR is reached. Table 2 summarizes the graph estimations under τ = 0.05. F-norm denotes the Frobenius distance between the true graph Ā and the estimated graph W. Higher values of NNZ, TPR, and smaller values of FDR, SHD and F norm distance demonstrate that PLRL can discover more true edges than the other three and achieve the best performance in recovering the underlying graph. The simulation study indicates that PLRL can better capture the similarities between training instances which helps improve the final prediction accuracy. 

5.2. CONTROLLED UCI DATASETS

Following the common design of previous PLL studies, we generate an artificial partially labelled dataset based on the UCI dataset (Dua & Graff, 2019) . The characteristics of eight UCI datasets are summarized in Table 8 of the Supplements. We set the proportion of PL examples p = 1 across all our experiments, and vary the number of wrong labels r and the probability one specific false positive label co-occurs with the true label ϵ. The configuration of the two settings is provided as follows: (I) p = 1, r = 1, ϵ ∈ {0.2, 0.3, ..., 0.8}. (II) p = 1, r ∈ {1, 2, 3, 4, 5, 6, 7}. The details of the experiment performance are shown in Supplement D. To clearly illustrate the advantage of the proposed PLRL algorithm, we report the win/tie/loss results between PLRL and each competing method using two-sample t-test. As Table 3 shows, PLRL significantly outperforms its competitors with win/tie rates greater than 93.0% among 86 set-ups. In particular, for some recently proposed methods such as SURE and AGGD, PLRL is superior or comparable to them in most cases. 

5.3. REAL-WORLD DATASETS

We use five real-world datasets to validate the proposed method, including Lost (Cour et al., 2011) , Soccer Player (Zeng et al., 2013) , Yahoo! News (Guillaumin et al., 2010) , MSRCv2 (Liu & Dietterich, 2012) and BirdSong (Briggs et al., 2012) . The first three are for automatic face naming, the fourth is for object classification, and the last is to classify bird songs. The characteristics of these five datasets are summarized in Table 9 of the Supplements, as well as the average number of candidate labels. The mean inductive classification accuracy with its standard deviation for each algorithm on unlabelled test data are summarized in Table 4 . Pairwise t-tests at 0.05 significance level is conducted based on 5-fold cross-validation. As shown in Table 4 , PLRL significantly outperforms the others in four datasets and achieves the second best prediction accuracy in Yahoo! News, which indicates that PLRL is less sensitive to the data structures and performs consistently well in practice. Table 5 presents the transductive classification accuracy on partially labelled training samples, which reflects the disambiguation capacity of each method in recovering ground-truths from the candidate label set. For PLRL, SURE and AGGD, the generated label confidence vector f i can be used to predict the ground-truth label for each partially labelled training instance x i such that ŷi = arg max y k ∈Ci f ik . Other approaches directly make predictions by choosing the most likely ŷi ∈ C i . As shown in Table 5 , the performance of PLRL is superior to other algorithms in terms of a higher transductive prediction accuracy. In particular, PLRL significantly outperforms CLPL, IPAL, PL-KNN, and LSB-CMM in all the five real-world datasets. When compared to SURE, AGGD and PL-SVM, PLRL can still achieves the best performance in most cases. These results imply that PLRL performs better in disambiguating blurry labels and can extract more useful information which helps to build a more precise prediction model. 

5.4. FURTHER STUDY

To fairly evaluate the contributions of the MARL design, the GNN model used to learn the weighted graph and the KRR model, we carry out an ablation study on four different datasets to quantitatively illustrate their necessity. Specifically, we compare the full PLRL model with the ones removing either GNN&RL or KRR and also the baseline model IPAL, whose detailed architectures are described as follows, • "w/o GNN&RL" represents the prediction method that uses k-NN to estimate the underlying graph and make predictions via KRR. In this case, both GNN and RL are removed. • "w/o KRR " refers to the case that the reward is obtained by n j=1 f ji w ij f i 2 + η n j=1 ∥x ji w ij x i ∥ 2 . Since g(•) is removed, the final prediction of each test instance is voted by it neighbors. • IPAL serves as the baseline of PLRL without KRR and GNN&RL. As shown in Table 6 , the full PLRL model consistently outperforms the other three in most scenarios with statistically significant t-test results. The prediction accuracy decreases when either the KRR model or the GNN&RL mechanism is removed. Thus, both of these two components are important in improving the baseline performance although their contributions can be varied in different datasets, which validates their effectiveness and indispensability. We also show that the reward and the classification accuracy on both training and test datasets converge after 1000 epochs. More details can be found in the Supplement E.2.

6. CONCLUSION

In this work, we propose a novel PLL method, called PLRL, which models the instance similarities using an MARL based GNN model and enhances the disambiguation capacity. Our method takes all training samples into consideration when building the similarity graph and can better utilize the neighboring information. Thanks to the end-to-end design, the graph structure learned by PLRL is more related to the main classification task. Despite the empirical success PLRL achieves, there are still some open questions to be answered. First, we need to figure out why the improvement of PLRL over baselines is varied across different datasets. Second, the significance of the improvement by PLRL largely lies in the usage of GNN and RL. It may bring some new ideas to other weakly supervised learning problems such as graph estimations and link predictions.



Figure 1: Algorithm architecture of PLRL

the combined weights of all the n instances. The classifier can be obtained by fitting the following kernel ridge regression (KRR), min U,b

95, T = 100, λ = 0.05, µ = 1, η = 0.5, β = 0.05 AGGD(Wang et al., 2019) Yes k = 10, T = 10, λ = 1, µ = 1, γ = 0.05 IPAL(Zhang & Yu, 2015)

Figure 2: Classification and graph recovery performance on simulation datasetsFor the three competitors, we plot their prediction results under different choices of k, which is the total number of neighbours of each training instance when building the underlying similarity graph. The prediction performance of all the four methods is visualized in Figure2(a), where the 40 replicates are sorted in an ascending order according to the classification accuracy. As Figure2(a)shows, PLRL performs consistently well while the classification accuracy of the the other three varies a lot with the k selections across the 40 replicates. In some cases a small neighbor set is preferred while in other cases a denser similarity graph may be helpful. Since the optimal number of k is usually unknown in the real world, the performance of the three graph-based methods can be highly affected with a improperly selected k. Our method addresses this issue by allowing all the n 2 edges to have non-zero weights which are optimized by the end-to-end MARL design.

Comparing methods

Graph recovery performance on one simulation dataset

Win/tie/loss (pairwise t-test at 0.05 significance level) counts on the controlled UCI datasets between PLRL and the comparing algorithms

Inductive classification accuracy (mean&std) of each method on the real-world partial label datasets, where •/• indicates whether the performance of PLRL is statistically superior/inferior to the comparing algorithm on each data set (pairwise t-test at 0.05 significance level)

Transductive classification accuracy (mean&std) of each method on the real-world partial label datasets

Prediction accuracy of PLRL and its ablated variants on four partial label datasets

