POINTWISE BINARY CLASSIFICATION WITH PAIRWISE CONFIDENCE COMPARISONS

Abstract

Ordinary (pointwise) binary classification aims to learn a binary classifier from pointwise labeled data. However, such pointwise labels may not be directly accessible due to privacy, confidentiality, or security considerations. In this case, can we still learn an accurate binary classifier? This paper proposes a novel setting, namely pairwise comparison (Pcomp) classification, where we are given only pairs of unlabeled data that we know one is more likely to be positive than the other, instead of pointwise labeled data. Compared with pointwise labels, pairwise comparisons are easier to collect, and Pcomp classification is useful for subjective classification tasks. To solve this problem, we present a mathematical formulation for the generation process of pairwise comparison data, based on which we exploit an unbiased risk estimator (URE) to train a binary classifier by empirical risk minimization and establish an estimation error bound. We first prove that a URE can be derived and improve it using correction functions. Then, we start from the noisy-label learning perspective to introduce a progressive URE and improve it by imposing consistency regularization. Finally, experiments validate the effectiveness of our proposed solutions for Pcomp classification.

1. INTRODUCTION

Traditional supervised learning techniques have achieved great advances, while they are demanding for precisely labeled data. In many real-world scenarios, it may be too difficult to collect such data. To alleviate this issue, a large number of weakly supervised learning problems (Zhou, 2018) have been extensively studied, including semi-supervised learning (Zhu & Goldberg, 2009; Niu et al., 2013; Sakai et al., 2018) , multi-instance learning (Zhou et al., 2009; Sun et al., 2016; Zhang & Zhou, 2017) , noisy-label learning (Han et al., 2018; Xia et al., 2019; Wei et al., 2020) , partial-label learning (Zhang et al., 2017; Feng et al., 2020b; Lv et al., 2020) , complementary-label learning (Ishida et al., 2017; Yu et al., 2018; Ishida et al., 2019; Feng et al., 2020a) , positive-unlabeled classification (Gong et al., 2019) , positive-confidence classification (Ishida et al., 2018) , similarunlabeled classification (Bao et al., 2018) , unlabeled-unlabeled classification (Lu et al., 2019; 2020) , and triplet classification (Cui et al., 2020) . This paper considers another novel weakly supervised learning setting called pairwise comparison (Pcomp) classification, where we aim to perform pointwise binary classification with only pairwise comparison data, instead of pointwise labeled data. A pairwise comparison (x, x ) represents that the instance x has a larger confidence of belonging to the positive class than the instance x . Such weak supervision (pairwise confidence comparison) could be much easier for people to collect than full supervision (pointwise label) in practice, especially for applications on sensitive or private matters. For example, it may be difficult to collect sensitive or private data with pointwise labels, as asking for the true labels could be prohibited or illegal. In this case, it could be easier for people to collect other weak supervision like the comparison information between two examples. It is also advantageous to consider pairwise confidence comparisons in pointwise binary classification with class overlapping, where the labeling task becomes difficult, and even experienced labelers may provide wrong pointwise labels. Let us denote the labeling standard of a labeler as p(y|x) and assume that an instance x 1 is more positive than another instance x 2 . Facing the difficult labeling task, different labelers may hold different labeling standards, p(y = +1|x 1 ) > p(y = +1|x 2 ) > 1/2, p(y = +1|x 1 ) > 1/2 > p(y = +1|x 2 ), and 1/2 > p(y = +1|x 1 ) > p(y = +1|x 2 ), thereby providing different pointwise labels: (+1, +1), (+1, -1), (-1, -1). We can find that different labelers may provide inconsistent pointwise labels, while pairwise confidence comparisons are unanimous and accurate. One may argue that we could aggregate multiple labels of the same instance using crowdsourcing learning methods (Whitehill et al., 2009; Raykar et al., 2010) . However, as not every instance will be labeled by multiple labelers, it is not always applicable to crowdsourcing learning methods. Therefore, our proposed Pcomp classification is useful in this case. Our contributions in this paper can be summarized as follows: • We propose Pcomp classification, a novel weakly supervised learning setting, and present a mathematical formulation for the generation process of pairwise comparison data. • We prove that an unbiased risk estimator (URE) can be derived, propose an empirical risk minimization (ERM) based method, and present an improvement using correction functions (Lu et al., 2020) for alleviating overftting when complex models are used. • We start from the noisy-label learning perspective to introduce the RankPruning method (Northcutt et al., 2017) that holds a progressive URE for solving our proposed Pcomp classification problem and improve it by imposing consistency regularization. • We experimentally demonstrate the effectiveness of our proposed solutions for Pcomp classification.

2. PRELIMINARIES

Binary classification with pairwise comparisons and extra pointwise labels has been studied (Xu et al., 2017; Kane et al., 2017) . Our paper focuses on a more challenging problem where only pairwise comparison examples are provided. Unlike previous studies (Xu et al., 2017; Kane et al., 2017) that leverage some pointwise labels to differentiate the labels of pairwise comparisons, our methods are purely based on ERM with only pairwise comparisons. In the next, we briefly introduce some notations and review the related problem formulations of binary classification, positive-unlabeled classification, and unlabeled-unlabeled classification. Binary Classification. Since our paper focuses on how to train a binary classifier from pairwise comparison data, we first review the problem formulation of binary classification. Let the feature space be X and the label space be Y = {+1, -1}. Suppose the collected dataset is denoted by D = {(x i , y i )} n i=1 where each example (x i , y i ) is independently sampled from the joint distribution with density p(x, y), which includes an instance x i ∈ X and a label y i ∈ Y. The goal of binary classification is to train an optimal classifier f : X → R by minimizing the following expected classification risk: Positive-Unlabeled (PU) Classification. In some real-world scenarios, it may be difficult to collect negative data, and only positive (P) and unlabeled (U) data are available. PU classification aims to train an effective binary classifier in this weakly supervised setting. Previous studies (du Plessis et al., 2014; 2015; Kiryo et al., 2017) showed that the classification risk R(f ) in Eq. ( 1) can be rewritten only in terms of positive and unlabeled data as R(f ) = E p(x,y) (f (x), y) = π + E p+(x) (f (x), +1) + π -E p-(x) (f (x), -1) , R(f ) = R PU (f ) = π + E p+(x) (f (x), +1) -(f (x), -1) + E p(x) (f (x), -1) , where p(x) = π + p + (x) + π -p -(x) denotes the probability density of unlabeled data. This risk expression immediately allows us to employ ERM in terms of positive and unlabeled data. Unlabeled-Unlabeled (UU) Classification. The recent studies (Lu et al., 2019; 2020) showed that it is possible to train a binary classifier only from two unlabeled datasets with different class priors. Lu et al. (2019) showed that the classification risk can be rewritten as R(f ) = R UU (f ) = E ptr(x) (1 -θ )π + θ -θ (f (x), +1) - θ (1 -π + ) θ -θ (f (x), -1) + E p tr (x ) θ(1 -π + ) θ -θ (f (x ), -1) - (1 -θ)π + θ -θ (f (x ), +1) , where θ and θ are different class priors of two unlabeled datasets, and p tr (x) and p tr (x ) are the densities of two datasets of unlabeled data, respectively. This risk expression immediately allows us to employ ERM only from two sets of unlabeled data. For R UU (f ) in Eq. ( 3), if we set θ = 1, θ = π + , and replace p tr (x) and p tr (x ) by p + (x) and p(x) respectively, then we can recover R PU (f ) in Eq. ( 2). Therefore, UU classification could be taken as a generalized framework of PU classification in terms of URE. Besides, Eq. ( 3) also recovers a complicated URE of similarunlabeled classification (Bao et al., 2018) by setting θ = π + and θ = π 2 + /(2π 2 + -2π + + 1). To solve our proposed Pcomp classification problem, we will present a mathematical formulation for the generation process of pairwise comparison data, based on which we will explore two UREs to train a binary classifier by ERM and establish the corresponding estimation error bounds.

3. DATA GENERATION PROCESS

In order to derive UREs for performing ERM, we first formulate the underlying generation process of pairwise comparison datafoot_0 , which consists of pairs of unlabeled data that we know which one is more likely to be positive. Suppose the provided dataset is denoted by D = {(x i , x i )} n i=1 where (x i , x i ) (with their unknown true labels (y i , y i )) is expected to satisfy p(y i = +1|x i ) > p(y i = +1|x i ). It is clear that we could easily collect pairwise comparison data if the positive confidence (i.e., p(y = +1|x)) of each instance could be obtained. However, such information is much harder to obtain than class labels in real-world scenarios. Therefore, unlike some studies (Ishida et al., 2018; Shinoda et al., 2020) that assume the positive confidence of each instance is provided by the labeler, we only assume that the labeler has access to the labels of training data. Specifically, we adopt the assumption (Cui et al., 2020) that weakly supervised examples are first sampled from the true data distribution, but the labels are only accessible to the labeler. Then, the labeler would provide us weakly supervised information (i.e., pairwise comparison information) according to the labels of sampled data pairs. That is, for any pair of unlabeled data (x, x ), the labeler would tell us whether (x, x ) could be collected as a pairwise comparison for Pcomp classification, based on the labels (y, y ) rather than the positive confidences (p(y = +1|x), p(y = +1|x )). Now, the question becomes: how does the labeler consider (x, x ) as a pairwise comparison for Pcomp classification, in terms of the labels (y, y )? As shown in our previous example of binary classification with class overlapping, we could infer that the labels (y, y ) of our required pairwise comparison data (x, x ) for Pcomp classification can only be one of the three cases {(+1, -1), (+1, +1), (-1, -1)}, because the condition p(y = +1|x) ≥ p(y = +1|x ) is definitely violated if (y, y ) = (-1, +1). Therefore, we assume that the labeler would take (x, x ) as a pairwise comparison example in the dataset D, if the labels (y, y ) of (x, x ) belong to the above three cases. It is also worth noting that for a pair of data (x, x ) with labels (y, y ) = (-1, +1), the labeler would take (x , x) as a pairwise comparison example. Because by exchanging the positions of (x, x ), (x , x) would be associated with labels (+1, -1), which belong to the three cases. In summary, we assume that pairwise comparison data are sampled from those pairs of data whose labels belong to the three cases {(+1, -1), (+1, +1), (-1, -1)}. Based on the above described generation process of pairwise comparison data, we have the following theorem. Theorem 1. According to the generation process of pairwise comparison data described above, let p(x, x ) = q(x, x ) π 2 + + π 2 -+ π + π - , where q(x, x ) = π 2 + p + (x)p + (x ) + π 2 -p -(x)p -(x ) + π + π -p + (x)p -(x ). Then we have D = {(x i , x i )} n i=1 i.i.d. ∼ p(x, x ). The proof is provided in Appendix A. Theorem 1 provides an explicit expression of the probability density of pairwise comparison data. Next, we would like to extract pointwise information from pairwise information, since our goal is to perform pointwise binary classification. Let π  and p -(x ) , where = π 2 + + π 2 -+ π + π -= π + + π 2 -= π 2 + + π - p + (x) = π + π 2 -+ π + p + (x) + π 2 - π 2 -+ π + p -(x), p -(x ) = π 2 + π 2 + + π - p + (x ) + π - π 2 + + π - p -(x ). The proof is provided in Appendix B. Theorem 2 shows the relationships between the pointwise densities and the class-conditional densities. Besides, it indicates that from pairwise comparison data, we can essentially obtain examples that are independently drawn from p + (x) and p -(x ).

4. THE PROPOSED METHODS

In this section, we explore two UREs to train a binary classifier by ERM from only pairwise comparison data with the above generation process.

4.1. CORRECTED PCOMP CLASSIFICATION

As shown in Eq. ( 1), the classification risk R(f ) could be separately expressed as the expectations over p + (x) and p -(x). Although we do not have access to the two class-conditional densities p + (x) and p -(x), we can represent them by our introduced pointwise densities p + (x) and p -(x). Lemma 1. We can express p + (x) and p -(x) in terms of p + (x) and p + (x) as p + (x) = 1 π + p + (x) -π -p -(x) , p -(x) = 1 π - p -(x) -π + p + (x) . The proof is provided in Appendix C. As a result of Lemma 1, we can express the classification risk R(f ) using only pairwise comparison data sampled from p + (x) and p -(x). Theorem 3. The classification risk R(f ) can be equivalently expressed as R PC (f ) = E p+(x) (f (x), +1) -π + (f (x), -1) + E p-(x ) (f (x ), -1) -π -(f (x ), +1) . ( ) The proof is provided in Appendix D. In this way, we could train a binary classifier by minimizing the following empirical approximation of R PC (f ): R PC (f ) = 1 n n i=1 (f (x i ), +1) -π + (f (x i ), -1) + (f (x i ), -1) -π -(f (x i ), +1) . (6) Estimation Error Bound. Here, we establish an estimation error bound for the proposed URE. Let F = {f : X → R} be the model class, f PC = arg min f ∈F R PC (f ) be the empirical risk minimizer, and f = arg min f ∈F R(f ) be the true risk minimizer. Let R + n (F) and R - n (F) be the Rademacher complexities (Bartlett & Mendelson, 2002) of F with sample size n over p + (x) and p -(x) respectively. Theorem 4. Suppose the loss function is ρ-Lipschitz with respect to the first argument (0 ≤ ρ ≤ ∞), and all functions in the model class F are bounded, i.e., there exists a positive constant C b such that f ≤ C b for any f ∈ F. Let C := sup z≤C b ,t=±1 (z, t). Then for any δ > 0, with probability at least 1 -δ, we have R( f PC ) -R(f ) ≤ (1 + π + )4ρ R + n (F) + (1 + π -)4ρ R - n (F) + 6C log 8 δ 2n . The proof is provided in Appendix E. Theorem 4 shows that our proposed method is consistent, i.e., as n → ∞, R( f PC ) → R(f ), since R + n (F), R - n (F) → 0 for all parametric models with a bounded norm such as deep neural networks trained with weight decay (Golowich et al., 2017; Lu et al., 2019) . Besides, R + n (F) and R - n (F) can be normally bounded by C F / √ n for a positive constant C F . Hence, we can further see that the convergence rate is O p (1/ √ n) where O p denotes the order in probability. This order is the optimal parametric rate for ERM without additional assumptions (Mendelson, 2008) . Relation to UU Classification. It is worth noting that the URE of UU classification R UU (f ) is quite general for binary classification with weak supervision. Hence we also would like to show the relationships between our proposed estimator R PC (f ) and R UU (f ). We demonstrate by the following corollary that under some conditions, R UU (f ) is equivalent to R PC (f ). Corollary 1. By setting 3) is equivalent to Eq. ( 5), which means that R UU (f ) is equivalent to R PC (f ). p tr = p + (x), p tr = p -(x), θ = π + /(1 -π + + π 2 + ), and θ = π 2 + /(1 - π + + π 2 + ), Eq. ( We omit the proof of Corollary 1 since it is straightforward to derive Eq. ( 5) from Eq. ( 3) by inserting the required notations. Empirical Risk Correction. As shown in Lu et al. (2020) , directly minimizing R PC (f ) would suffer from overfitting when complex models are used due to the negative risk issue. More specifically, since negative terms are included in Eq. ( 6), the empirical risk can be negative even though the original true risk can never be negative. To ease this problem, they wrapped the terms in R UU (f ) that cause a negative empirical risk by certain consistent correction functions such as the rectified linear unit (ReLU) function g(z) = max(0, z) and absolute value function g(z) = |z|. This solution could also be applied to R PC . In this way, we could obtain the following corrected empirical risk estimator: R cPC (f ) = g 1 n n i=1 (f (x i ), +1) -π -(f (x i ), +1) + g 1 n n i=1 (f (x i ), -1) -π + (f (x i ), -1) . (7)

4.2. PROGRESSIVE PCOMP CLASSIFICATION

Here, we start from the noisy-label learning perspective to solve the Pcomp classification problem. Intuitively, we could simply perform binary classification by regarding the data from p + (x) as (noisy) positive data and the data from p -(x) as (noisy) negative data. However, this naive solution could be inevitably affected by noisy labels. In this scenario, we denote the noise rates as ρ -= p( y = +1|y = -1) and ρ + = p( y = -1|y = +1), where y is the observed (noisy) label and y is the true label, and the inverse noise rates as φ + = p(y = -1| y = +1) and φ -= p(y = +1| y = -1). According to the defined generation process of pairwise comparison data, we have the following theorem. Theorem 5. The following equalities hold: φ + = π 2 - π 2 + + π 2 -+ π + π - , φ -= π 2 + π 2 + + π 2 -+ π + π - , ρ + = π + 2(π 2 + + π 2 -+ π + π -) , ρ -= π - 2(π 2 + + π 2 -+ π + π -) . The proof is provided in Appendix F. Theorem 5 shows that the noise rates can be obtained if we regard the Pcomp classification problem as the noisy-label learning problem. With known noise rates for noisy-label learning, it was shown (Natarajan et al., 2013; Northcutt et al., 2017 ) that a URE could be derived. Here, we adopt the RankPruning method (Northcutt et al., 2017) because it holds a progressive URE by selecting confident examples using the learning model and achieves state-of-the-art performance. Specifically, we denote by the dataset composed of all the observed positive data P, i.e., P = {x i } n i=1 , where x i is independently sampled from p + (x). Similarly, the dataset composed of all the observed negative data is denoted by N , i.e., N = {x i } n i=1 , where x i is independently sampled from p -(x ). Then, confident examples will be selected from P and N by ranking the outputs of the model f . We denote the selected positive data from P as P sel , and the selected negative data from N as N sel : P sel = arg max P:|P|=(1-φ+)| P| x∈{P∩ P} f (x), N sel = arg min N :|N |=(1-φ-)| N | x∈{N ∩ N } f (x). Then we show that if the model f satisfies the separability condition, i.e., for any true positive instance x p and for any true negative instance x n , we have f (x p ) > f (x n ). In other words, the model output of every true positive instance is always larger than that of every true negative instance, we could obtain a URE. We name it progressive URE, as the model f is progressively optimized. Theorem 6 (Theorem 5 in Northcutt et al. ( 2017)). Assume that the model f satisfies the above separability condition, then the classification risk R(f ) can be equivalently expressed as R pPC (f ) = E p+(x) (f (x), +1) 1 -ρ + I[x ∈ P sel ] + E p-(x ) (f (x ), -1) 1 -ρ - I[x ∈ N sel ] , where I[•] is the indicator function. In this way, we have the following empirical approximation of R pPC : R pPC (f ) = 1 n n i=1 (f (x i ), +1) 1 -ρ + I[x i ∈ P sel ] + (f (x i ), -1) 1 -ρ - I[x i ∈ N sel ] . Estimation Error Bound. It worth noting that Northcutt et al. ( 2017) did not prove the learning consistency for the RankPruning method. Here, we establish an estimation error bound for this method, which guarantees the learning consistency. Let f pPC = arg min f ∈F R pPC (f ) be the empirical risk minimizer of the RankPruning method, then we have the following theorem. Theorem 7. Suppose the loss function is ρ-Lipschitz with respect to the first argument (0 ≤ ρ ≤ ∞), and all functions in the model class F are bounded, i.e., there exists a positive constant C b such that f ≤ C b for any f ∈ F. Let C := sup z≤C b ,t=±1 (z, t). Then for any δ > 0, with probability at least 1 -δ, we have R( f pPC ) -R(f ) ≤ 2 1 -ρ + 2ρ R + n (F) + C log 4 δ 2n + 2 1 -ρ - 2ρ R - n (F) + C log 4 δ 2n . The proof is provided in Appendix G. Theorem 7 shows that the above method is consistent and this estimation error bound also attains the optimal convergence rate without any additional assumption (Mendelson, 2008) , as analyzed in Theorem 4. Regularization. For the above RankPruning method, its URE is based on the assumption that the learning model could satisfy the separability condition. Thus, its performance heavily depends on the accuracy of the learning model. However, as the learning model is progressively updated, some of the selected confident examples may still contain label noise during the training process. As a result, the RankPruning method would be affected by incorrectly selected data. A straightforward improvement could be to improve the output quality of the learning model. Motivated by Mean Teacher used in semi-supervised learning (Tarvainen & Valpola, 2017) , we also resort to a teacher model that is an exponential moving average of model snapshots, i.e., Θ t = αΘ t-1 + (1 -α)Θ t , where Θ denotes the parameters of the teacher model, Θ denotes the parameters of the learning model, the subscript t denotes the training step, and α is a smoothing coefficient hyper-parameter. Such a teacher model could guide the learning model to produce high-quality outputs. To learn from the teacher model, we leverage consistency regularization & Aila, 2016; Tarvainen & Valpola, 2017) to make the learning model consistent with the teacher model for improving the RankPruning method. Datasets. We use four popular benchmark datasets, including MNIST (LeCun et al., 1998) , Fashion-MNIST (Xiao et al., 2017) , Kuzushiji-MNIST (Clanuwat et al., 2018) , and CIFAR-10 ( Krizhevsky et al., 2009) . We train a multilayer perceptron (MLP) model with three hidden layers of width 300 and ReLU activation functions (Nair & Hinton, 2010) and batch normalization (Ioffe & Szegedy, 2015) on the first three datasets. We train ResNet-34 (He et al., 2016) on the CIFAR-10 dataset. We also use USPS and three datasets from the UCI machine learning repository (Blake & Merz, 1998) including Pendigits, Optdigits, and CNAE-9. We train a linear model on these datasets, since they are not large-scale datasets. Methods. For our proposed Pcomp classification problem, we propose the following methods: Pcomp-Unbiased, which denotes the proposed method that minimizes R PC (f ) in Eq. ( 6); Pcomp-ReLU, which denotes the proposed method that minimizes R cPC (f ) in Eq. ( 7) with the ReLU function; Pcomp-ABS, which denotes the proposed method that minimizes R cPC (f ) in Eq. ( 7) with the absolute value function; Pcomp-Teacher, which improves the RankPruning method by imposing consistency regularization to make the learning model consistent with a teacher model. Besides, we compare with the following baselines: Binary-Biased, which conducts binary classification by regarding the data from p + (x) as positive data and the data from p -(x) as negative data. This is a straightforward method to handle the Pcomp classification problem. In our setting, Binary-Biased reduces to the BER minimization method (Menon et al., 2015) ; Noisy-Unbiased, which is a noisy-label learning method that minimizes the empirical approximation of the URE proposed by Natarajan et al. (2013) ; RankPruning, which is a noisy-label learning method (Northcutt et al., 2017) that minimizes R pPC (f ) in Eq. ( 8). For all learning methods, we take the logistic loss as the binary loss function (i.e., (z) = ln(1 + exp(-z))), for fair comparisons. We implement our methods using PyTorch (Paszke et al., 2019) and use the Adam (Kingma & Ba, 2015) optimization method with mini-batch size set to 256 and the number of training epochs set to 100. All the experiments are conducted on GeForce GTX 1080 Ti GPUs. Experimental Setup. We test the performance of all learning methods under different class prior settings, i.e., π + is selected from {0.2, 0.5, 0.8}. It is worth noting that we could estimate π + according to our described data generation process. Specifically, we can exactly estimate π by counting the fraction of collected pairwise comparison data in all the sampled pairs of data. Since Ω(f ) = E x f Θ (x) -f Θ (x) 2 (Laine π = π 2 + + π -= π 2 + + 1 -π + , we have π + = 1/2 -π -3/4 (if π + < π -) or π + = 1/2 + π -3/4 (if π + ≥ π -) . Therefore, if we know whether π + is larger than π -, we could exactly estimate the true class prior π + . For simplicity, we assume that the class prior π + is known for all the methods. We repeat the sampling-and-training process 5 times for all learning methods on all datasets and record the mean accuracy with standard deviation (mean±std). Experimental Results with Complex Models. Table 1 records the classification performance of each method on the four benchmark datasets with different class priors. From Table 1 , we have the following observations: 1) Binary-Biased always achieves the worst performance, which indicates that simply conducting binary classification cannot well solve our Pcomp classification problem; 2) Pcomp-Unbiased is is inferior to Pcomp-ABS and Pcomp-ReLU. This observation accords with what we have discussed, i.e., directly minimizing R PC (f ) would suffer from overfitting when complex models are used because there are negative terms included in R PC (f ) and the empirical risk can be negative during the training process. In contrast, Pcomp-ReLU and Pcomp-ABS employ consistent correction functions on R PC (f ) so that the empirical risk will never be negative. Therefore, when complex models such as deep neural networks are used, Pcomp-ReLU and Pcomp-ABS are expected to outperform Pcomp-Unbiased; 3) Pcomp-Teacher achieves the best performance in most cases. This observation verifies the effectiveness of the imposed consistency regularization, which makes the learning model consistent with a teacher model, for improving the quality of selected confident examples by the RankPruning method; 4) It is worth noting that the standard deviations of Binary-Biased, Pcomp-Unbiased, and Noisy-Unbiased are sometimes higher than other methods. This is because the three methods suffer from overfitting when complex models are used, and the performance could be quite unstable in different trials. In addition, Noisy-Unbiased holds the accuracy of 80.00±0.00% on CIFAR-10 with class prior 0.2. This extreme case happens because Noisy-Unbiased always simply classifies all examples into the negative class due to the serious overfitting issue on a complex class-imbalanced dataset with a complex model ResNet-34. Experimental Results with Simple Models. Table 2 reports the classification performance of each method on the four UCI datasets with different class priors. From Table 2 , we have the follow-ing observations: 1) Binary-Biased achieves the worst performance in nearly all cases; 2) Pcomp-Unbiased is slightly better than Pcomp-ReLU and Pcomp-ABS, because Pcomp-Unbiased does not suffer from overfitting when the linear model is used, and it is not necessary to use consistent correction functions anymore. Besides, Pcomp-Unbiased becomes comparable to Pcomp-Teacher and achieves the best performance in half of the cases; 3) Pcomp-Teacher is still better than RankPruning, while it is sometimes inferior to Pcomp-Unbiased. This is because the linear model is not as powerful as neural networks, and the selected confident examples may not be so reliable.

6. CONCLUSION

In this paper, we proposed a novel weakly supervised learning setting called pairwise comparison (Pcomp) classification, where we aim to train a binary classifier from only pairwise comparison data, i.e., two examples that we know one is more likely to be positive than the other, instead of pointwise labeled data. Pcomp classification is useful for private classification tasks where we are not allowed to directly access labels and subjective classification tasks where labelers have different labeling standards. To solve the Pcomp classification problem, we presented a mathematical formulation for the generation process of pairwise comparison data, based on which we explored two unbiased risk estimators (UREs) to train a binary classifier by empirical risk minimization and established the corresponding estimation error bounds. We first proved that a URE can be derived and improved it using correction functions. Then, we started from the noisy-label learning perspective to introduce a progressive URE and improved it by imposing consistency regularization. Finally, experiments demonstrated the effectiveness of our proposed methods. In future work, we will apply Pcomp classification to solve some challenging real-world problems like binary classification with class overlapping. In addition, we could also extend Pcomp classification to the multi-class classification setting by using the one-versus-all strategy. Suppose there are multiple classes, we are given pairs of unlabeled data that we know which one is more likely to belong to a specific class. Then, we can use the proposed methods in this paper to train a binary classifier for each class. Finally, by comparing the outputs of these binary classifiers, the predicted class can be determined.

A PROOF OF THEOREM 1

It is clear that each pair of examples (x, x ) is independently drawn from the following data distribution: p(x, x ) = p((x, x ) | (y, y ) ∈ Y) = p((x, x ), (y, y ) ∈ Y) p((y, y ) ∈ Y) , where p((y, y ) ∈ Y) = π 2 + + π 2 -+ π + π -and p(x, x , (y, y ) ∈ Y) = (y,y )∈ Y p(x, x | (y, y )) • p(y, y ) = π 2 + p + (x)p + (x ) + π 2 -p -(x)p -(x ) + π + π -p + (x)p -(x ). Finally, let p(x, x ) = p((x, x ) | (y, y ) ∈ Y), the proof is completed.

B PROOF OF THEOREM 2

In order to decompose the pairwise comparison data distribution into pointwise distribution, we marginalize p(x, x ) with respect to x or x . Then we can obtain p(x, x )dx = 1 π π 2 + p + (x) + π 2 -p -(x) + π + π -p + (x) = π + π 2 -+ π + p + (x) + π 2 - π 2 -+ π + p -(x) = p + (x), p(x, x )dx = 1 π π 2 + p + (x ) + π 2 -p -(x ) + π + π -p -(x ) = π 2 + π 2 + + π - p + (x ) + π - π 2 + + π - p -(x ) = p -(x ), which concludes the proof of Theorem 2.

C PROOF OF LEMMA 1

Based on Theorem 2, we can obtain the following linear equation: p + (x) p -(x) = 1 π π + π 2 - π 2 + π - p + (x) p -(x) . By solving the above equation, we obtain p + (x) = 1 π + -π -π 2 + π • p + (x) -π -π • p -(x) = 1 π + p + (x) -π -p -(x) , p -(x) = 1 π --π + π 2 - π • p -(x) -π + π • p + (x) = 1 π - p -(x) -π + p + (x) , which concludes the proof of Lemma 1.

D PROOF OF THEOREM 3

It is quite intuitive to derive R(f ) = E p(x,y) (f (x), y) = π + E p+(x) (f (x), +1) + π -E p-(x) (f (x), -1) = π + π π + -π -π 2 + E p+(x) (f (x), +1) - π + π -π π + -π -π 2 + E p-(x ) (f (x), +1) (Lemma 1) + π -π π --π + π 2 - E p-(x ) (f (x), -1) - π + π -π π --π + π 2 - E p+(x) (f (x), -1) = E p+(x) (f (x), +1) -π + (f (x), -1) + E p-(x ) (f (x), -1) -π -(f (x), +1) = R PC (f ), which concludes the proof of Theorem 3.

E PROOF OF THEOREM 4

First of all, we introduce the following notations: R + PC (f ) = E p+(x) (f (x), +1) -π + (f (x), -1) , R + PC (f ) = 1 n n i=1 (f (x i ), +1) -π + (f (x i ), -1) , R - PC (f ) = E p-(x ) (f (x ), -1) -π -(f (x ), +1) , R - PC (f ) = 1 n n i=1 (f (x i ), -1) -π -(f (x i ), +1) . In this way, we could simply represent R PC (f ) and R PC (f ) as R PC (f ) = R + PC (f ) + R - PC (f ), R PC (f ) = R + PC (f ) + R - PC f ). Then we have the following lemma. Lemma 2. The following inequality holds: R( f PC ) -R(f ) ≤ 2 sup f ∈F R + PC (f ) -R + PC (f ) + 2 sup f ∈F R - PC (f ) -R - PC (f ) . Proof. We could intuitively express R ( f PC ) -R(f ) as R( f PC ) -R(f ) = R( f PC ) -R PC ( f PC ) + R PC ( f PC ) -R PC (f ) + R PC (f ) -R(f ) = R PC ( f PC ) -R PC ( f PC ) + R PC ( f PC ) -R PC (f ) + R PC (f ) -R PC (f ) ≤ sup f ∈F R PC (f ) -R PC (f ) + 0 + sup f ∈F R PC (f ) -R PC (f ) = 2 sup f ∈F R PC (f ) -R PC (f ) ≤ 2 sup f ∈F R + PC (f ) -R + PC (f ) + 2 sup f ∈F R - PC (f ) -R - PC (f ) , where the second inequality holds due to Theorem 3. As suggested by Lemma 2, we need to further upper bound the right hand size of Eq. ( 9). Before doing that, we introduce the uniform deviation bound, which is useful to derive estimation error bounds. The proof can be found in some textbooks such as Mohri et al. (2012) (Theorem 3.1). Lemma 3. Let Z be a random variable drawn from a probability distribution with density µ, H = {h : Z → [0, M ]} (M > 0) be a class of measurable functions, {z i } n i=1 be i.i.d. examples drawn from the distribution with density µ. Then, for any delta > 0, with probability at least 1 -δ, sup h∈H E Z∼µ h(Z) - 1 n n i=1 h(z i ) ≤ 2R n (H) + M log 2 δ 2n , where R n (H) denotes the (expected) Rademacher complexity (Bartlett & Mendelson, 2002) of H with sample size n over µ. Lemma 4. Suppose the loss function is ρ-Lipschitz with respect to the first argument (0 < ρ < ∞), and all the functions in the model class F are bounded, i.e., there exists a constant C b such that f ∞ ≤ C b for any f ∈ F. Let C := sup t=±1 (C b , t). For any δ > 0, with probability 1 -δ, sup f ∈F R + PC (f ) -R + PC (f ) ≤ (1 + π + )2ρ R + n (F) + (1 + π + )C log 4 δ 2n . Proof. By the definition of R + PC (f ) and R + PC (f ), we can obtain sup f ∈F R + PC (f ) -R + PC (f ) ≤ sup f ∈F E p+(x) (f (x), +1) - 1 n n i=1 (f (x), +1) + π + sup f ∈F E p+(x) (f (x), -1) - 1 n n i=1 (f (x), -1) . By applying Lemma 3, we have for any δ > 0, with probability 1 -δ, sup f ∈F E p+(x) (f (x), +1) - 1 n n i=1 (f (x), +1) ≤ 2 R + n ( • F) + C log 2 δ 2n , and for any for any δ > 0, with probability 1 -δ, sup f ∈F E p+(x) (f (x), -1) - 1 n n i=1 (f (x), -1) ≤ 2 R + n ( • F) + C log 2 δ 2n , where • F means { • f | f ∈ F}. By Talagrand's lemma (Lemma 4.2 in Mohri et al. (2012) ), R + n ( • F) ≤ ρ R + n (F). Finally, by combing Eqs. ( 10), ( 11), ( 12), and (13), we have for any δ > 0, with probability at least 1 -δ, sup f ∈F R + PC (f ) -R + PC (f ) ≤ (1 + π + )2ρ R + n (F) + (1 + π + )C log 4 δ 2n , which concludes the proof of Lemma 4.  π 2 + + π + π - π 2 + + π 2 -+ π + π - = π 2 - π 2 + + π 2 -+ π + π - , φ -= p(y = +1 | y = -1) = 1 - π 2 -+ π + π - π 2 + + π 2 -+ π + π - = π 2 + π 2 + + π 2 -+ π + π - . In this way, we can further obtain the following noise transition ratios: In this way, we could simply represent R ppc (f ) and R pPC (f ) as R pPC (f ) = 1 1 -ρ + R + pPC (f ) + 1 1 -ρ - R - pPC (f ), R pPC (f ) = 1 1 -ρ + R + pPC (f ) + 1 1 -ρ - R - pPC (f ). Then we have the following lemma. Lemma 6. The following inequality holds: R( f pPC ) -R(f ) ≤ 2 1 -ρ + sup f ∈F R + pPC (f ) -R + pPC (f ) + 2 1 -ρ - sup f ∈F R - pPC (f ) -R - pPC (f ) .



In contrast toXu et al. (2019) andXu et al. (2020) which utilized pairwise comparison data to solve the regression problem, we focus on binary classification. EXPERIMENTSIn this section, we conduct experiments to evaluate the practical performance of our proposed methods on various datasets. http://yann.lecun.com/exdb/mnist/ https://github.com/zalandoresearch/fashion-mnist https://github.com/rois-codh/kmnist https://www.cs.toronto.edu/ ˜kriz/cifar.html



where : R × Y → R + denotes a binary loss function, π + := p(y = +1) (or π -:= p(y = -1)) denotes the positive (or negative) class prior probability, and p + (x) := p(x|y = +1) (or p -(x) := p(x|y = -1)) denotes the class-conditional probability density of the positive (or negative) data. ERM approximates the expectations over p + (x) and p -(x) by the empirical averages of positive and negative data and the empirical risk is minimized with respect to the classifier f .

and we denote the pointwise data collected from D = {(x i , x i )} n i=1 by breaking the pairwise comparison relation as D + = {x i } n i=1 and D -= {x i } n i=1 . Then we can obtain the following theorem. Theorem 2. Pointwise examples in D + and D -are independently drawn from p + (x)

Suppose the loss function is ρ-Lipschitz with respect to the first argument (0 < ρ < ∞), and all the functions in the model class F are bounded, i.e., there exists a constantC b such that f ∞ ≤ C b for any f ∈ F. Let C := sup t=±1 (C b , t).For any δ > 0, with probability 1 -δ,sup f ∈F R - PC (f ) -R - PC (f ) ≤ (1 + π -)2ρ R - n (F) + (1 + π -)C log 4 δ 2n .Proof. Lemma 5 can be proved similarly to Lemma 4.By combining Lemma 2, Lemma 4, and Lemma 5, Theorem 4 is proved.F PROOF OF THEOREM 5Suppose there are n pairs of paired data points, which means there are in total 2n data points.For our Pcomp classification problem, we could simply regard x sampled from p + (x) as (noisy) positive data and x sampled from p -(x ) as (noisy) negative data. Given n pairs of examples {(x i , x i )} n i=1 , for the n observed positive examples, there are actually n • p(y = +1| y = +1) true positive examples; for the n observed negative examples, there are actually n • p(y = -1| y = -1) true negative examples. From our defined data generation process in Theorem 1, it is intuitive to obtain p(y = +1 | y= +1) = π 2 + + π + π - π 2 + + π 2 -+ π + π - , p(y = -1 | y = -1) = π 2 -+ π + π - π 2 + + π 2 -+ π + π - . Since φ + = p(y = -1 | y = +1) =1 -p(y = +1 | y = +1) and φ -= p(y = +1 | y = -1) = 1 -p(y = -1 | y = -1), we can obtain φ + = p(y = -1 | y = +1) = 1 -

+ = p( y = -1 | y = +1) = p(y = +1 | y = -1)p( y = -1) p(y = +1) = π + 2(π 2 + + π 2 -+ π + π -) , ρ -= p( y = +1 | y = -1) = p(y = -1 | y = +1)p( y = +1) p(y = -1) = π - 2(π 2 + + π 2 -+ π + π -), where p( y = 1) = p( y = -1) = 1 2 , because we have the same number of observed positive examples and negative examples.G PROOF OF THEOREM 7First of all, we introduce the following notations:R + pPC (f ) = E p+(x) (f (x), +1)I[x ∈ P P] x i ), +1)I[x i ∈ P P] , R - pPC (f ) = E p-(x ) (f (x ), -1)I[x ∈ N N] x i ), -1)I[x i ∈ N N] .

Classification accuracy (mean±std) in percentage of each method on the four benchmark datasets with different class priors. The best performance is highlighted in bold.

The detailed descriptions of all used datasets with the corresponding models are provided in Appendix H. Since these datasets are specially used for multi-class classification, we manually transformed them into binary classification datasets (please see Appendix H for details). As we have shown in Theorem 2, the pairwise comparison examples can be equivalently transformed into pointwise examples, which are more convenient to generate. Therefore, we generate pointwise examples in experiments. Specifically, as Theorem 5 discloses the noise rates in our defined data generation process, we simply generate pointwise corrupted examples according to the noise rates.

Classification accuracy (mean±std) in percentage of each method on the four UCI datasets with different class priors. The best performance is highlighted in bold.

annex

Proof. We omit the proof of Lemma 6 since it is quite similar to that of Lemma 2.As suggested by Lemma 6, we need to further upper bound the right hand size of Eq. ( 15). According to Lemma 3, we have the following two lemmas. Lemma 7. Suppose the loss function is ρ-Lipschitz with respect to the first argument (0 < ρ < ∞), and all the functions in the model class F are bounded, i.e., there exists a constant C b such that f ∞ ≤ C b for any f ∈ F. Let C := sup z≤C b ,t=±1 (z, t). For any δ > 0, with probability 1 -δ,Lemma 8. Suppose the loss function is ρ-Lipschitz with respect to the first argument (0 < ρ < ∞), and all the functions in the model class F are bounded, i.e., there exists a constant C b such that f ∞ ≤ C b for any f ∈ F. Let C := sup z≤C b ,t=±1 (z, t). For any δ > 0, with probability 1 -δ,We omit the proofs of Lemma 7 and Lemma 8 since they are similar to that of Lemma 4.By combing Lemma 6, Lemma 7, and Lemma 8, Theorem 7 is proved.

H SUPPLEMENTARY INFORMATION OF EXPERIMENTS

Table 3 reports the specification of the used benchmark datasets and models.MNIST 2 (LeCun et al., 1998) . This is a grayscale image dataset composed of handwritten digits from 0 to 9 where the size of the each image is 28 × 28. It contains 60,000 training images and 10,000 test images. Because the original dataset has 10 classes, we regard the even digits as the positive class and the odd digits as the negative class.Fashion-MNIST 3 (Xiao et al., 2017) • The positive class is formed by 'T-shirt', 'pullover', 'coat', 'shirt', and 'bag'.• The negative class is formed by 'trouser', 'dress', 'sandal', 'sneaker', and 'ankle boot'.Kuzushiji-MNIST 4 (Netzer et al., 2011) . This is another grayscale image dataset that is similar to MNIST. It is a 10-class dataset of cursive Japanese ("Kuzushiji") characters. It consists of 60,000 training images and 10,000 test images. It is converted into a binary classification dataset as follows:• The positive class is formed by 'o', 'su','na', 'ma', 're'.• The negative class is formed by 'ki','tsu','ha', 'ya','wo'.CIFAR-10 5 (Krizhevsky et al., 2009) . This is also a color image dataset of 10 different objects ('airplane', 'bird', 'automobile', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', and 'truck') , where the size of each image is 32 × 32 × 3. There are 5,000 training images and 1,000 test images per class. This dataset is converted into a binary classification dataset as follows:• The positive class is formed by 'bird', 'deer', 'dog', 'frog', 'cat', and 'horse'.• The negative class is formed by 'airplane', 'automobile', 'ship', and 'truck'. USPS, Pendigits, Optdigits. These datasets are composed of handwritten digits from 0 to 9. Because each of the original datasets has 10 classes, we regard the even digits as the positive class and the odd digits as the negative class.CNAE-9. This dataset contains 1,080 documents of free text business descriptions of Brazilian companies categorized into a subset of 9 categories cataloged in a table called National Classification of Economic Activities.• The positive class is formed by '2', '4', '6' and '8'.• The negative class is formed by '1', '3', '5', '7' and '9'.For MNIST, Kuzushiji-MNIST, and Fashion-MNIST, we set learning rate to 1e-3 and weight decay to 1e -5. For CIFAR-10, we set learning rate to 1e -3 and weight decay to 1e -3. We also list the number of pointwise corrupted examples used for model training on each dataset: 30,000 for MNIST, Kuzushiji-MNIST, Fashion-MNIST, and CIFAR-10; 4,000 for USPS; 5,000 for Pendigits; 2,000 for Optdigits; 400 for CNAE-9.

