PAIRWISE CONFIDENCE DIFFERENCE ON UNLABELED DATA IS SUFFICIENT FOR BINARY CLASSIFICATION Anonymous

Abstract

Learning with confidence labels is an emerging weakly supervised learning paradigm, where training data are equipped with confidence labels instead of exact labels. Positive-confidence (Pconf) classification is a typical learning problem in this context, where we are given only positive data equipped with confidence. However, pointwise confidence may not be accessible in real-world scenarios. In this paper, we dive into a novel weakly supervised learning problem called confidence-difference (ConfDiff) classification. Instead of pointwise confidence, we are given only unlabeled data pairs equipped with confidence difference specifying the difference in the probabilities of being positive. An unbiased risk estimator is derived to tackle the problem, and we show that the estimation error bound achieves the optimal convergence rate. Extensive experiments on benchmark data sets validate the effectiveness of our proposed approaches in leveraging the supervision information of the confidence difference.

1. INTRODUCTION

Recent years have witnessed the prevalence of deep learning and its successful applications. However, the success is built on the basis of the collection of large amounts of data with unique and accurate labels. In many real-world scenarios, it is often difficult to satisfy such requirements. To circumvent the difficulty, various weakly supervised learning problems have been investigated accordingly, including but not limited to semi-supervised learning (Chapelle et al., 2006; Zhu & Goldberg, 2009; Li & Zhou, 2015; Berthelot et al., 2019) , label-noise learning (Patrini et al., 2017; Han et al., 2018; Li et al., 2021; Wang et al., 2021; Wei et al., 2022) , positive-unlabeled learning (du Plessis et al., 2014; Su et al., 2021; Yao et al., 2022) , partial-label learning (Cour et al., 2011; Wang & Zhang, 2020; Wen et al., 2021; Wang et al., 2022; Wu et al., 2022) , unlabeled-unlabeled learning (Lu et al., 2019; 2020) and similarity-based classification (Bao et al., 2018; Cao et al., 2021b; Bao et al., 2022) . Learning with confidence labels (Ishida et al., 2018; Cao et al., 2021a; b) is another weakly supervised learning paradigm, where we are given training examples with confidence labels instead of exact labels. Positive-confidence (Pconf) classification (Ishida et al., 2018) is a problem setting within this scope, which is aimed at learning a binary classifier from only positive data equipped with confidence (the probability of being positive) without negative data. Pconf classification can alleviate the difficulty when negative data cannot be acquired due to privacy or security issues during the data annotation process. The need to learn from such inexact supervision widely exists in real-world scenarios, such as purchase prediction (Ishida et al., 2018) , user preservation prediction (Ishida et al., 2018) , drivers' drowsiness prediction (Shinoda et al., 2020) , etc. However, the process of collecting large amounts of training examples with pointwise confidence might be actually demanding under many circumstances, since it is tough to describe the probability of being positive for each training example exactly (Shinoda et al., 2020) . Feng et al. (2021) showed that learning from pairwise comparisons could serve as an alternative strategy given limited pointwise labeling information. Inspired by it, we investigate a more practical problem setting in this paper, where we are given only unlabeled data pairs with confidence difference indicating the difference in the probabilities of being positive. Compared with pointwise confidence, confidence difference can be collected more easily in many real-world scenarios. Take click-through rate prediction in recommender systems (Zhang et al., 2019) for example. The combinations of users and their favorite/disliked items can be regarded as positive/negative data. When collecting training data, it is not easy to distinguish between positive and negative data. Furthermore, the positive confidence of training data may be difficult to be determined due to the extremely sparse and class-imbalance problems (Yao et al., 2021) . However, it is much easier to obtain the difference in the preference between a pair of candidate items for a given user. Take the disease risk estimation problem for another example. The goal is to predict the risk of having some disease given a person's attributes. When asking doctors to annotate the probabilities of having the disease for people, it is not easy to determine the exact values of the probabilities. Furthermore, the probability values given by different doctors may be different due to personally subjective assumptions and will deviate from the ground-truth values. However, it is much easier and less biased to estimate the relative difference in the probabilities of having the disease between two people. Our contributions are summarized as follows: • We investigate confidence-difference (ConfDiff) classification, a novel and practical weakly supervised learning problem, which can be solved via empirical risk minimization by constructing an unbiased risk estimator. The proposed approach can be equipped with any model, loss function, and optimizer flexibly. • The estimation error bound is derived, showing that the proposed approach achieves the optimal parametric convergence rate. The robustness is further demonstrated by probing into the influence of an inaccurate class prior probability and noisy confidence difference. • To mitigate overfitting issues, a risk correction approach (Lu et al., 2020) with consistency guarantee is further introduced. Extensive experimental results on benchmark data sets validate the effectiveness of the proposed approaches.

Related works.

Learning with pairwise comparisons has been investigated pervasively in the community (Burges et al., 2005; Cao et al., 2007; Jamieson & Nowak, 2011; Park et al., 2015; Kane et al., 2017; Xu et al., 2017; Shah et al., 2019) , with applications in information retrieval (Liu, 2011) , computer vision (Fu et al., 2015) , regression (Xu et al., 2019; 2020) , crowdsourcing (Chen et al., 2013; Zeng & Shen, 2022) , graph learning (He et al., 2022) , etc. It is noteworthy that there exist distinct differences between our work and previous works on learning with pairwise comparisons. Previous works have mainly tried to learn a ranking function which can rank candidate examples according to the relevance or preference. In this paper, we try to learn a pointwise binary classifier by conducting empirical risk minimization under the binary classification setting. Relationship to Pcomp classification. Feng et al. (2021) elaborated that a binary classifier could be learned from pairwise comparisons, which was termed as Pcomp classification. There are distinct differences between our work and Pcomp classification. First, Pcomp classification is not capable of leveraging the fine-grained confidence difference, which can be incidentally obtained when collecting pairwise comparison data. We will experimentally elucidate the benefit of exploiting the confidence difference in the later section. Second, the assumptions of the data generation process are different. Pcomp classification assumes that the unlabeled data pair is ordered, where the first instance is more likely to be positive than the other. In ConfDiff classification, the instances of the unlabeled data pair are independent, which can be easier to collect.

2. PRELIMINARIES

In this section, we introduce the notations used in this paper and discuss the background of binary classification, Pconf classification and Pcomp classification. Then, we elucidate the data generation process of confidence-difference classification.

2.1. BINARY CLASSIFICATION

For binary classification, let X = R d denote the d-dimensional feature space and Y = {+1, -1} denote the label space. Let p(x, y) denote the unknown joint probability distribution over random variables (x, y) ∈ X × Y. The task of binary classification is to learn a binary classifier g : X → R which minimizes the following classification risk: R(g) = E p(x,y) [ℓ(g(x), y)], where ℓ(•, •) is a non-negative binary-class loss function, such as the 0-1 loss and logistic loss. Let π + = p(y = +1) and π -= p(y = -1) denote the class prior probabilities for the positive and negative classes respectively. Furthermore, let p + (x) = p(x|y = +1) and p -(x) = p(x|y = -1) denote the class-conditional probability densities of positive and negative data respectively. Then the classification risk in Eq. ( 1) can be equivalently expressed as R(g) = π + E p+(x) [ℓ(g(x), +1)] + π -E p-(x) [ℓ(g(x), -1)]. (2)

2.2. POSITIVE-CONFIDENCE (PCONF) CLASSIFICATION

In many real-world applications, it may be difficult to collect negative data. Pconf classification (Ishida et al., 2018 ) is aimed at inducing a binary classifier from only positive data. The additional requirement is that the confidence of being positive should be accessible to the learning algorithm. Given only positive data equipped with confidence {(x i , r i )} n i=1 , Ishida et al. (2018) provided an unbiased risk estimator to conduct empirical risk minimization: R Pconf (g) = π + n n i=1 (ℓ(g(x i ), +1) + 1 -r i r i ℓ(g(x i ), -1)), where r i = p(y i = +1|x i ) is the positive confidence associated with x i . However, pointwise positive confidence may not be easy to obtain in real-world scenarios (Shinoda et al., 2020) .

2.3. PAIRWISE-COMPARISON (PCOMP) CLASSIFICATION

Pcomp classification is a weakly supervised binary classification problem (Feng et al., 2021) . In Pcomp classification, we are given pairs of unlabeled data where we know which one is more likely to be positive than the other. It is assumed that Pcomp data are sampled from labeled data pairs whose labels belong to {(+1, -1), (+1, +1), (-1, -1)}. Based on this assumption, the probability density of Pcomp data (x, x ′ ) is given as p(x, x ′ ) = q(x, x ′ ) π 2 + + π 2 -+ π + π - , where q(x, x ′ ) = π 2 + p + (x)p + (x ′ ) + π 2 -p -(x)p -(x ′ ) + π + π -p + (x)p -(x ′ ). Then, an unbiased risk estimator for Pcomp classification is derived as follows: R Pcomp (g) = 1 n n i=1 (ℓ(g(x i ), +1) + ℓ(g(x ′ i ), -1) -π + ℓ(g(x i ), -1) -π -ℓ(g(x ′ i ), +1)). In real-world applications, we may not only know one example is more likely to be positive than the other, but also know how much the difference of confidence is. Next, a novel weakly supervised learning setting named ConfDiff classification is introduced.

2.4. CONFIDENCE-DIFFERENCE (CONFDIFF) CLASSIFICATION

In this subsection, the formal definition of confidence difference is given firstly. Then, we elaborate the data generation process of ConfDiff data. Definition 1 (Confidence Difference). The confidence difference c(x, x ′ ) between the unlabeled data pair (x, x ′ ) is defined as c(x, x ′ ) = p(y ′ = 1|x ′ ) -p(y = 1|x). As shown in the definition above, the confidence difference denotes the difference in the class posterior probabilities between the unlabeled data pair, which can measure how confident the pairwise comparison is. In ConfDiff classification, we are only given n unlabeled data pairs with confidence difference D = {((x i , x ′ i ), c i )} n i=1 . Here, c i = c(x i , x ′ i ) is the confidence difference for the unlabeled data pair (x i , x ′ i ). Furthermore, the unlabeled data pair (x i , x ′ i ) is assumed to be drawn from a probability density p(x, x ′ ) = p(x)p(x ′ ). This indicates that x i and x ′ i are two i.i.d. instances sampled from p(x). It is worth noting that the confidence difference c i will be positive if the second instance x ′ i has a higher probability to be positive than the first instance x i , and will be negative otherwise. During the data collection process, the labeler can first sample two unlabeled data from the marginal distribution p(x), then provide the confidence difference for them. This data generation assumption makes the unlabeled data pairs easier to be collected.

3. THE PROPOSED APPROACH

In this section, an unbiased risk estimator is presented for ConfDiff classification. Then, we give an estimation error bound to show the convergence property. Besides, we show the influence of an inaccurate class prior probability and noisy confidence difference on the risk estimator. Furthermore, a risk correction approach (Lu et al., 2020) is elaborated to improve the generalization performance of our proposed approach.

3.1. UNBIASED RISK ESTIMATOR

In this subsection, we show that the classification risk in Eq. ( 1) can be expressed with ConfDiff data in the equivalent way. Theorem 1. The classification risk R(g) in Eq. ( 1) can be equivalently expressed as R CD (g) = E p(x,x ′ ) [ 1 2 (L(x, x ′ ) + L(x ′ , x))], where L(x, x ′ ) = (π + -c(x, x ′ ))ℓ(g(x), +1) + (π --c(x, x ′ ))ℓ(g(x ′ ), -1). Accordingly, we can derive an unbiased risk estimator for ConfDiff classification: R CD (g) = 1 2n n i=1 ((π + -c i )ℓ(g(x i ), +1) + (π --c i )ℓ(g(x ′ i ), -1) +(π + + c i )ℓ(g(x ′ i ), +1) + (π -+ c i )ℓ(g(x i ), -1)). To estimate the class prior probability π + , we can transform ConfDiff data into Pcomp data by ranking the two instances in the unlabeled data pair according to the confidence difference. Then, we can adopt the approach proposed in Feng et al. (2021) to estimate π + . It is worth noting that the risk estimator in Eq. ( 3) for Pconf classification is very sensitive to small confidence values, while our risk estimator will not be influenced by them. Minimum-variance risk estimator. Actually, Eq. ( 8) is one of the candidates of the unbiased risk estimator. We introduce the following lemma: Lemma 1. The following expression is also an unbiased risk estimator: 1 n n i=1 (αL(x i , x ′ i ) + (1 -α)L(x ′ i , x i )), where α ∈ [0, 1] is an arbitrary weight. Then, we introduce the following theorem: Theorem 2. The unbiased risk estimator in Eq. ( 8) has the minimum variance among all the candidate unbiased risk estimators in the form of Eq. ( 9) w.r.t. α ∈ [0, 1]. Theorem 2 indicates the variance minimality of the proposed unbiased risk estimator in Eq. ( 8), and we adopt this risk estimator in the following sections.

3.2. ESTIMATION ERROR BOUND

In this subsection, we elaborate the convergence property of the proposed risk estimator R CD (g) by giving an estimation error bound. Let G = {g : X → R} denote the model class. It is assumed that there exists some constant C g such that sup g∈G ∥g∥ ∞ ≤ C g and some constant C ℓ such that sup |z|≤Cg ℓ(z, y) ≤ C ℓ . We also assume that the binary loss function ℓ(z, y) is Lipschitz continuous for z and y with a Lipschitz constant L ℓ .foot_0 Let g * = arg min g∈G R(g) denote the minimizer of the classification risk in Eq. ( 1) and g CD = arg min g∈G R CD (g) denote the minimizer of the unbiased risk estimator in Eq. ( 8). The following theorem can be derived: Theorem 3. For any δ > 0, the following inequality holds with probability at least 1 -δ: R( g CD ) -R(g * ) ≤ 8L ℓ R n (G) + 4C ℓ ln 2/δ 2n , where R n (G) denotes the Rademacher complexity of G for unlabeled data with size n. From Theorem 3, we can observe that as n → ∞, R( g CD ) → R(g * ) because R n (G) → 0 for all parametric models with a bounded norm, such as deep neural networks trained with weight decay (Golowich et al., 2018) . Furthermore, the estimation error bound converges in O p (1/ √ n), where O p denotes the order in probability, which is the optimal parametric rate for empirical risk minimization without making additional assumptions (Mendelson, 2008) .

3.3. ROBUSTNESS OF RISK ESTIMATOR

In the previous subsections, it was assumed that the class prior probability is known in advance or estimated accurately. In addition, it was assumed that the ground-truth confidence difference of each unlabeled data pair is accessible. However, these assumptions can rarely be satisfied in real-world scenarios, since the collection of confidence difference is inevitably injected with noise. In this subsection, we theoretically analyze the influence of an inaccurate class prior probability and noisy confidence difference on the learning procedure. Later in subsection 4.4, we will experimentally verify our theoretical findings. Let D = {((x i , x ′ i ), ci )} n i=1 denote n unlabeled data pairs with noisy confidence difference, where ci is generated by corrupting the ground-truth confidence difference c i with noise. Besides, let π+ denote the inaccurate class prior probability accessible to the learning algorithm. Furthermore, let RCD (g) denote the empirical risk calculated based on the inaccurate class prior probability and noisy confidence difference. Let ḡCD = arg min g∈G RCD (g) denote the minimizer of RCD (g). Then, the theorem demonstrating an estimation error bound is given as follows: Theorem 4. Based on the assumptions above, for any δ > 0, the following inequality holds with probability at least 1 -δ: R(ḡ CD ) -R(g * ) ≤ 16L ℓ R n (G) + 8C ℓ ln 2/δ 2n + 4C ℓ n i=1 |c i -c i | n + 4C ℓ |π + -π + |. ( ) Theorem 4 indicates that the estimation error is bounded by twice the original bound in Theorem 3 with the mean absolute error of the noisy confidence difference and the inaccurate class prior probability. Furthermore, if n i=1 |c i -c i | has a sublinear growth rate with high probability and the class prior probability is estimated consistently, the risk estimator can be even consistent. It elaborates the robustness of the proposed approach.

3.4. RISK CORRECTION APPROACH

It is worth noting that the empirical risk in Eq. ( 8) may be negative due to negative terms, which is unreasonable because of the non-negative property of loss functions. This phenomenon will result in severe overfitting problems when complex models are adopted (Lu et al., 2020; Cao et al., 2021b; Feng et al., 2021) . To circumvent this difficulty, we wrap the individual loss terms in Eq. ( 8) with risk correction functions proposed in Lu et al. (2020) , such as the rectified linear unit (ReLU) function f (z) = max(0, z) and the absolute value function f (z) = |z|. In this way, the corrected risk estimator for ConfDiff classification can be expressed as follows: R CD (g) = 1 2n (f ( n i=1 (π + -c i )ℓ(g(x i ), +1)) + f ( n i=1 (π --c i )ℓ(g(x ′ i ), -1)) +f ( n i=1 (π + + c i )ℓ(g(x ′ i ), +1)) + f ( n i=1 (π -+ c i )ℓ(g(x i ), -1))). ( ) Theoretical analysis. We assume that the risk correction function f (z) is Lipschitz continuous with Lipschitz constant L f . For ease of notation, let A g = n i=1 (π + -c i )ℓ(g(x i ), +1)/2n, B g = n i=1 (π --c i )ℓ(g(x ′ i ), -1)/2n, C g = n i=1 (π + + c i )ℓ(g(x ′ i ), +1)/2n, D g = n i=1 (π -+ c i )ℓ(g(x i ), -1)/2n. From Lemma 3 in Appendix A, the values of E[ A g ], E[ B g ], E[ C g ], and E[ D g ] are non-negative. Therefore, we assume that there exist non-negative constants a, b, c, d such that E[ A g ] ≥ a, E[ B g ] ≥ b, E[ C g ] ≥ c, and E[ D g ] ≥ d. Besides, let g CD = arg min g∈G R CD (g) denote the minimizer of R CD (g). Then, Theorem 5 is provided to elaborate the bias and consistency of R CD (g). Theorem 5. Based on the assumptions above, the bias of the risk estimator R CD (g) decays exponentially as n → ∞: 0 ≤ E[ R CD (g)] -R(g) ≤ 2(L f + 1)C ℓ ∆, where ∆ = exp (-2a 2 n/C 2 ℓ ) + exp (-2b 2 n/C 2 ℓ ) + exp (-2c 2 n/C 2 ℓ ) + exp (-2d 2 n/C 2 ℓ ). Furthermore, with probability at least 1 -δ, we have | R CD (g) -R(g)| ≤ 2C ℓ L f ln 2/δ 2n + 2(L f + 1)C ℓ ∆. ( ) Theorem 5 demonstrates that R CD (g) → R(g) in O p (1/ √ n), which means R CD (g) is biased yet consistent. The estimation error bound of g CD is analyzed in Theorem 6. Theorem 6. Based on the assumptions above, for any δ > 0, the following inequality holds with probability at least 1 -δ: R( g CD ) -R(g * ) ≤ 8L ℓ R n (G) + 4C ℓ (L f + 1) ln 2/δ 2n + 4(L f + 1)C ℓ ∆. Theorem 6 elucidates that as n → ∞, R( g CD ) → R(g * ), since R n (G) → 0 for all parametric models with a bounded norm (Mohri et al., 2012) and ∆ → 0. Furthermore, the estimation error bound converges in O p (1/ √ n), which is the optimal parametric rate for empirical risk minimization without additional assumptions (Mendelson, 2008) .

4. EXPERIMENTS

In this section, we verify the effectiveness of our proposed approaches experimentally.

4.1. EXPERIMENTAL SETUP

We conducted experiments on benchmark data sets, including MNIST (LeCun et al., 1998) , Kuzushiji-MNIST (Clanuwat et al., 2018) , Fashion-MNIST (Xiao et al., 2017) , and CIFAR-10 ( Krizhevsky & Hinton, 2009) . In addition, four UCI data sets (Dua & Graff, 2017) were used, including Optdigits, USPS, Pendigits, and Letter. Since the data sets were originally designed for multi-class classification, we manually partitioned them into binary classes. The detailed descriptions of data sets is illustrated in Appendix. For CIFAR-10, we used ResNet-34 (He et al., 2016) as the model architecture. For other data sets, we used a multilayer perceptron (MLP) with three hidden layers of width 300 equipped with the ReLU (Nair & Hinton, 2010) activation function and batch normalization (Ioffe & Szegedy, 2015) . The logistic loss is utilized to instantiate the loss function ℓ(•, •). It is worth noting that confidence difference is given by labelers in real-world applications, while it was generated synthetically in this paper to facilitate comprehensive experimental analysis. We firstly trained a probabilistic classifier via logistic regression with ordinarily labeled data and the same neural network architecture. Then, we sampled unlabeled data in pairs at random, and generated the class posterior probabilities by inputting them into the probabilistic classifier. After that, we generated confidence difference for each pair of sampled data according to Definition 1. In the experiments, we adopted the following variants of our proposed approaches: 1) ConfDiff-Unbiased, which denotes the method working by minimizing the unbiased risk estimator proposed in Eq. ( 8); 2) ConfDiff-ReLU, which denotes the method working by minimizing the corrected risk estimator proposed in Eq. ( 12) with the ReLU function as the risk correction function; 3) ConfDiff-ABS, which denotes the method working by minimizing the corrected risk estimator proposed in Eq. ( 12) with the absolute value function as the risk correction function. We compared our proposed approaches with the following approaches: 1) Pcomp-Unbiased, which denotes the method working by minimizing the unbiased risk estimator for Pcomp classification proposed in Feng et al. (2021) ; 2) Pcomp-ReLU, which denotes the risk correction approach for Pcomp classification with the ReLU function as the risk correction function; 3) Pcomp-ABS, which denotes the risk correction approach for Pcomp classification with the absolute value function as the risk correction function; 4) Pcomp-Teacher, which denotes the state-of-the-art approach improving the label-noise learning approach RankPruning (Northcutt et al., 2017) with consistency regularization. The number of training epoches was set to 200 and we obtained the testing accuracy by averaging the results in the last 10 epoches. The detailed hyperparameters can be found in Appendix. To verify the effectiveness of our approaches under different class prior settings, we set π + ∈ {0.2, 0.5, 0.8} for all the data sets. For ease of implementation, we assumed that the class prior π + was known for all the compared methods. We repeated the sampling-and-training procedure for five times, and the mean accuracy as well as the standard deviation were recorded.

4.2. EXPERIMENTAL RESULTS

Benchmark data sets. Table 1 reports detailed experimental results for all the compared methods on four benchmark data sets. Based on Table 1 , we can draw the following conclusions: a) On all the cases of benchmark data sets, our proposed ConfDiff-ABS method achieves superior performance against all of the other compared approaches significantly, which validates the effectiveness of our approach in utilizing supervision information from confidence difference; b) Pcomp-Teacher achieves superior performance against all of the other Pcomp approaches by a large margin. The excellent performance benefits from the effectiveness of consistency regularization for weakly supervised learning problems (Berthelot et al., 2019; Li et al., 2020; Wu et al., 2022) UCI data sets. Table 2 reports detailed experimental results on four UCI data sets as well. From Table 2 , we can observe that: a) On all the UCI data sets under different class prior probability settings, our proposed ConfDiff-ABS method achieves the best performance among all the compared approaches with significant superiority, which verifies the effectiveness of our proposed approaches again; b) The performance of our proposed approaches is more stable than the compared Pcomp approaches under different class prior probability settings, demonstrating the superiority of our methods in dealing with various kinds of data distributions; c) ConfDiff-Unbiased has comparable performance against its risk correction variants on some data sets while has inferior performance on some other data sets. This is mainly because some data sets have simpler patterns and are thus less affected by overfitting issues.

4.3. PERFORMANCE WITH FEWER TRAINING DATA

To validate the effectiveness of exploiting the confidence difference, we conducted experiments by changing the fraction of training data for ConfDiff-ReLU and ConfDiff-ABS (100% indicated that all the ConfDiff data were used for training). For comparison, we used 100% of training data for Pcomp-Teacher during the training process. Figure 1 shows the results on four data sets with π + = 0.2, and more experimental results can be found in Appendix. We can observe that the classification 

4.4. ANALYSIS ON ROBUSTNESS

In this subsection, we investigate the influence of an inaccurate class prior probability and noisy confidence difference on the generalization performance of the proposed approaches. Specifically, let π+ = ϵπ + denote the corrupted class prior probability with ϵ being a real number around 1. Let ci = ϵ ′ i c i denote the noisy confidence difference where ϵ ′ i is sampled from a normal distribution N (1, σ 2 ). Figure 2 shows the classification performance of our proposed approaches on MNIST and Pendigits (π + = 0.5) with different ϵ and σ. We can observe that ConfDiff-ABS is more robust against corruptions compared with ConfDiff-Unbiased and ConfDiff-ReLU. It is demonstrated that with π+ and ci varying in a reasonable range, the performance is generally stable and even still superior against compared approaches. However, the performance degenerates with ϵ = 0.8 or ϵ = 1.2 on some data sets, which indicates that it is more important to obtain an accurate estimation of the class prior probability to facilitate model training.

5. CONCLUSION

In this paper, we dived into a novel weakly supervised learning setting where only unlabeled data pairs equipped with confidence difference were given. To solve the problem, an unbiased risk estimator was derived to perform empirical risk minimization. An estimation error bound was established to show that the optimal parametric convergence rate could be achieved. Furthermore, a risk correction approach was introduced to alleviate overfitting issues. Extensive experimental results validated the superiority of our proposed approaches. In future, it would be promising to apply our approaches in real-world scenarios.

A PROOF OF THEOREM 1

Before giving the proof of Theorem 1, we begin with the following lemmas: Lemma 2. The confidence difference c(x, x ′ ) can be equivalently expressed as c(x, x ′ ) = π + p(x)p + (x ′ ) -π + p + (x)p(x ′ ) p(x)p(x ′ ) (16) = π -p -(x)p(x ′ ) -π -p(x)p -(x ′ ) p(x)p(x ′ ) (17) Proof. On one hand, c(x, x ′ ) = p(y ′ = 1|x ′ ) -p(y = 1|x) = p(x ′ , y ′ = 1) p(x ′ ) - p(x, y = 1) p(x) = π + p + (x ′ ) p(x ′ ) - π + p + (x) p(x) = π + p(x)p + (x ′ ) -π + p + (x)p(x ′ ) p(x)p(x ′ ) . On the other hand, c(x, x ′ ) = p(y ′ = 1|x ′ ) -p(y = 1|x) = (1 -p(y ′ = 0|x ′ )) -(1 -p(y = 0|x)) = p(y = 0|x) -p(y ′ = 0|x ′ ) = p(x, y = 0) p(x) - p(x ′ , y = 0) p(x ′ ) = π -p -(x) p(x) - π -p -(x ′ ) p(x ′ ) = π -p -(x)p(x ′ ) -π -p(x)p -(x ′ ) p(x)p(x ′ ) , which concludes the proof. Lemma 3. The following equations hold: E p(x,x ′ ) [(π + -c(x, x ′ ))ℓ(g(x), +1)] = π + E p+(x) [ℓ(g(x), +1)], E p(x,x ′ ) [(π -+ c(x, x ′ ))ℓ(g(x), -1)] = π -E p-(x) [ℓ(g(x), -1)], E p(x,x ′ ) [(π + + c(x, x ′ ))ℓ(g(x ′ ), +1)] = π + E p+(x ′ ) [ℓ(g(x ′ ), +1)], E p(x,x ′ ) [(π --c(x, x ′ ))ℓ(g(x ′ ), -1)] = π -E p-(x ′ ) [ℓ(g(x ′ ), -1)]. Proof. Firstly, the proof of Eq. ( 18) is given: E p(x,x ′ ) [(π + -c(x, x ′ ))ℓ(g(x), +1)] = π + p(x)p(x ′ ) -π + p(x)p + (x ′ ) + π + p + (x)p(x ′ ) p(x)p(x ′ ) ℓ(g(x), +1)p(x, x ′ ) dx dx ′ = (π + p(x)p(x ′ ) -π + p(x)p + (x ′ ) + π + p + (x)p(x ′ ))ℓ(g(x), +1) dx dx ′ = π + p(x)ℓ(g(x), +1) dx p(x ′ ) dx ′ -π + p(x)ℓ(g(x), +1) dx p + (x ′ ) dx ′ + π + p + (x)ℓ(g(x), +1) dx p(x ′ ) dx ′ = π + p(x)ℓ(g(x), +1) dx -π + p(x)ℓ(g(x), +1) dx + π + p + (x)ℓ(g(x), +1) dx = π + p + (x)ℓ(g(x), +1) dx =π + E p+(x) [ℓ(g(x), +1)]. After that, the proof of Eq. ( 19) is given: E p(x,x ′ ) [(π -+ c(x, x ′ ))ℓ(g(x), -1)] = π -p(x)p(x ′ ) + π -p -(x)p(x ′ ) -π -p(x)p -(x ′ ) p(x)p(x ′ ) ℓ(g(x), -1)p(x, x ′ ) dx dx ′ = (π -p(x)p(x ′ ) + π -p -(x)p(x ′ ) -π -p(x)p -(x ′ ))ℓ(g(x), -1) dx dx ′ = π -p(x)ℓ(g(x), -1) dx p(x ′ ) dx ′ + π -p -(x)ℓ(g(x), -1) dx p(x ′ ) dx ′ -π -p(x)ℓ(g(x), -1) dx p -(x ′ ) dx ′ = π -p(x)ℓ(g(x), -1) dx + π -p -(x)ℓ(g(x), -1) dx -π -p(x)ℓ(g(x), -1) dx = π -p -(x)ℓ(g(x), -1) dx =π -E p-(x) [ℓ(g(x), -1)]. It can be noticed that c(x, x ′ ) = -c(x ′ , x) and p(x, x ′ ) = p(x ′ , x). Therefore, it can be deduced naturally that E p(x,x ′ ) [(π + -c(x, x ′ ))ℓ(g(x), +1)] = E p(x ′ ,x) [(π + + c(x ′ , x))ℓ(g(x), +1)]. Because x and x ′ are symmetric, we can swap them and deduce Eq. ( 20). Eq. ( 21) can be deduced in the same manner, which concludes the proof. Based on Lemma 3, the proof of Theorem 1 is given. Proof of Theorem 1. To begin with, it can be noticed that E p+(x) [ℓ(g(x), +1)] = E p+(x ′ ) [ℓ(g(x ′ ), +1)] and E p-(x) [ℓ(g(x), -1)] = E p-(x ′ ) [ℓ(g(x ′ ), -1)]. Then, by summing up all the equations from Eq. ( 18) to Eq. ( 21), we can get the following equation: E p(x,x ′ ) [L + (g(x), g(x ′ )) + L -(g(x), g(x ′ ))] = 2π + E p+(x) [ℓ(g(x), +1)] + 2π -E p-(x) [ℓ(g(x), -1)] After dividing each side of the equation above by 2, we can obtain Theorem 1. Besides, it can be observed that 2µ 1 -2µ 2 = E p(x,x ′ ) [( 1 n n i=1 (L(x i , x ′ i ) -L(x ′ i , x i )) 2 ] ≥ 0. Therefore, Var(S(g; α)) achieves the minimum value when α = 1/2, which concludes the proof.

C PROOF OF THEOREM 3

To begin with, we give the definition of Rademacher complexity. Definition 2 (Rademacher complexity).  Let X n = {x 1 , • • • x n } denote n i.i.d. random R n (G) = E Xn E σ sup g∈G 1 n n i=1 σ i g(x i ) . ( ) Let D n i.i.d. ∼ p(x, x ′ ) denote n pairs of ConfDiff data and L CD (g; x i , x ′ i ) = (L(x, x ′ ) + L(x ′ , x))/2, then we introduce the following lemma. Lemma 4. Rn (L CD • G) ≤ 2L ℓ R n (G), where L CD • G = {L CD • g|g ∈ G} and Rn (•) is the Rademacher complexity over ConfDiff data pairs D n of size n. Proof. Rn (L CD • G) =E Dn E σ [sup g∈G 1 n n i=1 σ i L CD (g; x i , x ′ i )] =E Dn E σ [sup g∈G 1 2n n i=1 σ i ((π + -c i )ℓ(g(x i ), +1) + (π --c i )ℓ(g(x ′ i ), + (π + + c i )ℓ(g(x ′ i ), +1) + (π -+ c i )ℓ(g(x i ), -1))]. Then, we can induce that ∥∇L CD (g; x i , x ′ i )∥ 2 =∥∇( (π + -c i )ℓ(g(x i ), +1) + (π --c i )ℓ(g(x ′ i ), -1) 2 + (π + + c i )ℓ(g(x ′ i ), +1) + (π -+ c i )ℓ(g(x i ), -1) 2 )∥ 2 ≤∥∇( (π + -c i )ℓ(g(x i ), +1) 2 )∥ 2 + ∥∇( (π --c i )ℓ(g(x ′ i ), -1) 2 )∥ 2 + ∥∇( (π + + c i )ℓ(g(x ′ i ), +1) 2 )∥ 2 + ∥∇( (π -+ c i )ℓ(g(x i ), -1) 2 )∥ 2 ≤ |π + -c i |L ℓ 2 + |π --c i |L ℓ 2 + |π + + c i |L ℓ 2 + |π -+ c i |L ℓ 2 . ( ) Suppose π + ≥ π -, the value of RHS of Eq. ( 24) can be determined as follows: when c i ∈ [-1, -π + ), the value is -2c i L ℓ ; when c i ∈ [-π + , -π -), the value is (π + -c i )L ℓ ; when c i ∈ [-π -, π -), the value is L ℓ ; when c i ∈ [π -, π + ), the value is (π + + c i )L ℓ ; when c i ∈ [π + , 1], the value is 2c i L ℓ . To sum up, when π + ≥ π -, the value of RHS of Eq. ( 24) is less than 2L ℓ . When π + ≤ π -, we can deduce that the value of RHS of Eq. ( 24) is less than 2L ℓ in the same way. Therefore, Rn (L CD • G) ≤2L ℓ E Dn E σ [sup g∈G 1 n n i=1 σ i g(x i )] =2L ℓ E Xn E σ [sup g∈G 1 n n i=1 σ i g(x i )] =2L ℓ R n (G), which concludes the proof. After that, we introduce the following lemma. Lemma 5. The inequality below hold with probability at least 1 -δ: sup g∈G |R(g) -R CD (g)| ≤ 4L ℓ R n (G) + 2C ℓ ln 2/δ 2n . Proof. To begin with, we introduce Φ = sup g∈G (R(g) -R CD (g)) and Φ = sup g∈G (R(g) -RCD (g)), where R CD (g) and RCD (g) denote the empirical risk over two sets of training examples with exactly one different point {(x i , x ′ i ), c i } and {( xi , x′ i ), c( xi , x′ i )} respectively. Then we have Φ -Φ ≤ sup g∈G ( R CD (g) -RCD (g)) ≤ sup g∈G ( L CD (g; x i , x ′ i ) -L CD (g; xi , x′ i ) n ) ≤ 2C ℓ n . Accordingly, Φ -Φ can be bounded in the same way. The following inequalities holds with probability at least 1 -δ/2 by applying McDiarmid's inequality: sup g∈G (R(g) -R CD (g)) ≤ E Dn [sup g∈G (R(g) -R CD (g))] + 2C ℓ ln 2/δ 2n , Furthermore, we can bound E Dn [sup g∈G (R(g) -R CD (g))] with Rademacher complexity. It is a routine work to show by symmetrization (Mohri et al., 2012 ) that E Dn [sup g∈G (R(g) -R CD (g))] ≤ 2 Rn (L CD • G) ≤ 4L ℓ R n (G), where the second inequality is from Lemma 4. Accordingly, sup g∈G ( R CD (g) -R(g)) has the same bound. By using the union bound, the following inequality holds with probability at least 1 -δ: sup g∈G |R(g) -R CD (g)| ≤ 4L ℓ R n (G) + 2C ℓ ln 2/δ 2n , which concludes the proof. Finally, the proof of Theorem 3 is provided. Proof of Theorem 3. R( g CD ) -R(g * ) = (R( g CD ) -R CD ( g CD )) + ( R CD ( g CD ) -R CD (g * )) + ( R CD (g * ) -R(g * )) ≤ (R( g CD ) -R CD ( g CD )) + ( R CD (g * ) -R(g * )) ≤ |R( g CD ) -R CD ( g CD )| + R CD (g * ) -R(g * ) ≤ 2 sup g∈G |R(g) -R CD (g)| ≤ 8L ℓ R n (G) + 4C ℓ ln 2/δ 2n . The first inequality is derived because g CD is the minimizer of R CD (g). The last inequality is derived according to Lemma 5, which concludes the proof.

D PROOF OF THEOREM 4

To begin with, we provide the following inequality: sup g∈G | RCD (g) -R CD (g)| = 1 2n | n i=1 ((π + -π + + c i -ci )ℓ(g(x i ), +1) + (π --π -+ c i -ci )ℓ(g(x ′ i ), -1) + (π + -π + + ci -c i )ℓ(g(x ′ i ), +1) + (π --π -+ ci -c i )ℓ(g(x i ), -1))| ≤ 1 2n n i=1 (|(π + -π + + c i -ci )ℓ(g(x i ), +1)| + |(π --π -+ c i -ci )ℓ(g(x ′ i ), -1)| + |(π + -π + + ci -c i )ℓ(g(x ′ i ), +1)| + |(π --π -+ ci -c i )ℓ(g(x i ), -1)|) = 1 2n n i=1 (|π + -π + + c i -ci |ℓ(g(x i ), +1) + |π --π -+ c i -ci |ℓ(g(x ′ i ), -1) + |π + -π + + ci -c i |ℓ(g(x ′ i ), +1) + |π --π -+ ci -c i |ℓ(g(x i ), -1)) ≤ 1 2n n i=1 ((|π + -π + | + |c i -ci |)ℓ(g(x i ), +1) + (|π --π -| + |c i -ci |)ℓ(g(x ′ i ), -1) + (|π + -π + | + |c i -c i |)ℓ(g(x ′ i ), +1) + (|π --π -| + |c i -c i |)ℓ(g(x i ), -1)) = 1 2n n i=1 ((|π + -π + | + |c i -ci |)ℓ(g(x i ), +1) + (|π + -π+ | + |c i -ci |)ℓ(g(x ′ i ), -1) + (|π + -π + | + |c i -c i |)ℓ(g(x ′ i ), +1) + (|π + -π+ | + |c i -c i |)ℓ(g(x i ), -1)) ≤ 2C ℓ n i=1 |c i -c i | n + 2C ℓ |π + -π + |. Then, we deduce the following inequality: R(ḡ CD ) -R(g * ) =(R(ḡ CD ) -R CD (ḡ CD )) + ( R CD (ḡ CD ) -RCD (ḡ CD )) + ( RCD (ḡ CD ) -RCD ( g CD )) + ( RCD ( g CD ) -R CD ( g CD )) + ( R CD ( g CD ) -R( g CD )) + (R( g CD ) -R(g * )) ≤2 sup g∈G |R(g) -R CD (g)| + 2 sup g∈G | RCD (g) -R CD (g)| + (R( g CD ) -R(g * )) ≤4 sup g∈G |R(g) -R CD (g)| + 2 sup g∈G | RCD (g) -R CD (g)| ≤16L ℓ R n (G) + 8C ℓ ln 2/δ 2n + 4C ℓ n i=1 |c i -c i | n + 4C ℓ |π + -π + |. The first inequality is derived because ḡCD is the minimizer of R(g). The second and third inequality are derived according to the proof of Theorem 3 and Lemma 5 respectively.

E PROOF OF THEOREM 5

To begin with, let D + n (g) = {D n | A(g) ≥ 0 ∩ B(g) ≥ 0 ∩ C(g) ≥ 0 ∩ D(g) ≥ 0} and D - n (g) = {D n | A(g) ≤ 0 ∪ B(g) ≤ 0 ∪ C(g) ≤ 0 ∪ D(g) ≤ 0}. Before giving the proof of Theorem 5, we give the following lemma based on the assumptions in section 3. Lemma 6. The probability measure of D - n (g) can be bounded as follows: P(D - n (g)) ≤ exp ( -2a 2 n C 2 ℓ ) + exp ( -2b 2 n C 2 ℓ ) + exp ( -2c 2 n C 2 ℓ ) + exp ( -2d 2 n C 2 ℓ ). Proof. It can be observed that p(D n ) = p(x 1 , x ′ 1 ) • • • p(x n , x ′ n ) = p(x 1 ) • • • p(x ′ n )p(x 1 ) • • • p(x ′ n ). Therefore, the probability measure P(D - n (g)) can be defined as follows: P(D - n (g)) = Dn∈D - n (g) p(D n ) dD n = Dn∈D - n (g) p(D n ) dx 1 • • • dx n dx ′ 1 • • • dx ′ n . When exactly one ConfDiff data pair in S n is replaced, the change of A(g), B(g), C(g) and D(g) will be no more than C ℓ /n. By applying McDiarmid's inequality, we can obtain the following inequalities:  P(E[ A(g)] -A(g) ≥ a) ≤ exp ( -2a 2 n C 2 ℓ ), P(E[ B(g)] -B(g) ≥ b) ≤ exp ( -2b 2 n C 2 ℓ ), P(E[ C(g)] -C(g) ≥ c) ≤ exp ( -2c 2 n C 2 ℓ ), P(E[ D(g)] -D(g) ≥ d) ≤ exp ( -2d 2 n C 2 ℓ ). + P(E[ C(g)] -C(g) ≥ c) + P(E[ D(g)] -D(g) ≥ d) ≤ exp ( -2a 2 n C 2 ℓ ) + exp ( -2b 2 n C 2 ℓ ) + exp ( -2c 2 n C 2 ℓ ) + exp ( -2d 2 n C 2 ℓ ), which concludes the proof. Then, the proof of Theorem 5 is given. Proof of Theorem 5. To begin with, we prove the first inequality in Theorem 5. Finally, we have E[ R CD (g)] -R(g) =E[ R CD (g) -R CD (g)] = Dn∈D + n (g) ( R CD (g) -R CD (g))p(D n ) dD n + Dn∈D - n (g) ( R CD (g) -R CD (g))p(D n ) dD n = Dn∈D - n (g) ( R CD (g) -R CD (g))p(D n ) dD n ≥ 0, | R CD (g) -R(g)| = | R CD (g) -E[ R CD (g)] + E[ R CD (g)] -R(g)| ≤ | R CD (g) -E[ R CD (g)]| + |E[ R CD (g)] -R(g)| = | R CD (g) -E[ R CD (g)]| + E[ R CD (g)] -R(g) ≤ 2C ℓ L f ln 2/δ 2n + 2(L f + 1)C ℓ ∆, with probability at least 1 -δ, which concludes the proof.

F PROOF OF THEOREM 6

With probability at least 1 -δ, we have R( g CD ) -R(g * ) =(R( g CD ) -R CD ( g CD )) + ( R CD ( g CD ) -R CD ( g CD )) + ( R CD ( g CD ) -R( g CD )) + (R( g CD ) -R(g * )) ≤|R( g CD ) -R CD ( g CD )| + | R CD ( g CD ) -R( g CD )| + (R( g CD ) -R(g * )) ≤4C ℓ (L f + 1) ln 2/δ 2n + 4(L f + 1)C ℓ ∆ + 8L ℓ R n (G). The first inequality is derived because g CD is the minimizer of R CD (g). The second inequality is derived from Theorem 5 and Theorem 3. The proof is completed.

G ADDITIONAL INFORMATION ON EXPERIMENTS

In this section, the details of experimental data sets and hyperparameters are provided.

G.1 DETAILS OF EXPERIMENTAL DATA SETS

The detailed statistics and corresponding model architectures are summarized in Table 3 while the basic information, sources and data split details are elaborated in this subsection. For the four benchmark data sets, For the four UCI data sets, they can be downloaded from Dua & Graff (2017) . • Optdigits, USPS, Pendigits (Dua & Graff, 2017) : They are handwritten digit recognition data set. The train-test split can be found in Table 3 . The feature dimensions are 62, 256, and 16 respectively and the label space is 0-9. The even digits are regarded as the positive class while the odd digits are regarded as the negative class. We sampled 1,200, 2,000, and 2,500 unlabeled data pairs for training respectively. • Letter (Dua & Graff, 2017) 

G.2 DETAILS OF HYPERPARAMETERS

All the methods were implemented in Pytorch (Paszke et al., 2019) . We used the Adam optimizer (Kingma & Ba, 2015) . To ensure fair comparisons, We set the same hyperparameter values for all the comparing approaches. For MNIST, Kuzushiji-MNIST and Fashion-MNIST, the learning rate was set to 1e-3 and the weight decay was set to 1e-5. The batch size was set to 256 data pairs. For training the probabilistic classifier to generate confidence, the batch size was set to 256 and the epoch number was set to 10. For CIFAR10, the learning rate was set to 5e-4 and the weight decay was set to 1e-5. The batch size was set to 128 data pairs. For training the probabilistic classifier to generate confidence, the batch size was set to 128 and the epoch number was set to 10. For all the UCI data sets, the learning rate was set to 1e-3 and the weight decay was set to 1e-5. The batch size was set to 128 data pairs. For training the probabilistic classifier to generate confidence, the batch size was set to 128 and the epoch number was set to 10. The learning rate and weight decay for training the probabilistic classifier were the same as the setting for each data set correspondingly. 



The theoretical analysis in the next subsections is also based on these assumptions. For simplicity, we do not restate them in the next subsections.



Figure 1: Classification performance of ConfDiff-ReLU and ConfDiff-ABS given a fraction of training data as well as Pcomp-Teacher given 100% of training data (π + = 0.2).

Figure 2: Classification accuracy on MNIST (the first row) and Pendigits (the second row) with π + = 0.5 given an inaccurate class prior probability and noisy confidence difference.

variables drawn from a probability distribution with density p(x), G = {g : X → R} denote a class of measurable functions, and σ = (σ 1 , σ 2 , • • • , σ n ) denote Rademacher variables taking values from {+1, -1} uniformly. Then, the (expected) Rademacher complexity of G is defined as

n (g) ≤P( A(g) ≤ 0) + P( B(g) ≤ 0) + P( C(g) ≤ 0) + P( D(g) ≤ 0)≤P( A(g) ≤ E[ A(g)] -a) + P( B(g) ≤ E[ B(g)] -b) + P( C(g) ≤ E[ C(g)] -c) + P( D(g) ≤ E[ D(g)] -d) ≤P(E[ A(g)] -A(g) ≥ a) + P(E[ B(g)] -B(g) ≥ b)

Figure 3: Classification performance of ConfDiff-ReLU and ConfDiff-ABS given a fraction of training data as well as Pcomp-Teacher given 100% of training data with different prior settings (π + = 0.2 for the first row, π + = 0.5 for the second and the third row, and π + = 0.8 for the fourth and the fifth row).

Classification accuracy (mean±std) of each method on benchmark data sets with different class priors, where the best performance is shown in bold.

Classification accuracy (mean±std) of each method on UCI data sets with different class priors, where the best performance is shown in bold.

Characteristics of experimental data sets.

• MNIST(LeCun et al., 1998): It is a grayscale handwritten digits recognition data set. It is composed of 60,000 training examples and 10,000 test examples. The original feature dimension is 28*28, and the label space is 0-9. The even digits are regarded as the positive class while the odd digits are regarded as the negative class. We sampled 15,000 unlabeled data pairs as training data.The positive class is composed of 'o', 'su','na', 'ma', and 're' while the negative class is composed of 'ki','tsu','ha', 'ya', and 'wo'. We sampled 15,000 unlabeled data pairs as training data.The data set can be downloaded from https://github.com/rois-codh/kmnist. • Fashion-MNIST(Xiao et al., 2017): It is a grayscale fashion item recognition data set. It is composed of 60,000 training examples and 10,000 test examples. The original feature dimension is 28*28, and the label space is {'T-shirt', 'trouser', 'pullover', 'dress', 'sandal', 'coat', 'shirt', 'sneaker', 'bag', 'ankle boot'}. The positive class is composed of 'T-shirt', 'pullover', 'coat', 'shirt', and 'bag' while the negative class is composed of 'trouser', 'dress', 'sandal', 'sneaker', and 'ankle boot'. We sampled 15,000 unlabeled data pairs as training data. The data set can be downloaded from https://github.com/zalandoresearch/fashion-mnist. • CIFAR-10 (Krizhevsky & Hinton, 2009): It is a colorful object recognition data set. It is composed of 50,000 training examples and 10,000 test examples. The original feature dimension is 32*32*3, and the label space is {'airplane', 'bird', 'automobile', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck'}. The positive class is composed of 'bird', 'deer', 'dog', 'frog', 'cat', and 'horse' while the negative class is composed of 'airplane', 'automobile', 'ship', and 'truck'. We sampled 10,000 unlabeled data pairs as training data. The data set can be downloaded from https://www.cs.toronto.edu/ ˜kriz/cifar.html.

: It is a letter recognition data set. It is composed of 16,000 training examples and 4,000 test examples. The feature dimension is 16 and the label space is the 26 capital letters in the English alphabet. The positive class is composed of the top 13 letters while the negative class is composed of the latter 13 letters. We sampled 4,000 unlabeled data pairs for training.

B ANALYSIS ON VARIANCE OF RISK ESTIMATOR B.1 PROOF OF LEMMA 1

Based on Lemma 3, it can be observed that E p(x,x ′ ) [L(x, x ′ )] =E p(x,x ′ ) [(π + -c(x, x ′ ))ℓ(g(x), +1) + (π --c(x, x ′ ))ℓ(g(x ′ ), -1)]=π + E p+(x) [ℓ(g(x) , +1)] + π -E p-(x ′ ) [ℓ(g(x ′ ), -1)]=π + E p+(x) [ℓ(g(x) , +1)] + π -E p-(x) [ℓ(g(x) , -1)] =R(g) andis also an unbiased risk estimator and concludes the proof.

B.2 PROOF OF THEOREM 2

In this subsection, we show that Eq. ( 8) achieves the minimum variance ofTo begin with, we introduce the following notations:Furthermore, according to Lemma 1, we haveThen, we provide the proof of Theorem 2 as follows.Proof of Theorem 2.Var(S(g; α))where the last inequality is derived because R CD (g) is an upper bound of R CD (g). Furthermore,Similar to the proof of Theorem 3, we can obtain 

