AUL IS A BETTER OPTIMIZATION METRIC IN PU LEARNING

Abstract

Traditional binary classification models are trained and evaluated with fully labeled data which is not common in real life. In non-ideal dataset, only a small fraction of positive data are labeled. Training a model from such partially labeled data is named as positive-unlabeled (PU) learning. A naive solution of PU learning is treating unlabeled samples as negative. However, using biased data, the trained model may converge to non-optimal point and its real performance cannot be well estimated. Recent works try to recover the unbiased result by estimating the proportion of positive samples with mixture proportion estimation (MPE) algorithms, but the model performance is still limited and heavy computational cost is introduced (particularly for big datasets). In this work, we theoretically prove that Area Under Lift curve (AUL) is an unbiased metric in PU learning scenario, and the experimental evaluation on 9 datasets shows that the average absolute error of AUL estimation is only 1/6 of AUC estimation. By experiments we also find that, compared with state-of-the-art AUC-optimization algorithm, AULoptimization algorithm can not only significantly save the computational cost, but also improve the model performance by up to 10%.

1. INTRODUCTION

Classic binary classification tasks in machine learning usually assume that all data are fully labeled as positive or negative (PN learning). However, in real-world applications, dataset is usually nonideal and only a small fraction of positive data are labeled. Training a model from such partially labeled positive data is called positive-unlabeled (PU) learning. Take financial fraud detection as an example. Some fraudulent manners are found and can be labeled as positive, but we cannot simply regard the remaining data as negative, because in most cases only a subset of fraud manners are detected and the remaining data may also contain undetected positive data. As a result, the remaining data can only be regarded as unlabeled. Other typical PU learning applications include text classification, drug discovery, outlier detection, malicious URL detection, online advertise, etc (Yu et al. (2002) , Li & Liu (2003) , Li et al. (2009) , Blanchard et al. (2010) , Zhang et al. (2017) , Wu et al. (2018) ). A naive way for PU learning is treating unlabeled data as negative and using traditional PN learning algorithms. But the model trained in this way is biased and its prediction results are not reliable (Elkan & Noto (2008) ). Some early works try to recover labels for unlabeled data by heuristic algorithms, such as S-EM (Liu et al. (2002) ), 1-DNF (Yu et al. (2002) ), Rocchio (Li & Liu (2003) ), k-means (Chaudhari & Shevade (2012) ). But the performance of heuristic algorithms, which is critical to these works, is not guaranteed. Some other kind of methods introduce an unbiased risk estimator to eliminate the bias (Du Plessis et al. (2014 ), Du Plessis et al. (2015) , Kiryo et al. (2017) ). However, these methods rely on the knowledge of the proportion of positive samples in unlabeled samples, which is also unknown in practice. Another annoying problem of PU learning is how to accurately evaluate the model's performance. Model performance is usually evaluated by some metrics, such as accuracy, precision, recall, Fscore, AUC (Area Under ROC Curve), etc. During the life cycle of a model, its performance is usually monitored to ensure that the model is keeping a desired level of performance, with the variance and growth of data. In PU learning, the metrics above are also biased due to the lack of proportion of positive samples. Although Menon et al. (2015) proves that the ground-truth AUC (AUC) and the AUC estimated from PU data (AUC PU ) is linearly correlated, which indicates that AUC PU can be used to compare the performances between two models, it's still not possible to evaluate the true performance of a single model. Consider a situation when a model is evaluated on two different PU datasets generated from the same PN dataset but with different positive sample proportions. The ground-truth AUC which indicates the true performance of the model on two datasets are the same, but the AUC PU on the two datasets are different. Hence, AUC PU cannot be used to directly evaluate the model's performance. Jain et al. (2017) and Ramola et al. (2019) show that they can correct AUC PU , accuracy PU , balanced accuracy PU , F-score PU and Matthews correlation coefficient, with the knowledge of proportion of positive samples. However, this proportion is difficult to obtain in practice. Recently many works focus on estimating the proportion of positive samples Du Plessis & Sugiyama (2014), Christoffel et al. (2016) , Ramaswamy et al. (2016) , Jain et al. (2016) , Bekker & Davis (2018) , Zeiberg et al. (2020) , which are called mixture proportion estimation (MPE) algorithms. Yet according to our experiments on 9 datasets, the estimation methods still introduce some errors and thus make the corrected metrics inaccurate. Besides, the MPE algorithms may also introduce non-trivial computational overhead (by up to 2,000 seconds per proportion estimation in our experiments), which slows down the evaluation process. In this work, we find that Area Under Lift chart (AUL) (Vuk & Curk (2006) , Tufféry (2011) ) is a discriminating, unbiased and computation-friendly metric for PU learning. We make the following contributions. a). We theoretically prove that AUL estimation is unbiased to the ground-truth AUL and calculate a theoretical bound of the estimation error. b). We carry out experimental evaluation on 9 datasets and the results show that the average absolute error of AUL estimation is only 1/6 of AUC estimation, which means AUL estimation is more accurate and more stable than AUC estimation. c). By experiments we also find that, compared with state-of-the-art AUC-optimization algorithm, AUL-optimization algorithm can not only significantly save the computational cost, but also improve the model performance by up to 10%. The remaining of this paper is organized as follows. Section 2 describes the background knowledge. Section 3 theoretically proves the unbiased feature of AUL estimation in PU learning. Section 4 evaluates the performance of AUL estimation by experiments on 9 datasets. Section 5 experimentally shows the performance of AUL-optimization algorithm by applying AUL in PU learning. Section 6 concludes the whole paper.

2. BACKGROUND

Binary Classification Problem: Let D = {< x i , y i >, i = 1, ...n} be a positive and negative (PN) dataset which has n instances. Each tuple < x i , y i > is a record, in which x i ∈ R d is the feature vector and y i ∈ {1, 0} is the corresponding ground-truth label. Let X P , X N be the feature vectors set of positive, negative samples respectively, and n P , n N be the number of samples in these sets respectively. X P = {x i |y i = 1, i = 1, ...n P } X N = {x i |y i = 0, i = 1, ...n N } In PU learning, we use α = n P n P +n N = n P n to indicate the proportion of positive samples in all samples. Confusion Matrix: A confusion matrix is used to discriminate the model performance of different binary classification algorithms. In confusion matrix, true positive (TP) (actual label and predicted label are both positive), true negative (TN) (actual label and predicted label are both negative), false positive (FP) (actually negative but predicted as positive), and false negative (FN) (actually positive but predicted as negative) are counted according to model's outputs. Obviously, n T P + n F N = n P , n T N + n F P = n N . ROC: Since the numbers of TP, TN, FP and FN in a confusion matrix are highly related to the classification threshold (θ), Receiver Operating Characteristic (ROC) curve (Fawcett & Tom, 2003) is proposed to plot (x, y) = (f pr(θ), tpr(θ)) over all possible classification thresholds θ. In some literature, tpr is also known as sensitivity and the value of 1 -f pr is called specificity. true positive rate (tpr) = n T P n P , false positive rate (f pr) = n F N n N AUC: As a curve, ROC is not convenient enough to describe the model performance. Consequently, the Area Under ROC Curve (AUC), which is a single value, is proposed and widely used as a metric to evaluate a binary classification algorithm. AUC provides a summary of model performance under all possible classification thresholds. It also provides an elegant probabilistic interpretation that AUC is the probability of correct ranking between a random positive sample and a random negative sample (Hanley & McNeil, 1982) , which is a kind of ranking capability. According to Vuk & Curk (2006) , for a model g : R d → R, AUC can be computed as follows, AU C = 1 n P n N xi∈X P xj ∈X N S (g(x i ), g(x j )) where S(a, b) =    1 a > b 1 2 a = b 0 a < b It is worth noting that, there are other ways to calculate AUC, but they are essentially the same.

AUL:

Lift curve, which is popular in econometrics to decide a suitable marketing strategy (Tufféry (2011) , Vuk & Curk (2006) ), has not been well studied in machine learning field. Lift curve can be seen as a variant of ROC and it illustrates (x, y) = (Y rate (θ), tpr(θ)) over all possible classification thresholds θ. Y rate represents the proportion of samples predicted as positive. Y rate = n T P + n F P n In the curve figure, Lift curve has the same y-axis as ROC curve, but a different x-axis. Area Under Lift chart (AUL) (Vuk & Curk (2006) , Tufféry (2011) ), can also be used as a metric to evaluate the model performance. One way to compute AUL is AU L = 1 n P n xi∈X P xj ∈X P ∪X N S (g(x i ), g(x j )) Essentially, AUL can be regarded as the probability of correct ranking between a random positive sample and a random sample. AU L and AU C is linearly related (Tufféry, 2011) , i.e. AU L = 0.5α + (1 -α)AU C which shows that AUL has the same discriminating power with AUC.

3. UNBIASEDNESS OF AUL ESTIMATION IN PU LEARNING: THEORETICAL PROOF

A PU dataset D = {< x i , y i , s i >, s i ∈ {1, 0}, i = 1, . ..n} is generated by sampling a subset of positive data as labeled and leaving remain as unlabeled from D. In D , s i is the observed label and y i is the ground-truth label which may be unknown. If s i = 1, we can confirm y i = 1 (positive). If s i = 0, y i would be 1 or 0. In this paper, we assume that the labeled data is Select Completely At Random (SCAR) (Bekker & Davis, 2018) from positive data. Therefore the distribution of labeled samples in D are the same as the distribution of positive samples in D. Let X L , X U be the feature vectors set of labeled and unlabeled samples respectively, and n L , n U be the number of samples in these sets respectively. X L = {x i |s i = 1, i = 1, ...n L } X U = {x i |s i = 0, i = 1, ...n U } We use β = n L n P to indicate the proportion of labeled samples in positive samples.

AUC Estimation is Biased

To calculate AUC with PU dataset (AU C P U ), unlabeled data is regarded as negative, thus we have AU C P U = 1 n L n U xi∈X L xj ∈X U S (g(x i ), g(x j )) (3) where function S is the same as in Eq.1. The expectation of AU C P U over the distribution of D is E[AU C P U ] = 1 -α 1 -αβ (AU C -0.5) + 0.5 This formula is slightly different from the one in Menon et al. (2015) . Here we define AUC on a specific dataset but not on a distribution. This formula indicates that AU C P U is an biased estimation of AU C. We demonstrate the bias on an example dataset (1a), which contains 20 samples sorted by prediction score. Figure 1b illustrates two ROC curves on this dataset. curve-ROC is ploted with ground-truth label y and curve-ROC P U is ploted with observed label s. We can see that curve-ROC is almost above curve-ROC P U . The corresponding AUC is AU C = 0.740 and AU C P U = 0.653 respectively. There is a big difference (0.087) among the two measurements. As we discussed in section 1, (Jain et al., 2017) tries to recover AU C from AU C P U from the estimation of 1-α 1-αβ . To estimate 1-α 1-αβ , some works (Elkan & Noto (2008) 2020)) develop their mixture proportion estimation (MPE) algorithms. But according to our experiment on 9 datasats, these algorithms are neither accurate enough nor time saving.

AUL Estimation is Unbiased

Similar to AU C P U , AUL with PU dataset (AU L P U ) can be calculated as AU L P U = 1 n L n xi∈X L xj ∈X L ∪X U S (g(x i ), g(x j )) Unlike AU C P U , AU L P U is unbiased estimation of AU L. In contrast to Figure 1b , Figure 1c illustrates two lift curves which are very close to each other. curve-lift is ploted with ground-truth label y and curve-lift P U is ploted with observed label s. The corresponding AUL, AU L = 0.620 and AU L P U = 0.615, are very close. We then prove the unbiasedness. Theorem 1 For a given classifier g : R d → R, a PN dataset D with the proportion of labeled samples in positive samples β = n L n P , a PU dataset D can be generated following SCAR, the expectation and variance of AU L P U over the distribution of D are as follows, E[AU L P U ] = AU L Var[AU L P U ] = n P -n L n P -1 σ 2 n L where σ 2 is the variance of 1 n xj ∈X P ∪X N S (g(x i ), g(x j )), i = 1, ...n P . Proof Let t xi = 1 n xj ∈X P ∪X N S (g(x i ), g(x j )) = 1 n xj ∈X L ∪X U S (g(x i ), g(x j )) then, AU L = 1 n P xi∈X P t xi AU L P U = 1 n L xi∈X L t xi X L is generated by random sampling without replacement from X P , hence AU L P U is the estimation of the mean of {t xi , x i ∈ X P } which is AU L. According the theory of simple random sampling without replacement (Lohr (2009) ), the estimated population mean AU L P U is an unbiased estimator of the population mean AU L, i.e. E[AU L P U ] = AU L, and variance of AU L P U is Var[AU L P U ] = (1 - n L n P ) 1 n L xi∈X P (t xi -t) 2 n P -1 = n P -n L n P -1 1 n L xi∈X P (t xi -t) 2 n P = n P -n L n P -1 σ 2 n L According to Theorem 1, applying Chebyshev's inequality, we have P |AU L -AU L P U | ≥ ≤ Var = σ 2 n L (1 -β) n P n P -1 where Var is the variance of AU L P U , note that 0 < t xi < 1, hence Each instance has a prediction score (score) given by a certain model, a ground-truth label y and an observed label s. σ 2 = E (t -t) 2 ≤ E (t -t) 2 -(0 -t)(t -1) = E -2tt + t 2 + t = t -t 2 ≤ 1 4 then P |AU L -AU L P U | ≥ ≤ 1 -β 4n L n P n P -1 ≈ 1 -β 4n L

4.1. DATASETS

The experiment involves 9 real-life datasets from UCI Machining Learning Repository (Dua & Graff, 2017) , which are listed in table 1. To create binary classification datasets we do a little modification on the original datasets' 'target'. If it's a regression dataset we take a proper threshold and transfer it to a binary classification dataset. For a multi-class dataset, we chose one class as positive and the remaining as negative. We also transfer categorical features to numerical features using one-hot encoding. Considering the computing overhead of MPE algorithm, we limit the size of PN dataset around 4,000. In order to generate PU dataset, we use random sampling without replacement method to select a subset of positive samples as labeled data and the remaining are regarded as unlabeled data. There are three settings of β = {0.1, 0.2, 0.4}. For each dataset setting, 50 PU datasets are generated.

4.2. COMPARISON AMONG AUC&AUL ESTIMATION METHODS

High estimating accuracy and low cost are two important indexes for AUC/AUL estimation. Accuracy indicates how close the estimated AUC/AUL is to the ground-truth AUC/AUL. Cost indicates how much time (computation power) an estimation process takes. AUC and AUL estimation rely on a model's output. Therefore, for each dataset, we train a classifier using lightGBM (Ke et al., 2017) to simulate the classifier to be evaluated. The performance of this classifier is rational, not too bad nor totally perfect. To compare accuracy, we firstly compute the ground-truth value of AUC (AU C) and AUL (AU L) on fully labeled PN dataset. Then we calculate the estimated AUC (AU C est ) and AUL (AU L P U ) on PU dataset and compare them with AU C and AU L. To calculate AU C est , we use the direct conversion method following Jain et al. (2017) . This method firstly obtains AU C P U which is calculated by treating unlabeled data as negative and then estimates AU C est with it. The estimating step needs a mixture proportion estimation (MPE) algorithm to estimate the proportion of positive samples in unlabeled samples. Three MPE algorithms are used in this experiment. Ramaswamy et al. (2016) provides two algorithms named as KM1 and KM2. Zeiberg et al. (2020) which is based on distance curve is named as Distance. It's worth noting that Zeiberg et al. (2020) has compare their solution with all the existing MPE algorithms and claim themselves as the best so far. The corresponding estimated AUC of the three algorithms are AU C est KM 1 , AU C est KM 2 and AU C est Distance respectively. We use the codesfoot_0 foot_1 provided by the two papers. Table 1 shows the Mean Absolutely Error (MAE) results for each dataset. AU C and AU L are ground-truth value computed on fully labeled datasets. The estimation processes run 50 times for each dataset setting and get AU C est KM 1 , AU C est KM 2 , AU C est Distance and AU L P U . M AE KM 1 AU C , M AE KM 2 AU C , M AE Distance AU C , M AE AU L are the mean values of |AU C est KM 1 -AU C|, |AU C est KM 2 - AU C|, |AU C est Distance -AU C|, |AU L P U -AU L| respectively. In all settings, MAE of AUL estimation outperforms AUC estimation. The average MAE of AUL estimation on all 3*9 dataset settings is only 1/6 of the best AUC estimation method(Distance). Figure 2 illustrates the error (AU C est -AU C or AU L P U -AU L) distributions of 150 estimation results (3 β settings for each dataset, each setting run estimation for 50 times) per dataset using boxplot. Each boxplot plots the minimum, the first quartile, the sample median, third quartile and the maximum value in ascending order by five horizontal lines. The mean value is ploted with a black diamond. This figure shows that the mean value of AUL estimation error is the closest to 0 and the interquartile range (range between first and third quartile) of AUL estimation error is the smallest. It indicates that AUL estimation is more accurate and stable than AUC estimation. We also noticed that when β becomes larger, the interquartile range gets smaller. This is consistent with our conclusion in section 3. The cost of AUC estimation includes AU C P U calculation cost and MPE algorithm's calculation cost. The cost of AUL estimation includes AU L P U calculation cost only, which is nearly the same as AU C P U calculation cost and it is negligible (less than 1 second). However, the MPE algorithms cost is much larger. We conducted an experiment to count the time cost of MPE algorithms. For a dataset with 8,000 instances and 117 features, the time cost of Ramaswamy et al. (2016) 's method grows fast when the dataset's size grows (4 seconds for 1,000 samples, 2,000 seconds for 8,000 samples). Zeiberg et al. (2020) introduces an univariate transform to reduce the dimensionality of data. Therefor, their MPE method get a significant acceleration (16 seconds for 8,000 samples). But the selection of univariate transform is another time consuming problem. A careless selection of the transform will make the estimating result inaccurate. In the work of Zeiberg et al. (2020) , a lot of effort was made to select the optimal transform for each PU dataset.

5. EXPERIMENTAL EVALUATION OF AUL-OPTIMIZATION ALGORITHM

Optimizing AUC is a direct and popular way for training binary classifier models. Sakai et al. (2018) develops a AUC-optimization algorithm named as PU AUC in PU learning scenario. Following this idea we also implement an AUL-optimization algorithm named as PU AUL in a similar way of PU AUC. PU AUC and PU AUL share the same Gaussian kernel basis function, which is adopted by Sakai et al. (2018) . The datasets we used in this section are the same with that in section 4 excepting β is fixed at 0.1. Because PU AUC algorithm requires the proportion of positive samples estimated by MPE algorithms, we choose the best performed MPE algorithm for PU AUC in each dataset. The metric used for model performance comparison is AUC which is calculated on groundtruth PN data. Figure 3 shows that PU AUL achieves better performance on 8/9 datasets (average 2.5% improvement). Because PU AUC may suffer from the error of MPE algorithm, on 'Concrete' dataset, PU AUL outperforms PU AUC by 10%. Because MPE calculation is not required in PU AUL, it runs much faster. 

6. CONCLUSION

In this paper, we suggest replacing AUC by AUL for both model evaluation and model training in PU learning scenario. Comparing with AUC, AUL is an unbiased metric and can be computed efficiently. Existing MPE algorithms, which is a necessary for AUC-optimization algorithms, have been proved to be inaccurate and high cost. Besides, choosing a good set of parameters for MPE algorithms in order to get a good estimation result may even take more time.



http://web.eecs.umich.edu/ ˜cscott/code.html#kmpe https://github.com/Dzeiberg/ClassPriorEstimation



, Du Plessis & Sugiyama (2014), Sanderson & Scott (2014), Jain et al. (2016), Ramaswamy et al. (2016),Christoffel et al. (2016), Bekker & Davis (2018), Zeiberg et al. (

Figure1: AUC and AUL estimation on an example dataset 1a (20 instances, α = 0.5, β = 0.5). Each instance has a prediction score (score) given by a certain model, a ground-truth label y and an observed label s.

Figure 2: Estimation error distribution on 9 datasets.

Figure 3: PU AUL outperforms PU AUC by up to 10% on 9 datasets.

Mean Absolutely Error (MAE) of AUC/AUL estimation methods on 9 datasets.

