AUL IS A BETTER OPTIMIZATION METRIC IN PU LEARNING

Abstract

Traditional binary classification models are trained and evaluated with fully labeled data which is not common in real life. In non-ideal dataset, only a small fraction of positive data are labeled. Training a model from such partially labeled data is named as positive-unlabeled (PU) learning. A naive solution of PU learning is treating unlabeled samples as negative. However, using biased data, the trained model may converge to non-optimal point and its real performance cannot be well estimated. Recent works try to recover the unbiased result by estimating the proportion of positive samples with mixture proportion estimation (MPE) algorithms, but the model performance is still limited and heavy computational cost is introduced (particularly for big datasets). In this work, we theoretically prove that Area Under Lift curve (AUL) is an unbiased metric in PU learning scenario, and the experimental evaluation on 9 datasets shows that the average absolute error of AUL estimation is only 1/6 of AUC estimation. By experiments we also find that, compared with state-of-the-art AUC-optimization algorithm, AULoptimization algorithm can not only significantly save the computational cost, but also improve the model performance by up to 10%.

1. INTRODUCTION

Classic binary classification tasks in machine learning usually assume that all data are fully labeled as positive or negative (PN learning). However, in real-world applications, dataset is usually nonideal and only a small fraction of positive data are labeled. Training a model from such partially labeled positive data is called positive-unlabeled (PU) learning. Take financial fraud detection as an example. Some fraudulent manners are found and can be labeled as positive, but we cannot simply regard the remaining data as negative, because in most cases only a subset of fraud manners are detected and the remaining data may also contain undetected positive data. As a result, the remaining data can only be regarded as unlabeled. Other typical PU learning applications include text classification, drug discovery, outlier detection, malicious URL detection, online advertise, etc (Yu et al. (2002 ), Li & Liu (2003) A naive way for PU learning is treating unlabeled data as negative and using traditional PN learning algorithms. But the model trained in this way is biased and its prediction results are not reliable (Elkan & Noto (2008) ). Some early works try to recover labels for unlabeled data by heuristic algorithms, such as S-EM (Liu et al. (2002) ), 1-DNF (Yu et al. (2002) ), Rocchio (Li & Liu (2003) ), k-means (Chaudhari & Shevade (2012) ). But the performance of heuristic algorithms, which is critical to these works, is not guaranteed. Some other kind of methods introduce an unbiased risk estimator to eliminate the bias (Du Plessis et al. (2014 ), Du Plessis et al. (2015) , Kiryo et al. (2017) ). However, these methods rely on the knowledge of the proportion of positive samples in unlabeled samples, which is also unknown in practice. Another annoying problem of PU learning is how to accurately evaluate the model's performance. Model performance is usually evaluated by some metrics, such as accuracy, precision, recall, Fscore, AUC (Area Under ROC Curve), etc. During the life cycle of a model, its performance is usually monitored to ensure that the model is keeping a desired level of performance, with the variance and growth of data. In PU learning, the metrics above are also biased due to the lack of proportion of positive samples. Although Menon et al. (2015) proves that the ground-truth AUC (AUC) and the AUC estimated from PU data (AUC PU ) is linearly correlated, which indicates that AUC PU can be used to compare the performances between two models, it's still not possible to evaluate the true performance of a single model. Consider a situation when a model is evaluated on two different PU datasets generated from the same PN dataset but with different positive sample proportions. The ground-truth AUC which indicates the true performance of the model on two datasets are the same, but the AUC PU on the two datasets are different. Hence, AUC PU cannot be used to directly evaluate the model's performance. Jain et al. (2017) and Ramola et al. (2019) show that they can correct AUC PU , accuracy PU , balanced accuracy PU , F-score PU and Matthews correlation coefficient, with the knowledge of proportion of positive samples. However, this proportion is difficult to obtain in practice. Yet according to our experiments on 9 datasets, the estimation methods still introduce some errors and thus make the corrected metrics inaccurate. Besides, the MPE algorithms may also introduce non-trivial computational overhead (by up to 2,000 seconds per proportion estimation in our experiments), which slows down the evaluation process.

Recently many works focus on

In this work, we find that Area Under Lift chart (AUL) (Vuk & Curk (2006) , Tufféry ( 2011)) is a discriminating, unbiased and computation-friendly metric for PU learning. We make the following contributions. a). We theoretically prove that AUL estimation is unbiased to the ground-truth AUL and calculate a theoretical bound of the estimation error. b). We carry out experimental evaluation on 9 datasets and the results show that the average absolute error of AUL estimation is only 1/6 of AUC estimation, which means AUL estimation is more accurate and more stable than AUC estimation. c). By experiments we also find that, compared with state-of-the-art AUC-optimization algorithm, AUL-optimization algorithm can not only significantly save the computational cost, but also improve the model performance by up to 10%. The remaining of this paper is organized as follows. Section 2 describes the background knowledge. Section 3 theoretically proves the unbiased feature of AUL estimation in PU learning. Section 4 evaluates the performance of AUL estimation by experiments on 9 datasets. Section 5 experimentally shows the performance of AUL-optimization algorithm by applying AUL in PU learning. Section 6 concludes the whole paper.

2. BACKGROUND

Binary Classification Problem: Let D = {< x i , y i >, i = 1, ...n} be a positive and negative (PN) dataset which has n instances. Each tuple < x i , y i > is a record, in which x i ∈ R d is the feature vector and y i ∈ {1, 0} is the corresponding ground-truth label. Let X P , X N be the feature vectors set of positive, negative samples respectively, and n P , n N be the number of samples in these sets respectively. X P = {x i |y i = 1, i = 1, ...n P } X N = {x i |y i = 0, i = 1, ...n N } In PU learning, we use α = n P n P +n N = n P n to indicate the proportion of positive samples in all samples. Confusion Matrix: A confusion matrix is used to discriminate the model performance of different binary classification algorithms. In confusion matrix, true positive (TP) (actual label and predicted label are both positive), true negative (TN) (actual label and predicted label are both negative), false positive (FP) (actually negative but predicted as positive), and false negative (FN) (actually positive but predicted as negative) are counted according to model's outputs. Obviously, n T P + n F N = n P , n T N + n F P = n N . ROC: Since the numbers of TP, TN, FP and FN in a confusion matrix are highly related to the classification threshold (θ), Receiver Operating Characteristic (ROC) curve (Fawcett & Tom, 2003) is proposed to plot (x, y) = (f pr(θ), tpr(θ)) over all possible classification thresholds θ. In some literature, tpr is also known as sensitivity and the value of 1 -f pr is called specificity.



, Li et al. (2009), Blanchard et al. (2010), Zhang et al. (2017), Wu et al. (2018)).

estimating the proportion of positive samples Du Plessis & Sugiyama (2014), Christoffel et al. (2016), Ramaswamy et al. (2016), Jain et al. (2016), Bekker & Davis (2018), Zeiberg et al. (2020), which are called mixture proportion estimation (MPE) algorithms.

