AUL IS A BETTER OPTIMIZATION METRIC IN PU LEARNING

Abstract

Traditional binary classification models are trained and evaluated with fully labeled data which is not common in real life. In non-ideal dataset, only a small fraction of positive data are labeled. Training a model from such partially labeled data is named as positive-unlabeled (PU) learning. A naive solution of PU learning is treating unlabeled samples as negative. However, using biased data, the trained model may converge to non-optimal point and its real performance cannot be well estimated. Recent works try to recover the unbiased result by estimating the proportion of positive samples with mixture proportion estimation (MPE) algorithms, but the model performance is still limited and heavy computational cost is introduced (particularly for big datasets). In this work, we theoretically prove that Area Under Lift curve (AUL) is an unbiased metric in PU learning scenario, and the experimental evaluation on 9 datasets shows that the average absolute error of AUL estimation is only 1/6 of AUC estimation. By experiments we also find that, compared with state-of-the-art AUC-optimization algorithm, AULoptimization algorithm can not only significantly save the computational cost, but also improve the model performance by up to 10%.

1. INTRODUCTION

Classic binary classification tasks in machine learning usually assume that all data are fully labeled as positive or negative (PN learning). However, in real-world applications, dataset is usually nonideal and only a small fraction of positive data are labeled. Training a model from such partially labeled positive data is called positive-unlabeled (PU) learning. Take financial fraud detection as an example. Some fraudulent manners are found and can be labeled as positive, but we cannot simply regard the remaining data as negative, because in most cases only a subset of fraud manners are detected and the remaining data may also contain undetected positive data. As a result, the remaining data can only be regarded as unlabeled. Other typical PU learning applications include text classification, drug discovery, outlier detection, malicious URL detection, online advertise, etc (Yu et al. ( 2002 A naive way for PU learning is treating unlabeled data as negative and using traditional PN learning algorithms. But the model trained in this way is biased and its prediction results are not reliable (Elkan & Noto (2008) ). Some early works try to recover labels for unlabeled data by heuristic algorithms, such as S-EM (Liu et al. 2012)). But the performance of heuristic algorithms, which is critical to these works, is not guaranteed. Some other kind of methods introduce an unbiased risk estimator to eliminate the bias (Du Plessis et al. (2014 ), Du Plessis et al. (2015) , Kiryo et al. (2017) ). However, these methods rely on the knowledge of the proportion of positive samples in unlabeled samples, which is also unknown in practice. Another annoying problem of PU learning is how to accurately evaluate the model's performance. Model performance is usually evaluated by some metrics, such as accuracy, precision, recall, Fscore, AUC (Area Under ROC Curve), etc. During the life cycle of a model, its performance is usually monitored to ensure that the model is keeping a desired level of performance, with the variance and growth of data. In PU learning, the metrics above are also biased due to the lack of proportion of positive samples. Although Menon et al. (2015) proves that the ground-truth AUC (AUC) and



), Li & Liu (2003), Li et al. (2009), Blanchard et al. (2010), Zhang et al. (2017), Wu et al. (2018)).

(2002)), 1-DNF (Yu et al. (2002)), Rocchio (Li & Liu (2003)), k-means (Chaudhari & Shevade (

