PAIRWISE CONFIDENCE DIFFERENCE ON UNLABELED DATA IS SUFFICIENT FOR BINARY CLASSIFICATION Anonymous

Abstract

Learning with confidence labels is an emerging weakly supervised learning paradigm, where training data are equipped with confidence labels instead of exact labels. Positive-confidence (Pconf) classification is a typical learning problem in this context, where we are given only positive data equipped with confidence. However, pointwise confidence may not be accessible in real-world scenarios. In this paper, we dive into a novel weakly supervised learning problem called confidence-difference (ConfDiff) classification. Instead of pointwise confidence, we are given only unlabeled data pairs equipped with confidence difference specifying the difference in the probabilities of being positive. An unbiased risk estimator is derived to tackle the problem, and we show that the estimation error bound achieves the optimal convergence rate. Extensive experiments on benchmark data sets validate the effectiveness of our proposed approaches in leveraging the supervision information of the confidence difference.

1. INTRODUCTION

Recent years have witnessed the prevalence of deep learning and its successful applications. However, the success is built on the basis of the collection of large amounts of data with unique and accurate labels. In many real-world scenarios, it is often difficult to satisfy such requirements. To circumvent the difficulty, various weakly supervised learning problems have been investigated accordingly, including but not limited to semi-supervised learning (Chapelle et al., 2006; Zhu & Goldberg, 2009; Li & Zhou, 2015; Berthelot et al., 2019) , label-noise learning (Patrini et al., 2017; Han et al., 2018; Li et al., 2021; Wang et al., 2021; Wei et al., 2022 ), positive-unlabeled learning (du Plessis et al., 2014; Su et al., 2021; Yao et al., 2022 ), partial-label learning (Cour et al., 2011; Wang & Zhang, 2020; Wen et al., 2021; Wang et al., 2022; Wu et al., 2022 ), unlabeled-unlabeled learning (Lu et al., 2019; 2020) and similarity-based classification (Bao et al., 2018; Cao et al., 2021b; Bao et al., 2022) . Learning with confidence labels (Ishida et al., 2018; Cao et al., 2021a; b) is another weakly supervised learning paradigm, where we are given training examples with confidence labels instead of exact labels. Positive-confidence (Pconf) classification (Ishida et al., 2018) is a problem setting within this scope, which is aimed at learning a binary classifier from only positive data equipped with confidence (the probability of being positive) without negative data. Pconf classification can alleviate the difficulty when negative data cannot be acquired due to privacy or security issues during the data annotation process. The need to learn from such inexact supervision widely exists in real-world scenarios, such as purchase prediction (Ishida et al., 2018) , user preservation prediction (Ishida et al., 2018) , drivers' drowsiness prediction (Shinoda et al., 2020) , etc. However, the process of collecting large amounts of training examples with pointwise confidence might be actually demanding under many circumstances, since it is tough to describe the probability of being positive for each training example exactly (Shinoda et al., 2020 ). Feng et al. (2021) showed that learning from pairwise comparisons could serve as an alternative strategy given limited pointwise labeling information. Inspired by it, we investigate a more practical problem setting in this paper, where we are given only unlabeled data pairs with confidence difference indicating the difference in the probabilities of being positive. Compared with pointwise confidence, confidence difference can be collected more easily in many real-world scenarios. Take click-through rate prediction in recommender systems (Zhang et al., 2019) for example. The combinations of users and their favorite/disliked items can be regarded as positive/negative data. When collecting training data, it is not easy to distinguish between positive and negative data. Furthermore, the positive confidence of training data may be difficult to be determined due to the extremely sparse and class-imbalance problems (Yao et al., 2021) . However, it is much easier to obtain the difference in the preference between a pair of candidate items for a given user. Take the disease risk estimation problem for another example. The goal is to predict the risk of having some disease given a person's attributes. When asking doctors to annotate the probabilities of having the disease for people, it is not easy to determine the exact values of the probabilities. Furthermore, the probability values given by different doctors may be different due to personally subjective assumptions and will deviate from the ground-truth values. However, it is much easier and less biased to estimate the relative difference in the probabilities of having the disease between two people. Our contributions are summarized as follows: • We investigate confidence-difference (ConfDiff) classification, a novel and practical weakly supervised learning problem, which can be solved via empirical risk minimization by constructing an unbiased risk estimator. The proposed approach can be equipped with any model, loss function, and optimizer flexibly. • The estimation error bound is derived, showing that the proposed approach achieves the optimal parametric convergence rate. The robustness is further demonstrated by probing into the influence of an inaccurate class prior probability and noisy confidence difference. • To mitigate overfitting issues, a risk correction approach (Lu et al., 2020) with consistency guarantee is further introduced. Extensive experimental results on benchmark data sets validate the effectiveness of the proposed approaches.

Related works.

Learning with pairwise comparisons has been investigated pervasively in the community (Burges et al., 2005; Cao et al., 2007; Jamieson & Nowak, 2011; Park et al., 2015; Kane et al., 2017; Xu et al., 2017; Shah et al., 2019) , with applications in information retrieval (Liu, 2011 ), computer vision (Fu et al., 2015) , regression (Xu et al., 2019; 2020 ), crowdsourcing (Chen et al., 2013;; Zeng & Shen, 2022) , graph learning (He et al., 2022) , etc. It is noteworthy that there exist distinct differences between our work and previous works on learning with pairwise comparisons. Previous works have mainly tried to learn a ranking function which can rank candidate examples according to the relevance or preference. In this paper, we try to learn a pointwise binary classifier by conducting empirical risk minimization under the binary classification setting. Relationship to Pcomp classification. Feng et al. (2021) elaborated that a binary classifier could be learned from pairwise comparisons, which was termed as Pcomp classification. There are distinct differences between our work and Pcomp classification. First, Pcomp classification is not capable of leveraging the fine-grained confidence difference, which can be incidentally obtained when collecting pairwise comparison data. We will experimentally elucidate the benefit of exploiting the confidence difference in the later section. Second, the assumptions of the data generation process are different. Pcomp classification assumes that the unlabeled data pair is ordered, where the first instance is more likely to be positive than the other. In ConfDiff classification, the instances of the unlabeled data pair are independent, which can be easier to collect.

2. PRELIMINARIES

In this section, we introduce the notations used in this paper and discuss the background of binary classification, Pconf classification and Pcomp classification. Then, we elucidate the data generation process of confidence-difference classification.

2.1. BINARY CLASSIFICATION

For binary classification, let X = R d denote the d-dimensional feature space and Y = {+1, -1} denote the label space. Let p(x, y) denote the unknown joint probability distribution over random variables (x, y) ∈ X × Y. The task of binary classification is to learn a binary classifier g : X → R which minimizes the following classification risk: R(g) = E p(x,y) [ℓ(g(x), y)], (1)

