CONFIDENT SINKHORN ALLOCATION FOR PSEUDO-LABELING Anonymous

Abstract

Semi-supervised learning is a critical tool in reducing machine learning's dependence on labeled data. It has been successfully applied to structure data, such as image and language data, by exploiting the inherent spatial and semantic structure therein with pretrained models or data augmentation. Some of these methods are no longer applicable for the data where domain structures are not available because the pretrained models or data augmentation can not be used. Due to simplicity, existing pseudo-labeling (PL) methods can be widely used without any domain assumption, but are vulnerable to noise samples and to greedy assignments given a predefined threshold which is typically unknown. This paper addresses this problem by proposing a Confident Sinkhorn Allocation (CSA), which assigns labels to only samples with high confidence scores and learns the best label allocation via optimal transport. CSA outperforms the current state-of-the-art in this practically important area of semi-supervised learning.

1. INTRODUCTION

The impact of machine learning continues to grow in fields as disparate as biology (Libbrecht & Noble, 2015; Tunyasuvunakool et al., 2021) , quantum technology (Biamonte et al., 2017; van Esbroeck et al., 2020; Nguyen et al., 2021) , brain stimulation (Boutet et al., 2021; van Bueren et al., 2021), and computer vision (Esteva et al., 2021; Yoon et al., 2022) . Much of this impact depends on the availability of large numbers of annotated examples for the machine learning models to be trained on. The data annotation task by which such labeled data is created is often expensive, and sometimes impossible, however. Rare genetic diseases, stock market events, and cyber-security threats, for example, are hard to annotate due to the volumes of data involved, the rate at which the significant characteristics change, or both. Related work. Fortunately, for some classification tasks, we can overcome a scarcity of labeled data using semi-supervised learning (SSL) (Zhu, 2005; Huang et al., 2021; Killamsetty et al., 2021; Olsson et al., 2021) . SSL exploits an additional set of unlabeled data with the goal of improving on the performance that might be achieved using labeled data alone (Lee et al., 2019; Carmon et al., 2019; Ren et al., 2020; Islam et al., 2021) . Domain specific: Semi-supervised learning for image and language data has made rapid progress (Oymak & Gulcu, 2021; Zhou, 2021; Sohn et al., 2020) largely by exploiting the inherent spatial and semantic structure of images (Komodakis & Gidaris, 2018) and language (Kenton & Toutanova, 2019) . This is achieved typically either using pretext tasks (eg. (Komodakis & Gidaris, 2018; Alexey et al., 2016) ) or contrastive learning (eg. (Van den Oord et al., 2018; Chen et al., 2020) ). Both approaches assume that specific transformations applied to each data element will not affect the associated label. Greedy pseudo-labeling: Without domain assumption, a simple but effective way for SSL is pseudolabeling (PL) (Lee et al., 2013) which generates 'pseudo-labels' for unlabeled samples using a model trained on labeled data. A label k is assigned to an unlabeled sample x i where a predicted class probability is larger than a predefined threshold γ as where γ ∈ [0, 1] is a threshold used to produce hard labels and p(y i = k | x i ) is the predictive probability of the i-th data point belonging to the class k-th. A classifier can then be trained using both the original labeled data and the newly pseudo-labeled data. Pseudo labeling is naturally an iterative process, with the next round of pseudo-labels being generated using the most-recently trained classifier. The key advantage of pseudo-labeling is that it does not inherently require any domain assumption and can be generally applied to most domains, including tabular data. y k i = 1 p(y i = k | x i ) ≥ γ (1) Greedy PL with uncertainty: Rizve et al. ( 2021) propose an uncertainty-aware pseudo-label selection (UPS) that aims to reduce the noise in the training process by using the uncertainty score -together with the probability score for making assignments: y k i = 1 p(y i = k | x i ) ≥ γ 1 U p(y i = k | x i ) ≤ γ u (2) where γ u is an additional threshold on the uncertainty level and U(p) is the uncertainty of a prediction p. As shown in Rizve et al. ( 2021), selecting predictions with low uncertainties greatly reduces the effect of poor calibration, thus improving robustness and generalization. However, the aforementioned works in PL are greedy in assigning the labels by simply comparing the prediction value against a predefined threshold γ irrespective of the relative prediction values across samples and classes. Such greedy strategies will be sensitive to the choice of a threshold. Non-greedy pseudo-labeling: FlexMatch (Zhang et al., 2021) considers adaptively selecting a threshold γ k for each class based on the level of difficulty. This threshold is adapted using the predictions across classes. However, the selection process is still heuristic in comparing the prediction score with an adjusted threshold. Recently, Tai et al. ( 2021) provide a novel view in connecting the pseudolabeling assignment task to optimal transport problem, called SLA, which inspires our work. SLA and FlexMatch are better than existing PL in that their non-greedy label assignments not only use the single prediction value but also consider the relative importance of this value across rows and columns in a holistic way. However, both SLA and FlexMatch can overconfidently assign labels to noise samples and have not considered utilizing uncertainty values in making assignments. Contributions. We propose here a semi-supervised learning method that does not require any domain-specific assumption for the data. We hypothesize that this is by far the most common case for the vast volumes of data that exist. The method we propose is based on pseudo-labeling of a set of unlabeled data using Confident Sinkhorn Allocation (CSA). Our method is theoretically driven by the role of uncertainty in robust label assignment in SSL. CSA utilizes Sinkhorn's algorithm (Cuturi, 2013) to assign labels to only the data samples with high confidence scores. By learning the label assignment with optimal transport, CSA eliminates the need to predefine the heuristic thresholds used in existing pseudo-labeling methods, which can be greedy. The proposed CSA is applicable to any data domain, and could be used in concert with consistency-based approaches (Sohn et al., 2020) , but is particularly useful for data domain where pretext tasks and data augmentation are not applicable, such a tabular data.

2. CONFIDENT SINKHORN ALLOCATION (CSA)

We consider the semi-supervised learning setting whereby we have access to a dataset consisting of labeled examples D l = {x i , y i } N l i=1 , and one of unlabeled examples D u = {x i } N u i=1 where x i ∈ R d and y i ∈ Y = {1, . . . , K}. We define also X = {x i }, i = {1, . . . , N l + N u }. Our goal is to utilize D l ∪ D u to



Comparison with the related approaches in terms of properties and their relative trade-offs.

