CONFIDENT SINKHORN ALLOCATION FOR PSEUDO-LABELING Anonymous

Abstract

Semi-supervised learning is a critical tool in reducing machine learning's dependence on labeled data. It has been successfully applied to structure data, such as image and language data, by exploiting the inherent spatial and semantic structure therein with pretrained models or data augmentation. Some of these methods are no longer applicable for the data where domain structures are not available because the pretrained models or data augmentation can not be used. Due to simplicity, existing pseudo-labeling (PL) methods can be widely used without any domain assumption, but are vulnerable to noise samples and to greedy assignments given a predefined threshold which is typically unknown. This paper addresses this problem by proposing a Confident Sinkhorn Allocation (CSA), which assigns labels to only samples with high confidence scores and learns the best label allocation via optimal transport. CSA outperforms the current state-of-the-art in this practically important area of semi-supervised learning.

1. INTRODUCTION

The impact of machine learning continues to grow in fields as disparate as biology (Libbrecht & Noble, 2015; Tunyasuvunakool et al., 2021) , quantum technology (Biamonte et al., 2017; van Esbroeck et al., 2020; Nguyen et al., 2021 ), brain stimulation (Boutet et al., 2021; van Bueren et al., 2021), and computer vision (Esteva et al., 2021; Yoon et al., 2022) . Much of this impact depends on the availability of large numbers of annotated examples for the machine learning models to be trained on. The data annotation task by which such labeled data is created is often expensive, and sometimes impossible, however. Rare genetic diseases, stock market events, and cyber-security threats, for example, are hard to annotate due to the volumes of data involved, the rate at which the significant characteristics change, or both. Related work. Fortunately, for some classification tasks, we can overcome a scarcity of labeled data using semi-supervised learning (SSL) (Zhu, 2005; Huang et al., 2021; Killamsetty et al., 2021; Olsson et al., 2021) . SSL exploits an additional set of unlabeled data with the goal of improving on the performance that might be achieved using labeled data alone (Lee et al., 2019; Carmon et al., 2019; Ren et al., 2020; Islam et al., 2021) . Domain specific: Semi-supervised learning for image and language data has made rapid progress (Oymak & Gulcu, 2021; Zhou, 2021; Sohn et al., 2020) largely by exploiting the inherent spatial and semantic structure of images (Komodakis & Gidaris, 2018) and language (Kenton & Toutanova, 2019) . This is achieved typically either using pretext tasks (eg. (Komodakis & Gidaris, 2018; Alexey et al., 2016) ) or contrastive learning (eg. (Van den Oord et al., 2018; Chen et al., 2020) ). Both approaches assume that specific transformations applied to each data element will not affect the associated label. Greedy pseudo-labeling: Without domain assumption, a simple but effective way for SSL is pseudolabeling (PL) (Lee et al., 2013) which generates 'pseudo-labels' for unlabeled samples using a model trained on labeled data. A label k is assigned to an unlabeled sample x i where a predicted class probability is larger than a predefined threshold γ as y k i = 1 p(y i = k | x i ) ≥ γ (1) 1

