ULF: UNSUPERVISED LABELING FUNCTION CORRECTION USING CROSS-VALIDATION FOR WEAK SUPERVISION Anonymous authors Paper under double-blind review

Abstract

A way to overcome expensive and time-consuming manual data labeling is weak supervision -automatic annotation of data samples via a predefined set of labeling functions (LFs), rule-based mechanisms that generate artificial labels for the classes associated with the LFs. In this work, we investigate noise reduction techniques for weak supervision based on the principle of k-fold cross-validation. We introduce a new algorithm ULF for denoising weakly annotated data which uses models trained on all but some LFs to detect and correct biases specific to the held-out LFs. Specifically, ULF refines the allocation of LFs to classes by reestimating this assignment on highly reliable cross-validated samples. We realize two variants of this algorithm: feature-based ULF (relying on count-based feature vectors), and DeepULF (fine-tuning pre-trained language models). We compare ULF to methods originally developed for detecting erroneous samples in manually annotated data, as well as to our extensions of such methods to the weakly supervised setting. Our new weak supervision-specific methods (ULF and extensions) leverage the information about matching LFs, making detecting noisy samples more accurate. Evaluation on several datasets shows that ULF can successfully improve weakly supervised learning without utilizing any manually labeled data.

1. INTRODUCTION

A large part of today's machine learning success rests upon a vast amount of annotated training data. However, a manual expert annotation turns out to be tedious and expensive work. There are different approaches to reduce this data bottleneck: fine-tuning large pre-trained models (Devlin et al., 2019) , applying active learning (Sun & Grishman, 2012) and semi-supervised learning (Kozareva et al., 2008) . However, even if in a reduced amount, these approaches still demand manually annotated data. Moreover, constant data re-annotation would be necessary in settings with dynamically changing task specifications or changing data distributions. Another strategy that does not require any manual data labeling is weak supervision (WS), which allows to obtain massive amounts of labeled training data at a low cost. In a weakly supervised setting, the data is labeled in an automated process using one or multiple weak supervision sources, such as external knowledge bases (Lin et al., 2016; Mintz et al., 2009) and manually-defined or automatically generated heuristics (Varma & Ré, 2018) . By applying such rules, or labeling functions (LFs, Ratner et al., 2020) , to a large unlabeled dataset, one can quickly obtain weak training labels, which are, however, potentially error-prone and need additional denoising (see examples in Fig. 1 ). In this work, we explore methods for improving the quality of weak labels using methods based on the principle of k-fold cross-validation. Intuitively, if some part of the data is left out during training, the model does not overfit on errors specific to that part. Therefore, a mismatch between predictions of a model (trained on a large portion of the data set) and labels (of the held-out portion) can indicate candidates of noise specific to the held-out portion. This idea has motivated different approaches to data cleaning (Northcutt et al., 2021; Wang et al., 2019c) . As they deal with general, non-weakly supervised data, they usually split the data samples into folds randomly and independently on the sample level. However, a direct application of these methods to weakly labeled data ignores valuable knowledge stemming from the weak supervision 1

