ULF: UNSUPERVISED LABELING FUNCTION CORRECTION USING CROSS-VALIDATION FOR WEAK SUPERVISION Anonymous authors Paper under double-blind review

Abstract

A way to overcome expensive and time-consuming manual data labeling is weak supervision -automatic annotation of data samples via a predefined set of labeling functions (LFs), rule-based mechanisms that generate artificial labels for the classes associated with the LFs. In this work, we investigate noise reduction techniques for weak supervision based on the principle of k-fold cross-validation. We introduce a new algorithm ULF for denoising weakly annotated data which uses models trained on all but some LFs to detect and correct biases specific to the held-out LFs. Specifically, ULF refines the allocation of LFs to classes by reestimating this assignment on highly reliable cross-validated samples. We realize two variants of this algorithm: feature-based ULF (relying on count-based feature vectors), and DeepULF (fine-tuning pre-trained language models). We compare ULF to methods originally developed for detecting erroneous samples in manually annotated data, as well as to our extensions of such methods to the weakly supervised setting. Our new weak supervision-specific methods (ULF and extensions) leverage the information about matching LFs, making detecting noisy samples more accurate. Evaluation on several datasets shows that ULF can successfully improve weakly supervised learning without utilizing any manually labeled data.

1. INTRODUCTION

A large part of today's machine learning success rests upon a vast amount of annotated training data. However, a manual expert annotation turns out to be tedious and expensive work. There are different approaches to reduce this data bottleneck: fine-tuning large pre-trained models (Devlin et al., 2019) , applying active learning (Sun & Grishman, 2012) and semi-supervised learning (Kozareva et al., 2008) . However, even if in a reduced amount, these approaches still demand manually annotated data. Moreover, constant data re-annotation would be necessary in settings with dynamically changing task specifications or changing data distributions. Another strategy that does not require any manual data labeling is weak supervision (WS), which allows to obtain massive amounts of labeled training data at a low cost. In a weakly supervised setting, the data is labeled in an automated process using one or multiple weak supervision sources, such as external knowledge bases (Lin et al., 2016; Mintz et al., 2009) and manually-defined or automatically generated heuristics (Varma & Ré, 2018) . By applying such rules, or labeling functions (LFs, Ratner et al., 2020) , to a large unlabeled dataset, one can quickly obtain weak training labels, which are, however, potentially error-prone and need additional denoising (see examples in Fig. 1 ). In this work, we explore methods for improving the quality of weak labels using methods based on the principle of k-fold cross-validation. Intuitively, if some part of the data is left out during training, the model does not overfit on errors specific to that part. Therefore, a mismatch between predictions of a model (trained on a large portion of the data set) and labels (of the held-out portion) can indicate candidates of noise specific to the held-out portion. This idea has motivated different approaches to data cleaning (Northcutt et al., 2021; Wang et al., 2019c) . As they deal with general, non-weakly supervised data, they usually split the data samples into folds randomly and independently on the sample level. However, a direct application of these methods to weakly labeled data ignores valuable knowledge stemming from the weak supervision process (e.g., which LFs matched in each sample or what class each LF corresponds to). In this work, we leverage this additional source of knowledge by splitting the data considering the LFs matched in the samples. We build on the intuition that a mismatch between predictions of a model (trained with a large portion of the LFs) and labels (generated by held-out LFs) can indicate candidates of noise specific to the held-out LFs. If this LF-specific cross-validation is done for each LF, noise associated with all LFs could be found and corrected. This idea is realized in our extensions to methods proposed in Northcutt et al. 2021 and Wang et al. 2019c . Apart from that, we use the principle of weakly supervised crossvalidation to break out of the logic of just repairing the labels, to instead repair the LF-to-class assignment. This approach is formalized in ULF -our new method for Unsupervised Labeling Function correction with k-fold cross-validation. Its primary goal is to improve the LFs to classes allocation in order to correct the systematically biased label assignments. ULF re-estimates the joint distribution between LFs and class labels during cross-validation based on highly confident class predictions and their co-occurrence with matching LFs. The improved allocation allows to re-assign the weak labels for further training. Importantly, ULF also improves labels of the samples with no LFs matched, in contrast to other methods that filter them out (Ratner et al., 2020) . Overall, our main contributions are: (1) A new method ULF for improving the LFs to classes allocation in an unsupervised fashion. ULF not only detects inconsistent predictions but corrects the process that led to them by re-estimating the assignment of LFs to classes. Training with ULF results in more accurate labels and a better quality of the trained classifier. (2) Two implementations of ULF: feature-based ULF for feature-based learning (without a hidden layer), and DeepULF for fine-tuning pre-trained language models. (3) Extensions of two methods for denoising the data using the principle of k-fold cross-validation (Wang et al., 2019c; Northcutt et al., 2021) . Our extensions Weakly Supervised CrossWeigh and Weakly Supervised Cleanlab profit from the WSspecific information and make the denoising of WS data more accurate. (4) Extensive experiments on several weakly supervised datasets in order to demonstrate the effectiveness of our methods. To the best of our knowledge, we are the first (1) to adapt k-fold cross-validation-based noise detection methods to WS problems, and (2) to refine the LFs to classes allocation in the WS setting.

2. RELATED WORK

Weak supervision has been widely applied to different tasks in various domains, such as text classification (Ren et al., 2020; Shu et al., 2020 ), relation extraction (Yuan et al., 2019; Hoffmann et al., 2011) , named entity recognition (Lan et al., 2020; Wang et al., 2019c ), video analysis (Fang et al., 2020; Kundu et al., 2019 ), medical domain (Fries et al., 2021 ), image classification (Li et al., 2021) , and others. Weak labels are usually cheap and easy to obtain, but also potentially error-prone, and thus often need additional denoising. Denoising methods. Among the most popular approaches to improving weakly supervised data is building a specific model architecture or reformulating the loss functions (Karamanolakis et al., 2021; Hedderich & Klakow, 2018; Goldberger & Ben-Reuven, 2017; Sukhbaatar et al., 2014) . Sometimes, weak labels are combined with additional expert annotations: for example, by adding



Figure 1: Examples of weakly supervised annotation from YouTube dataset. In (1), both matched LFs correspond to the SPAM class; the sample is therefore assigned to SPAM class. In (2), there is a conflict as one of the matched LFs belongs to the SPAM class, while the other -to the HAM class. In (3), no LFs matched, meaning the sample does not get any weak signal.

