LABEL PROPAGATION WITH WEAK SUPERVISION

Abstract

Semi-supervised learning and weakly supervised learning are important paradigms that aim to reduce the growing demand for labeled data in current machine learning applications. In this paper, we introduce a novel analysis of the classical label propagation algorithm (LPA) (Zhu & Ghahramani, 2002) that moreover takes advantage of useful prior information, specifically probabilistic hypothesized labels on the unlabeled data. We provide an error bound that exploits both the local geometric properties of the underlying graph and the quality of the prior information. We also propose a framework to incorporate multiple sources of noisy information. In particular, we consider the setting of weak supervision, where our sources of information are weak labelers. We demonstrate the ability of our approach on multiple benchmark weakly supervised classification tasks, showing improvements upon existing semi-supervised and weakly supervised methods.

1. INTRODUCTION

High-dimensional machine learning models require large labeled datasets for good performance and generalization. In the paradigm of semi-supervised learning, we look to overcome the bottleneck of labeled data by leveraging large amounts of unlabeled data and assumptions on how the target predictor behaves over the unlabeled samples. In this work, we focus on the classical semi-supervised approach of label propagation (LPA) (Zhu & Ghahramani, 2002; Zhou et al., 2003) . This method propagates labels from labeled to unlabeled samples, under the assumption that the target predictor is smooth with respect to a graph over the samples (that is frequently defined by a euclidean distance threshold or nearest neighbors). However, in practice, to satisfy this strong assumption, the graph can be highly disconnected. In these cases, LPA performs well locally on regions connected to labeled points, but has low overall coverage as it cannot propagate to points beyond these connected regions. In practice, we also have additional side-information beyond such smoothness of the target predictor. One concrete example of side information comes from the field of weakly supervised learning (WSL) (Ratner et al., 2016; 2017) , which considers learning predictors from domain knowledge that takes the form of hand-engineered weak labelers. These weak labelers are heuristics that provide multiple weak labels per unlabeled sample, and the focus in WSL is to aggregate these weak labels to produce noisy pseudolabels for each unlabeled sample. In practice, weak labelers are typically not designed to be smooth with respect to a graph, even though the underlying target predictor might be. For example, weak labelers are commonly defined as hard, binary predictions, with an ability to abstain from predicting. We thus see that LPA and WSL have complementary sources of information, as smoothing via LPA can improve the quality of weak labelers. By encouraging smoothness, predictions near multiple abstentions can be made more uncertain, and abstentions can be converted into predictions by confident nearby predictions. In this paper, we first bolster the theoretical foundations of LPA in the presence of side information. While LPA has a strong theoretical motivation of leveraging smoothness of the target predictor, there is limited theory on how accurate the propagated labels actually are. As a key contribution of this paper, we provide a "fine-grained" theory of LPA when used with any general prior on the target classes of the unlabeled samples. We provide a novel error bound for LPA, which depends on key local geometric properties of the graph, such as underlying smoothness of the target predictor over the graph, and the flow of edges from labeled points, as well as the accuracy of our prior. Our bound provides an intuition as to when LPA should prioritize propagating label information or when it should prioritize using prior information. We provide a comparison of our error bound to an existing spectral bound (Belkin & Niyogi, 2004 ) and demonstrate that our bound is preferable in some examples. Next, we propose a framework for incorporating multiple sources of noisy information to LPA by extending a framework from Zhu et al. (2003) . We construct additional "dongle" nodes in the graph that correspond to individual noisy labels. With these additional nodes, we connect them to unlabeled points that receive noisy predictions and perform label propagation on this new graph as usual. We study multiple different techniques for determining the weight on these additional edges. Finally, we focus on the specific case when our side information comes from WSL. We provide experimental results on standard weakly supervised benchmark tasks (Zhang et al., 2021) to support our theoretical claims and to compare our methods to standard LPA, other semi-supervised methods, and existing weakly supervised baselines. Our experiments demonstrate that incorporating smoothness via LPA in the standard weakly supervised pipeline leads to better performance, outperforming many existing WSL algorithms. This supports that there are significant benefits to combining LPA and WSL, and we believe that this intersection is a fertile ground for future research.

1.1. RELATED WORK

Label propagation Many papers have studied LPA from a theoretical standpoint. LPA has various connections to random walks, spectral clustering (Zhu et al., 2003) , manifold learning (Belkin & Niyogi, 2004; Belkin et al., 2006) and network generative models (Yamaguchi & Hayashi, 2017) , graph conductance (Talukdar & Cohen, 2014) . Another line of research in LPA proposes using prior information at the initialization of LPA (Yamaguchi et al., 2016; Zhou et al., 2018) , with applications in image segmentation (Vernaza & Chandraker, 2017 ), distant supervision (Bing et al., 2015) , and domain adaptation (Cai et al., 2021; Wei et al., 2020) . Finally, as the graph has a large impact on the performance of LPA, another line of work studies how to optimize the construction of the graph with linear-based (Wang & Zhang, 2007) methods, manifold-based (Karasuyama & Mamitsuka, 2013) methods, or deep learning based methods (Liu et al., 2018; 2019) . Weakly supervised learning The field of (programmatic) weakly supervised learning provides a framework for creating and combining hand-engineered weak labelers (Ratner et al., 2016; 2017; 2019; Fu et al., 2020) to pseudolabel unlabeled data and train a downstream model. Recent advances in weakly supervised learning extend the setting to include a small set of labeled data. One recent line of work has considered constraining the space of possible pseudolabels via weak labeler accuracies (Arachie & Huang, 2019; Mazzetto et al., 2021a; b; Arachie & Huang, 2021; 2022) . Other works improve the aggregation scheme (Xu et al., 2021) or the weak labelers (Awasthi et al., 2020) . We note that only one method incorporates any notion of smoothness into the weakly supervised pipeline (Chen et al., 2022) . This work leverages the smoothness of pretrained embeddings in clustering. While clustering and LPA have similar intuitions, they result in fundamentally different notions of smoothness. We also remark that this paper does not consider the semi-supervised setting. Semi-supervised learning Many other methods in semi-supervised learning look to induce smoothness in a learnt model. These include consistency regularization (Bachman et al., 2014; Sajjadi et al., 2016; Samuli & Timo, 2017; Sohn et al., 2020) and co-training (Blum & Mitchell, 1998; Balcan et al., 2004; Han et al., 2018) . In addition, Graph Neural Networks (GNNs) (Kipf & Welling, 2017; Hamilton et al., 2017; Gilmer et al., 2017; Scarselli et al., 2008; Gori et al., 2005; Henaff et al., 2015) is a class of deep learning based methods that also operate over graphs. Some recent works (Huang et al., 2020; Wang & Leskovec, 2020; Dong et al., 2021) have made connections between graph neural networks and LPA. While all these methods focus on a similar goal of learning a smooth function, they do not address the weakly supervised setting.

2. PRELIMINARIES

We consider a binary classification setting where we want to learn a classifier f * : X → {0, 1}. We observe a small set of labeled data L = {(x i , y i )} n i=1 and a much larger set of unlabeled data U = {x j } n+m j=n+1 . LPA relies on the assumption that nearby data points have similar labels. This is

