LABEL PROPAGATION WITH WEAK SUPERVISION

Abstract

Semi-supervised learning and weakly supervised learning are important paradigms that aim to reduce the growing demand for labeled data in current machine learning applications. In this paper, we introduce a novel analysis of the classical label propagation algorithm (LPA) (Zhu & Ghahramani, 2002) that moreover takes advantage of useful prior information, specifically probabilistic hypothesized labels on the unlabeled data. We provide an error bound that exploits both the local geometric properties of the underlying graph and the quality of the prior information. We also propose a framework to incorporate multiple sources of noisy information. In particular, we consider the setting of weak supervision, where our sources of information are weak labelers. We demonstrate the ability of our approach on multiple benchmark weakly supervised classification tasks, showing improvements upon existing semi-supervised and weakly supervised methods.

1. INTRODUCTION

High-dimensional machine learning models require large labeled datasets for good performance and generalization. In the paradigm of semi-supervised learning, we look to overcome the bottleneck of labeled data by leveraging large amounts of unlabeled data and assumptions on how the target predictor behaves over the unlabeled samples. In this work, we focus on the classical semi-supervised approach of label propagation (LPA) (Zhu & Ghahramani, 2002; Zhou et al., 2003) . This method propagates labels from labeled to unlabeled samples, under the assumption that the target predictor is smooth with respect to a graph over the samples (that is frequently defined by a euclidean distance threshold or nearest neighbors). However, in practice, to satisfy this strong assumption, the graph can be highly disconnected. In these cases, LPA performs well locally on regions connected to labeled points, but has low overall coverage as it cannot propagate to points beyond these connected regions. In practice, we also have additional side-information beyond such smoothness of the target predictor. One concrete example of side information comes from the field of weakly supervised learning (WSL) (Ratner et al., 2016; 2017) , which considers learning predictors from domain knowledge that takes the form of hand-engineered weak labelers. These weak labelers are heuristics that provide multiple weak labels per unlabeled sample, and the focus in WSL is to aggregate these weak labels to produce noisy pseudolabels for each unlabeled sample. In practice, weak labelers are typically not designed to be smooth with respect to a graph, even though the underlying target predictor might be. For example, weak labelers are commonly defined as hard, binary predictions, with an ability to abstain from predicting. We thus see that LPA and WSL have complementary sources of information, as smoothing via LPA can improve the quality of weak labelers. By encouraging smoothness, predictions near multiple abstentions can be made more uncertain, and abstentions can be converted into predictions by confident nearby predictions. In this paper, we first bolster the theoretical foundations of LPA in the presence of side information. While LPA has a strong theoretical motivation of leveraging smoothness of the target predictor, there is limited theory on how accurate the propagated labels actually are. As a key contribution of this paper, we provide a "fine-grained" theory of LPA when used with any general prior on the target classes of the unlabeled samples. We provide a novel error bound for LPA, which depends on key local geometric properties of the graph, such as underlying smoothness of the target predictor over the graph, and the flow of edges from labeled points, as well as the accuracy of our prior. Our bound provides an intuition as to when LPA should prioritize propagating label information or when it should prioritize using prior information. We provide a comparison of our error bound to an existing spectral bound (Belkin & Niyogi, 2004) and demonstrate that our bound is preferable in some examples. Next, we propose a framework for incorporating multiple sources of noisy information to LPA by extending a framework from Zhu et al. (2003) . We construct additional "dongle" nodes in the graph that correspond to individual noisy labels. With these additional nodes, we connect them to unlabeled points that receive noisy predictions and perform label propagation on this new graph as usual. We study multiple different techniques for determining the weight on these additional edges. Finally, we focus on the specific case when our side information comes from WSL. We provide experimental results on standard weakly supervised benchmark tasks (Zhang et al., 2021) to support our theoretical claims and to compare our methods to standard LPA, other semi-supervised methods, and existing weakly supervised baselines. Our experiments demonstrate that incorporating smoothness via LPA in the standard weakly supervised pipeline leads to better performance, outperforming many existing WSL algorithms. This supports that there are significant benefits to combining LPA and WSL, and we believe that this intersection is a fertile ground for future research.

1.1. RELATED WORK

Label propagation Many papers have studied LPA from a theoretical standpoint. LPA has various connections to random walks, spectral clustering (Zhu et al., 2003) , manifold learning (Belkin & Niyogi, 2004; Belkin et al., 2006) and network generative models (Yamaguchi & Hayashi, 2017) , graph conductance (Talukdar & Cohen, 2014) . Another line of research in LPA proposes using prior information at the initialization of LPA (Yamaguchi et al., 2016; Zhou et al., 2018) , with applications in image segmentation (Vernaza & Chandraker, 2017) , distant supervision (Bing et al., 2015) , and domain adaptation (Cai et al., 2021; Wei et al., 2020) . Finally, as the graph has a large impact on the performance of LPA, another line of work studies how to optimize the construction of the graph with linear-based (Wang & Zhang, 2007) methods, manifold-based (Karasuyama & Mamitsuka, 2013) methods, or deep learning based methods (Liu et al., 2018; 2019) . Weakly supervised learning The field of (programmatic) weakly supervised learning provides a framework for creating and combining hand-engineered weak labelers (Ratner et al., 2016; 2017; 2019; Fu et al., 2020) to pseudolabel unlabeled data and train a downstream model. Recent advances in weakly supervised learning extend the setting to include a small set of labeled data. One recent line of work has considered constraining the space of possible pseudolabels via weak labeler accuracies (Arachie & Huang, 2019; Mazzetto et al., 2021a; b; Arachie & Huang, 2021; 2022) . Other works improve the aggregation scheme (Xu et al., 2021) or the weak labelers (Awasthi et al., 2020) . We note that only one method incorporates any notion of smoothness into the weakly supervised pipeline (Chen et al., 2022) . This work leverages the smoothness of pretrained embeddings in clustering. While clustering and LPA have similar intuitions, they result in fundamentally different notions of smoothness. We also remark that this paper does not consider the semi-supervised setting. Semi-supervised learning Many other methods in semi-supervised learning look to induce smoothness in a learnt model. These include consistency regularization (Bachman et al., 2014; Sajjadi et al., 2016; Samuli & Timo, 2017; Sohn et al., 2020) and co-training (Blum & Mitchell, 1998; Balcan et al., 2004; Han et al., 2018) . In addition, Graph Neural Networks (GNNs) (Kipf & Welling, 2017; Hamilton et al., 2017; Gilmer et al., 2017; Scarselli et al., 2008; Gori et al., 2005; Henaff et al., 2015) is a class of deep learning based methods that also operate over graphs. Some recent works (Huang et al., 2020; Wang & Leskovec, 2020; Dong et al., 2021) have made connections between graph neural networks and LPA. While all these methods focus on a similar goal of learning a smooth function, they do not address the weakly supervised setting.

2. PRELIMINARIES

We consider a binary classification setting where we want to learn a classifier f * : X → {0, 1}. We observe a small set of labeled data L = {(x i , y i )} n i=1 and a much larger set of unlabeled data U = {x j } n+m j=n+1 . LPA relies on the assumption that nearby data points have similar labels. This is expressed in terms of smoothness with respect to an undirected graph G = (V, E), with |V | nodes representing each point x ∈ L ∪ U , and with an adjacency matrix W = (w) ij . LPA then leverages the assumption that adjacent points in this graph have similar labels, by propagating label information from L to U . Specifically, it learns f : X → R by solving the following optimization problem: min f ∈R n+m 1 2 ( n+m i=1 n+m j=1 w ij (f i -f j ) 2 ) s.t. f i = y i for i ≤ n where f ∈ R n+m is the prediction vector and, abusing notation, f i = f (x i ). The method generalizes to the multi-class setting by replacing y i with a one-hot-encoding vector, and predicting a score vector at each node. Zhu et al. (2003) provides a quick iterative method to solve this optimization problem.

3. LABEL PROPAGATION WITH PRIOR INFORMATION

We analyze LPA with initial noisy predictions h(x) : X → [0, 1], by solving the following objective: min f ∈R n+m 1 2 ( n+m i=1 n+m j=1 w ij (f i -f j ) 2 + µ n+m i=1 (f i -h(x i )) 2 ) s.t. f i = y i for i ≤ n, where µ ∈ R determines how much the solution is regularized to be close to h. In the standard LPA, we have no prior information on the unlabeled points, which can be seen as the case when h = 0.5 and µ → 0. In our theory, h can be any general prior.

3.1. ERROR BOUND OF LPA WITH PRIOR INFORMATION

Similar to the standard LPA, there exists a closed form optimal solution of Equation 1, which is discussed in Appendix A. We know that, for the optimal solution of Equation 1, we can bound the error of a point i (|f * i -y i |) by the error of its neighbors and terms corresponding to the smoothness of the true labels and the prior information accuracy; we formally state this in Appendix B (Lemma 2). Because the error on labeled points are zero, we can bound the error in terms of the distance of a point to the nearest labeled point. For a set of labeled data L, let N (L) be a set of reachable points where there is at least one path from a point in L. Define a set of neighbors k-hops away from L as N k (L) (i.e, a set of points whose shortest path to a point in L is length k). Let l be the number of hops required to cover N (L). Then, we have N (L) = L ∪ l k=1 N k (L). For simplicity, we denote N k as N k (L) and N 0 as L. We now define terms that are fundamental to our error bound. First, we introduce notions of In-flow, Between-flow, and Out-flow, which represent the fraction of edges that flow in, between, and out of N k (L). Definition 1. For a graph G with an adjacency matrix W = (w) ij and a set of k-hop neighbors N k , we define the In-flow, Between-flow and Out-flow of N k as C in (k) = i∈N k ,j∈N k-1 w ij , C bet (k) = i∈N k ,j∈N k w ij , C out (k) = i∈N k ,j∈N k+1 w ij These terms are related to the notion of conductance, which measures the fraction of out-going edges from any subset of nodes. We can write the Dirichlet conductance (HaoChen et al., 2021) of a neighborhood N k as follows ϕ(N k ) = C in (k) + C out (k) C in (k) + C bet (k) + C out (k) . Definition 2. (Ratio between Out-flow and In-flow) γ k = C out (k) C in (k) + µ|N k | γ k is a proportion of the Out-flow and In-flow edges of a neighborhood (see Figure 1 for graphs with different flow). Next, we define the smoothness of N k , prior information error, and average error. Definition 3. (Smoothness of neighborhood) For 1 ≤ k ≤ l, we define the smoothness of true labels of points in N k with respect to the graph as s k = i∈N k j w ij |y j -y i |. Definition 4. (Prior information error) For 1 ≤ k ≤ l, let the average error of the prior in N k be α k = i∈N k |h i -y i | |N k | . Definition 5. (Average error) We define the average error at the N k as E k = i∈N k |f * i -y i | |N k | . Theorem 1. (Informal version of Theorem 3) Let f * be the optimal solution of the optimization problem of Equation 1, under Assumption 1 which assume that average error of a fraction of points that has "Out" connections from neighborhood N k is of a constant factor of the average error in N k (refer to Appendix B), the error of f * in each neighborhood N k is given by E k ≤ O k i=1 d i , where d k = l i=k c i ( i-1 j=k γ j ), c k = s k + µ|N k |α k C in (k) + µ|N k | . Proof. (Sketch) The key idea of our proof is to upper bound each E i for i ∈ {1, . . . , l} by exploiting the insight that we can bound the average error of a set N i (points that are i hops away from labeled points) with the average errors of its neighbors N i-1 and N i+1 by using Lemma 2. We first bound E 1 with E 0 = 0 and E 2 , then we bound E 2 with E 1 and E 3 , and so on. See Appendix B for the full version of our proof. c k is a combination of smoothness s k and the prior accuracy α k , and µ controls the trade-off between using information from the graph or the initialization. When µ = 0, we recover the standard LPA without any prior. On the other hand, µ → ∞ is equivalent to only using the initial predictions. d k is a linear combination of c i for k ≤ i ≤ l where the coefficient of each c i is given by i-1 j=k γ j , representing the influence from N i . When γ j < 1 ("In" ¿ "Out"), the influence is exponentially small while when γ j > 1 ("Out" ¿ "In") the influence can be exponentially large. This aligns with our intuition that when we have more "In" than "Out", we will have a better guarantee. We remark that if c k = 0, regardless of the product i-1 j=k γ j , c k will make no contribution to the bound. To tighten this upper bound, we have to reduce both c k and γ k . In doing so, the value of µ is important; a larger value of µ reduces both c k and γ k by increasing their denominators. Thus, given similar levels of smoothness and prior accuracy ( s k Cin(k) ≈ α k ), it is better to use a larger value of µ, that is we should rely more on the prior information. The number of hops (k) from labeled points L also plays a key role in the bound. The upper bound on E k is given by a linear combination of k terms, so points that are closer to L will have a smaller k and a better guarantee. This encourages us to have a more connected graph, requiring fewer hops to reach all points. However, adding noisy edges may potentially decrease the smoothness of the graph.

3.2. COMPARISON WITH PRIOR (SPECTRAL) BOUNDS

We compare our bound with an existing bound that relies on spectral analysis (Belkin & Niyogi, 2004) . This bound is for LPA with a soft constraint, given by the problem min f ∈R n+m n+m i=1 n+m j=1 w ij (f i -f j ) 2 + η i≤n (f i -y i ) 2 . (2) We define the empirical error and generalization error as: R n (f ) = 1 n n i=1 (f i -y i ) 2 , R(f ) = 1 n + m n+m i=1 (f i -y i ) 2 As we do not have a hard constraint (f i = y i for i ≤ n), the empirical error is not necessary zero. Theorem 2. (Generalization performance of graph regularization (simplified version)) Let f be the optimal solution of Equation 2, n ≥ 4 be the number of randomly sampled labeled points from some graph G and λ 1 be the second smallest eigenvalue of the Laplacian matrix of G. With probability 1 -δ, we have |R n (f ) -R(f )| ≤ β + 2 log(2/δ) n (nβ + 4) where β = 3η 2 √ n (λ 1 -η) 2 + 4η λ 1 -η The original version of this bound as in (Belkin & Niyogi, 2004) is in Appendix C. We consider graphs G 1 , G 2 , G 3 in Figure 2 to compare the bounds. Here v(G) refers to the value of parameter v for a graph G. First, we note that this bounds the difference between the empirical error and the generalization error, while our bound is for the generalization error itself. The spectral bound assumes that we have randomly sampled initial labeled points, and thus the bound only depend on the number of labeled points. For example, G 2 and G 3 have the same underlying graph and the same number of labeled points, so they have the same spectral bound of generalization error, which relies on the empirical error. In contrast, our bound takes the position of labeled points into account to provide an explicit explanation why LPA performs better on G 2 than G 3 . We can see this since G 2 is smoother than G 3 (c 2 (G 2 ) = 0, c 2 (G 3 ) = s2(G3) Cin(2)(G3) = 8 8 = 1 ). The spectral bound depends on the second smallest eigenvalue λ 1 . If the graph is not well clustered, λ 1 will be small. For example, λ 1 (G 1 ) = 2, λ 1 (G 2 ) = 0.53. Belkin & Niyogi (2004) suggests that when λ 1 is small, we should cut the graph in two, using the eigenvector corresponding to λ 1 , and optimize the objective separately. Our bound works for any graph, in fact, our bound is also tight on G 2 where LPA achieves zero error (as c 1 (G 2 ) = c 2 (G 2 ) = 0). Also, as η → ∞, the objective of Equation 2 is equivalent to Equation 1. The, the spectral bound takes on value β → 3 √ n -4, which implies that it does not depend on the geometry of the graph (λ 1 ) anymore. Finally, the spectral bound does not use any prior information, while our bound captures the interplay between the quality of graph and the quality of prior information.

4. LABEL PROPAGATION WITH MULTIPLE SOURCES OF INFORMATION

We now consider the setting where we observe multiple sources of prior information and provide a framework to incorporate them into LPA. Assume that we have multiple initial noisy predictions h i (x) : X → [0, 1] for i = 1, 2, . . . , k. A natural extension of the LPA objective is given by n+m i=1 n+m j=1 w ij (f i -f j ) 2 + n+m i=1 k j=1 (f i -h j (x i )) 2 α j (x i ) (3) such that f i = y i for i ≤ n. The first term encourages our prediction to be smooth with respect to a graph while the second term encourage our prediction to also be close to the initial predictions. The function α j : X → [0, ∞), which we need to learn, controls how close we want our final prediction to be to each initial prediction h j . We can turn this into a standard label propagation problem for which we have an efficient iterative method to solve by augmenting the graph G with dongle nodes (Appendix D). Given a fixed α j for each j = 1, . . . , k, we can also show that there exists an initial prediction h where the solution of Equation 3 is equivalent to a solution of LPA with a single initial prediction h for which our analysis applies (Appendix E). A key question is for this framework is "how to choose α j ?" In the ideal setting, we set α j (x i ) = 0 when h j makes an incorrect prediction for point x i and set α j (x i ) = 1 when h j makes a correct prediction, α j (x i ) = 1[1[h j (x i ) > 0.5] = y i ]. However, this is not applicable in a practical setting as knowing when h j is correct or incorrect at a point x i is equivalent to knowing the corresponding true label y i of that point. We now investigate different approaches to select the function α j .

4.1. ESTIMATED ACCURACY

Although we do not know whether h j will make a correct prediction at each point x i , we can still approximate its accuracy over the entire dataset. We can use techniques from crowd-sourcing literature or weak supervision literature to approximate the accuracy of each noisy labeler. Then, we can set α j as the estimated accuracy of a h j , α j (x i ) = P(1[h j (x) > 0.5] = y) = p j . We also consider setting α j = ln( pj 1-pj ) as in the boosting literature (in Appendix F.1).

4.2. PROBABILISTIC APPROACH

We can also consider a probabilistic approach to select the function α j . Let each h j be sampled from a Gaussian distribution h j ∼ N (y, σ j (x) 2 ) and let f follow the Gaussian field as in Zhu & Ghahramani (2002) , where ρ β (f ) ∝ exp(-βE(f )), E(f ) = 1 2 n+m i=1 n+m j=1 w ij (f i -f j ) 2 . Then, the log-likelihood is given by l(f ) = constant - n+m i=1 k j=1 1 2σ j (x i ) 2 (h j (x i ) -f i ) 2 - β 2 n+m i=1 n+m j=1 w ij (f i -f j ) 2 . This resembles the objective of Equation 3and suggests that we should set our function α j as α j (x i ) = 1 σ j (x i ) 2 , where σ j (x i ) 2 is the variance of h j at point x i . With access to a small set of labeled data points, we can estimate σ j (x i ) through heteroscedastic regression (Wasserman, 2006) , which is further discussed in Appendix F.2. We note that this function α j changes over values of x as it is computed through regression, while the accuracy-based weighting has a constant value for α j .

5. EXPERIMENTS

We connect LPA and the field of weak supervision by using weak labelers as our source of prior information. Formally, a set of weak labelers is given by λ = {λ 1 , ..., λ k }, where each λ i : X → {0, 1, ∅} and ∅ denotes an abstention. Here, we consider LPA with a single source of prior information h(x) = h λ (x) is an aggregation of weak labelers, which we refer to as LPA+WL. For this method, we use Snorkel MeTaL (Ratner et al., 2019) as our aggregation scheme. We also consider our extensions of LPA with multiple sources of prior information when we set h i = λ i , for each weak labeler. We refer to our extensions of LPA that incorporate weak labelers through dongle nodes as LPAD (A) and LPAD (P), where the last letter denotes our techniques to estimate the weighted edges of these dongle nodes (accuracy, and probabilistic approach). For methods that require accuracies, we use accuracies estimated via Snorkel MeTaL. We note that LPAD (A) and LPA+WL are both using the Snorkel estimated accuracy, we provide a discussion on their difference in Appendix E.1. For LPA + WL, we set µ = 1. Further experimental details for our methods and the baselines are in Appendix G. We compare our approaches to existing weak supervision methods, standard LPA, and other semisupervised baselines on 4 binary classification datasets from the WRENCH benchmark (Zhang et al., 2021) . The features from these text and image datasets are extracted from BERT (Kenton & Toutanova, 2019) and ResNet (He et al., 2016b) respectively. On each dataset, we balance the training data to have equal class proportions. To generate a small set of labeled data, we randomly sample n = 100 points from the training data. The remaining data serves as our unlabeled training data. For all graph-based methods, we construct a graph G with average degree t, which is a hyperparameter of our method, and with edges that have value 1. More information about t and other hyperparameters of all approaches are in Appendix G.2. Code to replicate our experiments can be found herefoot_0 . pseudolabeled training data, when averaged over 5 seeds. We highlight the best performing method in red and the second best performing method in blue.

5.1. BASELINES

We compare our methods against various semi-supervised and existing weakly supervised learning approaches. We use ground truth labels instead of pseudolabels on the 100 labeled points in methods denoted with (+L). In all these approaches, we train an inductive endmodel on the pseudolabeled training data, as is standard in WSL literature. Label Propagation (LPA): The standard label propagation baseline (Zhu & Ghahramani, 2002) on graph G. This does not take into account weak labeler information.

Graph Convolutional Network (GCN):

We provide results for a standard graph convolutional network (Kipf & Welling, 2017) . This method also does not take into account weak labeler information. Snorkel + L: A weakly supervised learning aggregation scheme, Snorkel MeTaL (Ratner et al., 2017; 2019) , which produces pseudolabels through a graphical model. FlyingSquid + L (FS + L): Another weakly supervised method that estimates parameters of a graphical model via a triplet method (Fu et al., 2020) . Constrained Label Learning (CLL): A method that produces an labeling contained within a feasible space constrained by the error rates of weak labelers (Arachie & Huang, 2021) . We use the small set of labeled data to generate the error rates of the weak labelers. Liger + L: A method that extends weak labelers using the smoothness of pretrained models and develops cluster-level aggregations (Chen et al., 2022) . This method uses FS (Fu et al., 2020) as a base aggregation scheme.

5.2. RESULTS

We provide results for test accuracy of training an endmodel on pseudolabels generated by the baselines and our methods in Table 1 . Our results demonstrate that incorporating weak labels into label propagation improves upon the performance of LPA across all datasets (LPA + WL > LPA). In addition, in almost every dataset, using LPA to incorporate smoothness improves upon the prior aggregation of weak labels (LPA + WL > Snorkel + L). Our methods also outperform other weakly supervised aggregation methods, in most cases. The best performing baseline we compare to is Liger or CLL, which each only marginally outperforms some of our methods on one dataset. We note that there is not clear best weighting scheme between (A) and (P), although they outperform most baselines on almost all tasks. In addition, one of our methods is the best performing approach on all datasets. We also report the accuracy of standard LPA and our methods on the labeled and unlabeled training data in approach on abstained datapoints as 50% (i.e, random guessing on binary data) on all abstained points as the method has no information on these points. Results for additional datasets are deferred to Appendix 3; on these datasets, the weak labeler coverage is almost 100% of the data, so coverage is roughly the same across our methods and the baselines. We observe that coverage drastically increases when using our weakly supervised prior (Table 2 ) over the standard LPA and slightly over that of Snorkel. We also observe that our method improves upon the base aggregation method of Snorkel on almost all datasets, improving both overall accuracy and coverage due to the propagation of information to nearby points. We note that Liger has much higher coverage on YouTube and SMS, although the accuracy on this larger set is much worse (see Table 3 in the Appendix).

6. DISCUSSION

We provide a novel theoretical perspective on LPA that takes advantage of useful prior information. Our bound differs significantly from existing spectral bounds, and provides insight into how to best incorporate priors into LPA. We note that our analysis is general and works with any initialization h. We also provide a framework to handle multiple sources of side information and empirical results for the setting of weak supervision. Further work can incorporate other types of prior information into LPA, such as in the recent line of work of learning with past predictions (Mitzenmacher & Vassilvitskii, 2021; Khodak et al., 2022) . In addition, our connections of LPA with weakly supervised learning illustrate (both theoretically and empirically) that these methods benefit each other. As a whole, our results support adding smoothness to the standard WSL pipeline and can encourage further connections between semi-supervised learning algorithms and WSL. We note a few limitations of our method; our bound depends on several parameters, smoothness s k , prior information accuracy α k ; in general, we may need to approximate these values through labeled data. It remains an open question of how to do this effectively. In addition, we assume a uniformity assumption that the average error of points with "Out" connections from N k is of a constant factor of the average error in N k . Relaxing the bound beyond this assumption is also an open question. We remark that our approach bridges the gap between classical label propagation and modern deep learning by incorporating information from pretrained models to construct our graph G. Since we construct G through Euclidean distance, our work uses notions of smoothness in the learnt embeddings, which is also noted in Chen et al. ( 2022). As we gain access to more powerful pretrained models, our approach will also benefit through a better graph G. This method also provides a natural framework to combine information from large pretrained models (via our graph G) and rules provided by domain experts (through our prior predictions h 1 , . . . , h k ).

A CLOSED FORM SOLUTION OF LPA

We provide a closed form solution of the following optimization problem. min f ∈R n+m 1 2 ( i,j w ij (f i -f j ) 2 + µ||f -h|| 2 2 ) s.t. f i = y i for i ≤ n. Here we abuse notation h as a vector (h(x 1 ), . . . , h(x n+m )) and h i = h(x i ). For simplicity we refer i ∈ L as 1 ≤ i ≤ n and i ∈ U as n + 1 ≤ i ≤ n + m. First, note that 1 2 ( i,j w ij (f i -f j ) 2 = 1 2 ( i∈L j∈L w ij (f i -f j ) 2 + 2 i∈L j∈U w ij (f i -f j ) 2 + i∈U j∈U w ij (f i -f j ) 2 ) The first term is a constant as f i = y i for i ∈ L. Denote f L ∈ R n is a column vector with entry f i for i ∈ L and f U ∈ R m is a column vector with entry f j for j ∈ U . We sometimes refer w i,j to w ij . For the second term we have i∈L j∈U w ij (f i -f j ) 2 = i∈L j∈U w ij (f 2 i -2f i f j + f 2 j ) = i∈L ( j∈U w ij )f 2 i + j∈U ( i∈L w ij )f 2 j -2 i∈L j∈U f i w ij f j = constant + f T U D U L f U -2f T L W LU f U . where D U L ∈ R m×m is a diagonal matrix with (D U L ) jj = i∈L w i,j+n and W LU ∈ R n×m is a matrix with entry (W LU ) ij = w i,j+n . For the third term, we have 1 2 i∈U j∈U w ij (f i -f j ) 2 = 1 2 i∈U j∈U w ij (f 2 i -2f i f j + f 2 j ) = i∈U ( j∈U w ij )f 2 i - i∈U j∈U f i w ij f j = f T U D U U f U -f T U W U U f U where D U U R m×m is a diagonal matrix with (D U U ) jj = i∈U w i,j+n and W U U ∈ R m×m is a matrix with entry (W U U ) ij = w i+n,j+n . Therefore, the overall objective is given by min f U ∈R m constant + f T U (D U L + D U U -W U U )f U -2f T L W LU f U + µ 2 ||f -h|| 2 2 . Differentiating with respect to f U and setting equal to 0, we have 2(D U L + D U U -W U U )f U -2W T LU f L + 2µ(f U -h U ) = 0 f U = (D U L + D U U -W U U + µI d ) -1 (µh U + W T LU f L ) when h U ∈ R m with (h U ) j = h j+n . We can also extend this to the case when µ ∈ R m+n , where we have a different value of µ i for each i. The optimization objective is given by min f ∈R n+m 1 2 ( i,j w ij (f i -f j ) 2 + i µ i (f i -h i ) 2 ) s.t. f i = y i for i ≤ n. We can write the regularization term for U as i∈U µ i (f i -h i ) 2 = (f U -h U ) T D µ (f U -h U ) when D µ ∈ R m is a diagonal matrix with entry (D µ ) jj = µ j+n . We can write the optimization objective as min f U ∈R m constant + f T U (D U L + D U U -W U U )f U -2f T L W LU f U + (f U -h U ) T D µ (f U -h U ). Differentiating with respect to f U and setting equal to 0, we have 2(D U L + D U U -W U U )f U -2W T LU f L + 2D µ (f U -h U ) = 0 f U = (D U L + D U U + D µ -W U U ) -1 (D µ h U + W T LU f L )

B THEORETICAL RESULTS

First, we analyze the closed form solution of the LPA in Lemma 1. Lemma 1. Let f * be the optimal solution of the optimization problem of Equation 1 then for n + 1 ≤ i ≤ n + m, f * i = j w ij f * j + µh i j w ij + µ Proof. For each i ≤ n, we must have f * i = y i to satisfy the hard constraints. For each n + 1 ≤ i ≤ n + m, we differentiate the objective with respect to f i and set equal to 0, resulting in j w ij (f i -f j ) + µ(f i -h i ) = 0. Rearranging this and setting f j = f * j , we have the lemma. Next, we will analyze the error |f * i -y i |, which is the difference between the optimal solution and the true label. Note that |f * i -y i | < 0.5 implies that we have a correct soft label f * i . Lemma 2. Let f * be the optimal solution of the optimization problem of Equation 1. Then, for n + 1 ≤ i ≤ n + m, we have |f * i -y i | ≤ j w ij |f * j -y j | + j w ij |y j -y i | + µ|h i -y i | j w ij + µ . Proof. From lemma 1, |f * i -y i | = | j w ij f * j + µh i j w ij + µ -y i | = | j w ij (f * j -y i ) + µ(h i -y i ) j w ij + µ | = | j w ij (f * j -y j ) + j w ij (y j -y i ) + µ(h i -y i ) j w ij + µ | ≤ j w ij |f * j -y j | + j w ij |y j -y i | + µ|h i -y i | j w ij + µ Lemma 2 says that we can bound the error of point i, |f * i -y i | by the error of its neighbors |f * j -y j | and terms corresponding to the smoothness of the true labels and the prior information accuracy. Because we know that the error on labeled points are zero, this lemma motivates us to bound the error in term of the distance of our points from the labeled points. Next, in addition to the average error defined in Definition 5, we define the In-error, Between-error and Out-error for each N k . Definition 6. For a graph G with an adjacency matrix W = (w) ij , a set of k-hop neighbors N k and a prediction f ∈ R n+m , we define the In-error, Between-error and Out-error of N k as Err in (f, y, k) = i∈N k ,j∈N k-1 w ij |f i -y i | C in (k) Err bet (f, y, k) = i∈N k ,j∈N k w ij |f i -y i | C bet (k) Err out (f, y, k) = i∈N k ,j∈N k+1 w ij |f i -y i | C out (k) For simplicity, we will write E in (k), E bet (k) ,E out (k) for Err in (f * , y, k), Err bet (f * , y, k), Err out (f * , y, k) respectively. We will use Lemma 2, to derive a relationship between errors in N k and its neighbors. We make use of the fact that the Out-flow of N k is the same as the In-flow of N k+1 . Lemma 3. For 0 ≤ k ≤ l -1 C out (k) = C in (k + 1) Lemma 4. (Error difference inequality) For 1 ≤ k ≤ l -1, we have C in (k)(E in (k) -E out (k -1)) + i∈N k µ|f * i -y i | ≤ C out (k)(E in (k + 1) -E out (k)) + s k + µ|N k |α k where s k is the smoothness of true labels and α k is the prior information error over N k . Proof. From lemma 2, we have j w ij |f * i -y i | + µ|f * i -y i | ≤ j w ij |f * j -y j | + j w ij |y j -y i | + µ|h i -y i |. We take a summation over i ∈ N k , i∈N k j w ij |f * i -y i | + µ|f * i -y i | ≤ i∈N k ( j w ij |f * j -y j | + j w ij |y j -y i | + µ|h i -y i |). From the definition of the In-error, Between-error, Out-error, smoothness s k , and weak label error α k , we have LHS = C in (k)E in (k) + C bet (k)E bet (k) + C out (k)E out (k) + i∈N k µ|f * i -y i | RHS = C out (k -1)E out (k -1) + C bet (k)E bet (k) + C in (k + 1)E in (k + 1) + s k + µ|N k |α k . From lemma 3, we know that C out (k) = C in (k + 1), so we can rearrange the inequality as C in (k)(E in (k) -E out (k -1)) + i∈N k µ|f * i -y i | ≤ C out (k)(E in (k + 1) -E out (k)) + s k + µ|N k |α k . Lemma 5. (Error difference inequality k = l) C in (l)(E in (l) -E out (l -1)) + i∈N l µ|f * i -y i | ≤ s l + µ|N l |α l Proof. Similar to lemma 4, we have i∈N l j w ij |f * i -y i | + µ|f * i -y i | ≤ i∈N l ( j w ij |f * j -y j | + j w ij |y j -y i | + µ|h i -y i |) Because, N l is the last neighborhood, there is no edge out from N l and LHS = C in (l)E in (l) + C bet (l)E bet (l) + i∈N l µ|f * i -y i | RHS = C out (l -1)E out (l -1) + C bet (l)E bet (l) + s l + µ|N l |α l and rearrange to C in (l)(E in (l) -E out (l -1)) + i∈N l µ|f * i -y i | ≤ s l + µ|N l |α l . We can see that this inequality contains different notions of error. We now define the proportion between In-error and Out-error. Definition 7. Let a k , b k be the proportion of the In-error and Out-error with the average error, a k = E in (k) E k , b k = E in (k) E k . when E k = i∈N k |f * i -y i | |N k | . Assumption 1. (Uniformity of error) We assume that the In-error and Out-error are roughly the same as the average error in each neighborhood. a k = O(1), b k = O(1), b k a k = O(1) For example, any graph G that has all points in a neighborhood N k with the same number of edges that go into and out from that point, has the property that a k = b k = 1. In particular, assume that we have 2 points in N k , the first point has 4 edges from N k-1 and 2 edges to N k+1 while the second point has 2 edges from N k-1 and 1 edge to N k+1 , this graph still has a k = b k = 1. In general, we expect the proportion b k a k to be close to 1. Next, we will substitute a k , b k in Lemma 4. Corollary 1. For 1 ≤ k ≤ l -1, we have (a k E k -b k-1 E k-1 ) ≤ C out (k) C in (k) + µ|N k | (a k+1 E k+1 -b k E k ) + s k + µ|N k |α k C in (k) + µ|N k | Proof. From lemma 4 C in (k)(E in (k) -E out (k -1)) + i∈N k µ|f * i -y i | ≤ C out (k)(E in (k + 1) -E out (k)) + s k + µ|N k |α k We let E in (k) = a k E k and E out (k) = b k E k and i∈N k |f * i -y i | = |N k |E k . C in (k)(a k E k -b k-1 E k-1 ) + µ|N k |E k ≤ C out (k)(a k+1 E k+1 -b k E k ) + s k + µ|N k |α k C in (k)(a k E k -b k-1 E k-1 ) + µ|N k |(E k -E k-1 ) ≤ C out (k)(a k+1 E k+1 -b k E k ) + s k + µ|N k |α k , as we know that E k-1 ≥ 0. Then, simplifying yields that (C in (k) + µ|N k |)(a k E k -b k-1 E k-1 ) ≤ C out (k)(a k+1 E k+1 -b k E k )) + s k + µ|N k |α k (a k E k -b k-1 E k-1 ) ≤ C out (k) C in (k) + µ|N k | (a k+1 E k+1 -b k E k ) + s k + µ|N k |α k C in (k) + µ|N k | . Corollary 2. We have (a l E l -b l-1 E l-1 ) ≤ s l + µ|N l |α l C in (l) + µ|N l | . The corollary implies that the difference between the error between neighborhood can't be too large. We introduce the next two lemma to help deriving the bound. Lemma 6. For d 1 , d 2 , . . . , d l that satisfies the following inequalities, d k ≤ γ k d k+1 + c k for 1 ≤ k ≤ l -1 and d l ≤ c l . We have d k ≤ l i=k c i ( i-1 j=k γ j ) Proof. The main idea is that we can use the upper bound on d l , d l-1 , . . . , d k+1 to find the upper bound of d k . First, we start with d l-1 d l-1 ≤ γ l-1 d l + c l-1 ≤ γ l-1 c l + c l-1 . Next, we continue with d l-2 , d l-2 ≤ γ l-2 d l-1 + c l-2 ≤ γ l-2 (γ l-1 c l + c l-1 ) + c l-2 . = c l γ l-1 γ l-2 + c l-1 γ l-2 + c l-2 and so on. With this idea, we can show by induction that d k ≤ l i=k c i ( i-1 j=k γ j ). We sum these inequalities up to have the lemma. Lemma 7. For x 1 , x 2 , . . . , x l that satisfies the following inequalities, a k x k -b k-1 x k-1 ≤ d k for 1 ≤ k ≤ l, when a k , b k , d k are positive constant. We have x k ≤ 1 a k ( k i=1 d i ( k-1 j=i δ j )) + a 1 a k ( k-1 j=1 δ j )x 0 when δ j = b j a j Proof. We divide both side of the inequality by a k , for each 1 ≤ k ≤ l, we have x k ≤ b k-1 a k x k-1 + d k a k . We can recursively apply this inequality, x k ≤ b k-1 a k ( b k-2 a k-1 x k-2 + d k-1 a k-1 ) + d k a k . = 1 a k ( b k-1 a k-1 b k-2 x k-2 + b k-1 a k-1 d k-1 + d k ) = 1 a k (δ k-1 b k-2 x k-2 + δ k-1 d k-1 + d k ) ≤ 1 a k (δ k-1 b k-2 ( b k-3 a k-2 x k-3 + d k-2 a k-2 ) + δ k-1 d k-1 + d k ) ≤ 1 a k (δ k-1 δ k-2 b k-3 x k-3 + δ k-1 δ k-2 d k-2 + δ k-1 d k-1 + d k ) ≤ . . . ≤ 1 a k ( k i=1 d i ( k-1 j=i δ j )) + a 1 a k ( k-1 j=1 δ j )x 0 Now, we are ready to derive the error bound of LPA. Theorem 3. Let f * be the optimal solution of the optimization problem of Equation 1, the error of f * in each neighborhood is given by E k ≤ 1 a k ( k i=1 d i ( k-1 j=i δ j )) when δ j = b j a j , d k = l i=k c i ( i-1 j=k γ j ) and c k = s k + µ|N k |α k C in (k) + µ|N k | , γ k = C out (k) C in (k) + µ|N k | . Under assumption 1, we have E k ≤ O( k i=1 d i ) Proof. From corollary 1, 2 we have (a k E k -b k-1 E k-1 ) ≤ C out (k) C in (k) + µ|N k | (a k+1 E k+1 -b k E k ) + s k + µ|N k |α k C in (k) + µ|N k | and (a l E l -b l-1 E l-1 ) ≤ s l + µ|N l |α l C in (l) + µ|N l | . Let d k = a k E k -b k-1 E k-1 , c k = s k + µ|N k |α k C in (k) + µ|N k | , γ k = C out (k) C in (k) + µ|N k | By lemma 6, we have a k E k -b k-1 E k-1 = d k ≤ l i=k c i ( i-1 j=k γ j ). By lemma 7, we have E k ≤ 1 a k ( k i=1 d i ( k-1 j=i δ j )) + a 1 a k ( k-1 j=1 δ j )E 0 = 1 a k ( k i=1 d i ( k-1 j=i δ j )) when δ j = b j a j . The last equality is true because the error E 0 = 0. With the assumption 1, b k a k = O(1), we have E k ≤ O( k i=1 d i ) C SPECTRAL BOUND The following is the original version of the spectral generalization bound found in Belkin & Niyogi (2004) , where they assume that we can have repeated labeled points (at most u times). Theorem 4. (Generalization performance of graph regularization) Let f be the optimal solution of Equation 2, n ≥ 4 be the number of randomly sampled labeled points from some distribution where each vertex occurs no more than u times, together with values y 1 , . . . , y n , |y i | ≤ M . Let λ 1 be the second smallest eigenvalue of the Laplacian matrix of G. Assuming that ∀x |f (x)| ≤ K, we have with probability 1 -δ, (conditional on the multiplicity being no greater than t), |R n (f ) -R(f )| ≤ β + 2 log(2/δ) n nβ + (K + M ) 2 where β = 3η 2 √ un (λ 1 -ηu) 2 + 4ηM λ 1 -ηu We can set u = 1, M = 1, K = 1 to achieve the simplified version (Theorem 2).

D DONGLE NODES

We can change a label propagation problem with multiple initial predictions into a standard label propagation by augmenting a graph with dongle nodes (Zhu et al., 2003) . Without loss of generality, we assume that each initial prediction has 3 possible outputs h j : X → {∅, 0, 1} for j = 1, . . . , k. However, this method also works for a general case when h j : X → [0, 1]. We augment the original graph G with the following nodes and edges, 1. For each weak labeler h j , we add 2 nodes to the graph G with vertices v n+m+j , v n+m+k+j . This represents a prediction of class 0 or 1 from the weak labeler j.

2.

For each x i that h j (x i ) = 0, we draw a weighted edge between v i (the corresponding vertex of x i ) and v n+m+j with weight α j (x i ).

3.

For each x i that h j (x i ) = 1, we draw a weighted edge between v i (the corresponding vertex of x i ) and v n+m+k+j with weight α j (x i ).

4.

For each x i that h j (x i ) = ∅, we do not draw any edge. Let G ′ be the new graph with a weighted adjacency matrix (w ′ ij ) then solving the objective of Equation 3 is equivalent to solving min f ∈R n+m+2k n+m+2k i=1 n+m+2k j=1 w ′ ij (f i -f j ) 2 (5) such that 1. f i = y i for i ≤ n. 2. f i = 0 for n + m + 1 ≤ i ≤ n + m + k. 3. f i = 1 for n + m + k + 1 ≤ i ≤ n + m + 2k. We see initial predictions as dongle nodes and encode the parameter α j (x i ) as a weight of an edge connecting the corresponding dongle node of predictor j to the node of x i . With a direct calculation, we can see that the objective of Equation 5 is the same as the original objective, n+m i=1 n+m j=1 w ij (f i -f j ) 2 + n+m i=1 k j=1 (f i -h j (x i )) 2 α j (x i ) such that f i = y i for i ≤ n. With this procedure, we add 2k nodes and at most (n + m)k edges to G.

PREDICTIONS

We analyze the closed form solution of the optimization objective of Equation 3. By differentiating with respect to f i , we know that the optimal solution f * i satisfies the following n+m j=1 2w ij (f * i -f * j ) + k j=1 2(f * i -h j (x i ))α j (x i ) = 0 f * i = n+m j=1 w ij f * j + k j=1 α j (x i )h j (x i ) n+m j=1 w ij + k j=1 α j (x i ) From Lemma 1, recall that the optimal solution of LPA with an initial prediction (objective of Equation 1) is given by f * i = j w ij f * j + µh(x i ) j w ij + µ We can see that if we set h(x i ) = k j=1 α j (x i )h j (x i ) k j=1 α j (x i ) , µ(x i ) = k j=1 α j (x i ), the solution LPA with multiple initial predictions is equivalent to LPA with the initial prediction h, which could be seen as a weighted average prediction. The objective is given by min f ∈R n+m 1 2 ( i,j w ij (f i -f j ) 2 + n+m i=1 µ(x i )(f i -h(x i )) 2 ) s.t. f i = y i for i ≤ n. We note that now the parameter µ is now depends on each instance x i . When we have k j=1 α j (x i ) = µ is a constant for all x i then we will have the same setting as in the objective of Equation 1. However, our analysis still works in this case when µ(x i ) is not a constant.

E.1 DIFFERENCE BETWEEN LPA+WL AND LPAD(A)

We note that LPAD (A) uses Snorkel to estimated accuracies α j . From above, LPAD (A) is equivalent to LPA with an initial prediction h(x i ) = k j=1 α j h j (x i ) k j=1 α j , µ(x i ) = k j=1 α j . We observe that h is exactly the prior information for LPA+WL. However, the key difference is that for LPA + WL, we have a fixed µ for all data points, while in LPAD(A), the value µ(x i ) depends on x i . To illustrate this, we consider 2 scenarios. First, we assume that we have 3 weak labelers h 1 , h 2 , and h 3 , all with estimated accuracy 0.8. We consider a point x 1 with h 1 (x 1 ) = 1, h 2 (x 1 ) = 1, h 3 (x 1 ) = 1 and x 2 with h 1 (x 2 ) = 1, h 2 (x 2 ) = ∅, h 3 (x 2 ) = ∅, where ∅ is abstention. We can observe that 1. h(x 1 ) = h(x 2 ) = 1 2. µ(x 1 ) = 2.4, µ(x 2 ) = 0.8 Here in LPA + WL, x 1 , x 2 have the same prior information and regularization parameter µ. In LPAD(A), we put much more weight on the regularization parameter µ(x 1 ) than µ(x 2 ). This is intuitive as we should be more confident about our prior information when a higher number of weak labelers agree. F METHODS FOR SELECTING ALPHA F.1 BOOSTING APPROACH From boosting literature (Freund & Schapire, 1997) , given many weak learners h j : X → {0, 1} for j = 1, 2, . . . , k, an optimal way to combine these weak learner (corresponding to an exponential loss upper bound) is a weighted average h = k j=1 α j h j k j=1 α j , α j = ln( P(h j (x) = y) 1 -P(h j (x) = y) ), Instead of accuracy, we could set α j in this fashion suggested by the boosting literature. We show that this value of α j minimizes the upper bound on the error |f * i -y i |. Recall that the optimal solution of Equation 3 satisfies f * i = n+m j=1 w ij f * j + k j=1 α j (x i )h j (x i ) n+m j=1 w ij + k j=1 α j (x i ) We can bound the error of a point i, |f * i -y i | by the error of its neighbor |f * j -y j |. Observe that f * i -y i = n+m j=1 w ij f * j + k j=1 α j (x i )h j (x i ) n+m j=1 w ij + k j=1 α j (x i ) -y i f * i -y i = n+m j=1 w ij (f * j -y j + y j -y i ) + k j=1 α j (x i )(h j (x i ) -y i ) n+m j=1 w ij + k j=1 α j (x i ) |f * i -y i | ≤ n+m j=1 w ij |f * j -y j | + n+m j=1 w ij |y j -y i | + | k j=1 α j (x i )(h j (x i ) -y i )| n+m j=1 w ij + k j=1 α j (x i ) .. The first term represents errors of neighbor points |f * j -y j | and the second term represents the smoothness of the true labels on the graph G and the third term represents the accuracy of the weighted prediction. We can improve the upper bound by selecting appropriate value of α j (x i ) and w ij to minimize | k j=1 α j (x i )(h j (x i ) -y i ) k j=1 α j (x i ) | ≥ | k j=1 α j (x i )(h j (x i ) -y i ) n+m j=1 w ij + k j=1 α j (x i ) |. Consider the following lemma, Lemma 8. Given k classifier h i : X → {-1, 1} for i = 1, . . . , k. Let h(x) = k i=1 α i h i (x) be the weighted average among the classifiers. Assume that the prediction of h i (x) are independent between different i . The optimal α i that minimize the risk when the loss is exponential loss of h(x), L(h, x, y) = exp(-yh(x)) is given by α i = 1 2 ln( P(h i (x) = y) 1 -P(h i (x) = y) ) Proof. The risk is given by E(L(h, x, y)) = E(exp(-yh(x))) = E(exp(-y k i=1 α i h i (x))) = k i=1 E(exp(-yα i h i (x))) = k i=1 p i exp(-α i ) + (1 -p i ) exp(α i ) when p i = P(h i (x) = y). It is sufficient to choose α i that maximize p i exp(-α i ) + (1 -p i ) exp(α i ). Differentiate with respect to α i and set to zero, we have -p i exp(-α i ) + (1 -p i ) exp(α i ) = 0 (1 -p i ) exp(α i ) = p i exp(-α i ) exp(2α i ) = p i 1 -p i α i = 1 2 ln( p i 1 -p i ) Note that exponential loss is an upper bound of our hinge loss and Lemma 8 suggests that to minimize the exponential loss upper bound, we should set α i = 1 2 ln( p i 1 -p i ) F.2 HETEROSCEDASTIC REGRESSION Recall that we model h j ∼ N (y, σ(x) 2 ) so that we can write h j (x i ) = y(x i ) + σ j (x i )ε when ε ∼ N (0, 1). We want to regress σ(x). Rearraging we have, h j (x i ) -y(x i ) = σ j (x i )ε (h j (x i ) -y(x i )) 2 = σ j (x i ) 2 ε 2 log((h j (x i ) -y(x i )) 2 ) = log(σ j (x i ) 2 ) + log(ε 2 ) On labeled data, we can regress a function g j (x i ) to match log((h j (x i ) -y(x i )) 2 ) then we set α j (x i ) = 1 exp(g j (x i )) .

G ADDITIONAL EXPERIMENTAL DETAILS

We use the default splits from the WRENCH benchmark (Zhang et al., 2021) for each of our binary classification dataset. This benchmark has a Apache-2.0 license. For each text classification dataset (Youtube, SMS, CDR), we use pretrained BERT embeddings (Kenton & Toutanova, 2019) . For our image classification tasks (Basketball), we use pretrained ResNet embeddings (He et al., 2016a) . For each task, we balance the datasets and randomly sample 100 labeled datapoints. We balance the datasets to make sure that the overall sample of labeled data contains roughly the same amount of points from each class. We use cluster compute resources to produce our empirical results. We use a single GPU (NVIDIA GeForce RTX 2080Ti) to run our methods and each of the baselines.

G.1 WEAK LABEL SOURCES

We use the standard weak labels contained within the WRENCH benchmark (Zhang et al., 2021) . These are standard in programmatic weak supervision literature and primarily consist of simple hand-engineered rules. For example, on the YouTube dataset (or a spam classification task), examples of weak labels are functions that check for the presence of words in a sentence (Figure 4 ). We defer interested readers to the benchmark (Zhang et al., 2021) and other papers in weak supervision (Ratner et al., 2017) for more details. We perform hyperparameter optimization of all methods, selecting the best set of parameters on the validation set. We optimize over the following parameters for all methods' endmodels: • learning rate: [0.01, 0.001, 0.0001] • number of epochs: [20, 30, 40, 50] • weight decay: [0, 0.01, 0.001] In each experiment, we have a fixed batch size of 100 and a fixed architecture of a 2 layer neural network with a hidden dimension of 64 and a ReLU activation function. For all graph-based methods, we have an additional parameter t ∈ [1, 2, 5, 10, 100]. t controls the average degree of nodes in G. Let N be the number of nodes in G, we use the value t N and as our threshold percentile for our euclidean distance threshold graph. In essence, we add an edge between two points when the Euclidean distance between them is less than the t N -th percentile of all N 2 pairwise distances. The motivation for this is that N node corresponds to N 2 edges, so adding t N edges leads to a resulting graph with average degree t. For our GCN baseline, we construct a graph G in the same manner as all other LPA-based methods. The GCN architecture is a 2 layer neural network with hidden dimension of 16 and ReLU activations. Consequently, we train an endmodel on the pseudolabeled data, which is the same architecture as all other methods. For our Liger + L baseline, we optimizer over a fixed threshold value for their cosine similarity as some k for each weak labeler. We note that this baseline is highly sensitive to the value of k; for BERT embeddings, we select values of k ∈ [0.995, 0.9975, 1] as points are much less distinguished in the embedding space in comparison the larger foundation models (GPT-3, CLIP) in the original paper (Chen et al., 2022) . We use 2 clusters for all tasks.

H COMPLETE VERSION OF TABLES

We present our results for both training/pseudolabel performance (Table 3 ) and test/endmodel performance (Table 4 ) in more detail and with additional comparisons. In our pseudolabel performance table, we provide the Non-Abstain accuracy (i.e, only considering accuracy on points on which the model makes a vote). We define abstaining as having a maximum logit (across either class) that is within ϵ = 0.001 of 0.5. We observe a fundamental tradeoff: balancing high accuracy and little coverage against lower accuracy and higher coverage. We also observe that LPA performs well locally on regions connected to labeled points with non-abstain accuracy close to 100 percents, but has low overall coverage. For our endmodel results, we additionally compare against a fully supervised approach that uses all of the training data and their labels. We remark that some datasets in the WRENCH benchmark have noisy labels, leading to imperfect fully supervised performance. For example, we also add evaluations on the Tennis dataset, which only achieves 88% fully supervised performance, and all methods seem to match this performance. We also add some additional variations of our dongle-based approach. We add a comparison to a boosting (Appendix F.1) to determine α, which we refer to as LPAD (B). We also compare against a method that uses the unosberved ground truth accuracies for α (LPAD (O)) and another method that sets α = 1, ∀x (LPAD (1)). We note that LPAD (O) is an unfair comparison 4 : We report accuracy on test data for training an endmodel on pseudolabeled training data, when averaged over 5 seeds. We bold baselines when they outperform both of our methods. We bold our methods when they outperform all baselines. to all other methods as it accesses ground truth accuracies that other methods do not use; we add this comparison to describe the best potential performance of LPAD.

I HYPERPARAMETER ABLATION

We provide an ablation study on hyperparameter t. We report accuracy on the test data of an endmodel that is trained on pseudolabels from baselines and our methods with particular values of t to determine the construction of G. We observe that all graph-based methods are sensitive to the choice of t, which controls the sparsity of edges in the (Euclidean) graph G. We remark that this finding is intuitive as most graph-based semi-supervised algorithms leverage properties of this graph to achieve better performance. We can see a common trend among all methods where when t is large, the end-model accuracy tends to decrease. We note that Snorkel + L does not leverage any graph information, but we still add it here for comparison. 



https://github.com/dsam99/label propagation weak supervision



Figure 1: A diagram of edges flow between neighborhoods of L. Color on each edge implies that the edge contributes to which flows (In, Between, Out) (left). Examples of graphs with different structure, where colored points represent labeled points (middle, right).

Figure 2: Example of graphs G 1 , G 2 , G 3 (left, mid, right) to compare our bound to existing bounds. The background color represents the true label class, and colored points represents labeled points.

Figure 3: One can turn the label propagation with multiple sources of information into a standard label propagation problem by augmenting the graph G (left) with dongle nodes (right). The colored point represents a labeled point. The points without the shade are dongle nodes.

Figure 4: Examples of weak labels on the YouTube dataset

We report accuracy (± s.e.) on a held-out test dataset for training an endmodel on

.96 ± 0.13 89.75 ± 0.08 70.40 ± 0.34 48.31 ± 0.52 70.64 ± 0.12 92.41 ± 0.11 LPA 55.98 ± 0.08 11.97 ± 0.16 54.71 ± 0.01 9.42 ± 0.03 50.79 ± 0.00 1.58 ± 0.00 Liger + L 81.06 ± 0.47 99.98 ± 0.01 78.62 ± 0.23 96.01 ± 0.20 50.56 ± 0.13 81.79 ± 10.72 LPA + WL 76.02 ± 0.12 89.81 ± 0.08 70.75 ± 0.35 49.03 ± 0.52 70.64 ± 0.12 92.41 ± 0.11 LPAD (A) 84.03 ± 0.15 89.81 ± 0.08 70.80 ± 0.26 49.00 ± 0.51 72.87 ± 0.08 91.66 ± 0.49 LPAD (P) 89.52 ± 0.13 89.75 ± 0.08 70.98 ± 0.27 49.03 ± 0.52 71.91 ± 0.25 92.22 ± 0.11

We report accuracy and coverage (± s.e.) of the various label propagation methods on the full partially labeled training data (i.e, pseudolabel accuracies), when averaged over 5 seeds. We also add Liger + L as it looks to improve coverage by extending weak labelers. We bold the best performing method (in terms of Accuracy) on each dataset.

.96 ± 0.13 89.75 ± 0.08 78.93 ± 0.14 70.40 ± 0.34 48.31 ± 0.52 92.20 ± 0.29 70.64 ± 0.12 92.41 ± 0.11 72.33 ± 0.15 LPA 55.98 ± 0.08 11.97 ± 0.16 100.00 ± 0.00 54.71 ± 0.01 9.42 ± 0.03 100.00 ± 0.00 50.79 ± 0.00 1.58 ± 0.00 100.00 ± 0.00 Liger + L 81.06 ± 0.47 99.98 ± 0.01 81.07 ± 0.47 78.62 ± 0.23 96.01 ± 0.20 79.81 ± 0.18 50.56 ± 0.13 81.79 ± 10.72 50.98 ± 0.46 ± 0.20 69.97 ± 0.23 86.97 ± 0.24 87.46 ± 0.17 99.38 ± 0.04 87.70 ± 0.17 We report accuracy, coverage, and non-abstaining accuracy (NA Acc) of the baselines and our variants of LPA on the training data (i.e, pseudolabel statistics), when averaged over 5 seeds. ± 0.47 96.24 ± 0.32 82.08 ± 0.83 68.03 ± 0.28 88.51 ± 0.04 FS + L 87.76 ± 0.51 94.84 ± 0.43 70.23 ± 1.20 67.70 ± 0.29 88.56 ± 0.02 CLL 88.56 ± 0.80 94.56 ± 0.73 77.02 ± 3.96 68.52 ± 0.58 88.78 ± 0.13 LPA 82.00 ± 1.37 94.32 ± 0.45 78.71 ± 2.41 67.41 ± 0.82 83.35 ± 3.30 GCN 84.16 ± 0.95 94.32 ± 1.02 61.34 ± 1.16 65.42 ± 1.00 88.63 ± 0.14 Liger + L 88.72 ± 0.58 96.08 ± 0.38 80.98 ± 1.71 67.33 ± 0.18 86.43 ± 0.87 LPA + WL 88.32 ± 0.50 96.80 ± 0.36 83.13 ± 1.43 67.61 ± 0.19 88.51 ± 0.04 LPAD (A) 90.32 ± 0.43 96.32 ± 0.52 83.06 ± 0.74 68.13 ± 0.74 88.56 ± 0.02 LPAD (P) 87.84 ± 0.53 96.64 ± 0.39 82.01 ± 2.96 68.97 ± 0.51 88.60 ± 0.04 LPAD (B) 88.64 ± 0.37 96.56 ± 0.33 76.58 ± 2.20 67.01 ± 0.43 88.56 ± 0.02 LPAD (O) 90.16 ± 0.50 96.40 ± 0.50 81.10 ± 1.43 69.06 ± 0.59 88.58 ± 0.02 LPAD (1) 83.20 ± 1.15 94.32 ± 0.35 78.61 ± 1.95 67.76 ± 0.21 88.43 ± 0.16 Fully Supervised 89.92 ± 1.45 98.04 ± 0.38 86.04 ± 2.02 73.71 ± 0.85 88.43 ± 1.06

Snorkel + L 87.04 ± 0.47 85.44 ± 0.9 86.4 ± 0.78 86.4 ± 0.78 87.04  ± 0.47 LPA 79.76 ± 1.99 81.6 ± 1.15 82.0 ± 1.37 82.64 ± 1.62 75.68 ± 1.51 LPA + WL 86.24 ± 0.75 86.56 ± 0.81 86.16 ± 0.71 88.32 ± 0.5 83.12 ± 1.2 LPAD (A) 89.12 ± 0.5 88.96 ± 1.04 89.04 ± 0.68 90.32 ± 0.43 84.0 ± 0.54 LPAD (B) 87.2 ± 0.55 86.24 ± 1.22 87.12 ± 0.69 88.64 ± 0.37 83.84 ± 0.27 LPAD (P) 87.76 ± 0.79 87.84 ± 0.53 89.2 ± 0.31 89.92 ± 0.53 82.8 ± 1.03 We report accuracy on test data for training an endmodel on pseudolabeled training data from various label propagation methods when using different hyperparameter t and averaged over 5 seeds. Snorkel + L 95.04 ± 0.46 96.08 ± 0.48 96.24 ± 0.32 96.24 ± 0.32 95.04 ± 0.46 LPA 94.32 ± 0.45 94.52 ± 0.26 95.24 ± 0.38 92.2 ± 1.08 80.76 ± 3.51 LPA + WL 94.88 ± 0.67 96.44 ± 0.26 96.8 ± 0.36 95.36 ± 0.6 85.12 ± 3.82 LPAD (A) 95.6 ± 0.28 96.8 ± 0.33 96.32 ± 0.52 95.96 ± 0.41 86.12 ± 4.27 LPAD (B) 96.04 ± 0.19 96.68 ± 0.32 96.56 ± 0.33 96.12 ± 0.34 82.64 ± 4.36 LPAD (P) 95.8 ± 0.46 95.64 ± 0.36 96.64 ± 0.39 96.16 ± 0.48 84.32 ± 2.82

We report accuracy on test data for training an endmodel on pseudolabeled training data from various label propagation methods when using different hyperparameter t and averaged over 5 seeds. Snorkel + L 67.87 ± 0.25 68.03 ± 0.28 68.24 ± 0.72 68.24 ± 0.72 67.87 ± 0.25 LPA 67.01 ± 0.82 65.17 ± 1.09 63.19 ± 1.18 59.33 ± 1.8 44.39 ± 2.69 LPA + WL 67.87 ± 0.25 67.61 ± 0.19 67.16 ± 0.84 65.8 ± 0.82 49.66 ± 1.8 LPAD (A) 68.13 ± 0.74 67.68 ± 1.1 68.65 ± 0.53 67.56 ± 0.38 61.28 ± 3.21 LPAD (B) 66.44 ± 1.23 67.01 ± 0.43 65.98 ± 0.92 66.26 ± 1.15 54.06 ± 2.78 LPAD (P) 68.97 ± 0.51 67.27 ± 1.05 65.48 ± 0.36 64.93 ± 1.8 58.34 ± 3.71

We report accuracy on test data for training an endmodel on pseudolabeled training data from various label propagation methods when using different hyperparameter t and averaged over 5 seeds. Snorkel + L 81.62 ± 1.07 83.62 ± 0.8 82.08 ± 0.83 82.08 ± 0.83 81.62 ± 1.07 LPA 75.56 ± 0.55 79.36 ± 1.49 75.12 ± 1.26 78.71 ± 2.41 66.02 ± 1.19 LPA + WL 83.13 ± 1.43 80.87 ± 0.64 79.95 ± 1.56 80.11 ± 1.75 73.45 ± 1.09 LPAD (A) 80.44 ± 0.83 78.9 ± 1.53 81.64 ± 1.35 83.06 ± 0.75 74.83 ± 1.49 LPAD (B) 75.81 ± 3.47 72.75 ± 2.11 76.58 ± 2.2 78.0 ± 4.34 69.36 ± 1.03 LPAD (P) 82.01 ± 2.96 74.6 ± 2.05 73.31 ± 5.25 68.71 ± 4.41 67.18 ± 3.64

We report accuracy on test data for training an endmodel on pseudolabeled training data from various label propagation methods when using different hyperparameter t and averaged over 5 seeds.

ACKNOWLEDGEMENTS

This work was supported in part by NSF grants IIS-1909816, IIS-1955532, IIS-2211907, CCF-1910321 and DARPA under cooperative agreement HR00112020003 and funding from Bosch Center for Artificial Intelligence and the ARCS Foundation.

