LEARNING HYPER LABEL MODEL FOR PROGRAMMATIC WEAK SUPERVISION

Abstract

To reduce the human annotation efforts, the programmatic weak supervision (PWS) paradigm abstracts weak supervision sources as labeling functions (LFs) and involves a label model to aggregate the output of multiple LFs to produce training labels. Most existing label models require a parameter learning step for each dataset. In this work, we present a hyper label model that (once learned) infers the ground-truth labels for each dataset in a single forward pass without dataset-specific parameter learning. The hyper label model approximates an optimal analytical (yet computationally intractable) solution of the ground-truth labels. We train the model on synthetic data generated in the way that ensures the model approximates the analytical optimal solution, and build the model upon Graph Neural Network (GNN) to ensure the model prediction being invariant (or equivariant) to the permutation of LFs (or data points). On 14 real-world datasets, our hyper label model outperforms the best existing methods in both accuracy (by 1.4 points on average) and efficiency (by six times on average). Our code is available at https://github.com/wurenzhi/hyper label model 

1. INTRODUCTION

The lack of labeled training data is a major challenge impeding the practical application of machine learning (especially deep learning) techniques. Therefore, practitioners have been increasingly turned to weak supervision in which large amounts of cheaply generated noisy labels are used. There are many forms of weak supervision sources, e.g. external knowledge bases (Mintz et al., 2009) , existing pre-trained models (Das et al., 2020; Wu et al., 2022b) , and heuristics/rules (Shin et al., 2015) . To unify different sources, the programmatic weak supervision (PWS) paradigm (Ratner et al., 2016; 2017; Zhang et al., 2022) was proposed. In PWS, the user expresses each available weak supervision signal from different sources with a labeling function (LF), a small program that takes in a data point and outputs a noisy label. After that, each LF is applied to unlabeled data of arbitrary size to obtain a noisy label vector; then, a label aggregation model (also referred as label model in literature) is used to aggregate all noisy label vectors to infer the unknown ground-truth labels. The inferred labels can then be used to train any downstream end models. The PWS paradigm has been successful in various tasks (Wu et al., 2018; Fries et al., 2019; Lison et al., 2020; Wu et al., 2021; 2020; Li et al., 2021) and industry scenarios (Mathew et al., 2021; Bach et al., 2019; Dunnmon et al., 2020) . The core challenge in PWS is how to aggregate all noisy label vectors to infer the ground-truth labels. Let label matrix X denote the noisy labels where each column X[:, j] denotes the noisy label vector from the j th LF and each row X[i, :] denotes the weak labels of the i th data point; Let y denote the ground-truth label vector. Most existing label models assume an underlying distribution p(y[i]|X[i, :]; θ) (Zhang et al., 2022) where y[i] is the label for the data point and θ is the parameter of the distribution. The parameter θ is first learned on the weak labels X = (X[1, :], X[2, :], . . . ) in an unsupervised and typically iterative way, and then inference is made using p(y[i]|X[i, :]; θ). In this approach, the parameter θ is dataset-specific and has to be learned for every different X (dataset). In contrast to existing solutions, we propose a hyper label model with the goal of reducing assumptions and parameter learning process. Specifically, we aim to develop a hyper model that enjoys two desiderata: (1) it works with "minimal" assumption, i.e., we only assume the majority of LFs is better-then-random while does not require the knowledge or assume any particular forms of underlying distribution p(y[i]|X[i, :]; θ); (2) once the hyper model is learned, it can be used to infer y for any new X without additional dataset-specific parameter learning process. To shed light on this direction, we first show, in theory, that without assuming underlying distribution, there is an optimal and analytical (therefore no parameter learning) way to estimate of y based on X, i.e., y * = h * (X). However, such h * is intractable to compute since it involves averaging over a set whose size is exponentially-increasing w.r.t.the size of X. Therefore, we propose to leverage the power of deep learning to approximate this solution, i.e., we seek for an alternative function h parametrized by some neural networks, and once learned, it can estimate the label vector for new dataset without ad hoc dataset-specific learning process. Thus, we call the learned model hyper label model. Materializing this idea involves two key questions: (1) How to generate training data? (2) How to design the model architecture? To generate training data, the straightforward solution is to use the analytical method to generate many pairs of (X, y * ) where y * = h * (X). However, computing y * with h * (X) is of exponential complexity. We notice that for each X, h * (X) is an average of the label vectors from a certain set. Taking advantage of this, we are able to avoid directly generating y * that is of exponential complexity and design a way of generating an equivalent set of training data such that the trained model approximates h * (X). The model architecture has two requirements. First, it should be able to accept input matrix X of arbitrary size as the size of the matrix X can be different across datasets. Second, the output of the model (e.g. the predicted label vector) should be invariant to the permutations of columns in X as the order of the LFs should not impact the final predicted labels; The output of the model should be equivariant to the permutation of rows in X as when switching the order of the data points in X the predicted labels should be switched accordingly. We noticed that a Graph Neural Network (GNN) is able to accept an input graph of arbitrary size and is permutation equivariant to the nodes on the graph (and can also be made to be permutation invariant by taking the average of the nodes). Therefore, we propose to represent the input matrix X as a graph and then design a GNN to satisfy the two requirements. Contributions. We make the following contributions: (1) We for the first time present an analytical method for label aggregation which is optimal in the sense that it minimizes a certain form of the averaged prediction error, though directly using the analytical method is of exponential complexity. (2) We train a model to learn the analytical method. The trained model is a hyper label model that can be used to infer the ground-truth labels for unseen datasets in a single forward pass without needing any dataset-specific parameter learning. (3) We design a synthetic training data generation method and show that the hyper label model trained on the synthetically generated data learns to be the analytical method. (4) We design an effective model architecture based on GNN so that the hyper label model is applicable to arbitrary number of LF label vectors of arbitrary size and is invariant/equivariant to the permutation of LF label vectors/data points. (5) We show that our hyper label model outperforms the best existing methods over 14 real-world weak supervision datasets in both accuracy (by 1.4 points on average) and efficiency (by a speedup of six times on average) for both unsupervised and semi-supervised label aggregation.

2. RELATED WORK

All existing methods (except majority vote) first learn some parameter θ ad hoc for each new dataset and inference is then performed based on the learned parameter θ. The existing methods differentiate from each other in how to formulate the parameter θ and how to learn the parameter (Zhang et al., 2022) . For example, most methods assume an underlying distribution p(y[i]|X[i, :]; θ) (Ratner et al., 2016; 2019; Fu et al., 2020; Wu et al., 2022a; Yu et al., 2022) and focus on how to represent the distribution and how to learn the parameter θ of the distribution. Another example is that some approaches treat the accuracy of the LFs as parameters then use iterative methods to learn the accuracy parameters of the LFs (Arachie & Huang, 2021a; b; Dawid & Skene, 1979) for each dataset. Different from all existing methods, our hyper label model directly performs inference on new datasets without needing an ad hoc dataset-specific parameter learning process. In principle, X could be any matrix in {+1, -1, 0} n×m and y can be any vector in {+1, -1} n . For arbitrary X and y, there is no way to infer y from X with a better performance than random guess. Therefore, all label models implicitly or explicitly make some assumptions about the quality of the LFs. For example, assuming the accuracy of each LF is in certain range (Ratner et al., 2016) or the accuracy of LFs are known or can be estimated in a certain way (Arachie & Huang, 2021a; b; Dawid & Skene, 1979) . On top of this, most existing methods also make additional assumptions about modeling. Specifically, most existing methods assumes a distribution p(y[i]|X[i, :]; θ), then further assumes the distribution taking a certain form (e.g. probabilistic graphical models (PGM) (Ratner et al., 2016; Fu et al., 2020; Yu et al., 2022) ). Our method only assumes the majority of the LFs is better than random guess, which is "minimum" comparing to existing methods. While our work focuses on PWS, there are other methods to reduce annotation cost. One important line of work is self-supervised learning where feature representations are learned from self-defined pseudolabels and can then be used for downstream tasks (Jaiswal et al., 2020; Misra & Maaten, 2020; Liu et al., 2021) . Another popular approach is active learning that interactively selects the most informative data points to annotate (Settles, 2012) .

3. PROBLEM SETUP

Given a binary classification task, let n and m denote the number of data points and the number of LFs respectively. Let X ∈ {+1, -1, 0} n×m denote a label matrix where X[i, j] ∈ {+1, -1, 0} denotes the weak label of the i th data point provided by the j th LF. The values +1 and -1 denote the positive and negative classes respectively and 0 denotes abstention, i.e. an LF does not have enough information to label a data point as either positive or negative (Ratner et al., 2016) . The goal of a label model is to infer the unknown ground-truth label vector y ∈ {+1, -1} n using X, which typically requires a learning process for each individual dataset, i.e., X. The Better-than-random Assumption. As discussed in Section 2, in weak supervision literature, there are often assumptions on the quality of LFs so that one can make a meaningful estimation of y using X. Different methods make different assumptions (Ratner et al., 2016; Fu et al., 2020; Ratner et al., 2019; Arachie & Huang, 2021a) . In this work, we assume that for each class, the majority of LFs are better than random. This assumption is realistic since the LFs are typically made by human and humans might occasionally make mistakes when developing individual LFs, resulting in a small portion of worse-than-random LFs. Formally, this assumption can be expressed as: m-1 j=0 g(X, y, j, +1) > m 2 and m-1 j=0 g(X, y, j, -1) > m 2 , where g(X, y, j, c) denotes whether the j th LF is better-than-random for class c: g(X, y, j, c) = 1, if n-1 i=0 1 X[i,j]=c & y[i]=c > n-1 i=0 1 X[i,j]=-c & y[i]=c 0, otherwise; (2) We define σ(X, y) = 1 when Equation 1 is satisfied and σ(X, y) = 0 otherwise. We say a pair (X, y) is valid (or a vector y is valid for a given X) when σ(X, y) = 1. Intuitively, σ constrains the space of the predicted label vector ŷ and we would only predict one of label vectors with σ(X, ŷ) = 1 for a label matrix X. Note the method we will propose is not tied to our better-than-random assumption, and it also works with any other assumptions to define σ. Our Goal. Most existing methods aim to learn an individual label model for each dataset. In contrast, our goal is to learn a hyper label model under our better-than-random assumption. The learned hyper model can be applied to any unseen dataset and produces a prediction of the label vector y in a single forward pass without any form of dataset-specific parameter learning. 

4. AN ANALYTICAL OPTIMAL SOLUTION

We first show there is an optimal and analytical (therefore no dataset-specific parameter learning) method to estimate y based on X, i.e., y * = h * (X). This makes a hyper label model possible. For each label matrix X, let U y (X) = {y|σ(X, y) = 1} denote the set of valid candidate ys for X. The expected error of an estimator h(X) of the y on each X is: (X, h) = y∈Uy (X) p(y|X)||y -h(X)|| where p(y|X) is a distribution of y defined on set U y (X) and || • || denotes L2 loss (i.e. squared error). p(y|X) is unknown and can be different in different real-world applications (datasets). Without additional information apart from X, there is no way to determine the preference of some valid choices of y over other valid choices of y, so the uniform distribution (i.e. p (y|X) = 1 |Uy(X)| ) is intuitively a good approximation for the unknown p(y|X). In fact, using the uniform distribution has optimalities in both the worst case and the average case. To maintain the flow of the paper, we defer the formal definition and proof of the optimalities of using the uniform distribution to Appendix A. Replacing p(y|X) by the uniform distribution, Equation 3 becomes: (X, h) = 1 |U y (X)| y∈Uy(X) ||y -h(X)|| (X, h) can be interpreted as the average error of all possible outcomes. An estimator h can be said to be the optimal if it minimizes the error (X, h), ∀X. Theorem 1. ∀X, h * (X) = 1 |Uy(X)| y∈Uy(X) y is an optimal estimator of the ground-truth in the sense that it minimizes (X, h). We omit the proof as it is straightforward (The mean minimizes mean squared error.). Theorem 1 makes sense intuitively: since X is the only information we have, y can be any element in U y (X) and there is no information to support preferences of some elements over other elements in U y (X), so the best prediction one can make is the average of all elements in U y (X).

5. LEARNING THE HYPER LABEL MODEL

Although we have the analytical form of the optimal estimator h * , computing it is of exponential complexity as U y (X) is exponentially large for any X. Therefore, we propose to train a neural network model h to approximate the optimal estimator h * . The trained model is a hyper label model that can infer the labels for a new dataset in a single forward pass. To materialize this idea, we need to answer the following questions: (1) What training data to use? (2) What model architecture to use? We discuss all these in the following sections.

5.1. TRAINING DATA GENERATION

Given a training set D = {(X 1 , y 1 ), . . . } and cross-entropy loss CE (•, •), our learning objective is: arg min h L(h, D) = arg min h |D| i=1 n j=1 CE (h(X i )[j], y i [j]), where we use notation [j] to index the jth item of the preceding vector. The key challenge is how to obtain the training dataset D. Naively, we could generate a random X and then use the analytical method to find y * , which is however computationally intractable. Therefore, we design an efficient data generation method that ensures the model trained on our generated data still approximates h * . By the following theorem, we show that given a X, uniformly sampling a valid y, i.e., y ∈ U y (X), to compose the training dataset D ensures the learned hyper label model is asymptotically close to the analytical solution. Theorem 2. ∀X ∈ D, if the corresponding y is uniformly sampled and valid, when |D| → +∞, then arg min h L(h, D) → h * (X) = 1 |Uy(X)| y∈Uy(X) y. See proof in Appendix B. Based on the theorem, we derive the following training data generation method such that for every X, the corresponding y is uniformly sampled and valid. Step 1: We first randomly generate the shape of X, by randomly draw m (and n) from a uniform distribution [L m , H m ] (and [L n , H n ]). We provide details of how to choose L m , H m , L n and H n in Appendix E and empirically show the trained model generalizes very well outside of the given regions of shape (In fact, 13 datasets out from the 14 datasets we evaluate on are outside of the given regions of shape). Step 2: Given the shape of X, we then generate X and the corresponding y with the values being sampled uniformly at random. Step 3: If σ(X, y) = 1, we keep it as a training data point. This process (Step 1, 2, and 3) is repeated untill we obtain enough training data. Apparently, since y is generated uniformly, for any two different and valid vectors y 1 and y 2 with σ(X, y 1 ) = σ(X, y 2 ) = 1, the probability of generating y 1 equals to the probability of generating y 2 , i.e. p(y|X) is uniform. The probability of generating a valid pair in one trial is about 0.2 (see Appendix C).

5.2. MODEL ARCHITECTURE DESIGN

Notably, the input of the model h is a matrix X of size n × m and the output is a vector ŷ of size n. Thus, a reasonable instantiation of h should satisfy the following three properties: We argue that a graph neural network (GNN) is a good fit here since it can accept input graph of arbitrary size and is permutation equivariant to the nodes (Sanchez-Lengeling et al., 2021) . Therefore, we attempt to represent the input matrix X as a graph and then use a GNN for h in order to satisfy the aforementioned properties. Specifically, the left-most matrix and graph in Figure 1 illustrate how we represent an input matrix of size 3 × 2 as a graph. Entry X[i, j], the weak label of the i th data point provided by the j th LF, is represented as a node V i,j with value X[i, j]. There are two types of edges: solid yellow edge and dashed blue edge. Nodes from the same LF (i.e. same column in matrix X) are connected with solid yellow edges and nodes from the same data point (i.e. same row in matrix X) are connected with dashed blue edges. The graph representation G loses no information as one can recover X (or its permutation X[P n , P m ]) from G. In graph G, if we only look at dashed blue edges, there would be n strongly connected components and each corresponds to one data point. Specifically, the strongly connected component SCC i ={V i,0 , V i,1 , . . . } corresponds to the i th data point. The overall model architecture is shown in Figure 1 : first we encode the input graph with a GNN of K layers where each node V i,j is encoded with embedding V k i,j at the k th layer; then after the final layer, we obtain an embedding for each SCC i (i.e. each data point) by pooling all of its nodes V K i,: = 1 m j V K i,j ; The embedding of each SCC i is passed to a Multilayer perceptron (MLP) to obtain the final prediction. This architecture satisfies all three mentioned properties (see Appendix D.1). We adopt the standard design of GNN. Since we have two types of edges, we perform message passing for neighboring nodes connected with each type of edges separately. Specifically, at the k th layer in the GNN, the embedding V k i,j for the node V i,j is obtained as: V k i,j = f k (A k (W k 1 1 n q V k-1 q,j , W k 2 1 m q V k-1 i,q , W k 3 1 nm q,l V k-1 q,l , W k 4 V k-1 i,j )) where W k 1 , ... ,W k 4 are weight matrices; 1 n q V k-1 q,j denotes average pooling over neighboring nodes of V i,j connected with solid yellow edges and 1 m q V k-1 i,q denotes average pooling over neighboring nodes of V i,j connected with dashed blue edges; Note we use average pooling because the graph can be of variable size as recommended by Sanchez-Lengeling et al. ( 2021) and we also include the node's previous embedding V k-1 i,j in the average in case the node has no neighbors (this is equivalent to adding a self-edge to each node.). We also add the global context of the graph 1 nm j,j V k-1 i,j to enable message passing beyond neighboring nodes, following the standard practice (Gilmer et al., 2017; Battaglia et al., 2018) ; A k (•, •, •, •) denotes an aggregation operation and we use simple concatenation; f k denotes a linear layer with Relu activation. Handling Abstention. Handling abstention is straightforward in our approach. We can simply remove the corresponding nodes in our graph. For example, when the j th LF abstains on the i th data point, we simply remove the node V i,j from the graph.

5.3. MODEL INFERENCE ON UNSEEN DATASET

The trained hyper label model can be applied to any new dataset with the inference being simply a single forward pass. During the forward pass, different data points (rows in matrix X) and different LFs (columns in X) exchange information through message passing in GNN. This information exchange step can be regarded as the counterpart of the dataset-specific training step of other methods. Inference Complexity. The complexity of a forward pass is dominated by the GNN. Although there are O(mn 2 ) edges in the graph, there is no need to actually materialize the O(mn 2 ) edges and the complexity of each GNN layer is only O(nm). In each GNN layer, for the three averaged pooling operations in Equation 6, the first one with complexity O(n) needs to be computed once for each LF totaling m times so the complexity is O(nm); Similarly, the second one and the third one also have a complexity of O(mn). Therefore, the time complexity for each GNN layer is O(mn).

5.4. LEVERAGING GROUND TRUTH LABEL IF GIVEN

We pretrain a model h 0 on our sythetically generated data. When a small set of ground-truth labels is provided, our method can incorporate the labels by fine-tuning the model h 0 on the provided labels. Let I denote the set of indices of the elements in y that are provided. For example, when I = [2, 3], it means y[2] and y[3] are provided. Fine tuning is done by minimizing the cross-entropy loss i∈I CE (h(X)[i], y[i]) and h is initialized as the pretrained model h 0 . After fine-tuning we obtain a model h , and then all labels are obtained by h (X). We note the fine tuning process is dataset-specific, i.e. finetuning is done independently and specifically for each dataset X using the ground-truth labels of that dataset.

5.5. SUPPORTING MULTI-CLASS CLASSIFICATION TASK

We have only considered the binary labels and it turns out that our trained model for binary labels can be easily used to support multi-class classification datasets by decomposing a multi-class task with C classes to be C one-vs-rest binary classification tasks. For multi-class tasks, we have X[i, j] ∈ {0, 1, 2, . . . C} where 0 still denotes abstention and other numbers denote all the classes. We construct the label matrix for the c th class as X c [i, j] = 1 if X[i, j] = c, X c [i, j] = 0 if X[i, j] = 0, and otherwise X c [i, j] = -1. In this way, we obtain C label matrices {X 1 , . . . X c }. We apply our pre-trained model h 0 on each label matrix of each class and obtain C predicted probability vectors (p 1 , . . . , p c ). Then, for the i th data point, its soft label over the C classes is ( p1[i] c pc[i] , . . . , pc[i]) c pc[i] ). We show in experiments this simple method works well on multi-class datasets (4 datasets out of the 14 datasets we use are multi-class datasets).

6. EXPERIMENTS

We evaluate the performance of all label models under both unsupervised and semi-supervised settings. We provide additional experimental results on the performance of end models trained on the generated labels by different label models in Appendix F.2. The code and instructions to reproduce the experiments are in supplementary materials. Datasets. We use all 14 classification datasets in a recent weak supervision benchmark (Zhang et al., 2021) that are from diverse domains (e.g. income/sentiment/spam/relation/question/topic classification tasks). We highlight these datasets are only used for evaluation after our model is trained on synthetically generated data, and we never used these datasets during training. Table 1 shows the statistics of all datasets. We also use the metrics provided by the benchmark (Zhang et al., 2021) for each dataset (as different datasets need different metrics depending on their application background). All LFs are from the original authors of each dataset and are hosted in the benchmark project (Zhang, 2022a) . We consider baselines for both unsupervised and semi-supervised label aggregation. Unsupervised Baselines: (1) Majority Vote (MV). The predicted label of each data point is the most common label given by LFs. (2) Data Programming (DP) (Ratner et al., 2016) . DP uses a probabilistic graph model (PGM) where each LF is a node and the hidden ground truth is a latent variable. (3) Flyingsquid (FS) (Fu et al., 2020) . FS also uses a PGM but gives a closed-form solution with some assumptions. (4) MeTaL (Ratner et al., 2019) . MeTaL infers the ground truth using a matrix completion model. The latest version of the popular Snorkel system (snorkel team, 2022b) adopts MeTaL as its default label model. (5) NPLM (Yu et al., 2022) . This method is also based on a PGM and assumes LFs are conditionally independent. It supports partial LFs that predict a subset of class labels and is designed to be very efficient. (6) Dawid and Skene's method (DS) (Dawid & Skene, 1979) . DS models the confusion matrix of each LF with respect to the ground truth labels. The confusion matrix is learned by an Expectation-Maximization algorithm. (7) Enhanced Bayesian Classifier Combination (EBCC) (Li et al., 2019) . This method models the joint distribution of LFs as a mixture of multiple low dimensional tensors. (8) Constrained Label Learning (CLL) (Arachie & Huang, 2021a) . This method also minimizes the average prediction error where the error is defined using the unknown expected errors of each LFs. (9) HLM. This is our learned Hyper Label Model . Semi-supervised Baselines: (1) Semi-supervised DS (Dawid & Skene, 1979) . This is the semisupervised extension of the Dawid and Skene's method. (2) AMCL-CC (Mazzetto et al., 2021) . This method uses labeled data to construct feasibility constraints and provides performance guarantees. (3) Random Forest. This method trains a random forest classifier with X as the features using the provided labels. (4) Semi-supervised HLM. The semi-supervised version of our method HLM obtained by finetuning on the provided labels. Note the baseline methods require a dataset-specific learning step, we use the transductive setting (data points used in unsupervised learning is also used to evaluate the learned model) following prior work (Mazzetto et al., 2021; Zhang, 2022b) . Implementation. We provide the implementation details of our method (e.g. setups and all parameters in data generation/model architecture/model training/validation) in Appendix E and implementation details of the experiments (e.g. hardware/datasets/baselines/setups) in Appendix G. Since model training is important, here we provide a brief overview on training HLM. We generate each batch of data on-the-fly with a batch size of 50, i.e. each batch consists of 50 pairs of generated (X, y). We train our model until training loss converges (loss doesn't decrease in 10 4 iterations), which takes about 5×10 4 iterations. We noticed that in different runs, the performance of the trained model can vary, so we use a synthetic validation set D to select the best run out of ten runs. The validation set is generated with a different method from a prior work (Zhang et al., 2021) .

6.1. UNSUPERVISED LABEL AGGREGATION

The performance of all methods on all 14 datasets averaged over five runs are shown in Table 2 . To maintain the table to be readable, we only show the error bars for the averaged scores. Again, for our method HLM, we note only synthetically generated data is used for training and the 14 datasets are only used to evaluate the trained model. We note MV and our method HLM are deterministic while the other methods can give different results with different seeds. For HLM, the error bar is obtained by repeating the training process multiple times and then performing inference with different trained models. Main results. First, our results align with the benchmark (Zhang et al., 2021) where MeTaL is slightly better than MV. The difference in numbers from the benchmark (Zhang et al., 2021) is due to that we use a transducive setting following (Mazzetto et al., 2021; Zhang, 2022b) . Second, HLM outperforms the best baseline CLL by 1.4 points on average. Third, HLM is the best on 8 out of 14 datasets; On the remaining 6 datasets, HLM is the second best or is close to the second best method. Efficiency. We report the running time in Table 3 . When measuring the running time, we use GPU for methods that support GPU (MeTaL, NPLM, and HLM). CPU-only running times are in Appendix F.1. HLM requires less than 1 seconds on every dataset. HLM is on average 6 times (and can be up to 18 times) faster than the fastest baseline (except Majority Vote) and is on average 50 times faster than the baseline (CLL) with the best accuracy. This is because all prior methods (except Majority Vote) require an unsupervised learning process while HLM performs prediction in a single forward pass just like Majority Vote. We note that these 14 benchmark datasets are relatively small (as creating a large benchmark dataset with ground-truth labels is expensive). In industry scenarios, LFs can be applied on millions of data points to create labels (Bach et al., 2019) . The runtime gain of HLM will be more significant and HLM will enable the LF development process to be more interactive.

6.2. SEMI-SUPERVISED LABEL AGGREGATION

For each dataset, we randomly sample N gt data points as the data points with known ground-truth labels and we evaluate on the remaining data points. When N gt > 0.7n, we only select 0.7n data points to keep 30% of the data for evaluation in order to have a reliable evaluation score. We vary N gt from 10 to 10000. When finetuning HLM (with the method in Section 5.4), we use a smaller learning rate lr = 0.0001 to prevent overfitting (originally lr = 0.001). Intuitively, when N gt is small, we trust the pre-trained HLM more than the provided labels; when N gt is large, we trust the provided labels more than the pre-trained HLM. Therefore, we relate the number of finetuning epochs to N gt by setting the number of epochs as N gt . The results are shown in Figure 2 . When N gt > 40, semi-supervised HLM outperforms unsupervised HLM. Semi-supervised HLM outperforms the other three baselines when N gt < 1000 and ties with AMCL-CC and Random Forest when N gt > 1000. We highlight semi-supervised HLM is also the most efficient, e.g. when N gt = 10000, the running time averaged over all datasets is 3.1 seconds for semi-supervised HLM, 61.3 seconds for Semi-supervised DS, 261.8 seconds for AMCL-CC, and 4.8 seconds for random forest. We perform ablation study (under unsupervised label aggregation setting) in three aspects: (1) We replace our data generation method with the one proposed in (Zhang et al., 2021) that was originally used to generate LFs to evaluate label models. (2) We replace our model architecture with a naive architecture based on MLP and another architecture based on DeepSet (Zaheer et al., 2017) (see details in Appendix G). We report the best result of the two architectures. (3) We replace our better-than-random assumption in Equation 1 with a straightforward assumption that each LF is better than random in each class. The results are shown in Table 4 . Replacing each component reduces performance. In particular, replacing our assumption with the straightforward assumption decreases performance because the assumption that each LF is better-than-random on each class is not satisfied in the real-world datasets.

7. CONCLUSION

We present a hyper label model for programmatic weak supervision, which infers the ground-truth labels for each dataset in a single forward pass and does not require any ad-hoc dataset-specific parameter learning step. The hyper label model approximates an analytical optimal method (which is computationally intractable due to its exponential complexity). We generate synthetic training data that ensures the trained hyper label model to approximate the analytical solution and design a model architecture based on GNN to ensure the model to be invariant to the permutation of LFs and equivariant to the permutation of data points. We experimentally verify the superiority of the hyper label model in both accuracy and efficiency with both unsupervised and semi-supervised label aggregation settings over 14 benchmark datasets.

A OPTIMALITIES OF THE UNIFORM DISTRIBUTION

We aim to approximate an unknown distribution p(y|X) (which can be different in different applications) with an fixed distribution q(y|X). Since both distributions are defined on a finite set U y (X), we can use the probabilities of the elements in U y (X) to represent each of the two distributions. Specifically, we represent p(y|X) as p = {p 1 , . . . , p |Uy(X)| } and q(y|X) as q = {q 1 , . . . , q |Uy(X)| }. Similarly, we denote the uniform distribution (i.e. p (y|X) = 1 |Uy(X)| ) as u = {u 1 , . . . u |Uy(X)| }. Apparently, ∀i, we have 0 ≤ p i ≤ 1, 0 ≤ q i ≤ 1, u i = 1 |Uy(X)| , i p i = 1, i q i = 1, and i u i = 1. Using the uniform distribution u to approximate p is the optimal in both the worst case and the average case. The two optimalities are formally defined as the following: (1) Worst-case Optimal: The uniform distribution u has the minimum maximum distance to the unknown distribution p: u = argmin q max p dist(p, q) (7) where "dist" can be the KL divergence or L α distance ∀α > 1. Note the conventional name is "L p " distance, but to avoid reusing the same notation p for different meanings, we use the name "L α " distance instead. (2) Average-case Optimal: The uniform distribution u has the minimum expected KL divergence to the unknown distribution p under a mild assumption. Let P(p) denote the probability of the unknown distribution being a specific distribution p (e.g. P(u) would be denoting the probability of the unknown distribution being the uniform distribution i.e. p(p = u)). Formally: u = argmin q E p [KL(p, q)] = argmin q p KL(p, q)P(p)dp under the assumption that P(p) is centrally symmetric, formally: p P(p)pdp = u (9) We provide a formal proof for the two optimalities in the following: Proof for Worst-case Optimal. Proof. We first prove Equation 7 for KL divergence. max p KL(p, q) = max p i p i log p i q i = max p i p i log p i + p i log 1 q i (10) The maximum of the first term is zero, as p i log p i ≤ 0 due to p i ≥ 0 and log p i ≤ 0. The maximum is obtained when there is a j such that p j = 1 and p i = 0, ∀i = j; j also comes into play in the maximum of the second term. We have i p i log 1 qi ≤ i p i max k log 1 q k = max k log 1 q k . Therefore, the maximum of of the second term is max i log 1 qi which is obtained when j = arg max i log( 1 qi ), p j = 1 and p i = 0, ∀i = j. We can see that the maximum of both terms is achieved at the same time with j = arg max i log( 1 qi ), p j = 1 and p i = 0, ∀i = j. The maximum value is max i log 1 qi . Therefore, max p KL(p, q) = max i log 1 q i = log 1 min i q i ≥ log |U y (X)| (11) Published as a conference paper at ICLR 2023 The inequality is because min i q i ≤ 1 |Uy(X)| (otherwise, i q i > i 1 |Uy(X)| > 1). The equality of the inequality is obtained when q is the uniform distribution, i.e. q = u. Therefore, u = argmin q max p KL(p, q). Next, we prove Equation 7for L α distance ∀α > 1. The L α distance is defined as: L α (p, q) = ( i |p i -q i | α ) 1/α (12) Take the derivative of L α (p, q) with respect to a p i : ∂L α (p, q) p i = 1 α ( i |p i -q i | α ) 1/α-1 α(p i -q i ) α-1 if p i ≥ q i -1 α ( i |p i -q i | α ) 1/α-1 α(q i -p i ) α-1 otherwise (13) This means if p i -q i ≥ p j -q j , ∂Lα(p,q) pi ≥ ∂Lα(p,q) pj . Therefore, replacing p i , p j with p i + δ, p j -δ where δ > 0 increases L α (p, q) and eventually replacing p i , p j with p i + p j , 0 increases L α (p, q). Let k = arg max i p i -q i . For each pair (p k , p i )i = k, we replace p k to be p k + p i and p i to be 0 and eventually we have p k = 1 and p i = 0, i = k: ( i |p i -q i | α ) 1/α ≤ ( i =k |q i | α + |1 -q k | α ) 1/α (14) Apparently, when k = arg min i q k , the right hand side is further maximized. Without loss of generality, we can assume q 1 ≤ q 2 ≤ • • • ≤ q |Uy(X)| . Therefore: max p L α (p, q) = ((1 -q 1 ) α + i>1 q α i ) 1/α By the Hölder's inequality (Hardy et al., 1988) : i>1 q α i ≥ (|U y (X)| -1) 1-α ( i>1 q i ) α = (|U y (X)| -1) 1-α (1 -q 1 ) α where equality in the inequality is obtained when q 2 = • • • = q |Uy(X)| . Therefore: max p L α (p, q) ≥((1 -q 1 ) α + (|U y (X)| -1) 1-α (1 -q 1 ) α ) 1/α ≥((1 - 1 |U y (X)| ) α + (|U y (X)| -1) 1-α (1 - 1 |U y (X)| ) α ) 1/α =(( |U y (X)| -1 |U y (X)| ) α + |U y (X)| -1 |U y (X)| α ) 1/α (17) where the second inequality is because ((1 -q 1 ) α + (|U y (X)| -1) 1-α (1 -q 1 ) α ) 1/α monotonically decreases as q 1 increases and we have q 1 ≤ 1 |Uy(X)| because q 1 is the minimum in q, i.e. q 1 ≤ q 2 ≤ . . . q |Uy(X)| . In summary, the minimum of max p L α (p, q) is obtained when q 2 = q 3 = • • • = q |Uy(X)| and q 1 = 1 |Uy(X)| , which means q 1 = q 2 = q 3 = • • • = q |Uy(X)| = 1 |Uy(X)| . In other words, q = u. Therefore, u = argmin q max p L α (p, q) Proof for Average-case Optimal. Proof. E p [KL(p, q)] = p KL(p, q)P(p)dp = p P(p) i p i log p i q i dp = p P(p) i p i log p i dp - p P(p) i p i log q i dp Since the first term is irrelevant to q, we have: E p [KL(p, q)] =constant - p P(p) i p i log q i dp =constant - i log q i p P(p)p i dp =constant - i u i log q i (19) where the last equation is by the assumption that P(p) is centrally symmetric, i.e. p P(p)pdp = u. Therefore: E p [KL(p, q)] =constant - i u i log q i =constant - 1 |U y (X)| i log q i =constant - 1 |U y (X)| log i q i ≥constant - 1 |U y (X)| log(( i q i |U y (X)| ) |Uy(X)| ) =constant + log(|U y (X)|) where the inequality is the inequality of arithmetic and geometric means. The equality of the inequality is obtained when q 1 = q 2 = • • • = q |Uy(X)| = 1 |Uy(X)| , i.e. q = u. Therefore, u = argmin q E p [KL(p, q)].

B PROOF FOR THEOREM 2

∀X ∈ D, if the corresponding y is uniformly sampled and valid, when |D| → +∞, then arg min h L(h, D) → h * (X) = 1 |Uy(X)| y∈Uy(X) y. Proof. For each X, let D(X) denote the subset {(X, y 1 ), (X, y 2 ), . . . } of D. The cross entropy loss on D(X) is: - |D(X)| i=1 n j=1 1 + y i [j] 2 log( 1 + h(X)[j] 2 ) + (1 - 1 + y i [j] 2 ) log(1 - 1 + h(X)[j] 2 ) where n denotes the number of rows in X; We use [j] to index the jth item of its preceding vector; We convert the region of y i [j] and h(X)[j] from [-1, 1] to [0, 1] by adding 1 then dividing by 2. By taking derivative and setting it to zero, the above equation is minimized when: h(X)[j] = |D(X)| i=1 y i [j] |D(X)| , ∀j When |D| → +∞ (so that |D(X)| → +∞), by the law of large numbers (Dekking et al., 2005) , h(X)[j] = |D(X)| i=1 y i [j] |D(X)| = E(y[j]|X). Since p(y|X) is uniform, E(y[j]|X) = y∈Uy(X) 1 |Uy(X)| y[j] for ∀j. This means h(X) = y∈Uy(X) 1 |Uy(X)| y = h * (X).

C PROBABILITY OF GENERATING A VALID PAIR

To simplify our analysis, in the following, we only consider y that contains both -1 and +1, which has a probability p 0 = 1 -2 2 n . p 0 ≈ 1 when n ≥ 100 (When generating data, we sample n from [L n , H n ] = [100, 2000] which we explain in Appendix E). Let S denote the set of all possible pairs of (X, y) and let U = {(X, y)|σ(X, y) = 1} denote the set of all valid pairs. S is made up by three subsets: U = {(X, y)|σ(X, y) = 1}, S e = {(X, y)| m-1 j=0 g(X, y, j, -1) = m 2 or m-1 j=0 g(X, y, j, 1) = m 2 } and S c = S -U -S e . Apparently S c is also made up by three subsets, i.e. S c = S c1 ∪ S c2 ∪ S c3 where S c1 = {(X, y)| m-1 j=0 g(X, y, j, -1) < m 2 and m-1 j=0 g(X, y, j, 1) < m 2 }, S c2 = {(X, y)| m-1 j=0 g(X, y, j, -1) < m 2 and m-1 j=0 g(X, y, j, 1) > m 2 } and S c3 = {(X, y)| m-1 j=0 g(X, y, j, -1) > m 2 and m-1 j=0 g(X, y, j, 1) < m 2 }. Lemma 3. |S c1 | = |S c2 | = |S c3 | = |U|. Proof. For each element (X, y) in S c1 , m-1 j=0 g(X, y, j, -1) < m 2 and m-1 j=0 g(X, y, j, 1) < m 2 . We can flip X[y = 1, :] to be -X[y = 1, :] and flip X[y = -1, :] to be -X[y = -1, :]. After flipping, we obtain pair (X , y), and apparently (X , y) ∈ U. This means for each element in S c1 there is a corresponding element in U, so we have |S c1 | ≤ |U|. Similarly, for each element in U, we can do flipping to get an element in S c1 , so we also have |U| ≤ |S c1 |. Therefore, |U| = |S c1 |. Similarly, one can show that |U| = |S c2 | and |U| = |S c3 |. By Lemma 3, |S c | = |S c1 | + |S c2 | + |S c3 | = 3|U|. When m is odd, apparently, |S e | = 0. Therefore, the probability of a randomly generated pair being valid is: p((X, y) ∈ U|m) = |U| |S| = |U| |U| + |S e | + |S c | = 1 4 Next, we consider when m is even. To simplify our analysis, approximately, p(g(X, y, j, -1)) = 1 2 and p(g(X, y, j, 1)) = 1 2 . This is because the probability that the number of correct elements exactly equal to the number of incorrect elements for each class is extremely small due to n being relatively large. Therefore, we have:  p((X, y) ∈ U) = m,m%2=1,Lm≤m≤Hm 1 H m -L m 1 4 + m,m%2=0,Lm≤m≤Hm 1 H m -L m (1 - m m/2 2 m ) 1 4 = 1 2 × 1 4 + m,m%2=0,Lm≤m≤Hm 1 H m -L m (1 - m m/2 2 m ) 1 4 ≈0.232 This means the probability of generating a valid pair in one trial is about 0.232.

D DISCUSSIONS D.1 THE PROPOSED ARCHITECTURE SATISFIES THE THREE PROPERTIES

To see how the proposed architecture in Figure 1 satisfies the three properties mentioned in the begining of Section 5.2. First, GNN accepts arbitrary input size, so X can be of any size; Second, GNN is permutation equivariant to the nodes, so the output embeddings of GNN are equivariant to the permutation of data points and LFs. After average pooling for each data point over all LFs (each SCC with dashed blue edges), the network is invariant to the permutation of LFs and is still equivariant to the permutation of data points.

D.2 CROWDSOURCING METHODS FOR WEAK SUPERVISION

The two crowdsourcing methods have the worst performance in Table 2 . The reason that crowdsourcing methods don't work well on weak supervision datasets has not been investigated or discussed in prior work, and we provide our conjecture. First, the label matrix in crowdsourcing tends to be extremely sparse as there can be many crowd workers while each crowd worker might annotate a few data points then quit (Zheng et al., 2017) ; In contrast, in weak supervision, each LF is applied to each data point. Second, since crowd workers are humans, the labels provided by the crowd workers tend to have higher accuracy; In contrast, a LF when applied on data unseen by the LF developer can predict very noisy labels. In other words, the existing crowdsourcing methods are designed to work in the sparse scenario with weak labels of higher accuracy, so that they don't work well in the weak supervision setting with a denser and noisier label matrix. Since our data is synthetically generated, there is no need to generate a fixed training set. Our training data is generated on the fly, i.e. during training when the data loader fetches the next pair of (X, y), a new pair is immediately generated and returned. Model Architecture. We implement our model architecture in Pytorch (Paszke et al., 2019) . We use K = 4 layers of GNN. The embedding dimension of GNN is 32, i.e. each node in the graph is encoded with a 32 dimensional embedding. The final MLP consists of three linear layers; the first two linear layers use Relu activation and the last linear layer uses Sigmoid activation. Model Training. We use the Adam optimizer (Kingma & Ba, 2014) . We set amsgrad to be true for better convergence (Reddi et al., 2019) and keep all other parameters as the default values provided by Pytorch (e.g. learning rate lr = 0.001). We use a batch size of 50, i.e. each batch consists of 50 pairs of generated (X, y). We tested different batch sizes of 16 and 250 and observed no meaningful difference. We train our model until training loss converges (loss doesn't decrease in 10 4 iterations), which takes about one day with 5 × 10 4 iterations on a K80 GPU. Note one iteration means one gradient update/one batch, and we don't have the notion of epoch as training data is generated on-the-fly for each batch. Validation. We also need to prevent our model from overfitting the training set. We highlight that, different from typical ML settings where one gets access to a validation set that is similar to the test set, in our setting we have no validation set that is similar to the test set. Again, when training our model, the real test datasets are unseen and we only have access to synthetic data. Our intuition is that when the model overfits the sythetically generated training set D, its performance will be poor on data that is different from the training set, for example, on another sythetic dataset D that is generated in a different way. We synthetically generate the validation set D with size |D | = 100 according to the generation method proposed in (Zhang et al., 2021) ; In this method, LFs are independent from each other conditioned on the ground-truth label. We note that the way we use the validation set is also different from a typical setting. We train the model until training loss converges (this typically requires about 5 × 10 4 iterations), and repeat 10 runs (i.e. train our model 10 times from scratch). We then select the run with the highest averaged validation accuracy over all iterations (as validation accuracy might fluctuate over iterations); We use the learned model at the final iteration of the selected run in our experiments. We provide our reasoning of doing this: (1) We do not use the validation set to do early stopping (i.e. to select the best iteration in a run). In a typical ML setting, the validation set is used to select the best epoch/iteration. This is possible because in a typical ML setting the validation set is similar to the test set and the validation set provides very strong signal towards which iteration is a good iteration for the test set. In our case, the validation set D can be very different from the test set, thus the selected iteration based on D might not be a good iteration for the test set. (2) We use the validation set to select the best run. We observed that at different runs, the curve of validation accuracy vs number of iteration can be different (e.g. the two runs in Figure 3 ), so the test accuracy of the model in different runs can be different. We would like to select the best run using the validation set D . Intuitively, one run with better validation accuracy on average over all iterations is stably better (e.g. the yellow run in Figure 3 ), so we select the run with an best averaged validation accuracy over iterations. As an example, for the two runs in Figure 3 , although the highest validation accuracy of purple run can be higher than that of the yellow run, the yellow run has a higher averaged validation accuracy over iterations and is much more stable, so we select the yellow run. We also observed this run to have a less degree of fluctuation in validation accuracy, as shown in Figure 3 . This suggests the model converges at a flat minima, which is known to generalize better (Li et al., 2018; Keskar et al., 2016; Izmailov et al., 2018) . One natural question is that why it is possible to select the best run but it is not possible to select the best iteration. The reason is that selecting the best run out from 10 runs require much less information than selecting the best iteration out from 5 × 10 4 iterations. Since the validation set D can be very different from the test set, the information provided by D is very limited. An interesting phenomenon in the validation accuracy curve in Figure 3 for the yellow run is that validation accuracy first increases then decreases and finally increases. A similar trend was observed in prior work (of a different task) that also trains a model on synthetic data and validate on a different data distribution (Wu et al., 2022c) . We believe this is a double descent phenomenon (Nakkiran et al., 2021) induced by the distributional difference between the training and validation sets.

F.1 RUNNING TIME

We report the running time for all methods in Table 5 . For MV, DP, FS, DS, EBCC, and CLL, the running is on CPU as these methods do not support GPU. For MeTaL, NPLM and HLM, we report the running time on CPU and GPU.

F.2 END MODEL PERFORMANCE

We use the generated labels of each method to train an end model for each dataset. We consider the three best performing baselines MV, MeTaL and CLL . We use the test split provided by the benchmark (Zhang et al., 2021) for each dataset because some datasets only have ground-truth labels for data points in the provided test split. We then randomly split the remaining data points to be a training set and a validation set with a 3:1 ratio. The labels in the training set and validation set are generated labels by each label model, while the labels in the test set are ground-truth labels for evaluation. Following prior work (Ratner et al., 2016; Zhang et al., 2021) , the probabilistic labels instead of the hard labels are used to train the end model when possible. We adopt the end models used in (and their implementations provided by) the benchmark (Zhang et al., 2021; Zhang, 2022a) , i.e. a pretrained BERT model (Devlin et al., 2018) for textual datasets and a multi-layer perception (MLP) for datasets with numeric features. We report the results on test set in Table 6 . Again, to maintain the table to be readable, we only show the error bars for the averaged scores. Our results align with those in the benchmark (Zhang et al., 2021) where the end model trained on labels generated by MeTaL is slightly worse than that by MV. Overall, HLM outperforms the other three methods. On Yelp, Spouse, and SemEval, HLM tied with MV in label quality (see Table 2 ) but has better end model performance as HLM's probabilistic labels can be more informative. Note the scores of the end model can be higher than that of the generated labels (as also observed in the benchmark (Zhang et al., 2021) and prior work (Ratner et al., 2017) ) because the end model incorporates additional information from the raw data.

G IMPLEMENTATION DETAILS OF EXPERIMENTS

Hardware. All of our experiments were performed on a machine with a 2.20GHz Intel Xeon(R) Gold 5120 CPU, a K80 GPU and with 96GB 2666MHz RAM. Datasets. We use the datasets prepared by the wrench benchmark on Github (Zhang, 2022a; Zhang et al., 2021) . All the datasets and LFs are publicly released by previous work (Zhang et al., 2021) . All datasets do not contain any personally identifiable information (Zhang et al., 2021) . Originally, each dataset include three files "train.json", "valid.json" and "test.json". Following the suggestion in a reported issue of the wrench benchmark (Zhang, 2022b) , we combine all three files to get a single matrix X and single ground-truth label vector y for the experiments on label aggregation. We then split the datasets using the original split for the experiment on end model (Appendix F.2). The information of the LFs as well as the raw data for each dataset can be found in the wrench benchmark project on Github (Zhang, 2022a) . Baselines. For each baseline, we use existing open-source implementations. The implementations of DS, DP, FS, MeTaL, EBCC, NPLM, and ACML-CC are from (sukrutrao, 2022), (snorkel team, 2022a), (HazyResearch, 2022) , (snorkel team, 2022b), (yuan li, 2022), (BatsResearch, 2022a),



(1) Ability to accept arbitrary input size: The number of data points n and LFs m can vary for different datasets. The model h should be able to accept an input matrix X of arbitrary size. (2) Invariance to permutation of LFs: Intuitively, randomly shuffling the LFs should not change the prediction of any data point. Formally, let P m denote one arbitrary permutation of the m integers in [0, m -1]. Invariance to permutation of LFs means that h(X[:, P m ]) = h(X), ∀P m . (3) Equivariance to permutation of data points: Smilarily, randomly shuffling the data points should not change the prediction of each data point. Formally, equivariance to permutation of data points means that h(X[P n , :]) = h(X)[P n ], ∀P n where P n is defined similarly as P m .

Figure 1: Overall network architecture.

Figure 2: Semi-supervised performance.

m is uniformly sampled from [L m , H m ] = [2, 60] (which we explain in Appendix E), we have:

Figure 3: Accuracy on the synthetic validation set D vs number of training iterations. The purple line and yellow line are two different runs. The yellow run is selected as it is more stable with a higher averaged validation accuracy.

The parameter θ should be learned for every new dataset before performing inference on each data point using the distribution p(y[i]|X[i, :]; θ). Instead, we aim to learn a hyper distribution p(y|X, Θ) over all possible datasets with a hyper label model. Once the hyper label model has learned Θ, for any new dataset X new , it could directly produce prediction using the distribution p(y|X new , Θ) without needing to learn a dataset-specific parameter θ.

14 classification datasets from the weak supervision benchmark(Zhang et al., 2021) Dataset Census IMDB Yelp Youtube SMS Spouse CDR Commercial Tennis Basketball AGNews TREC SemEval ChemProt

Performance (F1 or acc score depending on the dataset) on all datasets Dataset Census IMDB Yelp Youtube SMS Spouse CDR Commercial Tennis Basketball AGNews TREC SemEval ChemProt AVG.

Running time (seconds) of label aggregation on all datasetsDataset Census IMDB Yelp Youtube SMS Spouse CDR Commercial Tennis Basketball AGNews TREC SemEval ChemProt AVG.



E IMPLEMENTATION DETAILS OF HLMData Generation. When generating each pair (X, y), we first randomly generate n and m, the number of rows/columns of matrix X. Note n is the number of data points and m is the number of LFs. As we mentioned, we first sample n and m uniformly from[L n , H n ] and [L m , H m ] respectively. We set [L n , H n ] = [100, 2000]  where L n = 100 is because typically there are at least hundreds of data points otherwise it is not necessary to write LFs as one can just manually label all data points and we set H n = 2000 due to memory limit during model training. We set [L m , H m ] =[2, 60]  where L m = 2 is because when there is only one LF there is no need to aggregate and we set H m = 60 due to memory limit during model training; We highlight our trained model generalizes well to number of LFs and number of data points (see Table1) that are not in the region [L m , H m ] and [L n , H n ] as we have shown in experiments. Once we have n and m, we invoke the method mentioned in Section 5.1 to generate (X, y).

Running time (seconds) of label aggregation on all datasets with CPU and GPU.DatasetCensus IMDB Yelp Youtube SMS Spouse CDR Commercial Tennis Basketball AGNews TREC SemEval ChemProt AVG.

Performance of end model trained with labels generated by each method.DatasetCensus IMDB Yelp Youtube SMS Spouse CDR Commercial Tennis Basketball AGNews TREC SemEval ChemProt AVG.

annex

and (BatsResearch, 2022b) respectively. For baselines that require class weights as priors, we report the best results from using uniform weights and using the weights estimated by majority vote.Setup in Semi-supervised Label Aggregation. When sampling N gt data points as the data points with known labels, we make sure that each class has at least two data points. For random forest, we use the scikit-learn implementation (rfs, 2022) . When training the random forest classifier, we use five fold cross validation to perform grid search for the "max depth" parameter in range [2, 4, 8, 16, 32, None] and the "min samples split" parameter in range [2, 5] . The AMCL-CC method does not support abstention, to make it work we fill in the abstentions with labels provided by majority vote; AMCL-CC requires a lot of memory on some datasets and involves solving a constrained linear programming problem which may not have a solution. When AMCL-CC fails due to out-ofmemory error or no-solution-found error, we use the results from random forest. We repeat five runs and report results with error bars in Figure 2 .Setup in Ablation Study. For model architecture, we test two baselines. The first one is based on MLP. The input is a flattened vector of a fixed size matrix 2000 × 50 (padded with zero if the input matrix is smaller) and the network has 10 linear layers. The second one is based on DeepSet (Zaheer et al., 2017) where each row of X is treated as a set. We use an open source implementation (manzilzaheer, 2022) . We replace each of our GNN layer with a DeepSet layer.

Setup in End Model Experiment.

When training the end model, the training set and validation set both use generated labels by each method and the test set uses ground-truth labels. For the two end model MLP and BERT, we use the implementation provided by the benchmark (Zhang, 2022a; Zhang et al., 2021) . We use grid search to tune hyper-parameters for each end model based on validation set performance. We use the same search space as the benchmark (Zhang et al., 2021) , as summarized in Table 7 . We repeat five runs and report the scores averaged over runs in Table 6 . 

