LEARNING HYPER LABEL MODEL FOR PROGRAMMATIC WEAK SUPERVISION

Abstract

To reduce the human annotation efforts, the programmatic weak supervision (PWS) paradigm abstracts weak supervision sources as labeling functions (LFs) and involves a label model to aggregate the output of multiple LFs to produce training labels. Most existing label models require a parameter learning step for each dataset. In this work, we present a hyper label model that (once learned) infers the ground-truth labels for each dataset in a single forward pass without dataset-specific parameter learning. The hyper label model approximates an optimal analytical (yet computationally intractable) solution of the ground-truth labels. We train the model on synthetic data generated in the way that ensures the model approximates the analytical optimal solution, and build the model upon Graph Neural Network (GNN) to ensure the model prediction being invariant (or equivariant) to the permutation of LFs (or data points). On 14 real-world datasets, our hyper label model outperforms the best existing methods in both accuracy (by 1.4 points on average) and efficiency (by six times on average). Our code is available at https://github.com/wurenzhi/hyper label model 

1. INTRODUCTION

The lack of labeled training data is a major challenge impeding the practical application of machine learning (especially deep learning) techniques. Therefore, practitioners have been increasingly turned to weak supervision in which large amounts of cheaply generated noisy labels are used. There are many forms of weak supervision sources, e.g. external knowledge bases (Mintz et al., 2009) , existing pre-trained models (Das et al., 2020; Wu et al., 2022b) , and heuristics/rules (Shin et al., 2015) . To unify different sources, the programmatic weak supervision (PWS) paradigm (Ratner et al., 2016; 2017; Zhang et al., 2022) was proposed. In PWS, the user expresses each available weak supervision signal from different sources with a labeling function (LF), a small program that takes in a data point and outputs a noisy label. After that, each LF is applied to unlabeled data of arbitrary size to obtain a noisy label vector; then, a label aggregation model (also referred as label model in literature) is used to aggregate all noisy label vectors to infer the unknown ground-truth labels. The inferred labels can then be used to train any downstream end models. The PWS paradigm has been successful in various tasks (Wu et al., 2018; Fries et al., 2019; Lison et al., 2020; Wu et al., 2021; 2020; Li et al., 2021) and industry scenarios (Mathew et al., 2021; Bach et al., 2019; Dunnmon et al., 2020) . The core challenge in PWS is how to aggregate all noisy label vectors to infer the ground-truth labels. Let label matrix X denote the noisy labels where each column X[:, j] denotes the noisy label vector from the j th LF and each row X[i, :] denotes the weak labels of the i th data point; Let y denote the ground-truth label vector. Most existing label models assume an underlying distribution p(y[i]|X[i, :]; θ) (Zhang et al., 2022) where y[i] is the label for the data point and θ is the parameter of the distribution. The parameter θ is first learned on the weak labels X = (X[1, :], X[2, :], . . . ) in an unsupervised and typically iterative way, and then inference is made using p(y[i]|X[i, :]; θ). In this approach, the parameter θ is dataset-specific and has to be learned for every different X (dataset). In contrast to existing solutions, we propose a hyper label model with the goal of reducing assumptions and parameter learning process. Specifically, we aim to develop a hyper model that enjoys two desiderata: (1) it works with "minimal" assumption, i.e., we only assume the majority of LFs is better-then-random while does not require the knowledge or assume any particular forms of underlying distribution p(y[i]|X[i, :]; θ); (2) once the hyper model is learned, it can be used to infer y for any new X without additional dataset-specific parameter learning process. To shed light on this direction, we first show, in theory, that without assuming underlying distribution, there is an optimal and analytical (therefore no parameter learning) way to estimate of y based on X, i.e., y * = h * (X). However, such h * is intractable to compute since it involves averaging over a set whose size is exponentially-increasing w.r.t.the size of X. Therefore, we propose to leverage the power of deep learning to approximate this solution, i.e., we seek for an alternative function h parametrized by some neural networks, and once learned, it can estimate label vector for new dataset without ad hoc dataset-specific learning process. Thus, we call the learned model hyper label model. Materializing this idea involves two key questions: (1) How to generate training data? (2) How to design the model architecture? To generate training data, the straightforward solution is to use the analytical method to generate many pairs of (X, y * ) where y * = h * (X). However, computing y * with h * (X) is of exponential complexity. We notice that for each X, h * (X) is an average of the label vectors from a certain set. Taking advantage of this, we are able to avoid directly generating y * that is of exponential complexity and design a way of generating an equivalent set of training data such that the trained model approximates h * (X). The model architecture has two requirements. First, it should be able to accept input matrix X of arbitrary size as the size of the matrix X can be different across datasets. Second, the output of the model (e.g. the predicted label vector) should be invariant to the permutations of columns in X as the order of the LFs should not impact the final predicted labels; The output of the model should be equivariant to the permutation of rows in X as when switching the order of the data points in X the predicted labels should be switched accordingly. We noticed that a Graph Neural Network (GNN) is able to accept an input graph of arbitrary size and is permutation equivariant to the nodes on the graph (and can also be made to be permutation invariant by taking the average of the nodes). Therefore, we propose to represent the input matrix X as a graph and then design a GNN to satisfy the two requirements. Contributions. We make the following contributions: (1) We for the first time present an analytical method for label aggregation which is optimal in the sense that it minimizes a certain form of the averaged prediction error, though directly using the analytical method is of exponential complexity. (2) We train a model to learn the analytical method. The trained model is a hyper label model that can be used to infer the ground-truth labels for unseen datasets in a single forward pass without needing any dataset-specific parameter learning. (3) We design a synthetic training data generation method and show that the hyper label model trained on the synthetically generated data learns to be the analytical method. (4) We design an effective model architecture based on GNN so that the hyper label model is applicable to arbitrary number of LF label vectors of arbitrary size and is invariant/equivariant to the permutation of LF label vectors/data points. (5) We show that our hyper label model outperforms the best existing methods over 14 real-world weak supervision datasets in both accuracy (by 1.4 points on average) and efficiency (by a speedup of six times on average) for both unsupervised and semi-supervised label aggregation.

2. RELATED WORK

All existing methods (except majority vote) first learn some parameter θ ad hoc for each new dataset and inference is then performed based on the learned parameter θ. The existing methods differentiate from each other in how to formulate the parameter θ and how to learn the parameter (Zhang et al., 2022) . For example, most methods assume an underlying distribution p(y[i]|X[i, :]; θ) (Ratner et al., 2016; 2019; Fu et al., 2020; Wu et al., 2022a; Yu et al., 2022) and focus on how to represent the distribution and how to learn the parameter θ of the distribution. Another example is that some approaches treat the accuracy of the LFs as parameters then use iterative methods to learn the accuracy parameters of the LFs (Arachie & Huang, 2021a; b; Dawid & Skene, 1979) for each

