GENERATIVE MODELING HELPS WEAK SUPERVISION (AND VICE VERSA)

Abstract

Many promising applications of supervised machine learning face hurdles in the acquisition of labeled data in sufficient quantity and quality, creating an expensive bottleneck. To overcome such limitations, techniques that do not depend on ground truth labels have been studied, including weak supervision and generative modeling. While these techniques would seem to be usable in concert, improving one another, how to build an interface between them is not well-understood. In this work, we propose a model fusing programmatic weak supervision and generative adversarial networks and provide theoretical justification motivating this fusion. The proposed approach captures discrete latent variables in the data alongside the weak supervision derived label estimate. Alignment of the two allows for better modeling of sample-dependent accuracies of the weak supervision sources, improving the estimate of unobserved labels. It is the first approach to enable data augmentation through weakly supervised synthetic images and pseudolabels. Additionally, its learned latent variables can be inspected qualitatively. The model outperforms baseline weak supervision label models on a number of multiclass image classification datasets, improves the quality of generated images, and further improves end-model performance through data augmentation with synthetic samples.

1. INTRODUCTION

How can we get the most out of data when we do not have ground truth labels? Two prominent paradigms operate in this setting. First, programmatic weak supervision frameworks use weak sources of training signal to train downstream supervised models, without needing access to groundtruth labels (Riedel et al., 2010; Ratner et al., 2016; Dehghani et al., 2017; Lang & Poon, 2021) . Second, generative models enable learning data distributions which can benefit downstream tasks, e.g. via data augmentation or representation learning, in particular when learning latent factors of variation (Higgins et al., 2018; Locatello et al., 2019; Hu et al., 2019) . Intuitively, these two paradigms should complement each other, as each can be thought of as a different approach to extracting structure from unlabeled data. However, to date there is no simple way to combine them. Fusing generative models with weak supervision holds substantial promise. For example, it could yield large reductions in data acquisition costs for training complex models. Programmatic weak supervision replaces the need for manual annotations by applying so-called labeling functions to unlabeled data, producing weak labels that are combined into a pseudolabel for each sample. This leaves the majority of the acquisition budget to be spent on unlabeled data, and here generative modeling can reduce the number of real-world samples that need to be collected. Similarly, information about the data distribution contained in weak label sources may improve generative models, reducing the need to acquire large volumes of samples to increase generative performance and model discrete structure. Additionally, learning with weak labels may enable targeted data augmentation, allowing for class-conditional sample generation despite not having access to ground truth. The main technical challenge is to build an interface between the core models used in the two approaches. Generative adversarial networks (GANs) (Goodfellow et al., 2014) , which we focus on in this work, have at least a generator and a discriminator, and frequently additional auxiliary models, such as those that learn to disentangle latent factors of variation (Chen et al., 2016) . In programmatic weak supervision, the label model is the main focus. It is necessary to develop an interface that aligns the structures learned from the unlabeled data by the various components. We introduce weakly-supervised GAN (WS-GAN), a simple yet powerful fusion of weak supervision and GANs visualized in Fig. 2 , and we provide a theoretical justification that motivates the expected gains from this fusion. Our WSGAN approach is related to the unsupervised InfoGAN (Chen et al., 2016) generative model, and also inspired by encoder-based label models as in (Cachay et al., 2021) . These techniques expose structure in the data, and our approach ensures alignment between the resulting variables by learning projections between them. The proposed WSGAN offers a number of benefits, including: • Improved weak supervision: We obtain better-quality pseudolabels via WSGAN's label model, yielding consistent improvements in pseudolabel accuracy up to 6% over established programmatic weak supervision techniques such as Snorkel (Ratner et al., 2020) . • Improved generative modeling: Weak supervision provides information about unobserved labels which can be used to obtain better disentangled latent variables, thus improving the model's generative performance. Over 6 datasets, our WSGAN approach improves image generation by an average of 5.8 FID points versus InfoGAN. We conduct architecture ablations and show that the proposed approach can be integrated into state-of-the-art GAN architectures such as StyleGAN (Karras et al., 2019) (see Fig. 1 ), achieving state-of-the-art image generation quality. • Data augmentation via synthetic samples: WSGAN can generate samples and corresponding label estimates for data augmentation (e.g. Fig. 10 ), providing improvements of downstream classifier accuracy of up to 3.9% in our experiments. The trained WSGAN can produce label estimates even for samples, real or fake, that have no weak supervision signal available.

2. BACKGROUND

We propose to fuse weak supervision with generative modeling to the benefit of both techniques, and first provide a brief overview. A broader review of related work is presented in Section 5. Weak Supervision Weak supervision methods that use multiple sources of imperfect and partial labels (Ratner et al., 2016; 2020; Cachay et al., 2021) , sometimes referred to as programmatic weak supervision, seek to replace manual labeling for the construction of large labeled datasets. Instead, users define multiple weak label sources that can be applied automatically to the unlabeled dataset. Such sources can be heuristics, knowledge base look-ups, off-the-shelf models, and more. The technical challenge is to combine the source votes into a high-quality pseudolabel via a label model. This requires estimating the errors and dependencies between sources and using them to compute a posterior label distribution. Prior work has considered various choices for the label model, most of which only take the weak source outputs into account. A review can be found in Zhang et al. (2021; 2022) . Instead, our label model produces sample dependent accuracy estimates for the weak sources based on the features of the data, similar to Cachay et al. (2021) . Generative Models and GANs Generative models are used to model and sample from complex distributions. Among the most popular such models are generative adversarial networks (GANs) (Goodfellow et al., 2014) . GANs consist of a generator and discriminator model that play a minimax game against each other. Our approach builds off InfoGAN (Chen et al., 2016) , which adds an auxiliary inference component to learn disentangled representations via a set of latent factors of variation. We hypothesize that connecting such discrete latent variables to the label model should yield benefits for both weak supervision and generative modeling. 

3. THE WSGAN MODEL

We first describe our proposed weakly-supervised GAN (WSGAN) model, visualized in Fig. 2 , and then provide theoretical justification for the model fusion. We work with n unlabeled samples X ∈ X ⊆ R d drawn from a distribution D X . We want to achieve two goals with the samples X. First, in generative modeling, we approximate D X with a model that can be used to produce high-fidelity synthetic samples. Second, in supervised learning, we wish to use X to predict labels Y ∈ {1, 2, . . . , C}, where (X, Y ) is drawn from a distribution whose marginal distribution is D X . However, in the weak supervision setting, we do not observe Y . Instead, we observe m labeling functions (LFs) Λ ∈ {0, . . . , C} n×m that provide imperfect estimates of Y for a subset of the samples. These LFs vote on a sample x i to produce an estimate of the label λ j (x i ) ∈ {1, . . . , C} or abstain (i.e. no vote) with 0. The goal is to combine the m LF estimates into a pseudolabel Ŷ that can be used to train a supervised model (Ratner et al., 2016) . While weak supervision and generative modeling function over a number of modalities, this work focuses on images. Note that LF construction for image tasks is more challenging than for text tasks (cf. Section 5).

3.1. PROPOSED METHOD

To improve generative performance and the weak supervision-based pseudolabels, we propose a model that consists of a number of components. Because we wish to ensure that our component models benefit each other, our architecture aims for the following characteristics: (I) A generative model component that learns discrete latent factors of variation from data and exposes these externally, (II) a weak supervision label model component that makes predictions of the unobserved label by aggregating the weak supervision votes, using sample-dependent weights, (III) a set of interface models that connect the components. Our design choices are made to satisfy those goals.

GAN Architecture

We write G for the generator; its goal is to learn a mapping to the image space based on input consisting of samples z from a noise distribution p Z (z) along with a set of latent factors of variation b ∼ p(b), following the ideas introduced in InfoGAN (Chen et al., 2016) . Because we are targeting a classification setting, we restrict ourselves to discrete b. The output of G are samples x; these are consumed by a discriminative model D, which estimates the probability that a sample came from the training distribution rather than G. Furthermore, we define an auxiliary model Q which learns to map from a sample x to the discrete latent code b. We denote the standard GAN objective by V(D,G), and the InfoGAN objective by IV(D,G,Q) (Chen et al., 2016) : min G max D V (D, G) = E x∼D X [log(D(x))] + E z∼p(z),b∼p(b) [log(1 -D(G(z, b)))] , (1) min G,Q max D IV (D, G, Q) = V (D, G) + α E z∼p(z),b∼p(b) [l(b, Q(G(z, b)))] , where l is an appropriate loss function, such as cross entropy, and α is a trade-off parameter. Equation 2 aims to maximize the mutual information between generated images and b, while G continues to fool the discriminator D, leading to the discovery of latent factors of variation.

Weak Supervision Label Model

The purpose of the label model is to encode relationships between the LFs λ and the unobserved label y, enabling us to produce an informed estimate of y. In prior work, the model is often a factor graph (Ratner et al., 2016; 2019; Fu et al., 2020; Zhang et al., 2022) with potentials ϕ j (λ j (x), y) and ϕ j,k (λ j (x), λ k (x)) capturing the degree of agreement between an LF λ j and y or correlations between two LFs λ j and λ k . We define the accuracy potentials ϕ j (λ j , y) ≜ 1{λ j = y} as in related work. Each potential ϕ j is associated with an accuracy parameter θ j . Once we obtain estimates of θ j , we can predict y from the LFs λ via L θ (λ) k = exp( m j=1 θ j ϕ j (λ j (x), k)) ỹ∈Y exp( m j=1 θ j ϕ j (λ j (x), ỹ)) , ∀ k ∈ {1, . . . , C}. This is a softmax over the weighted votes of all LFs, which derives from the factor graph introduced in Ratner et al. (2016) . Note that related work only models the LF outputs to learn θ, ignoring any additional information in the features x. However, the structure in the input data x is crucial to our fusion. For this reason, we define a modified label model predictor in the spirit of Cachay et al. (2021) . It has local accuracy parameters (sample-dependent values encoding the accuracy of each λ j ) via an accuracy parameter encoder A(x) : R d → R m + . This variant is given by: L A θ (λ) k = exp( m j=1 A(x) j ϕ j (λ j (x), k)) ỹ∈Y exp( m j=1 A(x) j ϕ j (λ j (x), ỹ)) , ∀ k ∈ {1, . . . , C}, a softmax over the LF votes by class, weighted by the accuracy encoder output. Note that, while A(x) allows for finer-grained adjustments of the label estimate Ŷ , the estimate is still anchored in the votes of LFs which represent strong domain knowledge and are assumed to be better than random. Learning the Label Model The technical challenge of weak supervision is to learn the parameters of the label model (such as θ j above) without observing y. Existing approaches find parameters under a label model that (i) best explain the LF votes while (ii) satisfying conditionally independent relationships (Ratner et al., 2016; 2019; Fu et al., 2020) . The features x are ignored; it is assumed that all information about y is present in the LF outputs. Instead, we promote cooperation between our models by ensuring that the best label model is the one which agrees with the discrete structure that the GAN can learn, and vice versa. The intuition is that, as each of the generative and label models learn useful information from data, this information can-if aligned correctly-be shared to help teach the other model. To this end, note that the sampled variable b can only be observed for generated images, not for real images. Nonetheless, Q can be applied to real-world samples to obtain a prediction of the latent b. Crucially, in the weak supervision setting we observe the LF outputs, enabling us derive a label estimate for each real image L A θ (λ) = Ŷ , which can be aligned with the predicted code to guide Q on real data, and vice versa. Interface Models and Overall WSGAN Objective We introduce the following interface models to map between the estimates of b and y. Let F 1 : [0, 1] C → [0, 1] C and F 2 : [0, 1] C → [0, 1] C . An effective choice for F 1 and F 2 are linear models with a softmax activation function. To achieve agreement between the latent structure discovered by the GAN's auxiliary model Q as well as by the label model L A θ via the LFs, we introduce the following overall objective, ensuring that a mapping exists between the latent structures on the real images in the training data: min G,Q,A,F1,F2 max D IV (D, G, Q) + β E x,λ∼D X,Λ [l(F 1 (Q(x)), L A (λ)) + l(Q(x), F 2 (L A (λ)))], with hyperparameter β and loss function l, such as the cross entropy. Pseudocode for the added loss term can be found in Algorithm 1. In our implementation, as common in related GAN work, we let D, Q and A share convolutional layers and define distinct prediction heads for each. For L A we detach the features from the computation graph before passing them to a multilayer perceptron (MLP), followed by a sigmoid activation function. Thus, the WSGAN method only adds a small number of additional parameters compared to a basic GAN or InfoGAN. Improving Alignment Initializing the label model L A such that it produces equal weights for all LFs results in a strong baseline estimate of Ŷ , as users build LFs to be better than random. Initializing L A θ in this way, it can act as a teacher in the beginning and guide Q towards the discrete structure encoded in the LFs. We find that adding a decaying penalty term that encourages equal label model weights in early epochs-while not necessary to achieve good performance-almost always improves latent label estimates. Let i ≥ 0 denote the current epoch. We propose to add the following linearly decaying penalty term for an encoder A that uses a sigmoid activation function: C/(i × γ + 1)||A(x) -⃗ 1 × 0.5|| 2 2 , where γ is a decay parameter. In our experiments we set γ = 1.5. Augmenting the Weak Supervision Pipeline with Synthetic Data Given a WSGAN model trained according to Eq. 4, we can generate images via G to obtain unlabeled synthetic samples x. To obtain pseudolabels for these images we have at least one and sometimes two options. When LFs can be applied to synthetic images, we can obtain their votes λ(x) = λ and apply our WSGAN label model L A ( λ). However, in many practical applications of weak supervision, some LFs are not applied to images directly, but rather to metadata or an auxiliary modality such as text (cf. Section 5). With WSGAN, we can obtain pseudolabels via ŷ = F 1 (Q(x)) for samples that have no LF votes, using the trained WSGAN components Q and F 1 , in essence transferring knowledge from Q to the end model. Note that the quality of these synthetic pseudolabels hinges on the performance of Q, which can conceivably improve with the supply of weakly supervised as well as entirely unlabeled data.

3.2. THEORETICAL JUSTIFICATION

In this section, we provide theoretical results that suggest that there is a provable benefit to combining weak supervision and generative modeling. In particular, we provide two theoretical claims justifying why weak supervision should help generative modeling (and vice versa): (1) generative models help weak supervision via a generalization bound on downstream classification and (2) weak supervision improves a multiplicative approximation bound on the loss for a conditional GAN using the unobserved true labels-namely, we extend the theoretical setup and noisy channel model of the Robust Conditional GAN (RCGAN) (Thekumparampil et al., 2018) . Formal statements and proofs of these claims can be found in Appendix F. Claim (1) Assume that we have n 1 unlabeled real examples where our label model fails to produce labels, i.e. all LFs abstain on these n 1 points. This is a typical issue in weak supervision, as sources often only vote on a small proportion of points. We then sample enough synthetic examples from our generative model such that we obtain n 2 synthetic examples for which our label model does produce labels; this enables training of a downstream classifier on synthetic examples alone with the following generalization bound: sup f ∈F | R D (f ) -R D (f )| ≤ 2R + log(1/δ) 2n 2 + B ℓ G 1 2 + B ℓ √ 2 exp(-mα 2 ), where R is the Rademacher complexity of the function class. The first two terms are standard. The third term is the penalty due to generative model usage; any generative model estimation result for total variation distance can be plugged in. For example, for estimating a mixture of Gaussians, G = (4c G kd 2 /n 1 ) 1/2 which depends on the number of mixture components k and dimension d. The last term is the penalty from weak supervision with m LFs whose accuracy is α better than chance; this implies that generated samples can help weak supervision generalize when true samples cannot. Claim (2) Noisy labels from majority vote improve the multiplicative bound on the RCGAN loss given in Theorem 2 of Thekumparampil et al. (2018) . Let P and Q be two distributions over X × {0, 1} and let P MV and Q MV be the corresponding distributions with noisy labels generated by majority vote over m LFs. Let d F ( P MV , Q MV ) be the RCGAN loss with noisy labels generated by majority vote and let ϵ λ be the mean error of each of the m LFs. Using majority vote with m ≥ 0.5 log(1/ϵ λ )/ 1 2 -ϵ λ 2 LFs, we obtain an exponentially tighter multiplicative bound on the noiseless RCGAN loss: d F ( P MV , Q MV ) ≤ d F (P, Q) ≤ 1 -2 exp -2m 1 2 -ϵ λ 2 -1 d F ( P MV , Q MV ) ≤ (1 -2ϵ λ ) -1 d F ( P MV , Q MV ). This means that weak supervision can help an RCGAN more-accurately learn the true joint distribution, even when the true labels are unobserved. The full analysis is provided in Appendix F.

4. EXPERIMENTS

Our experiments on multiple image datasets show that the proposed WSGAN approach is able to take advantage of the discrete latent structure it discovers in the images, leading to better label model performance compared to prior work. The results also indicate that weak supervision as used by WSGAN can improve image generation performance. In the spirit of democratizing AI, we aim to keep the complexity of our experiments manageable, to ensure accessible reproducibility. Therefore, we conduct our main experiments with a simple DCGAN base architecture. As an ablation, we also adapt StyleGAN2-ADA (Karras et al., 2020) to WSGAN, showing that the proposed method can be integrated with other GAN architectures to achieve state-of-the-art image generation and label model performance. Please see the Appendix for additional details and experiments as well as a link to code.

4.1. SETUP

Datasets Table 1 shows key characteristics of the datasets used in our experiments, including information about the different LF sets. We conduct our main experiments with the Animals with Attributes 2 (AwA2) (Mazzetto et al., 2021a) , DomainNet (Peng et al., 2019) , the German Traffic Sign Recognition Benchmark (GTSRB) (Stallkamp et al., 2012) , and CIFAR10 (Krizhevsky, 2009) color image datasets, as well as with the gray-scale MNIST (LeCun et al., 1998) and FashionMNIST (Xiao et al., 2017) datasets. We use a variety of types of weak supervision sources for these datasets (see Appendix B for more dataset details). The LF types we cover are: • Domain transfer: classifiers are trained on images in source domains (e.g. paintings), and the trained classifiers are then applied to images in a target domain (e.g. real images) to obtain weak labels. This LF type is used in our DomainNet experiments, following Mazzetto et al. (2021a) . • Attribute heuristics: we use these LFS in our AwA2 experiments. Attribute classifiers are trained on a number of seen classes of animals. Given these weak attribute predictions, we use the known attribute relations and a small amount of validation data to train shallow decision trees to produce weak labels for a set of unseen classes of animals. • SSL-based: using image features learned on ImageNET with SimCLR (Chen et al., 2020) , we fine-tune shallow multilayer perceptron classifiers on small sets of held-out data to produce weak labels for our datasets. • Synthetic: these simulated LFs, used in some of our CIFAR10 experiments, are unipolar LFs based on the corrupted true class label. To this end, random errors are introduced to the class label to achieve a sampled target accuracy and propensity.

Models

We study two versions of the proposed WSGAN model: (1) WSGAN-Encoder, which uses an accuracy parameter encoder A(x), that takes in an image x and outputs an accuracy weight vector for the label model. ( 2) WSGAN-Vector, a baseline which learns a parameter vector that is used to weigh LF votes and is not sample-dependent. For our main experiments, G, D follow a simple DCGAN (Radford et al., 2015) design. All networks are trained from scratch and we use the same hyperparameter settings in all experiments. For our architecture ablation, we adapt StyleGAN2-ADA (Karras et al., 2020) to create StyleWSGAN. See Appendix A for implementation details and parameter settings. We compare WSGAN to the following label model approaches: (I) Snorkel (Ratner et al., 2016; 2020) : a probabilistic graphical model that estimates LF parameters by maximizing the marginal likelihood using observed LFs. To compare the quality of generated images, we use the Fréchet Inception Distance (FID) on color images, which has been shown to be consistent with human judgments (Heusel et al., 2017) and is used to measure performance of current state-of-the-art GAN approaches (Karras et al., 2021) .To show the improvement in alignment of the auxiliary model Q's predictions of the discrete latent code b with the latent labels y, we track the Adjusted Rand Index (ARI) between the two.

4.2. RESULTS

We first discuss results comparing label model and image generation performance, before presenting the use of WSGAN for augmentation of the downstream classifier with synthetic samples. We repeat each experiment at least three times and average the results in our tables.

4.2.1. LABEL MODEL AND IMAGE GENERATION

Label Model Performance Table 2 shows a comparison of label model performance based on the accuracy of the posterior on the training data, without the use of any labeled data or validation sets. WSGAN-encoder largely outperforms alternative label models, while the simpler WSGAN-vector model performs competitively as well. These results hold according to additional metrics provided in Appendix C . Results with standard deviations over 5 random runs are provided in the Appendix in Table 11 , indicating that many differences are significant. 

Discrete Latent Code Comparison

F 1 (Q(x)), LF PLs via L A(x) (λ(x)). Dataset Synthetic PLs LF PLs AwA2 -A 0.88% 0.79% AwA2 -B 2.40% 3.90% DomainNet 2.31% 1.50% MNIST 1.60% 1.71% FashionMNIST 0.29% 0.34% GTSRB 0.40% 0.02% CIFAR10-A 0.04% - CIFAR10-B 0.30% -

Synthetic Images with Labeling Function Votes

The last column in Table 4 displays test accuracy increases by applying LFs λ to synthetic images x. We obtain pseudolabels via L A(x) (λ(x)). We observe a modest average increase of 1.38%.

Synthetic Images with Synthetic Pseudolabels

We can create pseudolabels with F 1 (Q(x)), e.g. when LFs cannot be applied to synthetic images. With this, the second column of Table 4 shows an average increase in test accuracy of 1%, and up to 2.4%. We do not observe larger increases in accuracy by adding more generated images. Figure 10 shows a small number of generated images along with synthetic pseudolabel estimates. While F 1 (Q(x)) could conceivably be used as a downstream classifier, the choices of network architecture are then constrained as it shares convolutional layers with D. Synthetic Data Quality Checks In addition to visually inspecting some generated samples and checking if conditionally generated samples reflect the target labels, we recommend checking the class balance in the pseudolabels of synthetic images before adding them to a downstream training set, as mode collapse in a trained GAN can potentially be diagnosed this way.

4.2.3. NETWORK ABLATION -STYLEWSGAN

We apply StyleWSGAN to weakly supervised LSUN scene categories (Yu et al., 2015) , and to our CIFAR10-B dataset, please see Appendix C.2 for details. The results demonstrate WSGAN's complementarity with other GAN architectures and that it scales to images of higher resolution. On weakly supervised LSUN scene category images with a resolution of 256 by 256 pixels, StyleWSGAN achieves a mean FID of 7.54 (samples visualized in Fig. 8 ), while an unconditional, tuned StyleGAN2-ADA (Karras et al., 2020) achieves an FID of 8.41. On CIFAR10-B, StyleWSGAN achieves a mean FID of 3.79 (see generated images in Fig. 9 ), while also attaining a high label model accuracy of 0.736 (compare with Table 2 ). The unsupervised StyleGAN2-ADA, with the optimal, tuned settings identified in Karras et al. (2020) , achieves an average FID of 3.85 on this subset. An unsupervised StyleInfoGAN that we created achieved a mean FID of 4.13.

5. RELATED WORK

Programmatic Weak Supervision Data programming (DP) (Ratner et al., 2016 ) is a popular weak supervision framework in which subject matter experts programmatically label data through multiple noisy sources of labels known as labeling functions (LFs). These LFs capture partial knowledge about an unobserved ground truth variable at better than random accuracy. In DP, a label model combines LF votes to provide an estimate of the unobserved ground truth, which is then used to train an end model using a noise-aware loss function. DP has been successfully applied to various domains including medicine (Fries et al., 2019; Dunnmon et al., 2020; Eyuboglu et al., 2021) and industry applications (Ré et al., 2020; Bach et al., 2019) . Many works offer DP label models with improved properties, e.g., extensions to multitask models (Ratner et al., 2019) , better computational efficiency (Fu et al., 2020) , exploiting small amounts of labels as in semi-supervised learning settings (Chen et al., 2021; Mazzetto et al., 2021a; b), end-to-end training (Cachay et al., 2021) , interactive learning (Boecking et al., 2021) , or extensions to structured prediction settings (Shin et al., 2022) . See Zhang et al. (2022) for a more detailed survey. Programmatic Weak Supervision and Images Our main focus is on applications of weak supervision to image data. On images, imperfect labels are often obtained from domain specific primitives and rules (Varma & Ré, 2018; Fries et al., 2019) , rules defined on top of annotations by surrogate models (Varma & Ré, 2018; Chen et al., 2019; Hooper et al., 2021) , rules defined on meta-data (Li & Fei-Fei, 2010; Chen & Gupta, 2015; Izadinia et al., 2015; Denton et al., 2015) or rules applied to a second paired modality such as text (Joulin et al., 2016; Wang et al., 2017; Irvin et al., 2019; Boecking et al., 2021; Dunnmon et al., 2020; Saab et al., 2019; Eyuboglu et al., 2021) . Generative Models and Disentangled Representations Among the numerous existing approaches to generative modeling, in this work we focus on generative adversarial networks (GANs) (Goodfellow et al., 2014) . We are particularly interested in work that aims to learn disentangled representations (Chen et al., 2016; Lin et al., 2020 ) that can align with class variables of interest. Chen et al. (2016) introduce InfoGAN, which learns interpretable latent codes. This is achieved by maximizing the mutual information between a fixed small subset of the GAN's input variables and the generated observations. Gabbay & Hoshen (2020) present a unified formulation for class and content disentanglement as well as a new approach for class-supervised content disentanglement. Nie et al. ( 2020) study semi-supervised high-resolution disentanglement learning for the state-of-the-art StyleGAN architecture. A potential downside to modeling latent factors in generative models is a decrease in image quality of generated samples that has been noted when disentanglement terms are added (Burgess et al., 2018; Khrulkov et al., 2021) . Prior work has studied how to integrate additional information into GAN training, in particular ground truth class labels (Mirza & Osindero, 2014; Salimans et al., 2016; Odena, 2016; Odena et al., 2017; Brock et al., 2019; Thekumparampil et al., 2018; Miyato & Koyama, 2018; Lučić et al., 2019) , also considering noisy scenarios (Kaneko et al., 2019) . However, in the programmatic weak supervision setting, having multiple noisy sources of imperfect labels that include abstains present large hurdles to similar conditional modeling. Some prior work uses other weak formats of supervision to aid specific aspects of generative modeling. For example, Chen & Batmanghelich (2020) propose learning disentangled representation using user-provided ground-truth pairs. Yet, prior work does not fuse programmatic weak supervision frameworks and generative models, and so are limited to one-off techniques to solely improve generative models. Using GANs for Data Augmentation An exciting application of GANs is to generate additional samples for supervised model training. The challenge is to produce sufficiently high-quality samples. For example, Abbas et al. (2021) use a conditional GAN to generate synthetic images of tomato plant leaves for a disease detection task. GANs for data augmentation are also popular in medical imaging (Yi et al., 2019; Motamed et al., 2021; Hu et al., 2019) 

6. CONCLUSION

We studied the question of how to build an interface between two powerful techniques that operate in the absence of labeled data: generative modeling and programmatic weak supervision. Our fusion of the two, a weakly supervised GAN (WSGAN), defines an interface that aligns structures discovered in its constituent models. This leads to three improvements: first, better quality pseudolabels compared to weak supervision alone, boosting downstream performance. Second, improvement in the quality of the generative model samples. Third, it enables data augmentation via generated samples and pseudolabels, further improving downstream model performance without additional burden on users. Standard failure cases of GANs such as mode collapse still apply to the proposed approach. However, we do not observe that WSGAN is more susceptible to such failures than the approaches we compare to. For future work, we are interested in other modalities, exploiting for instance generative models for graphs and time series. Further, motivated by the performance of WSGAN, we seek to extend the underlying notion of interfaces between models to a variety of other pairs of learning paradigms. Limitations of the proposed approach include common GAN restrictions such as the types of data that can be modeled and the number of unlabeled samples required to fit distributions, and also known difficulties of acquiring weak supervision sources of sufficient quality for image data. Generator G, Discriminator D, and auxilliary model Q: Figures 5 and 4 show the simple DC-GAN (Radford et al., 2015) based generator and discriminator architectures we use in WSGAN for experiments with 32 × 32 images. As mentioned in the main paper, Q and D are neural networks that generally share all convolutional layers, with a final fully connected layer to output predictions. We follow the same structure in our experiments. We set the dimension of the noise variable z to 100, and of b equal to the number of classes. We sample z from a normal distribution and b from a uniform discrete distribution. Accuracy Encoder A: For WSGAN-Vector, A is simply a parameter vector of the same length as the number of labeling functions. For WSGAN-Encoder, we use image features obtained from the shared convolutional layers of Q and D, which we detach from the computational graph before passing them on to an MLP prediction head. For images with 32 × 32 pixels, the feature vector obtained from the shared convolutional layers is of size 512 * 16. The MLP head of A is set to have three hidden layers of size (256, 128, 64), with ReLU activations, and an output layer the size of the number of labeling functions followed by a sigmoid function. We did not observe significant changes in performance when we change the MLP to be shallower or wider. However, for large numbers of LFs, one should consider increasing the width the MLP. Mappings F 1, F 2: We set F 1 and F 2 to each be simple linear models with a softmax at the output, and set the input and output size of each to the number of classes. (WS)GAN Training We use the same hyperparameter settings for all datasets. We train all GANs for a maximum of 200 epochs. We use a batch size of 16 and find that a lower batch size leads to more frequent convergence of the generator and discriminator. We also conducted ablation experiments with a batch size of 8 and 32 and found no significant difference in FID image generation quality or label model accuracy. For WSGAN, we use four optimizers, one for each of the different loss terms: discriminator training, generator training, the Info loss term, and the WSGAN loss term. We use Adam for all optimizers and set the learning rates as follows: 4 × 10 -4 for D, 1 × 10 -4 for G, 1 × 10 -4 for the info loss term, and 8 × 10 -5 for the WSGAN loss term. We follow the same settings for InfoGAN training for the components shared with WSGAN. (WS)GAN Training and Failure Cases While WSGAN is still susceptible to the common GAN failure cases of its base networks, such as mode collapse, we empirically find WSGAN training to be more stable than training a GAN that also learns a discrete latent code but uses no weak supervision signals (InfoGAN), despite the high level of noise in our weak supervision sources. InfoGAN failed to converge more frequently. To help train the DCGAN networks successfully, we find that employing discriminator label flipping (randomly calling a tiny percentage of real samples fake and vice versa) and label smoothing (adding small amounts of noise to the real target of 1.0 and fake target of 0.0) stabilizes and improves GAN training. Despite employing these tricks, we were unable to avoid occasional convergence failures. Fortunately, monitoring the generator and discriminator losses, inspecting the quality of generated images, or tracking image quality metrics such as FID allows one to easily discard failed runs or to pick model checkpoints from earlier iterations before a failure, without requiring labeled data.

StyleWSGAN Model Setup and Training

We adapt StyleGAN2-ADA (Karras et al., 2020) to build a StyleWSGAN Model as well as a StyleInfoGAN. The generator architecture follows is the same approach as a class-conditional StyleGAN generator: the sampled code is embedded to a d-dimensional vector via a linear layer and then concatenated with the original latent code, after each is normalized. This concatenated vector is then passed to the StyleGAN mapping network. We find the relationship between the number of layers of the StyleGAN mapping network and the size of the embedded sampled code d to be crucial for StyleWSGAN. When the mapping network is too shallow, as in the tuned CIFAR10 settings in (Karras et al., 2020) , a large d can lead to training instability for StyleWSGAN and StyleInfoGAN. We use separate optimizer settings for each loss term, and set the learning rate for the Info term (added term of Equation 2) and the WSGAN term (added term of Equation 4 plus decay penalty) to a factor of 2/10 of the base learning rate in StyleGAN. This results in a learning rate of 0.0005 for the added WSGAN terms in our experiments, while we maintain a learning rate of 0.0025 for the original StyelGAN terms. Due to the use of different learning rates in the separate optimizers, the added loss terms are not scaled, and the hyper-parameters α, β are set to 1. For the CIFAR10 experiments we largely follow the settings used in (Karras et al., 2020) : no style mixing, no path length regularization, no ResNet D. However, we increase the depth of the mapping network from 2 to 6, we decrease the size of the code embedding to 200, and continue training until the discriminator has seen a total of 50M real images. A mapping network of depth 4, and a code embedding size of 50 also lead to good performance, performing only slightly worse measured by both FID and label model accuracy. For the LSUN experiments, we train StyleWSGAN until the discriminator has seen a total of 35M real images, and the baseline StyleGAN2-ADA that we compare to is trained until the discriminator has seen a total of 50M real images. We largely follow the settings used for 256 x 256 images in (Karras et al., 2020) , but disable style mixing and path length regularization. We set the size of the discrete code embedding to 50.

End Model Training

For all datasets, we train a ResNet-18 (He et al., 2016) for 100 epochs, using Adam and a learning rate scheduler. The learning rate scheduler uses a small validation set to make adjustments to the learning rate. Image Augmentation We use the following random image augmentation functions during DC-GAN and endmodel training for color images: random crop and resize (cropping out a maximum height/width of 13%), random sharpness adjustment (p = 0.2), random Gaussian blur (p = 0.1), and random color jitter. Label Models To compare to related work, we use implementations of label models made available via WRENCH (Zhang et al., 2021) . Complexity WSGAN shares the same operations as InfoGAN and adds some additional steps on real samples that have at least one LF vote, which slightly increases the required computation. Recall that C denotes the number of classes, m the number of LFs, and x an image of a real sample. Further, let n w denote the number of samples that have at least one weak label vote from any LF, let q denote the number of steps required for a forward pass through Q to obtain image features and the discrete code prediction, and a denote the number of steps for a forward pass through the MLP A. For a forward pass, WSGAN increases the complexity compared to InfoGAN in each epoch by Θ(n w (a + q + m + 2C 2 + C(m + 8))). Note that q may be eliminated for the forward pass through careful implementation as the image features are already obtained for the basic InfoGAN update. In practice, in our experiments the computational overhead, including for additional data loading of the LFs, leads to a modest increase in runtime (measured in bps, denoting batches per second) of the weakly supervised WSGAN over the unsupervised InfoGAN, as follows. • MNIST and FashionMNIST both contain 28x28 grayscale images, which we resize to 32x32. For both, we use a random sample of 30,000 images from the training data for our experiments. SSL-based labeling functions are fine-tuned on small, random subsets of the remaining training data of each dataset. • GTSRB contains 64x64 color images of German traffic signs. We use 22,640 random images from the full training dataset during our experiments, while random subsets of the remaining images in the original training data are used to finetune the SSL-based labeling functions. • The original DomainNet (Peng et al., 2019) dataset contains 345 classes of images in 6 different domainsfoot_0 . As our dataset, following Mazzetto et al. (2021a) we use the images in the real domain and select the 10 classes with the largest number of instances in this domain. Because of the small size of the resulting dataset, we resize the images to 32 x 32 in our experiments. • Animals with Attributes 2 (AwA2) (Mazzetto et al., 2021a ) is an image dataset with known general attributes for each class, divided into 40 seen and 10 unseen classes. Because of the small size of the resulting dataset once LFs are created, we resize the images to 32 x 32 in our experiments. • LSUN scene categories see Section C.2 for details.

B.1 LABELING FUNCTION DETAILS

• Synthetic: based on the true class label, we create synthetic, unipolar LFs via the following procedure: for each LF, we sample a class label, an error rate, and a propensity (i.e., the percentage of samples where the LF casts a vote, also referred to as coverage). Given the target label, we then sample true positives and false positives at random to achieve the desired LF accuracy and propensity. • Domain transfer: these LFs are used in our DomainNet dataset experiments. We follow Mazzetto et al. (2021a) , and derive weak supervision sources for a multiclass classification task of the real images contained in the DomainNet (Peng et al., 2019) dataset. First, we set our target domain to real images and select the 10 classes with the largest number of instances in this domain. As LFs, we then train classifiers using the selected classes within the remaining five domains, and apply these trained classifiers to the unseen images in the target domain of real images to obtain weak labels. • Attribute heuristics: we create two sets of LFs for the Animals with Attributes 2 (AwA2) (Mazzetto et al., 2021a) (2021b;a), we train one-vs-rest attribute classifiers using the 40 seen classes of the AwA2 dataset. These classifiers are applied to the 10 unseen classes to produce weak attribute labels. At this stage, we discard attribute classifiers which perform worse than random. We create an 85%/5%/10% train/validation/test split of the 10 unseen classes which we use to define decision trees to produce weak labels on the bases of weak attribute predictions. We create the 29 unipolar LFs for AwA2-A by training 3 one-vs-rest decision trees per each of the 10 classes on 100 random samples from the training set. To create a slightly easier set, we create the 32 unipolar LFs used in AwA2-B by training 80 decision trees, retaining one random tree specializing in each class, and then selecting all remaining ones where validation accuracy is higher than 0.65. • SSL-based: The base representations are learned on unsupervised ImageNET with Sim-CLR (Chen et al., 2020) . The trained network is used to obtain features for our image datasets. We then train shallow MLP networks on a few hundred held-out samples to predict a randomly sampled target label at a randomly sampled target accuracy. The accuracy is validated to be within range of the target accuracy on another small amount of held-out data. Thus, during their creation these unipolar LFs are never trained or evaluated on the WSGAN training data or the downstream test data. For further evaluation of WSGAN with a DCGAN base architecture, we created additional weakly supervised image datasets based on CIFAR10 by varying the number of samples and the type 7 . Finally, a comparison between the latent discrete variable of WSGAN and InfoGAN is given in Figure 7 , which shows how the Adjusted Rand Index evolves between the unobserved class labels and the latent discrete variable modeled by auxiliary model Q.

C.2 STYLEWSGAN

Please see Section A for implementation details and hyperparameter settings in our StyleGAN experiments. Dataset statistics for this section are shown in Table 8 .

LSUN scene categories

To test the proposed WSGAN on higher resolution images with a Style-GAN base architecture, we create a balanced subset of the LSUN scene categories dataset (Yu et al., 2015) . The dataset contains 10 classes (i.e. 10 different scene categories) and we center-crop and resize images to 256 by 256 pixels. We sample an equal number of images from each of the 10 classes for a final dataset size of 1,212,270 images. As weak supervision sources, we create 30 SSL-based LFs by training classifiers on small amounts of held-out data using image features learned via self-supervised learning, as described in Section B.1. StyleWSGAN achieves an average FID of 7.54 on this dataset. An unconditional StyleGAN2-ADA achieves an FID of 10.3 with the settings for 256 by 256 images set in (Karras et al., 2020) , and an FID of 8.41 when we turn off path length regularization and style mixing. Note that unconditional StyleGAN results on LSUN images with lower FID scores reported in related work are generally obtained by training on a single LSUN scene or object category, rather than on multiple categories simultaneously, as in our experiments, which results in a more challenging setup. The average WSGAN labelmodel accuracy on this weakly supervised LSUN dataset is 0.766. Other label models obtain the following average accuracies: DawidSkene 0.765, Majority Vote 0.76, FlyingSquid 0.739, Snorkel MeTaL 0.740, and Snorkel 0.728. CIFAR10 First, Figure 9 shows synthetic images by StyleWSGAN on the weakly supervised CIFAR10-B dataset, which uses SSL-based LFs. These LFs are quite noisy, with a mean LF accuracy of 0.736, which is reflected in the noisy class-conditional samples that can be inspected in Figure 9 . On this dataset, StyleWSGAN achieves a mean FID of 3.79, while also attaining a high label model accuracy of 0.736. The unsupervised StyleGAN2-ADA, with the optimal, tuned settings identified in (Karras et al., 2020) , achieves an average FID of 3.85 on this subset. We create an additional weakly supervised CIFAR10 with lower noise LFs, to see if such a setting can lead to results that are better than the state-of-the-art (SOTA) unsupervised image generation quality on the full CIFAR10 dataset reported in (Karras et al., 2020) . For this experiment, we create LFs by randomly introducing errors and abstains to the ground-truth vector. For the LFs, we set a minimum accuracy of 0.8 and a maximum accuracy of 0.95 and create 20 LFs. This dataset contains 48000 samples, has a mean LF accuracy of 0.888, and a mean coverage of 0.102 (meaning that an LF on average abstains on ∼ 89.8% of the dataset). For this dataset, StyleWSGAN achieves an FID of 2.84, which is better than the SOTA unsupervised result reported in (Karras et al., 2019) of 2.92 FID on the full 50k CIFAR10 samples, but shy of the performance of the conditional StyleGAN (Karras et al., 2019) which uses projection discrimination and has access to all ground-truth labels and achieves and FID of 2.42. 

D ADDITIONAL BASELINES

We compare against two additional baselines. First, we train a generative model that is conditioned on pseudolabel information with the aim of improving image generation performance; we use pseudolabels provided by established weak-supervision label models in this role. Second, we use a basic generative model to produce synthetic samples that augment a downstream classifier (with weak labels provided by outputs of weak supervision sources applied to the synthetic images). These two baselines represent the straightforward way to use weak supervision to improve generative modeling (and vice-versa) . We observe that such naive combinations struggle compared to our proposed approach, further motivating the importance of the interface in WSGAN. As an additional GAN baseline to our proposed WSGAN, we slightly adapt an Auxiliary Classifier Generative Adversarial Network (ACGAN) (Odena et al., 2017) to condition it on pseudolabels. The ACGAN is run on all data, but the auxiliary loss on real data with pseudolabels is only used for samples where at least one labeling function does not abstain. We create two versions: (1) using probabilistic pseudolabels with a soft cross-entropy loss, and (2) using hard/crisp labels with a cross-entropy loss. To provide the strongest possible baseline in this experiment, we obtain the pseudolabels via the Dawid-Skene label model as it attains the best performance on average over all datasets compared to other related label models. Results are provided in Table 9 , showing that this baseline approach is frequently unable to overcome the noise in the pseudolabels to improve over the InfoGAN results, and that it does not perform better than WSGAN with an encoder. Furthermore, it was much more difficult to train these models and they frequently failed to converge. We also attempted to train different types of conditional GANs (ACGAN and a GAN with projection discrimination) conditioned on the raw weak supervision votes, but were unable to obtain reasonable performance as the models failed to converge.

D.2 DATA AUGMENTATION FOR DOWNSTREAM CLASSIFICATION WITH SYNTHETIC IMAGES

In this experiment, we augment the training set for a downstream classifier with synthetic images. As baselines, we generate synthetic images x with an InfoGAN, and then apply the image labeling functions λ to the generated images to obtain LF votes λ(x). Pseudolabels for the synthetic images are then obtained by fitting label models to the real training data and then applying the label models to the labeling function outputs on the synthetic data. Table 10 compares InfoGAN + Snorkel and InfoGAN + DawidSkene baselines to the improvements in test accuracy obtained by using WSGAN and shows that we were unable to obtain improvements in downstream accuracy with the baselines, possibly due to performance of the InfoGAN on these small datasets. 

E ADDITIONAL METRICS

We provide additional metrics for the label model comparisons shown in Table 2 in the main paper. Again, results are averaged over 4 random runs. Table 13 shows the weighted F1 score, an average over all classes weighted by the support of each class. Table 12 shows weighted mean average precision, a metric that summarizes the precision-recall curve across all classes. We compute the average precision individually for each class (one vs. rest) and then aggregate the scores by summing them weighted by the support of each class to produce the weighted mean average precision score. 

F THEORETICAL JUSTIFICATION

We provide additional setup and proofs for our two theoretical claims. F.1 CLAIM (1) Our goal is to derive a generalization bound; that is, an upper bound on | R D -R D |. In words, this is the gap between the loss on a sample drawn from the true distribution and the empirical loss we obtained by training on the weakly-supervised dataset with unlabeled data sampled from the generative model.

Mixture of Gaussians

Recall that D is the joint distribution of the unlabeled and labeled points. Let's call the unlabeled data marginal distribution D X . Then, we make the assumption that D X is a mixture of k Gaussians. Here, there is some relationship between the mixtures and the two classes, but we need not further specify it. Using the result (Ashtiani et al., 2018) , we get that the number of samples needed to learn D X up to ε in total variation distance is Θ(kd 2 /ε 2 ). Note that in fact this expression hides some polylogarithmic terms. However, for simplicity, we're going to ignore these terms and just pretend that the necessary bound is c G kd 2 /ε 2 , where c G is some constant for learning a density. Based on this, we'll make the following assumption. We perform density estimation on n 1 samples from D X and obtain some model g such that distribution of g (we'll abuse notation and just refer to this as the model itself g) and D X satisfies d TV (D X , g) ≤ d c G k n 1 . So now we have control over one marginal (the unlabeled data). Let's work on the conditional term next. Majority Vote For simplicity, let's assume that we use majority vote as the aggregation scheme for the m labeling functions. We make the following assumptions. The labeling functions have accuracy 1/2 + α, for some α ∈ (0, 1/2], in the following sense. For any datapoint (X, Y ), the probability of a labeling function guessing the value of Y correctly is 1/2 + α, and the probability of any guessing wrong is 1/2 -α. This holds for all values of X. Note: these are very strong assumptions. The probability that we make a mistake, e.g., that majority vote aggregates votes to 0 when Y = 1 or vice-versa is given by the binomial CDF F (m/2, m, α + 1/2), which has the following simple bound that follows from Hoeffding's inequality, F (m/2, m, α + 1/2) ≤ exp -2m α + 1/2 - m/2 m 2 = exp(-2mα 2 ). With the above, as D Y |X is a Bernoulli random variable, we can directly upper bound the total variation distance between D Y |X and D Ŷ |X : d TV (D Y |X , D Ŷ |X ) ≤ exp(-2mα 2 ). Joint Distribution Now we have some control over the generative model's error (from the density estimation bound) and some control over the label recovery (from the above bound resulting from majority vote). Now we put it together. First, we write down some useful inequalities between the total variation distance and the Hellinger distance (Duchi, 2016 ) (Prop 2.10). These are, for densities p, q, D hel (p, q) ≤ 2d TV (p, q) and d TV (p, q) ≤ D hel (p, q) 1 -D hel (p, q) 2 /4. For the right-hand term, we have the following: |R D -R D | = | ℓ(f (x), y)|p(x, y) -q(x, y)|dµ ≤ B ℓ d TV ( D, D). Then, putting this together with the expression in Eq. ( 11) into Eq. ( 12), we get that, with probability at least 1 -δ, sup f ∈F | R D (f ) -R D (f )| ≤ (sup f ∈F | R D -R D | + |R D -R D |) ≤ 2R + log(1/δ) 2n 2 + B ℓ d TV ( D, D) ≤ 2R + log(1/δ) 2n 2 + B ℓ 4c G kd 2 n 1 1 4 + B ℓ √ 2 exp(-mα 2 ). ( ) Interpreting the Bound In Eq. ( 13), we saw that , where n 1 is the number of samples of unlabeled data used to train the generative model. Note also the dependence on the number of mixture components and dimension. sup f ∈F | R D (f ) -R D (f )| ≤ 2R + log(1/δ) 2n 2 + B ℓ 4c G kd 2 n 1 1 4 + B ℓ √ 2 exp(-mα 2 ). • A penalty term due to weak supervision. It tells us what we lose by using estimated (pseudo)labels rather than true labels; we note that the penalty scales exponentially in the number of labeling functions m, but is slowed down by small α, as our accuracies are α better than random.

F.2 CLAIM (2)

Our proof of claim (2) uses the setting of Thekumparampil et al. ( 2018), which introduces RCGAN. RCGAN is a conditional GAN architecture that corrupts the label before passing them to the discriminator by passing the true labels through a noisy channel. The authors provide a multiplicative approximation bound between the GAN loss under the unobserved true labels and the loss under the noisy labels. This noisy channel model acts as a nice model of the label generating process of weak supervision. Using this noisy channel model, we can control the amount of label corruption to match that of weak supervision.  Theorem 1 says that the total variation distance between the true noisy distribution and the noisy generated distribution from RCGAN approximate its noiseless counterpart up to a factor of ∥C -1 ∥ ∞ . Our goal is to construct C ϵMV to model the noise from weak supervision (in particular, majority vote) and show that it leads to a tighter bound than when we directly plug in the labels from a single LF into Theorem 1. To begin, consider the following parameterization of C, with ϵ ∈ (0, 1/2): C ϵ = I 2 + -ϵ ϵ ϵ -ϵ = 1 -ϵ ϵ ϵ 1 -ϵ . ( ) Here, ϵ denotes the labeling error for each class. Given this parameterization, we obtain the following expression for ∥C -1 ϵ ∥ ∞ . ∥C -1 ϵ ∥ ∞ = 1 -ϵ ϵ ϵ 1 -ϵ -1 ∞ (16) = |((1 -ϵ) 2 -ϵ 2 ) -1 | 1 -ϵ -ϵ -ϵ 1 -ϵ ∞ (17) = (1 -2ϵ) -1 (18) Note that C ϵ is full-rank as it has a finite inverse. It is also clear that ∥C -1 ϵ ∥ ∞ is a monotonically increasing function of ϵ. That is to say that if we do something to decrease the labeling error ϵ, then ∥C -1 ϵ ∥ ∞ also decreases and we obtain a tighter bound. We will go on to derive an expression for the labeling error under majority vote with m LFs, ϵ MV , and show that it is smaller than the labeling error from a single LF, ϵ λ . Namely, we want find a condition where ϵ MV ≤ ϵ λ holds and that majority vote leads to an improved Theorem 1 bound. Proof. We begin by deriving an upper bound on ϵ MV . We have LFs {λ i } m i=1 that each produce incorrect predictions with probability ϵ λ = 1 2 -α, using α as defined in Claim (1). Now, we need to show that the probability of producing incorrect predictions using majority vote with more label functions, {λ i } m i=1 , has error ϵ MV ≤ ϵ λ . Define the event that λ i is incorrect as follows: z i = I[λ i ̸ = y], then E [[] z i ] = ϵ λ . Using this, we apply Hoeffding's bound to ϵ MV . ϵ MV = P m i=1 z i -mϵ λ ≥ m 2 -mϵ λ (19) ≤ exp -2 m 2 -mϵ λ 2 m (20) = exp -2m 1 2 -ϵ λ 2 . ( ) Next, we plug the bound from ( 21) into (18) to obtain the following expression for ∥C -1 ϵMV ∥ ∞ . ∥C -1 ϵMV ∥ ∞ = (1 -2ϵ MV ) -1 (22) ≤ 1 -2 exp -2m 1 2 -ϵ λ 2 -1 . ( ) To complete the proof, we need the following to hold: ∥C -1 ϵMV ∥ ∞ ≤ ∥C -1 ϵ λ ∥ ∞ , but due to the monotonicity of ∥C -1 ϵ ∥ ∞ , it is sufficient to show that ϵ MV ≤ ϵ λ . Recall that ϵ MV ≤ exp -2m 1 2 -ϵ λ 2 , so if we set exp -2m 1 2 -ϵ λ 2 ≤ ϵ λ , we obtain the minimum number of label functions, m, required to ensure ϵ MV ≤ ϵ λ . exp -2m 1 2 -ϵ λ 2 ≤ ϵ λ ⇒ m ≥ log(1/ϵ λ ) 2 1 2 -ϵ λ 2 . Plugging (23) into Theorem 1, we obtain the following d TV ( P MV , Q MV ) ≤ d TV (P, Q) ≤ ∥C -1 ϵMV ∥ ∞ d TV ( P MV , Q MV ) ≤ 1 -2 exp -2m 1 2 -ϵ λ 2 -1 d TV ( P MV , Q MV ) ≤ (1 -2ϵ λ ) -1 d TV ( P MV , Q MV ) = ∥C -1 ϵ λ ∥ ∞ d TV ( P MV , Q MV ) which completes the proof. Notice that the proof of Proposition 1 does not depend on total variation distance beyond the dependence on Theorem 1. As such, Proposition 1 can be stated more generally in terms of the Integral Probability Metric induced by the GAN discriminator F using Theorem 2 of Thekumparampil et al. ( 2018): d F ( P MV , Q MV ) ≤ d F (P, Q) ≤ ∥C -1 ϵMV ∥ ∞ d F ( P MV , Q MV ) ≤ ∥C -1 ϵ λ ∥ ∞ d F ( P MV , Q MV ). Finally, notice that Proposition 1 is made in terms of d F ( P MV , Q MV ) and not in terms of d F ( P λ , Q λ ). We can show that as the number of LFs approach infinity, we recover the distance under the clean labels: d F (P, Q). Applying Theorem 1 to majority vote and a single LF results in the following two expressions: d F ( P MV , Q MV ) ≤ d F (P, Q) ≤ ∥C -1 ϵMV ∥ ∞ d F ( P MV , Q MV ) (24) d F ( P λ , Q λ ) ≤ d F (P, Q) ≤ ∥C -1 ϵ λ ∥ ∞ d F ( P λ , Q λ ) Rearranging terms, we obtain the following 1 ≤ d F (P, Q) d F ( P MV , Q MV ) ≤ ∥C -1 ϵMV ∥ ∞ ≤ ∥C -1 ϵ λ ∥ ∞ and 1 ≤ d F (P, Q) d F ( P λ , Q λ ) ≤ ∥C -1 ϵ λ ∥ ∞ . Notice that ∥C -1 ϵ λ ∥ ∞ has no dependence on m since it's a single LF, but ∥C -1 ϵMV ∥ ∞ approaches 1 as m → ∞: lim m→∞ 1 ≤ d F (P, Q) d F ( P MV , Q MV ) ≤ ∥C -1 ϵMV ∥ ∞ ≤ ∥C -1 ϵ λ ∥ ∞ ⇒ 1 ≤ d F (P, Q) d F ( P MV , Q MV ) ≤ 1 ≤ ∥C -1 ϵ λ ∥ ∞ ⇒ d F (P, Q) = d F ( P MV , Q MV ). Hence we obtain a stronger bound as the number of LFs increases. Figure 12 : A random set of FashionMNIST images generated by WSGAN-E (using a DCGAN), with the discrete latent random variable kept fix for each row of images. Figure 13 : A random set of Domainnet images generated by WSGAN-E (using a DCGAN), with the discrete latent random variable kept fix for each row of images. Note that this dataset is particularly challenging for a GAN as our dataset has fewer than 7,000 images, resulting in considerably lower quality of synthetic images compared to GTSRB for example. Figure 14 : A random set of AwA2 images generated by WSGAN-E (using a DCGAN), with the discrete latent random variable kept fix for each row of images. Note that this dataset is particularly challenging for a GAN, as this weakly supervised dataset has fewer than 7,000 images, resulting in considerably lower quality of synthetic images compared to GTSRB for example.



Real, painting, sketch, clipart, infograph, quickdraw.



Figure 1: Class-conditional image generation by the proposed WSGAN based on a weakly supervised CIFAR10 subset with 30k samples. Here, WSGAN uses a StyleGAN2 base architecture and we keep the discrete code in each row fixed.

Figure 2: The proposed WSGAN models discrete latent variables in X via a network Q, while learning a generator G and discriminator D. A label model L uses weak supervision votes λ and weights estimated by A to produce pseudolabels. WSGAN aligns the pseudolabels with the discrete structure learned by Q.

(II)  Dawid-Skene (Dawid & Skene, 1979): a model motivated by the crowdsourcing setting. The model, fit using expectation maximization, assumes that error statistics of sources are the same across classes and that errors are equiprobable independent of the true class. (III) Snorkel MeTaL(Ratner et al., 2019): a Markov random field (MRF) model similar to Snorkel which uses a technique to complete the inverse covariance matrix of the MRF during model fitting, and also allows for modeling multi-task weak supervision. (IV) FlyingSquid (FS)(Fu et al., 2020): based on a label model similar to Snorkel, FS provides a closed form solution by augmenting it to set up a binary Ising model, enabling scalable model fitting. (V) Majority Vote (MV): A standard scheme that uses the most popular LF output as the estimate of the true label.

Figure 3 shows the evolving ARI between the ground truth and the auxiliary model Q's prediction of the latent code on real data during model training. The figures show a large improvement in Q's ability to uncover the unobserved class label structure when comparing WSGAN to InfoGAN, which is expected as WSGAN can take advantage of the weak signals encoded in LFs, while InfoGAN is completely unsupervised.

Figure 3: Adjusted Rand Index of the unobserved y and the code predictions Q(x) on real images x. Weak supervision allows WSGAN to better uncover latent y compared to an unsupervised InfoGAN.

Figure 5: The DCGAN generator architecture used in our experiments with 32x32 images.

Figure6: Some synthetic images and pseudolabels generated by the proposed WSGAN with a DCGAN base-architecture, learned from weakly supervised CIFAR10. We note that WSGAN is able to generate images and estimate their labels, even for images where no weak supervision sources provide information (see end of Section 3.1 for details).

Figure 7: We here show the Adjusted Rand Index (ARI) for the additional CIFAR experiments. The plots show the ARI between the unobserved class label y and the discrete code prediction by the auxiliary model Q(x) on real image x, during training. Weak supervision allows WSGAN to better uncover the latent class structure compared to an unsupervised InfoGAN.

Figure 8: Synthetic images learned by StyleWSGAN on a weakly supervised subset of the LSUN scene category dataset.

Figure 9: Synthetic images learned on the CIFAR10-B subset by StyleWSGAN, which a version of our WSGAN that built on StyleGAN2-ADA rather than a simple DCGAN as in our main experiments.

Now let's interpret this result piece-by-piece. The terms are the following • The Rademacher complexity of the function class, which is present in the standard generalization bound. • An estimation error term as a function of how much data we have to train our classifier n 2 . It has the standard rate 1/ √ n 2 . Again, this is standard in any bound. • A penalty term due to the generative model usage. It tells us how much we lose by training on generated data rather than (unlabeled) data from the true distribution. It scales as n -1/4 1

the setup ofThekumparampil et al. (2018), we define a function that multiplies a one-hot encoded true label vector by a right-stochastic matrix C ∈ R 2×2 where C i,j = P (ỹ j |y i )-this is our noisy channel. This induces a joint distribution P X, Ỹ for the examples x and noisy labels ỹ from the conditional distribution defined by C. We restate the theorems of interest from Thekumparampil et al. (2018) here, and proceed to adapt them to our problem setting. Theorem 1. (Multiplicative bound on the total variation distance from Thekumparampil et al. (2018).) Let P X,Y and Q X,Y be two distributions over X × {0, 1} and let P X, Ỹ and Q X, Ỹ be the corresponding distributions with noisy labels from C. If C is full-rank, then d TV ( P , Q) ≤ d TV (P, Q) ≤ ∥C -1 ∥ ∞ d TV ( P , Q).

Proposition 1. (Total variation version.) Let ϵ MV be the labeling error from majority vote from m LFs, where m≥ log(1/ϵ λ ) 2( 1 2 -ϵ λ )2 , whose individual labeling errors are each ϵ λ . Then the following holdsd TV ( P MV , Q MV ) ≤ d TV (P, Q) ≤ ∥C -1 ϵMV ∥ ∞ d TV ( P MV , Q MV ) ≤ ∥C -1 ϵ λ ∥ ∞ d TV ( P MV , Q MV ).

Datasets and labeling function (LF) characteristics used to evaluate the proposed WSGAN. Acc denotes accuracy, and Coverage denotes the proportion of samples where the LF does not abstain.

Average posterior accuracy of various label models on training samples with at least one LF vote. We highlight the best result in blue and the second best result in bold.Evaluation Metrics As common in related work, label model performance is compared based on the pseudolabel accuracy the models achieve on the training data, since programmatic weak supervision operates in a transductive setting. Weighted F1 and mean Average Precision are provided in Appendix E.

Color image generation quality (mean FID). The best scores are highlighted in blue.



Additional datasets and labeling function (LF) characteristics used to evaluate the proposed WSGAN model. Acc denotes accuracy, while Coverage denotes the number of samples where the LF does not abstain.

Additional datasets to evaluate WSGAN with a DCGAN base architecture. This table shows average posterior accuracy of various label models on training samples with at least one LF vote. We highlight the best result in blue and the second best result in bold.

Additional datasets: color image generation quality measured by average Fréchet Inception Distance (FID). The best scores for each dataset are highlighted in blue. , see CIFAR10 dataset details in Table 5. The proposed WSGAN approach outperforms related approaches in these experiments as well. The label model accuracy results are shown in Table 6, while additional metrics including F1 are shown in Section E. Image generation quality results are provided in Table

Additional datasets used to evaluate StyleWSGAN. Acc denotes accuracy, while Coverage denotes the number of samples where an LF does not abstain.

A comparison to using an ACGAN with pseudolabels. Image generation quality is measured by average Fréchet Inception Distance (FID). The best scores for each dataset are highlighted in blue.

Baseline comparisons using an InfoGAN to create synthetic images, applying LFs to the synthetic images, and then using established label models to synthesize the weak labels in to a pseudolabel resulting in weakly labeled fake images. The table shows the change in test accuracy by augmenting the downstream classifier training data with such 1,000 synthetic images and corresponding pseudo labels. Experiments are conducted on a subset of the datasets where labeling functions can be applied to synthetic images.

In this table, we include standard deviations for the posterior accuracy of various label models on training samples with at least one LF vote, computed over five random runs. Due to a limited computational budget, we were unable to accumulate five runs for all datasets and model combinations.

Weighted mean average precision of various label models on training samples with at least one LF vote. We highlight the best result in blue and the second best result in bold.

Weighted F1 score of various label models on training samples with at least one LF vote. The F1 is computed separately for each class and then averaged weighted by the support of each class. We highlight the best result in blue and the second best result in bold.

ACKNOWLEDGMENTS

This work was partially supported by a Space Technology Research Institutes grant from NASA's Space Technology Research Grants Program and the Defense Advanced Research Projects Agency's award FA8750-17-2-0130. This work was also supported in part by NSF (#1651565, #CCF2106707) and ARO (W911NF2110125), and the Wisconsin Alumni Research Foundation (WARF).

availability

Code for WSGAN can be found at https://github.com/benbo/WSGAN-paper.(WS)GAN Models The following design choices were used for the experiments conducted with simple DCGAN base networks (as opposed to the settings used in the StyleGAN ablations).

APPENDIX

Algorithm 1 Pseudocode for the proposed WSGAN loss term which is added to the basic InfoGAN loss. Input images and LFs are assumed to be filtered to only contain samples with at least 1 non-abstaining LF vote.input: Batch of real images X and one-hot encoded LFs Λ, label model L, networks Q, A, F 1 , F 2 , WSGAN mode m, number of classes C, current epoch i, γ decay parameter.b, Z = Q(X) # get predicted code and image features Z if m == "vector": θ = A θ () # WSGAN-vector: get weight vector. else: θ = A(Z.detach()) # WSGAN-encoder: predict weights using image features Z. ŷ = L(Λ, θ) # Get label estimate from labelmodel using weights θ. # Compute cross-entropy losses loss = celoss(F 1 ( b), ŷ.detach()) loss += celoss(F 2 (ŷ), b.detach()) loss += C/(i × γ + 1)mse(θ, ⃗ 1 × 0.5) # add decaying loss keeping weights uniform. return loss

A IMPLEMENTATION DETAILS AND COMPLEXITY

Published as a conference paper at ICLR 2023 We use p as the density for D and q as the density for D, and write p = p 1 (x)p 2 (y|x), q = q 1 (x)q 2 (y|x). First, using Eq. ( 5) and Eq. ( 7), we have thatThen, using Eq. ( 6) and Eq. ( 7), we getNext,Note that here, we use the fact that our bound holds for all conditional distributions regardless of x. Continuing,Now we apply Eq. ( 8) to get the bound back into the total variation distance setting. We haveBounding the Risk The final task is to bound the risk. First, suppose we are training a classifier chosen from a function class F, trained on n 2 independently-drawn data points. Then, a standard result is that with probability at least 1 -δ,Here, R is the Rademacher complexity of the function class. However, the above is for training on samples from the true distribution. Instead, we can writePublished as a conference paper at ICLR 2023

F.3 EXTENSIONS

For the sake of clarity, we make several simplifying assumptions in Claims (1) and ( 2). Both claims use the simplest possible aggregation strategy for weak supervision-majority vote, and our analysis in Claim (1) involves the use of a Gaussian mixture model-a less complex object of study compared to a GAN. We can directly extend both analyses to use more sophisticated weak supervision label models instead of majority vote, and different generative models, which should lead to improved bounds at the expense of a more complex claim statement. We note, additionally, that neither of the claims attempt to provide deep insight into the benefits of jointly learning the generative model and the label model-but this can be done with a slightly more careful analysis.Published as a conference paper at ICLR 2023 

G ADDITIONAL IMAGES

Here we provide additional generated images in Figures 10, 11, 12, 13, and 14 . These random images are generated by WSGAN with a DCGAN base architecture, where the discrete latent variable d passed to the generator is kept the same in each row.

