GENERATIVE MODELING HELPS WEAK SUPERVISION (AND VICE VERSA)

Abstract

Many promising applications of supervised machine learning face hurdles in the acquisition of labeled data in sufficient quantity and quality, creating an expensive bottleneck. To overcome such limitations, techniques that do not depend on ground truth labels have been studied, including weak supervision and generative modeling. While these techniques would seem to be usable in concert, improving one another, how to build an interface between them is not well-understood. In this work, we propose a model fusing programmatic weak supervision and generative adversarial networks and provide theoretical justification motivating this fusion. The proposed approach captures discrete latent variables in the data alongside the weak supervision derived label estimate. Alignment of the two allows for better modeling of sample-dependent accuracies of the weak supervision sources, improving the estimate of unobserved labels. It is the first approach to enable data augmentation through weakly supervised synthetic images and pseudolabels. Additionally, its learned latent variables can be inspected qualitatively. The model outperforms baseline weak supervision label models on a number of multiclass image classification datasets, improves the quality of generated images, and further improves end-model performance through data augmentation with synthetic samples.

1. INTRODUCTION

How can we get the most out of data when we do not have ground truth labels? Two prominent paradigms operate in this setting. First, programmatic weak supervision frameworks use weak sources of training signal to train downstream supervised models, without needing access to groundtruth labels (Riedel et al., 2010; Ratner et al., 2016; Dehghani et al., 2017; Lang & Poon, 2021) . Second, generative models enable learning data distributions which can benefit downstream tasks, e.g. via data augmentation or representation learning, in particular when learning latent factors of variation (Higgins et al., 2018; Locatello et al., 2019; Hu et al., 2019) . Intuitively, these two paradigms should complement each other, as each can be thought of as a different approach to extracting structure from unlabeled data. However, to date there is no simple way to combine them. Fusing generative models with weak supervision holds substantial promise. For example, it could yield large reductions in data acquisition costs for training complex models. Programmatic weak supervision replaces the need for manual annotations by applying so-called labeling functions to unlabeled data, producing weak labels that are combined into a pseudolabel for each sample. This leaves the majority of the acquisition budget to be spent on unlabeled data, and here generative modeling can reduce the number of real-world samples that need to be collected. Similarly, information about the data distribution contained in weak label sources may improve generative models, reducing the need to acquire large volumes of samples to increase generative performance and model discrete structure. Additionally, learning with weak labels may enable targeted data augmentation, allowing for class-conditional sample generation despite not having access to ground truth. The main technical challenge is to build an interface between the core models used in the two approaches. Generative adversarial networks (GANs) (Goodfellow et al., 2014) , which we focus on in this work, have at least a generator and a discriminator, and frequently additional auxiliary models, such as those that learn to disentangle latent factors of variation (Chen et al., 2016) . In programmatic weak supervision, the label model is the main focus. It is necessary to develop an interface that aligns the structures learned from the unlabeled data by the various components. We introduce weakly-supervised GAN (WS-GAN), a simple yet powerful fusion of weak supervision and GANs visualized in Fig. 2 , and we provide a theoretical justification that motivates the expected gains from this fusion. Our WSGAN approach is related to the unsupervised InfoGAN (Chen et al., 2016) generative model, and also inspired by encoder-based label models as in (Cachay et al., 2021) . These techniques expose structure in the data, and our approach ensures alignment between the resulting variables by learning projections between them. The proposed WSGAN offers a number of benefits, including: • Improved weak supervision: We obtain better-quality pseudolabels via WSGAN's label model, yielding consistent improvements in pseudolabel accuracy up to 6% over established programmatic weak supervision techniques such as Snorkel (Ratner et al., 2020). • Improved generative modeling: Weak supervision provides information about unobserved labels which can be used to obtain better disentangled latent variables, thus improving the model's generative performance. Over 6 datasets, our WSGAN approach improves image generation by an average of 5.8 FID points versus InfoGAN. We conduct architecture ablations and show that the proposed approach can be integrated into state-of-the-art GAN architectures such as StyleGAN (Karras et al., 2019) (see Fig. 1 ), achieving state-of-the-art image generation quality. • Data augmentation via synthetic samples: WSGAN can generate samples and corresponding label estimates for data augmentation (e.g. Fig. 10 ), providing improvements of downstream classifier accuracy of up to 3.9% in our experiments. The trained WSGAN can produce label estimates even for samples, real or fake, that have no weak supervision signal available.

2. BACKGROUND

We propose to fuse weak supervision with generative modeling to the benefit of both techniques, and first provide a brief overview. A broader review of related work is presented in Section 5. Weak Supervision Weak supervision methods that use multiple sources of imperfect and partial labels (Ratner et al., 2016; 2020; Cachay et al., 2021) , sometimes referred to as programmatic weak supervision, seek to replace manual labeling for the construction of large labeled datasets. Instead, users define multiple weak label sources that can be applied automatically to the unlabeled dataset. Such sources can be heuristics, knowledge base look-ups, off-the-shelf models, and more. The technical challenge is to combine the source votes into a high-quality pseudolabel via a label model. This requires estimating the errors and dependencies between sources and using them to compute a posterior label distribution. Prior work has considered various choices for the label model, most of which only take the weak source outputs into account. A review can be found in Zhang et al. (2021; 2022) . Instead, our label model produces sample dependent accuracy estimates for the weak sources based on the features of the data, similar to Cachay et al. (2021) . Generative Models and GANs Generative models are used to model and sample from complex distributions. Among the most popular such models are generative adversarial networks (GANs) (Goodfellow et al., 2014) . GANs consist of a generator and discriminator model that play a minimax game against each other. Our approach builds off InfoGAN (Chen et al., 2016) , which adds an auxiliary inference component to learn disentangled representations via a set of latent factors of variation. We hypothesize that connecting such discrete latent variables to the label model should yield benefits for both weak supervision and generative modeling.



Figure 1: Class-conditional image generation by the proposed WSGAN based on a weakly supervised CIFAR10 subset with 30k samples. Here, WSGAN uses a StyleGAN2 base architecture and we keep the discrete code in each row fixed.

