CLASSIFY AND GENERATE RECIPROCALLY: SIMULTANEOUS POSITIVE-UNLABELLED LEARNING AND CONDITIONAL GENERATION WITH EXTRA DATA Anonymous authors Paper under double-blind review

Abstract

The scarcity of class-labeled data is a ubiquitous bottleneck in a wide range of machine learning problems. While abundant unlabeled data normally exist and provide a potential solution, it is extremely challenging to exploit them. In this paper, we address this problem by leveraging Positive-Unlabeled (PU) classification and conditional generation with extra unlabeled data simultaneously, both of which aim to make full use of agnostic unlabeled data to improve classification and generation performance. In particular, we present a novel training framework to jointly target both PU classification and conditional generation when exposing to extra data, especially out-of-distribution unlabeled data, by exploring the interplay between them: 1) enhancing the performance of PU classifiers with the assistance of a novel Conditional Generative Adversarial Network (CGAN) that is robust to noisy labels, 2) leveraging extra data with predicted labels from a PU classifier to help the generation. Our key contribution is a Classifier-Noise-Invariant Conditional GAN (CNI-CGAN) that can learn the clean data distribution from noisy labels predicted by a PU classifier. Theoretically, we proved the optimal condition of CNI-CGAN and experimentally, we conducted extensive evaluations on diverse datasets, verifying the simultaneous improvements on both classification and generation.

1. INTRODUCTION

Existing machine learning methods, particularly deep learning models, typically require big data to pursue remarkable performance. For instance, conditional deep generative models are able to generate high-fidelity and diverse images, but they have to rely on vast amounts of labeled data (Lucic et al., 2019) . Nevertheless, it is often laborious or impractical to collect large-scale accurate class-labeled data in real-world scenarios, and thus the label scarcity is ubiquitous. Under such circumstances, the performance of classification and conditional generation (Mirza & Osindero, 2014) drops significantly (Lucic et al., 2019) . At the same time, diverse unlabeled data are available in enormous quantities, and therefore a key issue is how to take advantage of the extra data to enhance the conditional generation or classification. Within the unlabeled data, both in-distribution and out-of-distribution data exist, where indistribution data conform to the distribution of the labeled data while out-of-distribution data do not. Our key insight is to harness the out-of-distribution data. In generation with extra data, most related works focus on the in-distribution data (Lucic et al., 2019; Gui et al., 2020; Donahue & Simonyan, 2019) . When it comes to the out-of-distribution data, the majority of existing methods (Noguchi & Harada, 2019; Yamaguchi et al., 2019; Zhao et al., 2020) attempt to forcibly train generative models on a large number of unlabeled data, and then transfer the learned knowledge of the pre-trained generator to the in-distribution data. In classification, a common setting to utilize unlabeled data is semi-supervised learning (Miyato et al., 2018; Sun et al., 2019; Berthelot et al., 2019) , which usually assumes that the unlabeled and labeled data come from the same distribution, ignoring their distributional mismatch. In contrast, Positive and Unlabeled (PU) Learning (Bekker & Davis, 2018; Kiryo et al., 2017) is an elegant way of handling this under-studied problem, where a model only has access to positive examples and unlabeled data. Therefore, it is possible to utilize pseudo labels predicted by a PU classifier on unlabeled data to guide the conditional generation.

annex

However, the predicted signals from the classifier tend to be noisy. Although there are a flurry of papers about learning from noisy labels for classification (Tsung Wei Tsai, 2019; Ge et al., 2020; Guo et al., 2019) , to our best knowledge, no work has considered to leverage the noisy labels seamlessly in joint classification and generation. Additionally, Hou et al. (2017) leveraged GANs to recover both positive and negative data distribution to step away from overfitting, but they never considered the noise-invariant generation or their mutual improvement. Xu et al. (2019) focused on generativediscriminative complementary learning in weakly supervised learning, but we are the first attempt to tackle the (Multi-) Positive and Unlabeled learning setting while developing the method of noiseinvariant generation from noisy labels. The discussion about more related works can be found in Appendix B.In this paper, we focus on the mutual benefits of conditional generation and PU classification, in settings where extra unlabeled data, including out-of-distribution data, are provided although little class-labeled data is available. Firstly, a parallel non-negative multi-class PU estimator is derived to classify both the positive data of all classes and the negative data. Then we design a Classifier-Noise-Invariant Conditional Generative Adversarial Network (CNI-CGAN) that is able to learn the clean data distribution on all unlabeled data with noisy labels provided by the PU classifier. Conversely, we also leverage our CNI-CGAN to enhance the performance of the PU classification through data augmentation, demonstrating a reciprocal benefit for both generation and classification. We provide the theoretical analysis on the optimal condition of our CNI-CGAN and conduct extensive experiments to verify the superiority of our approach.

2.1. POSITIVE-UNLABELED LEARNING

Traditional Binary Positive-Unlabeled Problem Setting Let X ∈ R d and Y ∈ {±1} be the input and output variables and p(x, y) is the joint distribution with marginal distribution p p (x) = p(x|Y = +1) and p n (x) = p(x|Y = -1). In particular, we denote p(x) as the distribution of unlabeled data. n p , n n and n u are the amount of positive, negative and unlabeled data, respectively.Parallel Non-Negative PU Estimator Vanilla PU learning (Bekker & Davis, 2018; Kiryo et al., 2017; Du Plessis et al., 2014; 2015) employs unbiased and consistent estimator. Denote g θ : R d → R as the score function parameterized by θ, and : R × {±1} → R as the loss function. The risk of g θ can be approximated by its empirical version denoted as R pn (g θ ):where π p represents the class prior probability, i.e. P (Y = +1), with π p + π n = 1, and R + p (g θ ) =As negative data x n are unavailable, a common strategy is to offset R - n (g θ ). We also know that π n p n (x) = p(x) -π p p p (x), and henceThen the resulting unbiased risk estimator R pu (g θ ) can be formulated as:whereThe advantage of this unbiased risk minimizer is that the optimal solution can be easily obtained if g is linear in θ. However, in real scenarios we tend to leverage more flexible models g θ , e.g., deep neural networks. This strategy will push the estimator to a point where it starts to suffer from overfitting. Hence, we decide to utilize non-negative risk (Kiryo et al., 2017) for our PU learning, which has been verified in (Kiryo et al., 2017) to allow deep neural network to mitigate overfitting. The non-negative PU estimator is formulated as: R pu (g θ ) = π p R + p (g θ ) + max 0, R - u (g θ ) -π p R - p (g θ ) .(3)In pursue of the parallel implementation of R pu (g θ ), we replace max 0, R - u (g θ ) -π p R - p (g θ ) with its lower bound 1 N N i=1 max 0, R - u (g θ ; X i u ) -π p R - p (g θ ; X i p ) where X i u and X i p denote as the unlabeled and positive data in the i-th mini-batch, and N is the number of batches.

