CLASSIFY AND GENERATE RECIPROCALLY: SIMULTANEOUS POSITIVE-UNLABELLED LEARNING AND CONDITIONAL GENERATION WITH EXTRA DATA Anonymous authors Paper under double-blind review

Abstract

The scarcity of class-labeled data is a ubiquitous bottleneck in a wide range of machine learning problems. While abundant unlabeled data normally exist and provide a potential solution, it is extremely challenging to exploit them. In this paper, we address this problem by leveraging Positive-Unlabeled (PU) classification and conditional generation with extra unlabeled data simultaneously, both of which aim to make full use of agnostic unlabeled data to improve classification and generation performance. In particular, we present a novel training framework to jointly target both PU classification and conditional generation when exposing to extra data, especially out-of-distribution unlabeled data, by exploring the interplay between them: 1) enhancing the performance of PU classifiers with the assistance of a novel Conditional Generative Adversarial Network (CGAN) that is robust to noisy labels, 2) leveraging extra data with predicted labels from a PU classifier to help the generation. Our key contribution is a Classifier-Noise-Invariant Conditional GAN (CNI-CGAN) that can learn the clean data distribution from noisy labels predicted by a PU classifier. Theoretically, we proved the optimal condition of CNI-CGAN and experimentally, we conducted extensive evaluations on diverse datasets, verifying the simultaneous improvements on both classification and generation.

1. INTRODUCTION

Existing machine learning methods, particularly deep learning models, typically require big data to pursue remarkable performance. For instance, conditional deep generative models are able to generate high-fidelity and diverse images, but they have to rely on vast amounts of labeled data (Lucic et al., 2019) . Nevertheless, it is often laborious or impractical to collect large-scale accurate class-labeled data in real-world scenarios, and thus the label scarcity is ubiquitous. Under such circumstances, the performance of classification and conditional generation (Mirza & Osindero, 2014) drops significantly (Lucic et al., 2019) . At the same time, diverse unlabeled data are available in enormous quantities, and therefore a key issue is how to take advantage of the extra data to enhance the conditional generation or classification. Within the unlabeled data, both in-distribution and out-of-distribution data exist, where indistribution data conform to the distribution of the labeled data while out-of-distribution data do not. Our key insight is to harness the out-of-distribution data. In generation with extra data, most related works focus on the in-distribution data (Lucic et al., 2019; Gui et al., 2020; Donahue & Simonyan, 2019) . When it comes to the out-of-distribution data, the majority of existing methods (Noguchi & Harada, 2019; Yamaguchi et al., 2019; Zhao et al., 2020) attempt to forcibly train generative models on a large number of unlabeled data, and then transfer the learned knowledge of the pre-trained generator to the in-distribution data. In classification, a common setting to utilize unlabeled data is semi-supervised learning (Miyato et al., 2018; Sun et al., 2019; Berthelot et al., 2019) , which usually assumes that the unlabeled and labeled data come from the same distribution, ignoring their distributional mismatch. In contrast, Positive and Unlabeled (PU) Learning (Bekker & Davis, 2018; Kiryo et al., 2017) is an elegant way of handling this under-studied problem, where a model only has access to positive examples and unlabeled data. Therefore, it is possible to utilize pseudo labels predicted by a PU classifier on unlabeled data to guide the conditional generation. However, the predicted signals from the classifier tend to be noisy. Although there are a flurry of papers about learning from noisy labels for classification (Tsung Wei Tsai, 2019; Ge et al., 2020; Guo et al., 2019) , to our best knowledge, no work has considered to leverage the noisy labels seamlessly in joint classification and generation. Additionally, Hou et al. (2017) leveraged GANs to recover both positive and negative data distribution to step away from overfitting, but they never considered the noise-invariant generation or their mutual improvement. Xu et al. (2019) focused on generativediscriminative complementary learning in weakly supervised learning, but we are the first attempt to tackle the (Multi-) Positive and Unlabeled learning setting while developing the method of noiseinvariant generation from noisy labels. The discussion about more related works can be found in Appendix B. In this paper, we focus on the mutual benefits of conditional generation and PU classification, in settings where extra unlabeled data, including out-of-distribution data, are provided although little class-labeled data is available. Firstly, a parallel non-negative multi-class PU estimator is derived to classify both the positive data of all classes and the negative data. Then we design a Classifier-Noise-Invariant Conditional Generative Adversarial Network (CNI-CGAN) that is able to learn the clean data distribution on all unlabeled data with noisy labels provided by the PU classifier. Conversely, we also leverage our CNI-CGAN to enhance the performance of the PU classification through data augmentation, demonstrating a reciprocal benefit for both generation and classification. We provide the theoretical analysis on the optimal condition of our CNI-CGAN and conduct extensive experiments to verify the superiority of our approach.

2.1. POSITIVE-UNLABELED LEARNING

Traditional Binary Positive-Unlabeled Problem Setting Let X ∈ R d and Y ∈ {±1} be the input and output variables and p(x, y) is the joint distribution with marginal distribution p p (x) = p(x|Y = +1) and p n (x) = p(x|Y = -1). In particular, we denote p(x) as the distribution of unlabeled data. n p , n n and n u are the amount of positive, negative and unlabeled data, respectively. Parallel Non-Negative PU Estimator Vanilla PU learning (Bekker & Davis, 2018; Kiryo et al., 2017; Du Plessis et al., 2014; 2015) employs unbiased and consistent estimator. Denote g θ : R d → R as the score function parameterized by θ, and : R × {±1} → R as the loss function. The risk of g θ can be approximated by its empirical version denoted as R pn (g θ ): R pn (g θ ) = π p R + p (g θ ) + π n R - n (g θ ), where π p represents the class prior probability, i.e. P (Y = +1), with π p + π n = 1, and R + p (g θ ) = 1 np np i=1 (g θ (x p i ) , +1) and R - n (g θ ) = 1 nn nn i=1 (g θ (x n i ) , -1). As negative data x n are unavailable, a common strategy is to offset R - n (g θ ). We also know that π n p n (x) = p(x) -π p p p (x), and hence π n R - n (g θ ) = R - u (g θ ) -π p R - p (g θ ). Then the resulting unbiased risk estimator R pu (g θ ) can be formulated as: R pu (g θ ) = π p R + p (g θ ) -π p R - p (g θ ) + R - u (g θ ), where R - p (g θ ) = 1 np np i=1 (g θ (x p i ) , -1) and R - u (g θ ) = 1 nu nu i=1 (g θ (x u i ) , -1). The advantage of this unbiased risk minimizer is that the optimal solution can be easily obtained if g is linear in θ. However, in real scenarios we tend to leverage more flexible models g θ , e.g., deep neural networks. This strategy will push the estimator to a point where it starts to suffer from overfitting. Hence, we decide to utilize non-negative risk (Kiryo et al., 2017) for our PU learning, which has been verified in (Kiryo et al., 2017) to allow deep neural network to mitigate overfitting. The non-negative PU estimator is formulated as: R pu (g θ ) = π p R + p (g θ ) + max 0, R - u (g θ ) -π p R - p (g θ ) . In pursue of the parallel implementation of R pu (g θ ), we replace max 0, R - u (g θ ) -π p R - p (g θ ) with its lower bound 1 N N i=1 max 0, R - u (g θ ; X i u ) -π p R - p (g θ ; X i p ) where X i u and X i p denote as the unlabeled and positive data in the i-th mini-batch, and N is the number of batches. From Binary PU to Multi-PU Learning Previous PU learning focuses on learning a classifier from positive and unlabeled data, and cannot easily be adapted to K +1 multi-classification tasks where K represents the number of classes in the positive data. Xu et al. (2017) ever developed Multi-Positive and Unlabeled learning, but the proposed algorithm may not allow deep neural networks. Instead, we extend binary PU learning to multi-class version in a straightforward way by additionally incorporating cross entropy loss on all the positive data with labels for different classes. More precisely, we consider the K + 1-class classifier f θ as a score function f θ = f 1 θ (x), . . . , f K+1 θ (x) . After the softmax function, we select the first K positive data to construct cross-entropy loss CE , i.e., CE (f θ (x), y) = log K+1 j=1 exp f j θ (x) -f y θ (x) where y ∈ [K]. For the PU loss, we consider the composite function h(f θ (x)) : R d → R where h(•) conducts a logit transformation on the accumulative probability for the first K classes, i.e., h( f θ (x)) = ln( p 1-p ) in which p = K j=1 exp f j θ (x) / K+1 j=1 exp f j θ (x) . The final mini-batch risk of our PU learning can be presented as: R pu (f θ ; X i ) = π p R + p (h(f θ ); X i p ) + max 0, R - u (h(f θ ); X i u ) -π p R - p (h(f θ ); X i p ) + R CE p (f θ ; X i p ), (4) where R CE p (f θ ; X i p ) = 1 np np i=1 CE (f θ (x p i ) , y). 2.2 CLASSIFIER-NOISE-INVARIANT CONDITIONAL GENERATIVE ADVERSARIAL NETWORK PU PU Figure 1: Model architecture of our Classifier- Noise-Invariant Conditional GAN (CNI-CGAN). The output x g of the conditional generator G is paired with a noisy label ỹ corrupted by the PUdependent confusion matrix C. The discriminator D distinguishes between whether a given labeled sample comes from the real data (x r , P U θ (x r )) or generated data (x g , ỹ). To leverage extra data, i.e., all unlabeled data, to benefit the generation, we deploy our conditional generative model on all data with pseudo labels predicted by our PU classifier. However, these predicted labels tend to be noisy, reducing the reliability of the supervision signals and thus worsening the performance for the conditional generative model. Besides, the noise depends on the accuracy of the given PU classifier. To address this issue, we focus on developing a novel noise-invariant conditional GAN that is robust to noisy labels provided by a specified classifier, e.g. a PU classifier. We call our method Classifier-Noise-Invariant Conditional Generative Adversarial Network (CNI-CGAN) and the architecture is depicted in Figure 1 . In the following, we elaborate on each part of it.

Principle of the Design of CNI-CGAN

Albeit being noisy, the pseudo labels given by the PU classifier still provide rich information that we can exploit. The key is to take the noise generation mechanism into consideration during generation. We denote the real data as x r and the predicted hard label through PU classifier as P U θ (x r ), i.e., P U θ (x r ) = arg max i f i θ (x r ), as displayed in Figure 1 . We let the generator "imitate" the noise generation mechanism to generate pseudo labels for the labeled data. With both pseudo and real labels, we can leverage the PU classifier f θ to estimate a confusion matrix C to model the label noise from the classifier. During generation, a real label y, while being fed into the generator G, will also be polluted by C to compute a noisy label ỹ, which then will be combined with the generated fake sample x g for the following discrimination. Finally, the discriminator D will distinguish the real samples [x r , P U θ (x r )] out of fake samples [x g , ỹ]. Overall, the noise "generation" mechanism from both sides can be balanced.

Estimation of C

The key in the design of C is to estimate the label noise of the pre-trained PU classifier by considering all the samples of each class. More specifically, the confusion matrix C is k + 1 by k + 1 and each entry Cij represents the probability of a generated sample x g , given a label i, being classified as class j by the PU classifier. Mathematically, we denote Cij as: Cij = P (P U θ (x g ) = j|y = i) = E z [I {P U θ (xg)=j|y=i} ], where x g = G(z, y = i) and I is the indicator function. Owing to the stochastic optimization nature when training deep neural networks, we incorporate the estimation of C in the processing of training by Exponential Moving Average (EMA) method. We formulate the update of C(l+1) in the l-th mini-batch as follows: C(l+1) = λ C(l) + (1 -λ)∆ C X l , where ∆ C X l denotes the incremental change of C on the current l-th mini-batch data X l via Eq. 5. λ is the averaging coefficient in EMA. Theoretical Guarantee of Clean Data Distribution Firstly, we denote O(x) as the oracle class of sample x from an oracle classifier O(•). Let π i , i = 1, ..., K + 1, be the class-prior probability of the class i in the multi-postive unlabeled setting. Theorem 1 proves the optimal condition of CNI-CGAN to guarantee the convergence to the clean data distribution. The proof is provided in Appendix A. Theorem 1. (Optimal Condition of CNI-CGAN) Let P g be a probabilistic transition matrix where P g ij = P (O(x g ) = j|y = i) indicates the probability of sample x g with the oracle label j generated by G with the initial label i. We assume that the conditional sample space of each class is disjoint with each other, then (1) P g is a permutation matrix if the generator G in CNI-CGAN is optimal, with the permutation, compared with an identity matrix, only happens on rows r where corresponding π r , r ∈ r are equal. (2) If P g is an identity matrix and the generator G in CNI-CGAN is optimal, then p r (x, y) = p g (x, y) where p r (x, y) and p g (x, y) are the real and the generating joint distribution, respectively. The Auxiliary Loss The optimal G in CNI-CGAN can only guarantee that p g (x, y) is close to p r (x, y) as the optimal permutation matrix P g , i.e., a permutation matrix, is close to the identity matrix. Hence in practice, to ensure that we can learn an identity matrix for P g and thus achieve the clean data distribution, we introduce an auxiliary loss to encourage a larger trace of P g , i.e., K+1 i=1 P (O(x g ) = i)|y = i). As O(•) is intractable, we approximate it by the current PU classifier P U θ (x g ). Then we obtain the auxiliary loss: aux (z, y) = max{κ - 1 K + 1 K+1 i=1 E z (I {P U θ (xg)=i|y=i} ), 0}, where κ ∈ (0, 1) is a hyper-parameter. With the support of auxiliary loss, P g has the tendency to converge to the identity matrix where CNI-CGAN can learn the clean data distribution even in the presence of noisy labels. Comparison with RCGAN (Thekumparampil et al., 2018; Kaneko et al., 2019) The theoretical property of CNI-CGAN has a major advantage over existing Robust CGAN (RC-GAN) (Thekumparampil et al., 2018; Kaneko et al., 2019) , for which the optimal condition can only be achieved when the label confusion matrix is known a priori. Although heuristics can be employed, such as RCGAN-U (Thekumparampil et al., 2018) , to handle the unknown label noise setting, these approaches still lack the theoretical guarantee to converge to the clean data distribution. Additionally, to guarantee the efficacy of our approach, one implicit and mild assumption is that our PU classifier will not overfit on the training data, while our non-negative estimator helps to ensure that it as explained in Section 2.1. It should be worthwhile to note that our CNI-CGAN conducts K + 1 classes generation. To further clarify the optimization process of CNI-CGAN, we elaborate the training steps of D and G, respectively.

D-Step:

We train D on an adversarial loss from both the real data and the generated (x g , ỹ), where ỹ is corrupted by C. Cy denotes the y-th row of C. We formulate the loss of D as: max D∈F E x∼p(x) [φ(D(x, P U θ (x)))] + E z∼P Z ,y∼P Y ỹ|y∼ Cy [φ(1 -D(G(z, y), ỹ))], where F is a family of discriminators and P Z is the distribution of latent space vector z, e.g., a Normal distribution. P Y is a discrete uniform distribution on [K + 1] and φ is the measuring function.

G-Step:

We train G additionally on the auxiliary loss aux (z, y) as follows: min G∈G E z∼P Z ,y∼P Y ỹ|y∼ Cy [φ(1 -D(G(z, y), ỹ)) + β aux (z, y)] , where β controls the strength of auxiliary loss and G is a family of generators.

2.3. TRAINING ALGORITHM

Firstly, we obtain a PU classifier f θ trained on multi-positive and unlabeled dataset with the parallel non-negative estimator derived in Section 2.1. Then we train our CNI-CGAN, described in Section 2.2, on all data with pseudo labels predicted by the pre-trained PU classifier. As our CNI-CGAN is robust to noisy labels, we leverage the data generated by CNI-CGAN to conduct data augmentation to improve the PU classifier. Finally, we implement the joint optimization for the training of CNI-CGAN and the data augmentation of the PU classifier. We summarize the details in Algorithm 1 and Appendix C. Sample {z 1 , ..., z M }, {y 1 , ..., y M } and {x 1 , ..., x M } from P Z , P Y and all training data, respectively, and then sample {ỹ 1 , ..., ỹM } through the current C(l) . Update the discriminator D by ascending its stochastic gradient of

Simultaneous

1 M M i=1 [φ(D(x i , P U θ (x i )))] + φ(1 -D(G(z i , y i ), ỹi ))].

9:

Sample {z 1 , ..., z M } and {y 1 , ..., y M } from P Z and P Y , and then sample {ỹ 1 , ..., ỹM } through the current C(l) . Update the generator G by descending its stochastic gradient of 1 M M i=1 [φ(1 -D(G(z i , y i ), ỹi )) + β aux (y i , z i )]. 10: if l ≥ L 0 then 11: Compute ∆ C X l = 1 M M i=1 I {P U θ (G(zi,yi ))|yi} via Eq. 5, and then estimate C by C(l+1) = λ C(l) + (1 -λ)∆ C X l . 12: end if 

3. EXPERIMENT

Experimental Setup We perform our approaches and several baselines on MNIST, Fashion-MNIST and CIFAR-10. We select the first 5 classes on MNIST and 5 non-clothes classes on Fashion-MNIST, respectively, for K + 1 classification (K = 5). To verify the consistent effectiveness of our method in the standard binary PU setting, we pick the 4 categories of transportation tools in CIFAR-10 as the one-class positive dataset. As for the baselines, the first is CGAN-P, where a Vanilla CGAN (Mirza & Osindero, 2014) is trained only on limited positive data. Another natural baseline is CGAN-A where a Vanilla CGAN is trained on all data with labels given by the PU classifier. The last baseline is RCGAN-U (Thekumparampil et al., 2018) where the confusion matrix is totally learnable while training. For fair comparisons, we choose the same GAN architecture, and more details about hyper-parameters can be found in Appendix D. Evaluation For MNIST and Fashion-MNIST, we mainly use Generator Label Accuracy (Thekumparampil et al., 2018) and PU Accuracy to evaluate the quality of generated images. Generator Label Accuracy compares specified y from CGANs to the true class of the generated examples through a pre-trained (almost) oracle classifier f . In experiments, we pre-trained two K +1 classifiers with 99.28% and 98.23% accuracy on the two datasets, respectively. Additionally, the increased PU Accuracy measures the closeness between generated data distribution and test (almost real) data distribution for the PU classification, serving as a key indicator to reflect the quality of generated images. For CIFAR 10, we take the Inception Score into consideration.

3.1. GENERATION AND CLASSIFICATION PERFORMANCE

We set the whole training dataset as the unlabeled data and select certain amount of positive data with the ratio of Positive Rate. rates, while CGAN-A even worsens the original PU classifier sometimes in this scenario due to the existence of too much label noise given by a less accurate PU classifier. Meanwhile, when more supervised positive data are given, the PU classifier generalizes better and then provides more accurate labels, conversely leading to more consistent and better performance for all methods. Besides, note that even though the CGAN-P achieves comparable generator label accuracy on MNIST, it results in a lower Inception Score. We demonstrate this in Appendix D. To verify the advantage of theoretical property for our CNI-CGAN, we further compare it with RCGCN-U (Thekumparampil et al., 2018; Kaneko et al., 2019) , the heuristic version of robust generation against unknown noisy labels setting without the theoretical guarantee of optimal condition. As observed in Table 1 , our method outperforms RCGAN-U especially when the positive rate is low, and when the number of positive labeled data is relatively large, e.g., 10.0%, both Ours and RCGAN-U obtain comparable performance. Visualization To further demonstrate the superiority of CNI-CGAN compared with the other baselines, we present some generated images within K + 1 classes from CGAN-A, RCGAN-U and CNI-CGAN on MNIST, and high-quality images from CNI-CGAN on Fashion-MNIST and CIFAR-10, in Figure 3 . In particular, we choose the positive rate as 0.2% on MNIST, yielding the initial PU classifier with 69.14% accuracy. Given the noisy labels on all data, our CNI-CGAN can generate more accurate images of each class visually compared with CGAN-A and RCGAN-U. Results of Fashion-MNIST and comparison with CGAN-P on CIFAR-10 can refer to Appendix E.

3.2. ROBUSTNESS OF OUR APPROACH

Robustness against the Initial PU accuracy The auxiliary loss can help the CNI-CGAN to learn the clean data distribution regardless of the initial accuracy of PU classifiers. To verify that, we select distinct positive rates, yielding the pre-trained PU classifiers with different initial accuracies. Then we perform our method based on these PU classifiers. Figure 4 suggests that although better initial PU accuracy can be beneficial to the initial generation performance, our approach under different PU accuracies can still attain the similar generation quality after sufficient training. ] where the number of data in each class, including the negative data, is even, while type 2 is [ 1 2K , ... 1 2K , 1 2 ] where the negative data makes up half of all unlabeled data. In experiments, we focus on the PU Accuracy to evaluate both the generation quality and the improvement of PU learning. For MNIST, we choose 1% and 0.5% for two settings while we opt for 0.5% and 0.2% on both Fashion-MNIST and CIFAR-10. 

4. DISCUSSION AND CONCLUSION

In this paper, we proposed a new method, CNI-CGAN, to jointly exploit PU classification and conditional generation. It is, to our best knowledge, the first method of such kind to break the ceiling of class-label scarcity, by combining two promising yet separate methodologies to gain massive mutual improvements. CNI-CGAN can learn the clean data distribution from noisy labels given by a PU classifier, and then enhance the performance of PU classification through data augmentation in various settings. We have demonstrated, both theoretically and experimentally, the superiority of our proposal on diverse benchmark datasets in an exhaustive and comprehensive manner. In the future, it will be promising to investigate learning strategies on imbalanced data, e.g., cost-sensitive learning (Elkan, 2001) , to extend our approach to broader settings, which will further cater to realworld scenarios where only highly unbalanced data are available.



Improvement on PU Learning and Generation with Extra Data From the perspective of PU classification, due to the theoretical guarantee from Theorem 1, CNI-CGAN is capable of learning a clean data distribution out of noisy pseudo labels predicted by the pre-trained PU Algorithm 1 Alternating Minimization for PU Learning and Classifier-Noise-Invariant Generation. Input: Training data (X p , X u ). Batch size M and hyper-parameter β > 0, λ, κ ∈ (0, 1). L 0 and L ∈ N + . Initializing C(1) as identity matrix. Number of batches N during the training. Output: Model parameter for generator G, and θ for the PU classifier f θ . 1: / * Pre-train PU classifier f θ * / 2: for i = 1 to N do 3:Update f θ by descending its stochastic gradient of R pu f θ ; X i via Eq. 4

Figure 2: Generation and classification performance of CGAN-P, CGAN-A and Ours on three datasets. Results of CGAN-P on PU accuracy do not exist since CGAN-P generates only K classes data rather than K + 1 categories that the PU classifier needs.

Figure 4: Tendency of generation performance as the training iterations increaseon three datasets.

Figure 5: PU Classification accuracy of CGAN-A, RCGAN-U and Ours after joint optimization across different amounts and distribution types of unlabeled data.

Figure5manifests that the accuracy of PU classifier exhibits a slight ascending tendency with the increasing of the number of unlabeled data. More importantly, our CNI-CGAN almost consistently outperforms other baselines across different amount of unlabeled data as well as distinct distributions of unlabeled data. This verifies the robustness of our proposal to the situation of extra data.

Hence, the following data augmentation has the potential to improve the generalization of PU classification regardless of the specific form of the PU estimator. From the perspective of generation with extra data, the predicted labels on unlabeled data from the PU classifier can provide the CNI-CGAN with more supervision signals, thus further improving the quality of generation. Due to the joint optimization, both the PU classification and the conditional generative models are able to improve each other reciprocally, as demonstrated in the following experiments.

PU classification accuracy of RCGAN-U and Ours across three datasets. Final PU accuracy represents the accuracy of PU classifier after the data augmentation.

Unlabeled data In real scenarios, we are more likely to have little knowledge about the extra data we have. To further verify the robustness of CNI-CGAN against the

