PGASL: PREDICTIVE AND GENERATIVE ADVERSAR-IAL SEMI-SUPERVISED LEARNING FOR IMBALANCED DATA

Abstract

Modern machine learning techniques often suffer from class imbalance where only a small amount of data is available for minority classes. Classifiers trained on an imbalanced dataset, although have high accuracy on majority classes, can perform poorly on minority classes. This is problematic when minority classes are also important. Generative Adversarial Networks (GANs) have been proposed for generating artificial minority examples to balance the training. We propose a class-imbalanced semi-supervised learning algorithm PGASL which can be efficiently trained on unlabeled and class-imbalanced data. In this work, we use a predictive network which is trained adversarially for the discriminator to correct predictions on the unlabeled dataset. Experiments on text datasets show that PGASL outperforms state-of-the-art class-imbalanced learning algorithms by including both predictive network and generator.

1. INTRODUCTION

In many real world applications such as medical data analysis (Cameron et al., 2010) , data is often imbalanced as those patients suffering from a certain disease will typically have a very small portion of the population. It is often challenging to train a machine learning model on imbalanced data since a classifier may be trained biased towards the majority classes and lead to poor performance for the minority classes. Although the classification may have overall good performance, this is not preferable especially for some tasks when the minority classes are also important. Addressing class imbalance has become more and more important in many fields. Another feature of these kinds of data is that usually only a small amount of data is labeled, for example, patients who visit hospitals who are under-diagnosed for a certain disease, can't be labeled as negative, thus remain unlabeled. Recently, many deep neural network based semi-supervised learning (SSL) algorithms have shown their ability of utilizing unlabeled data to improve performance. However, most works only focus on one of the above challenges and class-imbalanced semi-supervised learning is still under explored. In this paper we consider the rare disease detection task in healthcare (Yu et al., 2019) which is a typical imbalanced semi-supervised machine learning task with binary outputs. As GANs have huge success in learning almost real data distributions through adversarial training, we propose a three-player GAN model which can generate artificial minority data as well as extracting minority data from the unlabeled dataset, which helps handling the class imbalance issue. We let the discriminator output probabilities of predictions for each class and we then use a predictive network, which has the same structure as the discriminator but is trained adversarially, to correct predictions of samples on the unlabeled dataset. Finally we use a generator to generate minority samples so that the discriminator will be trained using more balanced data and will be less likely to bias towards the majority class. Our main contributions include: • We introduce a novel semi-supervised three-players GAN model for imbalanced data. • We conduct experiments on binary text classification datasets to benchmark PGASL with previous class imbalance learning methods. Results demonstrate that our proposed method can efficiently utilize unlabeled data and handle imbalanced datasets that outperform prior works. 1

