ENSEMBLES OF GENERATIVE ADVERSARIAL NETWORKS FOR DISCONNECTED DATA Anonymous

Abstract

Most computer vision datasets are composed of disconnected sets, such as images of different objects. We prove that distributions of this type of data cannot be represented with a continuous generative network without error, independent of the learning algorithm used. Disconnected datasets can be represented in two ways: with an ensemble of networks or with a single network using a truncated latent space. We show that ensembles are more desirable than truncated distributions for several theoretical and computational reasons. We construct a regularized optimization problem that rigorously establishes the relationships between a single continuous GAN, an ensemble of GANs, conditional GANs, and Gaussian Mixture GANs. The regularization can be computed efficiently, and we show empirically that our framework has a performance sweet spot that can be found via hyperparameter tuning. The ensemble framework provides better performance than a single continuous GAN or cGAN while maintaining fewer total parameters.

1. INTRODUCTION

Generative networks, such as generative adversarial networks (GANs) (Goodfellow et al., 2014) and variational autoencoders (Kingma & Welling, 2013) , have shown impressive performance in generating highly realistic images that were not observed in the training set (Karras et al., 2017; 2019a; b) . However, even state of the art generative networks such as BigGAN (Brock et al., 2018) generate poor quality imagery if conditioned on certain classes of ILSVRC2012 (Russakovsky et al., 2015) . We argue that this is due to the inherent disconnected structure of the data. In this paper, we theoretically analyze the effects of disconnected data on GAN performance. By disconnected, we mean that the data points are drawn from an underlying topological space that is disconnected (the rigorous definition is provided below in Section 3.1). As an intuitive example, consider the collection of all images of badgers and all images of zebras. These two sets are disconnected, because images of badgers do not resemble images of zebras, and modeling the space connecting these sets does not represent real images of animals. We rigorously prove that one cannot use a single continuous generative network to learn a data distribution perfectly under the disconnected data model. Because generative networks are continuous, they cannot map a connected latent space (R ) into the disconnected image space, resulting in the generation of data outside of the true data space. In related work, (Khayatkhoei et al., 2018) has empirically studied disconnected data but does not formally prove the results in this paper. In addition, the authors use a completely unsupervised approach to attempt to find the disconnected components as a part of learning. In contrast, we use class labels and hence work in the supervised learning regime. Our suggested approach to best deal with disconnected data is to use ensembles of GANs. We study GANs in particular for concreteness and because of their widespread application; however, our methods can be extended to other generative networks with some modification. Ensembles of GANs are not new, e.g., see (Nguyen et al., 2017; Ghosh et al., 2018; Tolstikhin et al., 2017; Arora et al., 2017) , but there has been limited theoretical study of their properties. We prove that ensembles can learn the data distribution under the disconnected data assumption and study their relationship to single GANs. Specifically, we develop a first-of-its-kind theoretic framework that relates single GANs, ensembles of GANs, conditional GANs, and Gaussian mixture GANs. The framework makes it easy to, e.g., develop regularized GAN ensembles that encourage parameter sharing, which we show outperform cGANs and single GANs. While our primary focus here is on theoretical insight, we also conduct a range of experiments to demonstrate empirically that the performance (measured in terms of FID (Heusel et al., 2017) , MSE to the training set (Metz et al., 2016), Precision, and Recall (Sajjadi et al., 2018) ) increases when we use an ensemble of WGANs over a single WGAN on the CIFAR-10 dataset (Krizhevsky & Hinton, 2009) . The performance increase can be explained in terms of three contributing factors: 1) the ensemble has more parameters and hence has higher capacity to learn complex distributions, 2) the ensemble better captures the disconnected structure of the data, and 3) parameter sharing among ensemble networks enables successful joint learning, which we observe can increase performance. We summarize our contributions as follows: • We prove that generative networks, which are continuous functions, cannot learn the data distribution if the data is disconnected (Section 3.2). The disconnected data model is defined in Section 3.1, where we argue that it is satisfied in many common datasets, such as MNIST, CIFAR-10, and ILSVRC2012. Restricting the generator to a disconnected subset of the domain is one solution (Section 3.3), but we study a better solution: using ensembles. • We demonstrate how single GANs and ensembles are related (Section 4.1). We then prove that ensembles are able to learn the true data distribution under our disconnected data model (Section 4.2). Finally, we demonstrate that there is an equivalence between ensembles of GANs and common architectures such as cGANs and GM-GANs due to parameter sharing between ensemble components (Section 4.3). • We empirically show that, in general, an ensemble of GANs outperforms a single GAN (Section 5.1). This is true even if we reduce the number of parameters used in an ensemble so that it has fewer total parameters than a single GAN (Section 5.2). Finally, we empirically show that parameter sharing among ensemble networks leads to better performance than a single GAN (Section 5.3) or even a cGAN (Section 5.4).

2.1. GENERATIVE ADVERSARIAL NETWORKS (GANS)

GANs are generative neural networks that use an adversarial loss, typically from another neural network (Goodfellow et al., 2014) . In other words, a GAN consists of two neural networks that compete against each other. The generator G : R Ñ R p is a neural network that generates pdimensional images from an -dimensional latent space. The discriminator D : R p Ñ p0, 1q is a neural network which is trained to classify between the training set and generated images. As compositions of continuous functions (Goodfellow et al., 2016) , both G and D are continuous.

G has parameters θ

G P R |θ G | , where |θ G | is the possibly infinite cardinality of θ G . Similarly, D has parameters θ D P R |θ D | . The latent, generated, and data distributions are P z , P G , and P X , respectively. We train this network by solving the following optimization problem: min θ G max θ D V pθ G , θ D q " min θ G max θ D E x"P X rlog Dpxqs `Ez"Pz rlogp1 ´DpGpzqqqs. Here we write min and max instead of minimize and maximize for notational compactness, but we are referring to an optimization problem. The objective of this optimization is to learn the true data distribution, i.e., P G " P X . Alternatively, we can use the Wasserstein distance instead of the typical cross-entropy loss: V pθ G , θ D q " E x"P X Dpxq ´Ez"Pz DpGpzqq restricted to those θ G , θ D which force D to be 1-Lipschitz as done in the WGAN paper (Arjovsky et al., 2017) . Thus, we will use V to denote either of these two objective functions.

2.2. GANS THAT TREAT SUBSETS OF DATA DIFFERENTLY

Ensembles of GANs. Datasets with many different classes, such as ILSVRC2012 (Russakovsky et al., 2015) , are harder to learn in part because the relationship between classes is difficult to quantify. Some models, such as AC-GANs (Odena et al., 2017) , tackle this complexity by training different

