ENSEMBLES OF GENERATIVE ADVERSARIAL NETWORKS FOR DISCONNECTED DATA Anonymous

Abstract

Most computer vision datasets are composed of disconnected sets, such as images of different objects. We prove that distributions of this type of data cannot be represented with a continuous generative network without error, independent of the learning algorithm used. Disconnected datasets can be represented in two ways: with an ensemble of networks or with a single network using a truncated latent space. We show that ensembles are more desirable than truncated distributions for several theoretical and computational reasons. We construct a regularized optimization problem that rigorously establishes the relationships between a single continuous GAN, an ensemble of GANs, conditional GANs, and Gaussian Mixture GANs. The regularization can be computed efficiently, and we show empirically that our framework has a performance sweet spot that can be found via hyperparameter tuning. The ensemble framework provides better performance than a single continuous GAN or cGAN while maintaining fewer total parameters.

1. INTRODUCTION

Generative networks, such as generative adversarial networks (GANs) (Goodfellow et al., 2014) and variational autoencoders (Kingma & Welling, 2013) , have shown impressive performance in generating highly realistic images that were not observed in the training set (Karras et al., 2017; 2019a; b) . However, even state of the art generative networks such as BigGAN (Brock et al., 2018) generate poor quality imagery if conditioned on certain classes of ILSVRC2012 (Russakovsky et al., 2015) . We argue that this is due to the inherent disconnected structure of the data. In this paper, we theoretically analyze the effects of disconnected data on GAN performance. By disconnected, we mean that the data points are drawn from an underlying topological space that is disconnected (the rigorous definition is provided below in Section 3.1). As an intuitive example, consider the collection of all images of badgers and all images of zebras. These two sets are disconnected, because images of badgers do not resemble images of zebras, and modeling the space connecting these sets does not represent real images of animals. We rigorously prove that one cannot use a single continuous generative network to learn a data distribution perfectly under the disconnected data model. Because generative networks are continuous, they cannot map a connected latent space (R ) into the disconnected image space, resulting in the generation of data outside of the true data space. In related work, (Khayatkhoei et al., 2018) has empirically studied disconnected data but does not formally prove the results in this paper. In addition, the authors use a completely unsupervised approach to attempt to find the disconnected components as a part of learning. In contrast, we use class labels and hence work in the supervised learning regime. Our suggested approach to best deal with disconnected data is to use ensembles of GANs. We study GANs in particular for concreteness and because of their widespread application; however, our methods can be extended to other generative networks with some modification. Ensembles of GANs are not new, e.g., see (Nguyen et al., 2017; Ghosh et al., 2018; Tolstikhin et al., 2017; Arora et al., 2017) , but there has been limited theoretical study of their properties. We prove that ensembles can learn the data distribution under the disconnected data assumption and study their relationship to single GANs. Specifically, we develop a first-of-its-kind theoretic framework that relates single GANs, ensembles of GANs, conditional GANs, and Gaussian mixture GANs. The framework makes it easy to, e.g., develop regularized GAN ensembles that encourage parameter sharing, which we show outperform cGANs and single GANs. While our primary focus here is on theoretical insight, we also conduct a range of experiments to demonstrate empirically that the performance (measured in terms of FID (Heusel et al., 2017) , MSE to the training set (Metz et al., 2016), Precision, and Recall (Sajjadi et al., 2018) ) increases when we use an ensemble of WGANs over a single WGAN on the CIFAR-10 dataset (Krizhevsky & Hinton, 2009) . The performance increase can be explained in terms of three contributing factors: 1) the ensemble has more parameters and hence has higher capacity to learn complex distributions, 2) the ensemble better captures the disconnected structure of the data, and 3) parameter sharing among ensemble networks enables successful joint learning, which we observe can increase performance. We summarize our contributions as follows: • We prove that generative networks, which are continuous functions, cannot learn the data distribution if the data is disconnected (Section 3.2). The disconnected data model is defined in Section 3.1, where we argue that it is satisfied in many common datasets, such as MNIST, CIFAR-10, and ILSVRC2012. Restricting the generator to a disconnected subset of the domain is one solution (Section 3.3), but we study a better solution: using ensembles. • We demonstrate how single GANs and ensembles are related (Section 4.1). We then prove that ensembles are able to learn the true data distribution under our disconnected data model (Section 4.2). Finally, we demonstrate that there is an equivalence between ensembles of GANs and common architectures such as cGANs and GM-GANs due to parameter sharing between ensemble components (Section 4.3). • We empirically show that, in general, an ensemble of GANs outperforms a single GAN (Section 5.1). This is true even if we reduce the number of parameters used in an ensemble so that it has fewer total parameters than a single GAN (Section 5.2). Finally, we empirically show that parameter sharing among ensemble networks leads to better performance than a single GAN (Section 5.3) or even a cGAN (Section 5.4).

2.1. GENERATIVE ADVERSARIAL NETWORKS (GANS)

GANs are generative neural networks that use an adversarial loss, typically from another neural network (Goodfellow et al., 2014) . In other words, a GAN consists of two neural networks that compete against each other. The generator G : R Ñ R p is a neural network that generates pdimensional images from an -dimensional latent space. The discriminator D : R p Ñ p0, 1q is a neural network which is trained to classify between the training set and generated images. As compositions of continuous functions (Goodfellow et al., 2016) , both G and D are continuous.

G has parameters θ

G P R |θ G | , where |θ G | is the possibly infinite cardinality of θ G . Similarly, D has parameters θ D P R |θ D | . The latent, generated, and data distributions are P z , P G , and P X , respectively. We train this network by solving the following optimization problem: min θ G max θ D V pθ G , θ D q " min θ G max θ D E x"P X rlog Dpxqs `Ez"Pz rlogp1 ´DpGpzqqqs. Here we write min and max instead of minimize and maximize for notational compactness, but we are referring to an optimization problem. The objective of this optimization is to learn the true data distribution, i.e., P G " P X . Alternatively, we can use the Wasserstein distance instead of the typical cross-entropy loss: V pθ G , θ D q " E x"P X Dpxq ´Ez"Pz DpGpzqq restricted to those θ G , θ D which force D to be 1-Lipschitz as done in the WGAN paper (Arjovsky et al., 2017) . Thus, we will use V to denote either of these two objective functions.

2.2. GANS THAT TREAT SUBSETS OF DATA DIFFERENTLY

Ensembles of GANs. Datasets with many different classes, such as ILSVRC2012 (Russakovsky et al., 2015) , are harder to learn in part because the relationship between classes is difficult to quantify. Some models, such as AC-GANs (Odena et al., 2017) , tackle this complexity by training different models on different classes of data in a supervised fashion. In the AC-GAN paper, the authors train 100 GANs on the 1000 classes of ILSVRC2012. The need for these ensembles is not theoretically studied or justified beyond their intuitive usefulness. Several ensembles of GANs have been studied in the unsupervised setting, where the modes or disconnected subsets of the latent space are typically learned (Pandeva & Schubert, 2019; Hoang et al., 2018; Khayatkhoei et al., 2018) with some information theoretic regularization as done in (Chen et al., 2016) . These are unsupervised approaches which we do not study in this paper. Models such as SGAN (Chavdarova & Fleuret, 2018) and standard GAN ensembles (Wang et al., 2016) use several GANs in part to increase the capacity or expressibility of GANs. Other ensembles, such as Dropout-GAN (Mordido et al., 2018) , help increase robustness of the generative network. Conditional GANs (cGANs). Conditional GANs (Mirza & Osindero, 2014) attempt to solve the optimization problem in (1) by conditioning on the class y, a one-hot vector. The generator and discriminator both take y as an additional input. This conditioning can be implemented by having the latent variable be part of the input, e.g., the input to the generator will be rz T y T s T instead of just z. Typically, conventional cGANs have the following architecture modification. The first layer has an additive bias that depends on the class vector y and the rest is the same. For example, consider a multilayer perceptron, with matrix W in the first layer. Converting this network to be conditional would result in the following modification to the matrix in the first layer: W conditional " x y  " rW Bs " x y  " W x `By " W x `B¨,k . Hence, we can think of B as a matrix with columns B ¨,k , k P t1, . . . , Ku being bias vectors and W being the same as before. We pick a bias vector B ¨,k based on what class we are conditioning on but the other parameters of the network are held the same, independent of k. This is done to both the generator and the discriminator. Some cGANs condition on multiple layers, such as BigGAN (Brock et al., 2018) , or on different types of layers, such as convolutional layers, but our formulation here extends clearly to those other architectures. Gaussian Mixture GANs (GM-GANs). The latent distribution P z is typically chosen to be either uniform, isotropic Gaussian, or truncated isotropic Gaussian (Goodfellow et al., 2014; Radford et al., 2015; Brock et al., 2018) . We are not restricted to these distributions; research has been conducted in extending and studying the affect of using different distributions, such as a mixture of Gaussians (Ben-Yosef & Weinshall, 2018; Gurumurthy et al., 2017) .

3. CONTINUOUS GENERATIVE NETWORKS CANNOT MODEL DISTRIBUTIONS

DRAWN FROM DISCONNECTED DATA

3.1. DISCONNECTED DATA MODEL

We begin by introducing a new data model that accounts for disconnected data. Typical datasets with class labels satisfy this model; we provide additional examples below. Definition 1 (Disconnected data model). We assume that the data lies on K disjoint, compact sets X k Ă R p , k P t1, . . . , Ku so that the whole data lies on the disjoint union of each component: Ť ¨K k"1 X k " X . Moreover, we assume that each component X k is connected (Rudin, 1964) . We then draw data points from these sets in order to construct our finite datasets. In Definition 1, we let each X k be compact in order to remove the degenerate case of having two components X k and X j that are arbitrarily close to one another, which is possible if we only assume that X is closed and disjoint. If that is the case, there are trivial counter-examples (see the appendix) to the theorems proved below. Lemma 1. X is a disconnected set, and X j is disconnected from X k for j ‰ k. Disconnected datasets are ubiquitous in machine learning (Khayatkhoei et al., 2018; Hoang et al., 2018; Pandeva & Schubert, 2019) . For example, datasets with discrete labels (typical in classification problems) will often be disconnected. We study this disconnected data property, because generative networks are unable to learn the distribution supported on such a dataset, as we show below.

3.2. CONTINUOUS GENERATIVE NETWORKS CANNOT REPRESENT A DISCONNECTED DATA DISTRIBUTION EXACTLY

In this section, we prove that, under Definition 1, continuous generative networks cannot learn the true data distribution exactly due to model misspecification. Suppose that pΩ, F, P z q is a probability space with P z being the distribution of the random vector z : Ω Ñ R . We assume that P z is equivalent to the Lebesgue measure λ. This just means that λpzpAqq " 0 if and only if P z pAq " 0 for any set A P F. This is true for a Gaussian distribution, for example, which is commonly used as a latent distribution in GANs (Arjovsky et al., 2017) . The transformed (via the generative network G) random vector x " G ˝z : Ω Ñ R p is determined by the original probability measure P z but is defined on the induced probability space pΩ 1 , F 1 , P G q. Theorem 1. If G can generate from multiple components of X (say X 1 and X 2 ), then the probability of generating samples outside of X is positive: P G px P R p zX q ą 0. Otherwise if we can only generate from one component (say X 1 ), then P G pX i q " 0 for x P t2, . . . , Ku. The continuity of G is the fundamental reason why Theorem 1 is true. A continuous function cannot map a connected space to a disconnected space. This means that all generative networks must generate samples outside of the dataset if the data satisfies Definition 1. Suppose that our data is generated from the true random vector x data : Ω 1 Ñ R p using the probability distribution P X . Also, suppose that we learn P G by training a generative network. Corollary 1. Under Definition 1, we have that dpP G , P X q ą 0 for any distance metric d and any learned distribution P G . From Corollary 1, we see that learning the data distribution will incur irreducible error under Definition 1 because our data model and the model that we are trying to train do not match. Hence, we need to change which models we consider when we train in order to best reflect the structure of our data. At first thought a discontinuous G might be considered, but that would require training G without backpropagation. Instead, we focus on restricting G to a discontinuous domain (Section 3.3) and training an ensemble of GANs (Section 4.1) as two possible solutions.

DISTRIBUTION

In this section, we study how we can remove the irreducible error in Theorem 1 from our models after training. Suppose that we train a generator G on some data so that GpR q Ą X . Therefore, we can actually generate points from the true data distribution. We know that the distributions cannot be equal because of Theorem 1, implying that if we restrict the domain of G to the set Z " G ´1pX q then GpZq " X . The next theorem shows how the latent distribution is related to restricting the domain of G. Theorem 2 (Truncating the latent space reduces error). Suppose that P z pZq ą 0 and let the generator G learn a proportionally correct distribution over X . In other words, there exists a real number c P R so that P G pAq " cP X pAq A P F 1 , A Ă X . Then, we use the truncated latent distribution defined by P z T pBq " 0 for all B P F that satisfy B X Z " H. This allows us to learn the data distribution exactly, i.e. P G| Z pAq " P X pAq A P F 1 . We write P G| Z because, by truncating the latent distribution, we effectively restrict G to the domain Z. Theorem 2 shows that if we learn the data distribution approximately by learning a proportional distribution, then we can learn the true data distribution by truncating our latent distribution. By 4.22 in (Rudin, 1964) , Z must be disconnected, which implies that a disconnected latent distribution is a solution to remove the irreducible error in Theorem 1. Although Theorem 2 suggests that we truncate the latent distribution, there are several limitations with this approach. First, the latent distribution cannot be truncated without knowing a closed form expression for P G . Second, we may learn the disconnected set Z by training a mixture distribution for P z as is done in (Ben-Yosef & Weinshall, 2018; Gurumurthy et al., 2017) . The problem with this is that the geometric shape of Z is restricted to be spherical or hyperellipsoidal. Third, before truncating the latent space, we need to train a generative network to proportionally learn the data distribution, which is impossible to confirm. Given these limitations, we introduce the use of ensembles of generative networks in Section 4.1. This class of models addresses the issues above as follows. First, we will not need to have access to P G in any way before or after training. Second, knowing the geometric shape of Z is no longer an issue because each network in the ensemble is trained on the connected set X k instead of the disconnected whole X . Finally, since the k-th network will only need to learn the distribution of X k , we reduce the complexity of the learned distribution and do not have to confirm that the distribution learned is proportionally correct.

4. ENSEMBLES OF GANS AND PARAMETER SHARING

We demonstrate how to train ensembles of GANs practically and relate ensembles to single GANs, cGANs, and GM-GANs. We focus on feedforward (Goodfellow et al., 2016) GANs in this section for concreteness; therefore, we study an ensemble of discriminators as well as generators.

4.1. ENSEMBLE OF GANS VS. A SINGLE GAN

Given an ensemble of GANs, we will write G k : R Ñ R p as the k-th generator with parameters θ G k P R |θ G | for k P t1, . . . , Ku, where K is the number of ensemble networks. We assume that each of the generators has the same architecture, hence |θ Gi | " |θ Gj | for all i, j; thus we drop the subscript and write |θ G |. Likewise, we write D k : R p Ñ r0, 1s for the k-th discriminator with parameters θ D k P R |θ D | since the discriminators all have the same architecture. The latent distribution is the same for each ensemble network: P z . The generated distributions will be denoted P G k . For concreteness, we assume that K is the number of classes in the data; for MNIST, CIFAR-10, and ILSVRC2012, K would be 10, 10, and 1000, respectively. If K is unknown, then an unsupervised approach (Hoang et al., 2018; Khayatkhoei et al., 2018 ) can be used. Define the parameter π P R K such that ř K k"1 π k " 1. We then draw a one-hot vector y " Catpπq randomly and generate a sample using the k-th generator if the k-th component of y is 1. Hence, we have that a generated sample is given by x " G k pzq. This ensemble of GANs is trained by solving min θ G k max θ D k V pθ G k , θ D k q (2) for k P t1, . . . , Ku. Note that with an ensemble like this, the overall generated distribution P G pxq " ř K k"1 π k P G k pxq is a mixture of the ensemble distributions. This makes comparing a single GAN to an ensemble challenging; for example, consider comparing a Gaussian to a mixture of Gaussians. In order to compare a single GAN to an ensemble of GANs, we define a new hybrid optimization min θ G 1 ,...,θ G K ¨max θ D 1 ,...,θ D K K ÿ k"1 V pθ G k , θ D k q s.t. K ÿ k"1 j"k }θ Dj ´θD k } 0 ď t ‹ ‚ s.t. K ÿ k"1 j"k }θ Gj ´θG k } 0 ď t, where } ¨}0 " 1 denotes the 0 "norm," which counts the number of non-zero values in a vector. Thus, t ě 0 serves as a value indicating how many parameters are the same across different networks, which is more general than having tied weights between networks (Ghosh et al., 2018) . We penalize the parameters because it is convenient, although it is not equivalent to, penalizing the functions themselves. This is true because θ G k ´θGj " 0 ùñ G k ´Gj " 0 but the converse is not true. We analyze the behavior of (3) as we vary t in the next theorem. Theorem 3. Let G and D be the generator and discriminator network in a GAN. Suppose that for k P t1, . . . , Ku we have that G k and D k have the same architectures as G and D, respectively. Moreover, assume that P X pX j q " P X pX k q for all j, k. Then, i) Suppose that t ě max ! K K´1 2 |θ D |, K K´1 2 |θ G | ) . Then for all k P t1, . . . , Ku we have that `θG k , θ Dk ˘is a solution to (3) if and only if `θG k , θ Dk ˘is a solution to (2). ii) Suppose that t " 0. Then we have that pθ G, θ D q is a solution to (3) for each k P t1, . . . , Ku if and only if pθ G, θ D q is a solution to (1). Informally, Theorem 3 shows that, when t " 0, we essentially have a single GAN, because all of the networks in the ensemble have the same parameters. If t is large then we have an unconstrained problem such that the ensemble resembles the one in Equation ( 2). Therefore, this hybrid optimization problem trades off the parameter sharing between ensemble components in a way that allows us to compare performance of single GANs with ensembles. Unfortunately, Equation ( 3) is a combinatorial optimization problem and is computationally intractable. Experimentally, we relax Equation (3) to the following min θ G 1 ,...,θ G K ¨max θ D 1 ,...,θ D K K ÿ k"1 V pθ G k , θ D k q ´λ K ÿ k"1 j"k }θ Dj ´θD k } 1 ‹ ‚`λ K ÿ k"1 j"k }θ Gj ´θG k } 1 (4) in order to promote parameter sharing and have an almost everywhere differentiable regularization term that we can backpropagate through while training. Although Equation ( 4) is a relaxation of Equation ( 3), we still have the same asymptotic behavior when we vary λ as when we vary t as shown in Appendix A.

4.2. OPTIMALITY OF ENSEMBLES FO GANS

This next theorem shows that if we are able to learn each component's distribution, P X k , then an ensemble can learn the whole data distribution P X . Theorem 4. Suppose that G k is the network that generates X k for each k P t1, . . . , Ku, i.e. P G k " P X k . Under Definition 1, we can learn each G k by solving (2) with V being the objective function in Equation (1). We know from (Goodfellow et al., 2014 ) that a globally optimal solution is achieved when the distribution of the generated images equals P X . Hence, this theorem has an important consequence: Training an ensemble of networks is optimal under our current data model. It is important to note that the condition "G k is the network that generates X k " is necessary but not too strong because we may have a distribution that cannot be learned by a generative network or that our network does not have enough capacity to learn. We do not care about such cases however, because we are studying the behavior of generative networks under Definition 1.

Relation to cGANs.

We compare a cGAN to an ensemble of GANs. Recall from Section 2.2 that a cGAN has parameters θ G and θ D that do not change with different labels but there are matrices B G and B D that do depend on the labels. Specifically, they solve the optimization problem Theorem 3 with the additional constraint that the only parameters that can be different are the biases in the first layer. For other variants of cGANs a similar result applies. Theorem 5. A cGAN is equivalent to an ensemble of GANs with parameter sharing among all parameters except for the biases in the first layer. Moreover, the optimization in (3) can be modified so that it is equivalent to the cGAN optimization problem. Relation to GM-GANs. Another generative network that is related to ensembles is the GM-GAN. The first layer in GM-GANs transforms the latent distribution from isotropic Gaussian into a mixture of Gaussians. This new layer plays a similar role as the B ¨,k in the cGAN comparison above, meaning that GM-GANs solve the optimization problem (3) with the additional constraint that the only parameters that can be different are the parameters in the first layer. Theorem 6. A GM-GAN is equivalent to an ensemble of GANs with parameter sharing among all parameters except for the first layer. Moreover, the optimization in (3) can be modified so that it is equivalent to the GM-GAN optimization problem. We do not have to sacrifice computation to achieve better performance, we just need models that capture the underlying structure of the data. The dotted red line is the baseline WGAN, the solid blue line is the equivalent ensemble, and the dashed black line is the full ensemble.

5. EXPERIMENTAL RESULTS

In this section, we study how ensembles of WGAN (Arjovsky et al., 2017) compare with a single WGAN and a conditional WGAN. We use code from the authors' official repository (Arjovsky et al., 2018) to train the baseline model. We modified this code to implement our ensembles of GANs and cGAN. For evaluating performance, we use the FID score (Heusel et al., 2017; 2020) , average MSE to the training data (Metz et al., 2016; Lipton & Tripathi, 2017) , and precision/recall (Sajjadi et al., 2018; 2019) . More details about the experimental setup are discussed in Appendix C.

5.1. ENSEMBLES PERFORM BETTER THAN SINGLE NETWORKS

We consider a basic ensemble of WGANs where we simply copy over the WGAN architecture 10 times and train each network on the corresponding class of CIFAR-10; we call this the "full ensemble". We compare this ensemble to the baseline WGAN trained on CIFAR-10. Figure 1 shows that the full ensemble of WGANs performs better than the single WGAN. It is not immediately clear, however, whether this boost in performance is due to the functional difference of having an ensemble or if it is happening because the ensemble has more parameters. The ensemble has 10 times more parameters than the single WGAN, so the comparison is hard to make. Thus, we consider constraining the ensemble so that it has fewer parameters than the single WGAN.

5.2. ENSEMBLES WITH FEWER TOTAL PARAMETERS STILL OUTPERFORM A SINGLE NETWORK

The "equivalent ensemble" (3,120,040 total generator parameters) in Figure 1 still outperforms the single WGAN (3, 476, 704 generator parameters) showing that the performance increase comes from using the ensemble rather than just having larger capacity. In other words, considering ensembles of GANs allows for improved performance even though the ensemble is simpler than the original network in terms of number of parameters. We see a performance boost as a result of increasing the number of parameters, in Figure 1 . Therefore, we perform better because of having a better model (an ensemble) as well as by having more parameters. Now, we investigate a way that we can further improve performance.

PERFORMANCE

We study how the regularization penalty λ affects performance. As discussed in Section 4.1, we can learn a model that is somewhere between an ensemble and a single network by using 1 regularization. In Figure 2 , the performance increases when we increase λ in the equivalent ensemble from 0 to 0.001, implying that there is some benefit to regularization. Recall that by having λ ą 0, we force parameter sharing between generator and discriminator networks. This performance increase is likely data dependent and has to do with the structure of the underlying data X . For example, we can have pictures of badgers pX 1 q and zebras pX 2 q in our dataset and they are disconnected. However, the 0 500 1,000 100 4) with different values of λ. Each curve is calculated using the equivalent ensemble of WGANs discussed in Section 5.2. We see that as we increase λ to 0.001, the performance increases but then decreases when we continue to increase λ to 0.01. This implies that there is an optimal value for λ that can be found via hyperparameter tuning. The solid blue line is the equivalent ensemble with λ " 0.01, the dotted red line is the equivalent ensemble WGAN, and the dashed black line is the equivalent ensemble with λ " 0.001. backgrounds of these images are likely similar so that there is some benefit in G 1 and G 2 treating these images similarly, if only to remove the background. As we increase λ from 0.001 to 0.01 we notice that performance decreases. This means that there is a sweet spot and we may be able to find an optimal 0 ă λ ˚ă 0.01 via hyperparameter tuning. We know that the performance is not monotonic with respect to λ because it decreases and then increases again; in other words, the performance has a minima that is not at λ " 0 or λ Ñ 8. The optimization problem in expression (4) therefore can be used to find a better ensemble than the equivalent ensemble used in Section 5.2 which still has fewer parameters than the baseline WGAN.

5.4. ENSEMBLES OUTPERFORM CGANS

We modify a WGAN to be conditional and call it cWGAN. This cWGAN is trained on CIFAR-10, and we compare cWGAN to ensembles of WGANs. We do this because we showed in Section 4.3 that cGANs are an ensemble of GANs with a specific type of parameter sharing. As can be seen from Figure 3 , ensembles perform better than the cWGAN. The baseline WGAN model actually performs similarly to the cWGAN, which implies that the conditioning is not helping in this specific case. We hypothesize that our model (λ " 0.001) performs better because there are more parameters that are free in the optimization, instead of just the bias in the first layer. Thus, although cGANs are widely used, an ensemble with the regularization described in Section 4.1 can outperform them because the ensemble captures the disconnected structure of the data better.

A PROOFS

We first show that under Definition 1, our data is disconnected. Proof of Lemma 1. Since X j X X k " X j X X k " H, we see that X is disconnected. Remark 1. Note that if our data is disconnected, it doesn't necessarily follow Definition 1. This means that Definition 1 is a stronger condition than just having disconnected data. This can be seen by the following counter-example. We denote a truncated Gaussian as N pµ, σ 2 q| S , where the distribution is non-zero on S. Let P X " 1 2 P X1 `1 2 P X2 " 1 2 N p0, 1q| p´8,0q `1 2 N p0, 1q| p0,8q be the true distributions. Note that p´8, 0q and p0, 8q are disconnected but do not follow Definition 1 because they are not compact. Moreover, we can learn this distribution easily by letting G be the identity function and having P z " N p0, 1q; it is trivial to show that this results in P G " P X . Hence, disconnected data is too weak of an assumption-we need there to be a non-zero distance between our disconnected sets and that is what Definition 1 captures. Proof of Theorem 1. Without loss of generality, we can assume that G can generate at least from the two components X 1 and X 2 . We define B " G ´1pX q and note that GpBq must be disconnected because G can generate from at least X 1 and X 2 and they are disconnected. Since, X is disconnected and closed in R p , we have that B is disconnected and closed in R because G is continuous (Theorem 4.8 and 4.22 in (Rudin, 1964) ). Since B is closed in R , this means that R zB is an open set. Moreover, R zB is not empty because we know that R is connected and B is not. We also know that the Lebesgue measure λ of a nonempty, open set is positive, hence we have that λpR zBq ą 0. Since λ is equivalent to P z , we have that P z pz P R zBq ą 0. Thus, P G px P R p zX q " P z pz P G ´1pR p zX qq " P z pz P R zBq ą 0, as desired. Proof of Corollary 1. Since the data lies only on X , we know that P X px data P RzX q " 0 for any valid probability measure. However, we have that P G px P R p zX q ą 0. Hence, dpP G , P X q ą 0 for any metric d. Proof of Theorem 2. The truncated latent distribution is denoted P z T and is defined as P z T pBq " P z pB X Zq P z pZq for any set B P F. Hence, we have that P G| Z pAq " P z T pG ´1pAqq " P z pG ´1pAq X Zq P z pZq " 1 P z pZq P z pG ´1pAq X G ´1pX qq " 1 P z pZq P z pG ´1pA X X qq " 1 P z pZq P G pA X X q " c P z pZq P X pA X X q " c P z pZq P X pAq Proof of Theorem 5. First we show that ensembles are equivalent to cGANs under the right conditions. Fix the architecture of the networks considered and only focus on the generator. We want to show that cGANs and ensembles are equivalent, so we first focus on the generators in cGANs. We define the set of functions which represent conditional versions of the fixed architecture as GpKq " " G θ,B : R ˆt1, . . . , Ku Ñ R p : B, θ are network parameters * , where B is a matrix with K columns and whose rows depend on the width of the first hidden layer as discussed in Section 2.1. The rest of the parameters are represented as the vector θ above. It is clear that for a function in GpKq, there could be many corresponding networks; some of these are due to activation and weight symmetries (Bishop, 2006) . This implicitly create an equivalence class of networks; in particular, two networks are equivalent if they are the same mapping, regardless of weights. These symmetries do not affect our argument but it a subtlety to keep in mind. Obviously not every ensemble is the same as the generator in a cGAN; however, we focus on a very specific type of ensemble. We define the set of all ensembles that have a variable bias in the first layer as G E pKq " " `Gθ,b k : R Ñ R p ˘K k"1 : b k , θ are network parameters for each k * , which is a collection of K-tuples of functions that represent networks with our fixed architecture. In the definition above, for j ‰ k, we see that G θ,bj and G θ,b k share the same parameter θ, but the biases b j and b k may be different. Ensembles that are used in the construction of G E pKq are essentially ensembles that have parameter sharing everywhere except in the bias term of the first layer; these biases are not constrained to be similar at all. Since a network that induces a function in GpKq is the conditional version of the networks in the ensembles that are used to construct G E pKq, as described in Section 2.1, then parameter equality implies functional equality. In other words, for G θ,rb1,...,b K s P GpKq and pG θ,b k q K k"1 P G E pKq we have that G θ,rb1,...,b K s pz, kq " G θ,b k pzq for all z P R , k P t1, . . . , Ku and all parameters θ, b k . We will now show that there exists a one-to-one correspondence between these two sets. Suppose that T : G E pKq Ñ GpKq is defined by ´T `pG θ,bj q K j"1 ˘¯pz, kq " G θ,b k pzq " G θ,rb1,...,b K s pz, kq for each z P R and k P t1, . . . , Ku. Informally, we map an ensemble to a single network by just picking the k-th network in the ensemble. For a fixed θ and b 1 , . . . , b K we see that T `pG θ,bj q K j"1 ˘is indeed a function from R ˆt1, . . . , Ku to R p . Moreover, T `pG θ,bj q K j"1 ˘is equal to (as a function) to G θ,rb1,...,b K s P GpKq so that T is well defined. Let G θ,B P GpKq be given. Then we just let b 1 , . . . , b K be the columns of B and we see that T `pG θ,bj q K j"1 ˘" G Thus, T is a one-to-one correspondence between G E pKq and GpKq. Hence, for every ensemble of networks defined above, we can find a cGAN which is equivalent to the ensemble. This equivalence is defined as the equivalence of the functions induced by these networks. The above result holds for any architecture and all K P Z `. Thus, it also holds for the discriminator networks. Next, we want to show that a modified version of the optimization problem from (3) yields the cGAN optimization problem.



Figure 1: Ensembles of WGANs with fewer total parameters than a single WGAN perform better on CIFAR-10.

Figure 2: Ensembles of WGANs have a performance sweet spot when we regularize the optimization problem in expression (4) with different values of λ. Each curve is calculated using the equivalent ensemble of WGANs discussed in Section 5.2. We see that as we increase λ to 0.001, the performance increases but then decreases when we continue to increase λ to 0.01. This implies that there is an optimal value for λ that can be found via hyperparameter tuning. The solid blue line is the equivalent ensemble with λ " 0.01, the dotted red line is the equivalent ensemble WGAN, and the dashed black line is the equivalent ensemble with λ " 0.001.

Figure3: Regularized ensembles of WGANs using the optimization in (4) outperform cWGANs, even though cGANs are a type of ensemble. Here, cWGAN actually performs similarly to the baseline WGAN even though it takes into consideration class information. The solid blue line is the baseline, the dotted red line is the cWGAN, and the dashed black line is the equivalent ensemble with λ " 0.001.

θ,b k implies that T is surjective. Next suppose that G θ α ,B α " G θ β ,B β are functions in GpKq with B α " rb α 1 , . . . , b α K s and B β " rb β 1 , . . . , b β K s. Then clearly we have that for `Gθ α ,b α k ˘K k"1 and `Gθ β ,b β k ˘K k"1 in G E pKq that G θ α ,b α k pzq " G θ α ,B α pz, kq " G θ β ,B β pz, kq " G θ β ,b β k pzq,implying that T is injective.

annex

for any A P F 1 . The last equality is true because P X pX q " 1, so that any points outside of X have zero probability. For the result above, set A " R p to see that c " P z pZq, implying that P G| Z pAq " P X pAq A P F 1 , as desired.Proof of Theorem 3. First we prove i). If t ě max) then the constraints on (3) are unnecessary so that the problem reduces towhich is equivalent to solving the optimization problemV pθ G k , θ D k q, k P t1, . . . , Ku.Thus, i) is shown.Next, we prove ii). Suppose that t " 0. Note that given a distribution P X , we can restrict it to each component X k and normalize to get the restricted distributionsfor each A P F 1 and each k. Since we assume that P X pX j q " P X pX k q for all j, k and that P X pX q " ř K k"1 P X pX k q " 1, we see that P X pX k q " 1 K for each k. This implies that for any measurable function f : R p Ñ R, we have thatSuppose that V is the standard cross entropy objective function. We will use the notation V pθ G , θ D ; P q to show that we are evaluating V pθ G , θ D q with the data distribution P . Then, we see thatSimilarly, if V is the Wasserstein objective function, thenThis means thatwhich is equivalent to the optimization problem in (1), as desired.Theorem 7. Let G and D be the generator and discriminator network in a GAN. Suppose that for k P t1, . . . , Ku we have that G k and D k have the same architectures as G and D, respectively. Moreover, assume that P X pX j q " P X pX k q for all j, k. Then, i) Suppose that λ " 0. Then for all k P t1, . . . , Ku we have that `θG k , θ Dk ˘is a solution to (4) if and only if `θG k , θ Dk ˘is a solution to (2). ii) Suppose that λ Ñ 8. Then we have that pθ G, θ D q is a solution to (4) for each k P t1, . . . , Ku if and only if pθ G, θ D q is a solution to (1).Proof of Theorem 7. First we prove i). If λ " 0 then we have thatHence, the problem reduces to the unconstrained problem of (2).Next, we prove ii). Since λ Ñ 8, any solution where θ D k ‰ θ Dj or θ G k ‰ θ Gj for all j, k P t1, . . . , Ku is suboptimal. Consequently, it means that the optimization problem in (4) reduces towhich is equivalent to the optimization problem in (1). We mainly just outline the proof here because it is so similar to the proof of Theorem 3.Proof of Theorem 4. Note that P X is the total data distribution and that P X k is the distribution of each disconnected set. This means thatfor some mixture coefficients α k ą 0 so that ř K k"1 α k " 1. Fix an arbitrary k P t1, . . . , Ku. Since P G k " P X k , we have thathas a solution of P G k " P X k from Theorem 1 of (Goodfellow et al., 2014) . Since this is true for every k and since P X " ř K k"1 π k P X k , we learn the complete data distribution.We begin with the generic optimization problem from (3) and see that it can be rewritten aswhere we simply use the name C D for the constraintThis is purely for notational convenience. Likewise, we simply denote θ G k as rpθ 1 G k q T pB G q T ¨,k s T and similarly for θ D k , for each k. Keep in mind that B G and B D are matrices such that the k-th column is the the bias of the first layer of the k-th network in the ensemble. So far, we have only introduced notational changes.Consider what happens if we change the constraints toWe have that B G and B D are unconstrained and that θ 1 G k is forced to be equal to θ 1 Gj for all k and j. Similarly θ 1 D k " θ 1 Dj for all k and j. Hence, we can say that the optimization problem above with the new constraint iswhich is equivalent to the cGAN optimization problem. Here, we just define θ G to be shorthand for any one of the θ G k vectors, since they are all the same. Hence, a cGAN is equivalent to solving the ensemble optimization problem in (3) with a modified constraint.Proof of Theorem 6. The proof for this is very similar to the proof for Theorem 5.

B ESTIMATION OF ENSEMBLE PARAMETERS

In Section 4.1, we assume that k " p k is a multinomial distribution of degree K parameters: π i for i " 1, . . . , K. Using the maximum likelihood estimator (Bishop, 2006) we obtain1py j " iq for i " t1, . . . , Ku. For datasets like MNIST (LeCun et al., 1998) and CIFAR-10 (Krizhevsky & Hinton, 2009) , k is a uniformly distributed random variable. For others one may have to calculate p k based on class imbalances.

C EXPERIMENTAL DETAILS

In this section we describe the details of our experiments.

C.1 PERFORMANCE MEASURES

We use FID (Heusel et al., 2017) , average MSE (Metz et al., 2016), precision, and recall (Sajjadi et al., 2018) to evaluate our models.For FID, precision, and recall we use the official repositories (Heusel et al., 2020; Sajjadi et al., 2019) .For each of these methods, we compare a set of generated images to a set of images from the training set. For the FID calculation, we use the precalculated statistics for CIFAR-10 and compare to 10,000 generated images from our trained networks. For precision and recall, we compare 10,000 generated images to 10,000 images in the training set. All other parameters are left the same.For the average MSE calculation, we use the algorithm introduced in (Lipton & Tripathi, 2017), which was empirically shown to work 100% of the time on DCGAN architecture, such as WGAN.We modified the code in (Lao, 2017) so that it can be run with multiple restarts if desired. We ran our experiments with 1000 iterations and 5 restarts. We ran the code on 100 training images.

C.2 BASELINE MODEL

For the baseline model, we ran the default WGAN code for 1000 epochs on CIFAR-10. All other parameters are left at their default values.

C.3 FULL ENSEMBLE

To create the full ensemble, we just copied over the baseline model 10 times and trained each network pair pG k , D k q in the ensemble on a single class of CIFAR-10. The training also lasted for 1000 epochs. This is equivalent to solving the optimization problem in (2).

C.4 EQUIVALENT ENSEMBLE

Normally WGAN is trained with the following two architecture parameters: ngf = 64 and ndf = 64. However, to get 10% of the parameters we trained each ensemble component with ngf = 15 and ndf = 20. The depth of the generator and discriminator in the equivalent ensemble are the same as in the single WGAN, however, we modify the width of each corresponding layer so that the total parameters are fewer in the ensemble than in the single WGAN. Specifically, the generator of the WGAN has 3, 576, 704 parameters and each generator of the equivalent ensemble has 312, 004 parameters. The discriminator of the WGAN has 2, 765, 568 parameters and each discriminator of the equivalent ensemble has 272, 880 parameters. Reducing the width of each layer is not necessarily the optimal way to reducing parameters in a network. We do this because it is easy and effective, not because we are trying to reduce parameters in an optimal way, which is out of the scope of this paper. This is equivalent to solving the optimization problem in (2).

C.5 REGULARIZED ENSEMBLES

For all the ensembles with λ ą 0, we use the equivalent ensemble architecture, while solving (4).

C.6 THE CGAN MODEL

For this architecture, we modify the baseline architecture and concatenate the class label, represented as a one-hot vector, to the input of the generator and discriminator networks.

