ON THE OPTIMAL PRECISION OF GANS

Abstract

Generative adversarial networks (GANs) are known to face model misspecification when learning disconnected distributions. Indeed, continuous mapping from a unimodal latent distribution to a disconnected one is impossible, so GANs necessarily generate samples outside of the support of the target distribution. In this paper, we make the connection between the performance of GANs and their latent space configuration. In particular, we raise the following question: what is the latent space partition that minimizes the measure of out-of-manifold samples? Building on a recent result of geometric measure theory, we prove a sufficient condition for GANs to be optimal when the dimension of the latent space is larger than the number of modes. In particular, we show the optimality of generators that structure their latent space as a 'simplicial cluster' -a Voronoi partition where centers are equally distant. We derive both an upper and a lower bound on the optimal precision of GANs learning disconnected manifolds. Interestingly, these two bounds have the same order of decrease: √ log m, m being the number of modes. Finally, we perform several experiments to exhibit the geometry of the latent space and experimentally show that GANs have a geometry with similar properties to the theoretical one.

1. INTRODUCTION

GANs (Goodfellow et al., 2014) , a family of deep generative models, have shown great capacities to generate photorealistic images (Karras et al., 2019) . State-of-the-art models, like StyleGAN (Karras et al., 2019) or TransformerGAN (Jiang et al., 2021) , show empirical benefits from relying on overparametrized networks with high-dimensional latent spaces. Besides, manipulating the latent representation of a GAN is also helpful for diverse tasks such as image editing (Shen et al., 2020; Wu et al., 2021) or unsupervised learning of image segmentation (Abdal et al., 2021) . However, there is still a poor theoretical understanding of how GANs organize their latent space. We argue that this is a crucial step in better apprehending the behavior of GANs. To better understand GANs, the setting of disconnected distributions learning is enlightening. Experimental and theoretical works (Khayatkhoei et al., 2018; Tanielian et al., 2020) have shown a fundamental limitation of GANs when dealing with such distributions. Since the distribution modeled by GANs is connected, some areas of GANs' support are necessarily mapped outside the true data distribution. When covering modes of a disconnected distribution, GANs try to minimize the measure of the generated distribution lying outside the true modes (e.g. the purple area on the right of Figure 1 ). In other words, GANs need to minimize the measure of the borders between the modes in the latent space. Considering a Gaussian latent space, minimizing this measure is closely linked to the field of Gaussian isoperimetric inequalities (Ledoux, 1996) . This field aims at deriving the partitions that decompose a Gaussian space with a minimal Gaussian-weighted perimeter. We argue that the optimal partitions derived in Gaussian isoperimetric inequalities cast a light on the structure of the latent space of GANs. Most notably, a recent result (Milman and Neeman, 2022) shows that, as long as the number of components m in the partition and the number of dimensions d of the Gaussian space are such that m ≤ d + 1, the optimal partition is a 'simplicial cluster': a Voronoi diagram obtained from the cells of equidistant points (see left of Figure 1 for m = 3 and d = 3). In this paper, we apply this result to the field of GANs and show, both experimentally and theoretically, that GANs with 'simplicial cluster' latent space minimize out-of-distribution generated samples. We draw the connection between GANs and Gaussian isoperimetric inequalities by using the precision metric (Sajjadi et al., 2018; Kynkäänniemi et al., 2019) , which quantifies the portion of generated Figure 1 : Illustration of the ability of GANs to find an optimal configuration in the latent space. On the left, the propeller shape is a partition of 3D Gaussian space with the smallest Gaussian-weighted perimeter (Figure from Heilman et al. (2013) ). On the right, we show the 3D Gaussian latent space of a GAN trained on three classes of MNIST. Each area colored in blue, green, or red maps samples in one of the three classes. In purple, we observe the samples that are classified with low confidence. We see that the partition reached by the GAN (right) is close to optimality (left), since the latent space partition is similar to the intersection of the propeller on a sphere. points that support the target distribution. We show that GANs with a latent space organized as a simplicial cluster reach optimal precision levels and derive both an upper and a lower bound on the precision of such GANs. Experimentally, we show that the GANs with higher performances tend to organize their latent space as simplicial clusters. To summarize, our contributions are the following: • We are the first to import the latest results from Gaussian isoperimetric inequalities by (Milman and Neeman, 2022) to the study and understanding of GANs. We use it to show that the latent space structure has major implications on the precision of GANs. • We derive a new theoretical analysis, stating both an upper bound and a lower bound on the precision of GANs. We show that GANs with latent space organized as a simplicial cluster have an optimal precision whose lower bound decrease in the same order as the upper bound: √ log m, where m is the number of modes. • Experimentally, we show that GANs tend to structure their latent space as 'simplicial clusters' on image datasets. First, we explore two properties of the latent space: linear separability and convexity of classes. Then, we play with the latent space dimension and highlight how it impacts the performance of GANs. Finally, we show that overparametrization helps approaching the optimal structure and improving GANs performance.

2.1. NOTATION

Data. We assume that the target distribution µ is defined on Euclidean space R D (potentially a high-dimensional space), equipped with the Euclidean norm • . We denote S µ the support of this unknown distribution µ . In practice, however, we only have access to a finite collection of i.i.d. observations X 1 , . . . , X m distributed according to µ . Thus, for the remainder of the article, we let µ m be the empirical measure based on X 1 , . . . , X m . Generative model. We consider G L the set of L-Lipschitz continuous functions from the latent space R d to the high-dimensional space R D . Each generator aims at producing realistic samples. The latent space distribution defined on R d is supposed to be Gaussian and is noted γ. Thus, each candidate distribution is the push forward between γ and a generator G and is noted as G γ. This Lipschitzness assumption on G L is reasonable since Virmaux and Scaman (2018) has shown and presented an algorithm that upper-bounds the Lipschitz constant of any deep neural network. In practice, one can enforce the Lipschitzness of generator functions by clipping the neural networks' parameters (Arjovsky et al., 2017) , penalizing the discriminative functions' gradient (Gulrajani et al., 2017; Kodali et al., 2017; Wei et al., 2018; Zhou et al., 2019) , or penalizing the spectral norms (Miyato et al., 2018) . Note that some large-scale generators such as SAGAN (Zhang et al., 2019) and BigGAN (Brock et al., 2019 ) also make use of spectral normalization for the generator.

2.2. GANS AND DISCONNECTED DISTRIBUTIONS

A significant flaw of GANs (Goodfellow et al., 2014) is their difficulty in learning multi-modal distributions. This phenomenon has been analyzed by Khayatkhoei et al. (2018) and Tanielian et al. (2020) . The problem comes from this fundamental trade-off: GANs can either cover all modes and generate out-of-manifold samples or generate only good quality samples and neglect some modes (mode collapse). Some methods have proposed ways to train disconnected generators (Khayatkhoei et al., 2018; Gurumurthy et al., 2017) , but with little benefits compared to single overparametrized generators with rejection mechanisms (Tanielian et al., 2020; Azadi et al., 2018) . Empirically, different works give intuition on the latent space structure of GANs. Karras et al. (2019) show that binary attributes are linearly separable in the Gaussian latent space and even better separated in an intermediate latent space. Shen et al. (2020) stress that face attributes are separated by hyperplanes, and edit images only by moving in the latent space orthogonally to these hyperplanes. Arvanitidis et al. (2018) and Chen et al. (2018a) view the latent space of generative models with a Riemannian perspective and define a metric tensor using the generator's Jacobian to find the shortest paths on the data manifold. However, these findings might not be sufficient for a clear understanding of the required geometry of the latent space. For instance, Karras et al. (2019) use a very large latent space dimension (R 512 ), while Sauer et al. (2022) argue that the optimal latent space dimension is close to the intrinsic dimension of images (R 64 for ImageNet). Tanielian et al. (2020) stress the relevance of this problem by showing that the precision of GANs can converge to 0 when the number of modes or the distance between them increases. In this paper, we make a step towards a better understanding of the behavior of GANs and expose an optimal latent space configuration when the number of modes m and the dimension of the latent space d are such that m ≤ d + 1.

2.3. EVALUATING GANS

When learning disconnected manifolds, Sajjadi et al. (2018) illustrated the need for new measures that simultaneously evaluate the quality (Precision), and the diversity (Recall) of the generated samples. Kynkäänniemi et al. (2019) highlighted an important drawback of this PR metric: it cannot correctly interpret situations when large numbers of samples are packed together. They propose an Improved PR metric based on the non-parametric estimation of manifolds to correct this. Improved PR metric: Informally, for a generator G, precision (α G ) quantifies the proportion of generated samples that can be approximated with true samples, while recall (β G ) measures the proportion of true samples that can be approximated with generated ones. Applying this to GANs, using the target distribution µ and modeled distribution G γ, the Improved PR metric was shown, by Tanielian et al. (2020, Theorem 1) , to be asymptotically equivalent to: α n G → n→∞ α G = G γ S µ and β n G → n→∞ β G = µ S G γ , where S µ denotes the support of µ . More recently, Naeem et al. (2020) have shown that the Improved PR metric (Kynkäänniemi et al., 2019) is not robust to outlier samples of both the target and the generated distribution. To correct this and fix the overestimation of the manifold around real outliers, Naeem et al. (2020) propose the Density/Coverage metric. Density/Coverage: Instead of counting how many fake samples belong to a real sample neighborhood, density counts how many real sample neighborhoods contain a generated sample. On the other hand, coverage counts the number of real sample neighborhoods that contain at least one fake sample. In the next section, we will use the notion of precision and recall defined in (1). Using this definition allows us to circumvent the non-parametric estimators involved in the existing metrics (Kynkäänniemi et al., 2019; Naeem et al., 2020) .

3. DETERMINING OPTIMAL PRECISION IN GANS

We want to better understand the latent space of GANs and stress which GANs have the highest precision under specific constraints. GANs are push-forward distributions of a unimodal (connected) Gaussian distribution γ and a continuous function parameterized by G. Consequently, the modeled generative distribution G γ will have connected support. When learning a target distribution µ with disconnected manifolds, GANs necessarily map fake data points out of the true manifold. This leads us to the following question: given that a generator samples data points in each of the distinct modes, what can be its maximum precision? To begin with, let's assume a target distribution µ composed of m disconnected modes. Assumption 1 (Disconnected manifolds). The target distribution µ lays on m equally measured spheres M i , i ∈ [1, m] of radius r, each located at equal distance D (with D/2 >> r). The use of Assumption 1 is reasonable. First, on many real-world datasets, data is correctly balanced in between the different modes. The equal distance assumption can be justified from the concentration of distances in high-dimensional spaces: centers of modes will be approximately at equal distance (Beyer et al., 1999; Aggarwal et al., 2001) . It has also been shown that embeddings of deep neural networks trained for classification tend to collapse around means that are equidistant to one another (Kothapalli et al., 2022) . This could thus pave the way for a new analysis where the chosen distance is no longer the euclidean distance in R D but a distance in the feature space of the generator or any pre-trained classifier. We also further discuss how this assumption could be relaxed. For the rest of the paper, let us define the set of well-balanced generators that maps equally data points to the different modes of the data distribution: Definition 1. A generator G is well-balanced, if for all i ∈ [1, m], G γ(M 1 ) = . . . = G γ(M m ). Considering well-balanced generators is also fair since many empirical improvements such as WGAN-GP (Gulrajani et al., 2017) or BigBiGAN (Donahue and Simonyan, 2019) have significantly decreased the mode collapse. GANs generate diverse output distribution on datasets such as CIFAR10, CIFAR100, and ImageNet. To validate the use of well-balanced generators, we ran a small experiment and evaluate the proportion of each class generated by GANs on MNIST and CIFAR10. On MNIST, the minimal proportion of a class is 9.2 and the maximal 10.9, while it is respectively 8.3 and 11.9 on CIFAR10 (in %). The variance/mean ratio is equal to 0.03 for MNIST and 0.22 for CIFAR10.

3.1. PRECISION AND THE ASSOCIATED PARTITION

Now that the prerequisites for both the data and the model have been given, we propose to define our approach. We create the connection between the set of generators from R d to R D and the set of partitions in the latent space. In particular, for each given partition in R d , there exists a set of associated generators defined as follows: Definition 2. For a given partition A = {A 1 , . . . , A m } on R d , we say that G is associated to A if: for all i ∈ [1, m], for all z ∈ A i , i = arg min j∈[1,m] d(G(z), M j ), where d(X, M j ) = min y∈M i X -y . It is clear that each given generator is associated with a unique partition in the latent space. Moreover, the geometry of the partition partly explains its behavior and performance. We are interested in maximizing the precision of generative models. Any point in the intersection of two cells A i ∩ A j , (i, j) ∈ [1, m ] 2 is at equal distance of M i and M j and thus does not belong to any of these modes (since D/2 >> r). Besides, due to the generator's Lipschitzness, there is a small neighborhood of the boundary such that any points in this neighborhood will be mapped out of the target manifold. This region in the latent space thus reduces the precision. For a given ε > 0, we now define the epsilon-boundary of the partition A as follows: Definition 3. For a given partition A = {A 1 , . . . , A m } of R d and a given ε ∈ R + , we denote ∂ ε A the epsilon-boundary of A , defined as follows: ∂ ε A = m i=1 ∪ j =i A j ε \ ∪ j =i A j , where A ε corresponds to the ε-extension of set A. To better understand the link between the precision of a generator α G and its associated partition A , we state the following lemma: Lemma 1. Assume that Assumption 1 is satisfied and A be a partition in R d . Then, any generator G ∈ G L associated with A verifies: α G 1 -γ(∂ ε min A ), where ε min = D/L. Interestingly, this result holds independently of the partition A . This result highlights that the geometry of the partition gives an upper-bound on the precision of the generator. Consequently, to properly determine this bound on the precision levels of generative models, one might be interested in determining the measure of this epsilon-boundary ∂ ε A . Furthermore, to exhibit generative models with optimal precision levels, one must look at partitions with the smallest epsilon-boundary measures γ(∂ ε A ). This is tightly connected to the theoretical field of Gaussian isoperimetric inequalities.

3.2. OPTIMALITY IN GANS

Isoperimetric inequalities link the measure of sets with their perimeters. More specifically, isoperimetric inequalities highlight minimizers of the perimeter for a fixed measure, e.g. the sphere in an euclidean space with a given Lebesgue measure. Gaussian isoperimetric inequalities study a similar problem in Gaussian space. Borell (1975) and Sudakov and Tsirel'son (1978) show that in a finite-dimensional Gaussian space, among all sets of a given measure, half-spaces have a minimum Gaussian perimeter. More formally, for any Borel set A in R d and a half-space H, if we have γ(A) γ(H), then γ(A ε ) γ(H ε ) for any ε > 0, where A ε denotes the ε-extension of the set A. The Gaussian multi-bubble conjecture was formulated when looking for a way to partition the Gaussian space in m parts and with the least-weighted boundary. It was recently proved by Milman and Neeman (2022) who showed that the best way to split a Gaussian space R d in m clusters of equal measure, with 2 m d + 1, is by using 'simplicial clusters' obtained as the Voronoi cells of m equidistant points in R d . Convex geometry theory tells us that each cell is a convex cone, whose borders are hyperplanes going through the origin of R d . We note A any partition corresponding to this optimal configuration, see Figure 1 . The aim of the present paper is to leverage this result to better understand the behaviour of GANs. We argue that in the case where m d + 1, optimal models in levels of precision are closely linked to the optimal partitions A derived in the Gaussian Multi-Bubble conjecture (Milman and Neeman, 2022) . Besides, using results on the Gaussian boundary measure of those sets (Schechtman, 2012) , we can also derive an upper-bound on the maximal precision of generative models, as follows: Theorem 1 (Upper-bounding the precision). Assume that Assumption 1 is satisfied and m d + 1. For any δ > 0, if L is large enough, then for any well-balanced generator G ∈ G L , we have: α G 1 -γ(∂ ε min A ) + δ . In particular, there exists L with L D log(m), such that for any well-balanced generator G ∈ G L : α G 1 -ε min log m e -3/2 , where ε min = D/L. (3) Theorem 1 links the precision of well-balanced generators with the optimal partition from Milman and Neeman (2022). In particular, result in (3) gives an interesting insight when training GANs on a finite number of modes. Tanielian et al. (2020, Theorem 3) showed a similar result but for the asymptotic case when the number of modes increases: α G m→∞ e -1 8 ε 2 min e -ε min √ log(m)/2 (4) For ε min log(m) = o(1) both (3) and ( 4) have the same behaviour w.r.t. to the number of modes. Now, to further show the usefulness of A , we prove the following theorem: Theorem 2 (Lower-bounding the precision). Assume that Assumption 1 is satisfied and m d + 1. For any δ > 0, there exists C large enough and L D √ m π log(Cm), and a well-balanced generator G ∈ G L associated with A such that for any other well-balanced generator G ∈ G L , we have: α G α G -δ . (5) Moreover, if m ≤ d: α G 1 -ε max π log(Cm) where ε max = D √ m L . ( ) A 2 A 3 A 1 ∂ ε A 3 Latent space Output space M 2 M 3 M 1 Figure 2 : An optimal generator maps a 2D latent space to a 2D output space with three modes (M 1 , M 2 , M 3 ). The latent space has an optimal 'simplicial cluster' geometry. In the latent space, all the ε-boundaries intersect each other in the gray circle, which is mapped in the output space in the convex hull of the three modes. This theorem, which proof is delayed to Appendix, shows that the set of generators associated with A contains optimal generators w.r.t. precision. More importantly, it shows that when L is large enough, the bound in (3) may be tight, as it is almost reached by optimal generators defined in Theorem 2. An example of such optimal generators for the 2D case is given in Figure 2 . This specific generator memorizes the dataset, since all samples are mapped to one of the center of the modes M i , i ∈ [1, m], except for those in ε-boundaries. It is not clear however, whether those are the only generators with optimal precision. We see that when the number of modes is less than the number of dimensions in the latent space, the only factor that impacts the precision is the number of modes. What if modes are not equally distant? This assumption is needed for the definition of a wellbalanced generator as the one proposed in Figure 2 . In R 2 for example, if there is no assumption at all on the location of the modes, there might not be any well-balanced generators associated with the optimal partition A . As shown in Figure 3 properties than the minimizers of the Gaussian isoperimetric inequality (Milman and Neeman, 2022) , and compute a series of experiments on the latent space of GANs to better understand its properties. In all the following experiments, we train WGANs with gradient penalty (Arjovsky et al., 2017; Gulrajani et al., 2017) . For mixture of Gaussians, generator and discriminator are MLP networks. For MNIST (LeCun et al., 1998) , both the generator and discriminator are standard convolutional architectures. On CIFAR-10 and CIFAR-100 datasets (Krizhevsky et al., 2009) , we use a Resnetbased (He et al., 2016) convolutional architecture with self-modulation in the generator (Chen et al., 2018b) , and the transformer-based architecture from Jiang et al. (2021) . To evaluate the performance of GANs, we use both the precision (Kynkäänniemi et al., 2019) , the FID (Heusel et al., 2017) , and the density/coverage (Naeem et al., 2020) . As recommended by recent works (Naeem et al., 2020; Kynkäänniemi et al., 2022) , we use a dataset-specific classifier to extract image features instead of an ImageNet pre-trained classifier, and thus refer to the FID as FD for Fréchet Distance. Implementation details are given in Appendix and code is provided in Supplementary Material. 4.1 LINEAR SEPARABILITY AND CONVEXITY Milman and Neeman (2022) show that the optimal configuration in the latent space is obtained as the Voronoi cells of m equidistant points in R d , if m ≤ d + 1. This means that if GANs reach this optimal configuration, each of the cells must be convex polytopes and thus verify the following properties: 1) each cell has 'flat' sides, and are bounded exclusively by faces; 2) each cell is convex. In the following experiments, we study whether GANs' latent spaces feature these two properties. Are classes linearly separable in the latent space of GANs? To verify this, we leverage a labeled dataset and investigate if a simple linear model (e.g., multinomial logistic regression) can map from latent space to label space. If cells in the latent space are bounded by hyperplanes, the linear model is expected to be a good predictor of a generated sample's label. We consider a labeled dataset of samples with a fixed number of classes. G θ is a pre-trained generator, and C φ a pre-trained classifier considered as an oracle. Using G θ and C φ , we construct a dataset of latent vectors z ∈ R d and their associated labels y = C φ (G θ (z)). On CIFAR-10/100, similarly to Razavi et al. (2019) , only data points with above a confidence threshold are accepted. This dataset is later split into 100k training points and 10k test points. The mapping from latent vectors z to their labels y is learned by a multinomial logistic regression. We report the test-set results in Table 1 under the column LogReg Accuracy. This accuracy reaches 90% on MNIST and 70% on CIFAR-10. Interestingly, there is also a correlation between the linear separability of the latent space and the precision metric, which validates the optimality of the simplicial cluster partition. Are classes convex in the latent space of GANs? In this experiment, we draw two random latent vectors z 0 and z 1 that belong to the same class. Then, we generate linear interpolations z ε = εz 0 + (1ε)z 1 and verify if these new samples belong to the same class as z 0 and z 1 , i.e. whether C φ (G θ (z ε )) equals to C φ (G θ (z 0 )). We report the mean accuracy of this experiment in Table 1 under the column Convex Accuracy. Again, the higher the precision, the 'more convex' each cell in the latent space seems to be. For a qualitative evaluation, we show this phenomenon in Figure 5 and stress that linear interpolations conserve the image class. ; 2) if each cell of the latent space is convex. (Convex Accuracy). In par with Theorem 1, the higher the precision is, the more each cell in the latent space is linearly separable and convex. Impact of the latent space dimension. Now, we evaluate the impact of the latent space dimension on MNIST, CIFAR10 and CIFAR100 datasets. To do so, we vary the number of latent dimensions d from 2 to 128 for each given dataset. In Figure 6 , we exhibit two phases in the performance of GANs when changing the number of latent dimensions: first, when d is below a dataset-specific threshold, the precision of the model falls when reducing the number of latent dimensions. When d is above this threshold, the precision becomes constant, and increasing the number of latent dimensions does not bring any apparent improvement. This threshold is a function of the complexity of the dataset and its number of modes. We observe from Figure 6 , that the more complex the dataset, the more it requires a large latent space for high precision levels. This is coherent with the theoretical results, where the precision decreases w.r.t. to the number of modes in log(m) when m d + 1. Disentangling the manifold dimension. As discussed in Roth et al. (2017) , two different problems can arise when training GANs: i) dimensional misspecification where the true and modeled distributions do not have density functions w.r.t. the same base measure, and ii) density misspecification, where GANs try to fit a disconnected manifold with a unimodal disitribution. To isolate the density misspecification studied in the current paper, we train a conditional GAN with a low-dimensional latent space R d (e.g. R 5 in our setting), so that the dimension of the generated manifold is at most 5. We later collect a dataset of synthetic generated samples Synthetic CIFAR-10, and train unconditional GANs by varying the dimension of the latent space. Figure 6 shows both the Synthetic CIFAR-10 and the standard CIFAR-10 converge to the same limits for FD, Precision and Density, showing that with large latent space dimensions, the density misspecification seems to be the main issue to cope with. A synthetic experiment showing the importance of density misspecification over dimensional misspecification is available in appendix. Impact of overparametrization. Balaji et al. (2020) already showed the importance of GANs' overparametrization in both their convergence and performance. Knowing that, we study whether overparametrization can help GANs obtain the optimal geometry of latent space. In Table 2 , we vary the width of ResNet generators, and highlight that overparametrized GANs better fit the target distribution. More importantly, we observe that overparametrization helps achieving better linear separability of their latent space, as shown by LogReg Accuracy. Table 2 : Overparametrization study: for a latent dimension equal to 64, we vary the width of the generator (confidence intervals computed on 10 checkpoints). Increasing the capacity of GANs tend to structure their latent space in simplicial clusters (better LogReg accuracy) and improve their performance on precision, density and coverage.

5. CONCLUSION

This paper aims to make a step toward a better understanding of GANs learning disconnected distributions. When the latent space dimension is large enough, we present an optimal latent space geometry of GANs: 'simplicial clusters', a Voronoi partition where each cell is a convex cone. We further show experimentally that GANs with sufficient latent capacity tend to respect this optimal geometry. We believe that our analysis can foster exciting research on GANs, with both theoretical and practical impacts. For example, understanding the optimal latent space's geometry could help design semi-supervised or transfer algorithms from GANs. Also, it could inspire new neural architectures with a bias for this 'simplicial cluster' partitioning of the latent space. Finally, let us note that our results could potentially be extended to other types of generative models with Gaussian latent space and, thus, would allow a better understanding of their properties. To adapt our analysis for variational auto-encoders or diffusion models, one would need to adapt our results to a stochastic generator. This could be an exciting follow-up of our work. Limitations. We showed the existence of optimal generators and have shown experimentally that overparametrization plays a key role. However, a limitation of our work is that we could not prove their uniqueness. This is linked to partitions with the lowest ε-boundaries measures in the Gaussian space, which is a complex, unknown result. A second limitation is that the derived optimal generators are not valid in the case m > d + 1, because the minimizers of Gaussian isoperimetric inequalities are not known in this case. Potential negative societal impacts. This work is mainly about understanding the behavior of deep generative models. Thus, it may lead to practical improvements in this technology and increase its potential negative impacts, such as deepfakes (Fallis, 2020) . B TECHNICAL RESULTS Generator G(z) z ∼ N(0, Id) 128 Fully Connected 4 × 4 × 128 - ResBlock [3 × 3] × 2 1 × 1 4 × 4 × 128 Y ReLU Nearest Up Sample 8 × 8 × 128 - ResBlock [3 × 3] × 2 1 × 1 8 × 8 × 128 Y ReLU Nearest Up Sample 16 × 16 × 128 - ResBlock [3 × 3] × 2 1 × 1 16 × 16 × 128 Y ReLU Nearest Up Sample 32 × 32 × 128 - Convolution 3 × 3 1 × 1 32 × 32 × 3 - Tanh Discriminator D(x) 32 × 32 × 3 ResBlock [3 × 3] × 2 1 × 1 32 × 32 × 256 - ReLU AvgPool 2 × 2 1 × 1 16 × 16 × 256 - ResBlock [3 × 3] × 2 1 × 1 16 × 16 × 256 - ReLU AvgPool 2 × 2 1 × 1 8 × 8 × 256 - ResBlock [3 × 3] × 2 1 × 1 8 × 8 × 256 - ReLU ResBlock [3 × 3] × 2 1 × 1 8 × 8 ×

B.1 PROOF OF LEMMA 1

We want to show that generator G ∈ G A L is such that α G 1γ(∂ ε min A ), where ∂ ε min A = m i=1 ∪ j =i A j ε min \ ∪ j =i A j , Proof by contradiction. Assume a generator G such that there exists z ∈ ∂ ε min A and i ∈ [1, m] such that G(z) ∈ M i . Since G is associated with A , we have using Definition 2, that there exists z and j ∈ [1, m], j = i such that zz < ε min /2 and j = arg min k∈ [1,m] G(z ) -M k . Thus, we have: G(z) -G(z ) d(G(z ), M i ), d(M i , M i )/2, D min /2. And, G(z) -G(z ) z -z > D min /ε min , > L. This contradicts G being in G A L .

B.2 PROOF OF THEOREM 1

Proving that: for m d + 1, for any δ > 0, if L is large enough, then, for any well-balanced G ∈ G L , we have α G 1γ(∂ ε min A ) + δ . Let G be a well-balanced generator and A the partition associated with G. Let us first define the gaussian boundary measure P γ of a partition A of R d . For partitions with smooth boundaries, it coincides with the (d -1)-dimensional gaussian measure of the boundary, defined as follows: P γ (A ) = lim inf ε→0 γ(∂ ε A ) -γ(A ) 2/πε Moreover, for sets with smooth boundaries, we have from Federer (1969, Theorem 3.2.29) : lim inf ε→0 γ(∂ ε A ) -γ(A ) 2/πε = lim ε→0 γ(∂ ε A ) -γ(A ) 2/πε Let us denote A , the optimal partition defined in Milman and Neeman (2022) , based on simplicial clusters. A is a standard partition where γ(A 1 ) = . . . = γ(A m ) for all i, and ∑ i γ(A i ) = 1. By the multi-bubble theorem (Milman and Neeman, 2022) , simplicial clusters (such as A ) are the unique minimizers of the gaussian isoperimetric problem, thus: P γ (A ) P γ (A ) lim ε→0 γ(∂ ε A ) ε lim ε→0 γ(∂ ε A ) ε L A L A where L A = lim ε→0 γ(∂ ε A ) ε and L A = lim ε→0 γ(∂ ε A ) ε . Then, for any δ > 0, there exists ε > 0 such that, for any ε < ε , | γ(∂ ε A ) ε -L A | < δ , | γ(∂ ε A ) ε -L A | < δ and L A L A Thus, for any δ > 0, there exists ε > 0 such that, for any ε < ε , γ(∂ ε A ) γ(∂ ε A ) + 2δ ε Besides, we know that α G 1γ(∂ ε min A ) Consequently, we have that: α G 1 -γ(∂ ε min A ) 1γ(∂ ε min A ) + 2δ ε min using (7) . We conclude by choosing L big enough such that ε min is strictly smaller than δ δ /2 . Second part of Theorem 1. Let L, D be such that L D log(m). Let's prove that for any wellbalanced generator G ∈ G L , we have: α G 1 -ε min log m e -3/2 . Using the method from Schechtman (2012), we have the measure of the border of cell i: γ ∪ j =i A j ε \ ∪ j =i A j 1 √ 2π t+ε t e -s 2 /2 ds, where t is such that 1 √ 2π ∞ t e -s 2 /2 ds = 1/m, ε √ 2π e -(t+ε) 2 /2 , ε √ log m m e -εt-ε 2 /2 (using log m ≤ t ≤ 2 log m), ε √ log m m e -ε √ log m-ε 2 /2 . Thus: γ(∂ ε min A ) = m ∑ i=1 γ ∪ j =i A j ε \ ∪ j =i A j ε min log m e -ε min √ log m-ε 2 min /2 . Thus, we have α G 1 -γ(∂ ε min A ), 1 -ε min log m e -ε min √ log m-ε 2 min /2 . Moreover, using ε min = D L and L D √ log m, so we get ε min √ log m 1: α G 1 -ε min log m e -3/2 .

B.3 PROOF OF THEOREM 2

For a given partition A , and a target distribution µ with m disconnected components M i , i ∈ [1, m], we defined X i , i ∈ [1, m] a set of sampled data points such that for all i ∈ [1, m], we have X i ∈ M i . Now, we define G ε with ε > 0, a generative model such that: G ε (z) = ∑ i∈S z w i (z) X i , with w i (z) = d(z, (A ε i ) ) ∑ j∈S z d(z, (A ε j ) ) where d(z, A) = min a∈A za , and S z = { j ∈ [1, n] such that z ∈ A ε j } denotes the set of cellextensions the point z belongs to. We can see that G ε γ memorizes the dataset since every z close to the center of the cell A i such that |S z | = 1 verifies G ε (z) = X i . An illustration is given in Figure 2 . To be more precise, all samples are mapped to one of the center of the modes X i , i ∈ [1, m], except for those in ε-boundaries. When z belongs to the intersection of two ε-boundaries, G ε (z) is a simple linear combination of 2 points. It is only when |S z | 3 that more complex samples are generated. A simple illustration of G ε for d = 2 and m = 3 is given in Figure 2 . Interestingly, one can also show that the image of G ε is equal to the convex hull of the diracs X i , i ∈ [1, m]. In particular, there exists a particularly interesting neighborhood ν of 0 where G ε (ν) is equal to the whole convex hull of the points X i , i ∈ [1, m]. Proof that G ε is well-balanced. We recall that a generator is well-balanced if we have G γ(M 1 ) = . . . = G γ(M m ). By construction (8), we have that for any i ∈ [1, m] G ε (z) -X i = ∑ k =i w k (X k -X i ) , = D × (1 -w i ). So, for any z ∈ A i , we have that i = arg min j∈[1,m] w j = arg min j∈[1,m] G(z)-X j Thus G ε is associated with the optimal partition A , . Besides, for a given radius r of the different modes, since everything is symmetrical, we have that γ ({z ∈ R d , G(z) -X 1 r} = . . . = γ({z ∈ R d , G(z) -X m r}. Thus, the generator is well-balanced. Proof that G ε max is in G L . To begin with, we have R d = ∂ ε A R d \ ∂ ε A . R d \ ∂ ε A = m i=1 A i \ (∪ j =i A j ) ε ∩ A i , and, we know that for each i ∈ [1, m], G ε (z) is constant and thus L-Lipschitz. Now, ∂ ε A = m i=1 ∪ j =i A j ε \ ∪ j =i A j = S∈P([1,m]) card(S) 2 i∈S A ε i . Now, let S ∈ P([1, m]) with card(S) = k 2. Let z, z ∈ S 2 . Let α = (α 1 , . . . , α m ) and β = (β 1 , . . . , β m ) be two vectors, both in R m , such that for all i ∈ [1, m]: α i = d(z, (A ε i ) ) ∑ j∈A z d(z, (A ε j ) ) and β i = d(z , (A ε i ) ) ∑ j∈A z d(z , (A ε j ) ) We have that G(z) -G(z ) = (1 -∑ i =1 α i )X 1 -(1 -∑ i =1 β i )X 1 + ∑ i =1 α i X i -∑ i =1 β i X i = ∑ i =1 (α i -β i )(X 1 -X i ) max (i, j)∈[1,m] 2 X i -X j α -β , max (i, j)∈[1,m] 2 X i -X j h(z) -h(z ) , where h is the function from R d → R m defined as: h(z) = ( d(z, (A ε 1 ) ) ∑ i∈A z d(z, (A ε i ) ) , . . . , d(z, (A ε m ) ) ∑ i∈A z d(z, (A ε i ) ) ). We can write h = f • g with f the function defined from R We have that f is a √ m-Lipschitz functions (given that z → d(z, (A ε m ) ) is 1-Lipschitz). Besides, we know that outside the ball B ε (0), the function g is (1/ε)-Lipschitz. Using the convexity of function z → ∑ j∈A z d(z, (A ε j )) (as a sum of convex functions), we can show that for all z ∈ A z , we have that f (z) (m -1)ε and f (z) is not B ε (0). Finally, the function h is √ m ε -Lipschitz. Thus, we have that: G ε (z) -G ε (z ) D √ m ε z -z , with D = X i -X j , (i, j) ∈ [1, m] 2 , i = j. Consequently, by noting ε max = D √ m L , we have : G ε max (z) -G ε max (z ) L zz . We can now conclude on the Lipschitzness of G on R d . Proving that: for m d + 1, for any δ > 0, if L is large enough, then, for any well-balanced G ∈ G L , we have α G εmax α Gδ . From the proof of Theorem 1, we have that for any δ > 0, there exists ε min such that: α G 1γ(∂ ε min A ) 1γ(∂ ε min A ) + 2δ ε min using (7) . Now, by construction of G ε max , we have that α G εmax 1 -γ(∂ ε max A ). Consequently, α G 1 -γ(∂ ε min A ) + 2δ ε max + γ(∂ ε max A ) -γ(∂ ε max A ) α G ε + 2δ ε max + γ(∂ ε max A ) -γ(∂ ε min A ) α G ε + 2δ ε max + γ(∂ ε max A ) -2L A ε max -γ(∂ ε min A ) + 2L A ε min + 2L A (ε max -ε min ) α G ε + 4δ ε max + 2L A ε max , α G ε + ε max (4δ + 2L A ). We conclude by choosing L big enough such that ε max is strictly smaller than δ 4δ +2L A . Proving the second part of Theorem 2. The precision of G ε is thus such that: α G εmax 1 -γ(∂ ε max A ). However, since ∂ ε A ⊂ n i=1 A ε i , we have that for any ε γ(∂ ε A ) n ∑ i=1 γ(A ε i ). Using results from Schechtman (2012, Proposition 1), when m ≤ d, there exists C large enough, such that γ(A ε max i ) ε max m π log(Cm) . Thus, we have α G εmax 1ε max π log(Cm), To have α G εmax 0, we must have ε max 1/ π log(Cm). This is the case since we have 

C ADDITIONAL RESULT

Impact of the number of modes. To illustrate the results from Theorem 1 and Theorem 2, we propose to vary the number of modes of the data distribution. On real-world data, the number of modes is set but usually unknown, and removing/adding classes as a proxy for modes usually does not give insightful results since some classes can be much more complex than others. We thus use a synthetic setting, where we can easily control both the number of modes and their complexity. Figure 7 stresses that as the number of modes increase, the precision decrease. Interestingly, using large latent space dimension can relieve the problem, even if the latent space dimension is clearly below that of the target. Recall the two problems that arise when training GANs: i) dimensional misspecification where the true and modeled distributions do not have density functions w.r.t. the same base measure, and ii) density misspecification, where GANs try to fit a disconnected manifold with a unimodal disitribution. From the results we conclude that: • With very low latent space dimensions, both problems i) and ii) have to be addressed and this leads to poor precision as the number of modes increases. • With larger latent space dimensions, the problem i) is less of a burden even when there is a clear dimensional misspecification and thus the GANs' performance is more tied to problem ii). Figure 7 : Training on a mixture of Gausians in R 100 with varying number of modes and varying latent space dimension. The bigger the number of modes, the lower the precision. Increasing the latent space dimension helps up to a limit depending on the number of modes.



EXPERIMENTS: UNDERSTANDING THE LATENT SPACE OF GANSIn the following experiments, we validate our theoretical analysis and derive insights for GANs trained on toy and image datasets. We verify if the latent space geometry of GANs has similar



, the latent space configuration obtained by the GANs on 3 aligned data points (right) is made of two parallel hyperplanes, much different from A defined by Milman and Neeman (2022) (left).

Figure 3: GANs training with 3 equidistant modes and 3 (almost) aligned modes. The first and third figure show the data points in the output space. The second and fourth stress the boundaries in the latent space using heatmaps of the norm of the gradient of the generator.

Figure 4: Extending the multi-bubble conjecture when m > d + 1. We plot the partition of the R 2 latent space of a GAN that maps to m equidistant points in R m , from m = 3 (left) to m = 12 (right). Each colored cell maps to a distinct data point in the output space.

The architecture Transformer refers to the TransGAN model from Jiang et al. (2021).

Figure 5: Illustration of convexity of classes in the latent space of GANs trained on CIFAR-10. Visual inspection confirms that latent linear interpolations between two samples of the same class most often conserve the class.

Figure 6: Performance of GANs w.r.t. the number of modes and latent space dimensions on image datasets. Left, the FD gets lower with increased latent dimension. Center and right, precision and density improve when the latent dimension increases and saturates from a threshold.

d → R m by f (z) = d(z, (A ε 1 ) ), . . . , d(z, (A ε k ) ) ,and g the function defined on R m \ {0} by g(z) = z z 1

max = D √ m/L and L D √ m π log(Cm).

In this experiment, we verify 1) if the latent spaces of GANs are linearly separable (LogReg Accuracy)

GANs training details on CIFAR datasets. BN stands for batch-normalization.

Performance of GANs when varying latent space dimension. Confidence intervals are computed on 10 checkpoints of the same training. See main paper for curves of precision and FID with regard to the latent space dimension.

A IMPLEMENTATION DETAILS

Training. We use the Wasserstein loss with gradient-penalty on interpolations of fake and real data. At each iteration, the discriminator is trained 2 steps and the generator 1 step with Adam optimizer. The batch size is 256. The learning rate of the discriminator is two times larger (Heusel et al., 2017) , i.e. 5 × 10 -5 for the generator and 1 × 10 -4 for the discriminator. GANs are trained for 80k steps on MNIST and for 100k steps on CIFAR datasets. Architectures of generator and discriminator are described in Table 3 and Table 4 .Evaluation. Features are extracted with a classifier with simple architecture (convolutions, relu activation, no batch normalization). The classifier is trained on each dataset with cross-entropy loss.Weights of the classifiers are given in the code. For evaluation metrics, we follow the setting proposed by the authors. For FID (Heusel et al., 2017) , we use 50k real images and 50k fake images. For precision, recall, density and coverage (Kynkäänniemi et al., 2019; Naeem et al., 2020) , we use 10k real images and 10k fake images with nearest-k= 5. Full results of the study on latent space dimension are presented in Table 5 .We also share the code for a better reproducibility.GPUs. For all datasets, the training of GANs was run on NVIDIA TESLA V100 GPUs (16 GB).The training of a GAN for 100k steps on CIFAR takes around 30 hours.

