THE GEOMETRY OF DEEP GENERATIVE IMAGE MOD-ELS AND ITS APPLICATIONS

Abstract

Generative adversarial networks (GANs) have emerged as a powerful unsupervised method to model the statistical patterns of real-world data sets, such as natural images. These networks are trained to map random inputs in their latent space to new samples representative of the learned data. However, the structure of the latent space is hard to intuit due to its high dimensionality and the non-linearity of the generator, limiting the usefulness of the models. Understanding the latent space requires a way to identify input codes for existing real-world images (inversion), and a way to identify directions with known image transformations (interpretability). Here, we use a geometric framework to address both issues simultaneously. We develop an architecture-agnostic method to compute the Riemannian metric of the image manifold created by GANs. The eigen-decomposition of the metric isolates axes that account for different levels of image variability. An empirical analysis of several pretrained GANs shows that image variation around each position is concentrated along surprisingly few major axes (the space is highly anisotropic) and the directions that create this large variation are similar at different positions in the space (the space is homogeneous). We show that many of the top eigenvectors correspond to interpretable transforms in the image space, with a substantial part of eigenspace corresponding to minor transforms which could be compressed out. This geometric understanding unifies key previous results related to GAN interpretability. We show that the use of this metric allows for more efficient optimization in the latent space (e.g. GAN inversion) and facilitates unsupervised discovery of interpretable axes. Our results illustrate that defining the geometry of the GAN image manifold can serve as a general framework for understanding GANs.

1. BACKGROUND

Generative adversarial networks (GANs) learn patterns that characterize complex datasets, and subsequently generate new samples representative of that set. In recent years, there has been tremendous success in training GANs to generate high-resolution and photorealistic images (Karras et al., 2017; Brock et al., 2018; Donahue & Simonyan, 2019; Karras et al., 2020) . Well-trained GANs show smooth transitions between image outputs when interpolating in their latent input space, which makes them useful in applications such as high-level image editing (changing attributes of faces), object segmentation, and image generation for art and neuroscience (Zhu et al., 2016; Shen et al., 2020; Pividori et al., 2019; Ponce et al., 2019) . However, there is no systematic approach for understanding the latent space of any given GAN or its relationship to the manifold of natural images. Because a generator provides a smooth map onto image space, one relevant conceptual model for GAN latent space is a Riemannian manifold. To define the structure of this manifold, we have to ask questions such as: are images homogeneously distributed on a sphere? (White, 2016) What is the structure of its tangent space -do all directions induce the same amount of variance in image transformation? Here we develop a method to compute the metric of this manifold and investigate its geometry directly, and then use this knowledge to navigate the space and improve several applications. To define a Riemannian geometry, we need to have a smooth map and a notion of distance on it, defined by the metric tensor. For image applications, the relevant notion of distance is in image space rather than code space. Thus, we can pull back the distance function from the image space onto the latent space. Differentiating this distance function on latent space, we will get a differential geometric structure (Riemannian metric) on the image manifold. Further, by computing the Riemannian metric at different points (i.e. around different latent codes), we can estimate the anisotropy and homogeneity of this manifold. The paper is organized as follows: first, we review the previous work using tools from Riemannian geometry to analyze generative models in section 2. Using this geometric framework, we introduce an efficient way to compute the metric tensor H on the image manifold in section 3, and empirically investigate the properties of H in various GANs in section 4. We explain the properties of this metric in terms of network architecture and training in section 5. We show that this understanding provides a unifying principle behind previous methods for interpretable axes discovery in the latent space. Finally, we demonstrate other applications that this geometric information could facilitate, e.g. gradient-free searching in the GAN image manifold in section 6.

2. RELATED WORK

Geometry of Deep Generative Model Concepts in Riemannian geometry have been recently applied to illuminate the structure of latent space of generative models (i.e. GANs and variational autoencoders, VAEs). Shao et al. ( 2018) designed algorithms to compute the geodesic path, parallel transport of vectors and geodesic shooting in the latent space; they used finite difference together with a pretrained encoder to circumvent the Jacobian computation of the generator. While promising, this method did not provide information of the metric directly and could not be applied to GANs without encoders. Arvanitidis et al. (2017) focused on the geometry of VAEs, deriving a formula for the metric tensor in order to solve the geodesic in the latent space; this worked well with shallow convolutional VAEs and low-resolution images (28 x 28 pixels). Chen et al. ( 2018) computed the geodesic through minimization, applying their method on shallow VAEs trained on MNIST images and a low-dimensional robotics dataset. In the above, the featured methods could only be applied to neural networks without ReLU activation. Here, our geometric analysis is architecture-agnostic and it's applied to modern large-scale GANs (e.g. BigGAN, StyleGAN2). Further, we extend the pixel L2 distance assumed in previous works to any differentiable distance metric.

3. METHODS

Formulation A generative network, denoted by G, is a mapping from latent code z to image I, G : R n → I = R H×W ×3 , z → I. Borrowing the language of Riemannian geometry, G(z) parameterizes a submanifold in the image space with z ∈ R n . Note for applications in image domain, we care about distance in the image space. Thus, given a distance function in image space D : I × I → R + , (I 1 , I 2 ) → L, we can define the distance between two codes as the distance between the images they generate, i.e. pullback the distance function to latent space through G. d : R n × R n → R + , d(z 1 , z 2 ) := D(G(z 1 ), G(z 2 )). The Hessian matrix (second order partial derivative) of the squared distance function d 2 can be seen as the metric tensor of the image manifold (Palais, 1957) . The intuition behind this is as follows: consider the squared distance to a fixed reference vector z 0 as a function of z, f z0 (z) = d 2 (z 0 , z). Obviously, z = z 0 is a local minimum of f z0 (z), thus f z0 (z) can be locally approximated by a positive semi-definite quadratic form H(z 0 ) as in Eq.1. This matrix induces an inner product and defines a vector norm, v 2 H = v T H(z 0 )v. This squared vector norm approximates the squared image distance, d 2 (z 0 , z 0 + δz) ≈ δz 2 H = δ T z H(z 0 )δz. Thus, this matrix encodes the local distance information on the image manifold up to second order approximation. This is the intuition behind Riemannian metric. In this article, the terms "metric tensor" and "Hessian matrix" are used interchangeably. We will call α H (v) = v T Hv/v T v the approximate speed of image change along v as measured by metric H. d 2 (z 0 , z) ≈ δz T ∂ 2 d 2 (z 0 , z) ∂z 2 | z0 δz, H(z 0 ) := ∂ 2 d 2 (z 0 , z) ∂z 2 | z0

