ANALYZING THE LATENT SPACE OF GAN THROUGH LOCAL DIMENSION ESTIMATION Anonymous authors Paper under double-blind review

Abstract

The impressive success of style-based GANs (StyleGANs) in high-fidelity image synthesis has motivated research to understand the semantic properties of their latent spaces. Recently, a close relationship was observed between the semantically disentangled local perturbations and the local PCA components in W-space. However, understanding the number of disentangled perturbations remains challenging. Building upon this observation, we propose a local dimension estimation algorithm for an arbitrary intermediate layer in a pre-trained GAN model. The estimated intrinsic dimension corresponds to the number of disentangled local perturbations. In this perspective, we analyze the intermediate layers of the mapping network in StyleGANs. Our analysis clarifies the success of W-space in StyleGAN and suggests a method for finding an alternative. Moreover, the intrinsic dimension estimation opens the possibility of unsupervised evaluation of global-basis-compatibility and disentanglement for a latent space. Our proposed metric, called Distortion, measures an inconsistency of intrinsic tangent space on the learned latent space. The metric is purely geometric and does not require any additional attribute information. Nevertheless, the metric shows a high correlation with the global-basis-compatibility and supervised disentanglement score. Our work is the first step towards selecting the most disentangled latent space among various latent spaces in a GAN without attribute labels.

1. INTRODUCTION

Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) have achieved remarkable success in generating realistic high-resolution images (Karras et al., 2018; 2019; 2020b; 2021; 2020a; Brock et al., 2018) . Nevertheless, understanding how GAN models represent the semantics of images in their latent spaces is still a challenging problem. To this end, several recent works investigated the disentanglement (Bengio et al., 2013) properties of the latent space in GAN (Goetschalckx et al., 2019; Jahanian et al., 2019; Plumerault et al., 2020; Shen et al., 2020) . In this work, we concentrate on finding a disentangled latent space in a pre-trained model. A latent space is called (globally) disentangled if there is a bijective correspondence between each semantic attribute and each axis of latent space when represented with the optimal basis. (See the appendix for detail.) The style-based GAN models (Karras et al., 2019; 2020b) have been popular in previous studies for identifying a disentangled latent space in a pre-trained model. First, the space of style vector, called W-space, was shown to provide a better disentanglement property compared to the latent noise space Z (Karras et al., 2019) . After that, several attempts have been made to discover other disentangled latent spaces, such as W + -space (Abdal et al., 2019) and S-space (Wu et al., 2020) . However, their better disentanglement was assessed by the manual inspection (Karras et al., 2019; Abdal et al., 2019; Wu et al., 2020) or by the quantitative scores employing a pre-trained feature extractor (PPL (Karras et al., 2019) ) or an attribute annotator (Separability (Karras et al., 2019) and DCI metric (Eastwood & Williams, 2018; Wu et al., 2020) ). The manual inspection is vulnerable to sample dependency, and the quantitative scores depend on the pre-trained models and the set of selected target attributes. Therefore, we need an unsupervised quantitative evaluation scheme for the disentanglement of latent space that does not rely on pre-trained models. In this paper, we investigate the semantic property of a latent space by analyzing its geometrical property. In this regard, we propose a local intrinsic dimension estimation scheme for a learned intermediate latent space in pre-trained GAN models. The local intrinsic dimension is the number of dimensions required to properly approximate the latent space locally (Fig 1a) . We discover this intrinsic dimension by estimating the robust rank of Jacobian of the subnetwork. The estimated dimension is interpreted as the number of disentangled local perturbations. Furthermore, the intrinsic dimension of latent manifold leads to an unsupervised quantitative score for the global disentanglement property. The experiments demonstrate that our proposed metric shows a high correlation with the global-basis-compatibility and supervised disentanglement score. (The global-basis-compatibility will be rigorously defined in Sec 4.) Our contributions are as follows: 1. We propose a local intrinsic dimension estimation scheme for an intermediate latent space in pre-trained GAN models. The scheme is derived from the rank estimation algorithm applied to the Jacobian matrix of a subnetwork. 2. We propose a layer-wise global disentanglement score, called Distortion, that measures the inconsistency of intrinsic tangent space. The proposed metric shows a high correlation with the global-basis-compatibility and supervised disentanglement score. 3. We analyze the intermediate layers of the mapping network through the proposed Distortion metric. Our analysis elucidates the superior disentanglement of W-space compared to the other intermediate layers and suggests a criterion for finding a similar-or-better alternative.

2. RELATED WORKS

Style-based Generator Recently, GANs with style-based generator architecture (Karras et al., 2019; 2020b; 2021; Sauer et al., 2022) have achieved state-of-the-art performance in realistic image generation. In conventional GAN architecture, such as DCGAN (Radford et al., 2016) and ProGAN (Karras et al., 2018) , the generator synthesizes an image by transforming a latent noise with a sequence of convolutional layers. On the other hand, the style-based generator consists of two subnetworks: mapping network f : Z → W and synthesis network g : R n0 × W L → X . The synthesis network is similar to conventional generators in that it is composed of a series of convolutional layers {g i } i=1,••• ,L . The key difference is that the synthesis network takes the learned constant feature y 0 ∈ R n0 at the first layer g 0 , and then adjusts the output image by injecting the layer-wise styles w and noise (Layer-wise noise is omitted for brevity.): y i = g i (y i-1 , w) with w = f (z) for i = 1, • • • , L, where the style vector w is attained by transforming a latent noise z via the mapping network f . Understanding Latent Semantics. The previous attempts to understand the semantic property of latent spaces in StyleGANs are categorized into two topics: (i) finding more disentangled latent space in a model; (ii) discovering meaningful perturbation directions in a latent space corresponding to disentangled semantics. Several studies on (i) suggested various disentangled latent spaces in StyleGAN models, for example, W (Karras et al., 2019) , W + (Abdal et al., 2019) , P N (Zhu et al., 2020) , and S-space (Wu et al., 2020) . However, the superiority of the newly proposed latent space was demonstrated only through comparison with the previous latent spaces, not by selecting the best one among all candidates. Moreover, the comparison was conducted by manual inspections (Karras et al., 2019; Abdal et al., 2019; Wu et al., 2020) or by quantitative metrics relying on pre-trained models (Karras et al., 2019; Wu et al., 2020) . Also, the previous works on (ii) are classified into local and global methods. The local methods find sample-wise perturbation directions (Ramesh et al., 2018; Patashnik et al., 2021; Abdal et al., 2021; Zhu et al., 2021; Choi et al., 2022b) . On the other hand, the global methods search layer-wise perturbation directions that perform the same semantic manipulation on the entire latent space (Härkönen et al., 2020; Shen & Zhou, 2021; Voynov & Babenko, 2020) . Throughout this paper, we refer to these local methods as local basis and these global methods as global basis. GANSpace (Härkönen et al., 2020) showed that the principal components obtained by PCA can serve as the global basis. SeFa (Shen & Zhou, 2021) suggested the singular vectors of the first weight parameter applied to latent noise as the global basis. These global basis showed promising results, but they were successful in a limited area. Depending on the sampled latent variables, these methods exhibited limited semantic factorization and sharp degradation of image fidelity Choi et al. (2022b; a) . In this regard, (Choi et al., 2022b) suggested the need for diagnosing a global-basis-compatibility of latent space. Here, the global-basis-compatibility means how well the optimal global basis can work on the target latent space. Local Basis Choi et al. (2022b) proposed an unsupervised method for finding local semantic perturbations based on the local geometry, called Local Basis (LB). Throughout this paper, we denote Local Basis as LB to avoid confusion with the general term "local basis" in the previous paragraph. Assume the support Z of input prior distribution p(z) is the entire Euclidean space, i.e., Z = R d Z , for example, Gaussian prior p(z) = N (0, I). We denote the target latent space by M = f (Z) ⊆ R d M and refer to the subnetwork between them by f . Note that the target latent space M is defined as an image of the trained subnetwork f . Hence, we call M the learned latent space or the learned latent manifold following the manifold interpretation of Choi et al. (2022b) . LB is defined as the ordered basis of tangent space T w M k w at w = f (z) ∈ M for the k-dimensional local approximating manifold M k w . Here, M k w ⊆ M indicates a k-dimensional submanifold of M that approximates M around w (Fig 1a) with k ≤ d M : M k w ≈ M w where M w = {f (z ϵ ) | ∥z ϵ -z∥ < ϵ} ⊆ M. Using the fact that M = f (Z), the local approximating manifold M k w can be discovered by solving the low-rank approximation problem of df z , i.e., the Jacobian matrix ∇ z f of f : minimize L ∥df z -L∥ 2 , where rank(L) ≤ k. (3) The analytic solution of this low-rank approximation problem is obtained in terms of Singular Value Decomposition (SVD) by Eckart-Young-Mirsky Theorem (Eckart & Young, 1936) . From that, M k w and the corresponding LB are given as follows: For the i-th singular vector u z i ∈ R d Z , v w i ∈ R d M , and i-th singular value σ z i ∈ R of df z with σ z 1 ≥ • • • ≥ σ z n , df z (u z i ) = σ z i • v w i for ∀i, LB(w = f (z)) = {v w i } 1≤i≤n , M k w = f z + i t i • u z i | t i ∈ (-ϵ i , ϵ i ), for 1 ≤ i ≤ k . ( ) Note that the tangent space of M k w is spanned by the top-k LB, i.e. T w M k w = span{v w i : 1 ≤ i ≤ k}. Therefore, traversing along LB is guaranteed to stay close to the latent manifold, thereby providing a strong robustness of image quality. However, Choi et al. (2022b) did not provide an estimate on the number of meaningful perturbations. Since LB is defined as singular vectors, Choi et al. (2022b) presents the candidates as much as the ambient dimension. In this regard, we propose the local dimension estimation that can refine these candidates up to 90%. Moreover, this local dimension estimation leads to an unsupervised global disentanglement metric (Sec 4).

3. LATENT DIMENSION ESTIMATION

In this section, we propose a local dimension estimation scheme for a learned latent manifold in a pre-trained GAN model. Following the work of Choi et al. (2022b) , the estimated local dimension at w ∈ M corresponds to the number of local semantic perturbations from w. The proposed scheme is based on the rank estimation algorithm (Kritchman & Nadler, 2008) applied to the differential of subnetwork df z . Then, we evaluate the validity of the estimated local dimension. In this section, our analysis of learned latent manifold is focused on the intermediate layers in the mapping network of StyleGAN2 (Karras et al., 2020b) trained on FFHQ (Karras et al., 2019) . However, the proposed scheme can be applied to any z-differentiable intermediate layers for an input latent noise z.

3.1. METHOD

Throughout this work, we follow the notation presented in Sec 2. Consider a target latent space M given by a subnetwork f , i.e., M = f (Z). Our goal is to estimate the intrinsic local dimension of the learned latent manifold M around w = f (z). Geometrically, this intrinsic dimension represents the dimension required to locally describe the major variations of the manifold. The intrinsic local dimension is discovered by interpreting the differential df z as a noisy linear map and finding its intrinsic rank. The correspondence between the local dimension and rank of df z is described in Eq 5 because the rank of a linear map is the same as the number of singular vectors with non-zero singular values. Note that the matrix representation of df z is a Jacobian matrix (∇ z f )(z). Motivation Before presenting our dimension estimation algorithm, we provide motivation for introducing the lower-dimensional approximation to M. Figure 1b shows the singular value distribution of Jacobian matrices evaluated for the subnetworks of the mapping network in StyleGAN2. Last layer i in Fig 1b denotes the subnetwork from the input noise space Z to the i-th fully connected layer. The distribution of singular values {σ z i } i gets monotonically sparser as the subnetwork gets deeper. In particular, W-space, i.e., Last layer 8, is extremely sparse as much as σ z 150 /σ z 1 ≈ 0.005. Therefore, it is reasonable to prune the singular values with negligible magnitude and consider the lower-dimensional approximation of the learned latent manifold.

Pseudorank Algorithm

The intrinsic rank estimation algorithm distinguishes the large meaningful components and the small noise-like components given the singular values {σ z i } i of (∇ z f )(z). The Pseudorank algorithm (Kritchman & Nadler, 2008) determines the number of meaningful components based on the theoretical results from the random matrix theory literature. Assume the isotropic Gaussian noise on the Jacobian (∇ z f )(z) ∈ R d M ×d Z : (∇ z f )(z) = L(z) + σ • (ϵ 1 , • • • , ϵ d M ) ⊺ with ϵ i ∼ N (0, I d Z ), where L(z) denotes the denoised low-rank representation of (∇ z f )(z). Then, taking the expectation over the noise distribution gives: E ϵ [(∇ z f ) ⊺ (z) • (∇ z f )(z)] = L ⊺ (z) • L(z) + σ 2 • I d Z . The eigenvalues of [(∇ z f ) ⊺ (z) • (∇ z f )(z) ] are the squares of sigular values {(σ z i ) 2 } i , and the noise covariance term σ 2 • I d Z increases all eigenvalues by σ 2 . This observation explains our intuition that large singular values correspond to signals and small ones correspond to noise. Therefore, determining the intrinsic rank of (∇ z )f (z) is closely related to the largest eigenvalue λ 1 of the empirical covariance matrix S = 1 d Z i ϵ i • ϵ ⊺ i , which is the threshold for distinguishing between signal and noise. The Pseudorank algorithm is based on the theoretical results of the asymptotic behavior of λ 1 . The distribution of the largest eigenvalue λ 1 of the empirical covariance matrix for n-samples of N (0, I p ) converges to a Tracy-Widom distribution F β of order β = 1 for real-valued observations (Johnstone, 2001) (See the appendix for detail.): P λ 1 < σ 2 (µ n,p + s • σ n,p ) → F β (s) as n, p → ∞ with c = p/n fixed. ( ) Here, note that we do not know the true noise level σ in Eq 6 a priori. Using the above theoretical results, the Pseudorank algorithm applies a sequence of nested hypothesis tests. Given the Jacobian (∇ z f )(z) ∈ R d M ×d Z and let p = min(d M , d Z ). Then, for k = 1, 2, • • • , p -1, H 0 : rank at least k vs. H 1 : rank at most (k -1) For each k, the hypothesis test consists of two parts. First, the noise level σ est (k) of (∇ z f )(z) should be estimated to perform a hypothesis test. The Pseudorank (Kritchman & Nadler, 2008) suggests the consistent noise estimate algorithm under the assumption that λ k+1 , λ k+2 • • • , λ p are the noise components where λ i = (σ z i ) 2 . Second, we test whether λ k belongs to the noise components based on the corresponding Tracy-Widom distribution as follows: λ k ≤ σ 2 est (k) (µ n,p-k + s(α) • σ n,p-k ) , where α denotes a chosen confidence level. We chose α = 0.1 in our experiments. The above test is repeated until Eq 10 is satisfied. Then, the estimated rank K becomes K = k -1. Preprocessing The Pseudorank algorithm supposes the isotropic Gaussian noise on the Jacobian matrix. However, even considering the randomness of empirical covariance, the observed singular values of Jacobian matrix are too sparse (σ min /σ max ≈ 10 -9 ). Hence, the isotropic Gaussian assumption leads to the underestimation of the noise level, which causes the overestimation of intrinsic rank (In our experiments, estimated rank > 200 and σ rank ≈ 0.003). To address this problem, we introduce a simple preprocessing on the singular values of the Jacobian. Before applying the Pseudorank algorithm, we filter out the singular values {σ z i } i with {(σ z i ) 2 ≤ θ pre • max i (σ z i ) 2 }. We set θ pre ∈ {0.0005, 0.001, 0.005, 0.01}. ,QGH[RI6LQJXODU9DOXH )LQDO/RVV )LQDO/RVV pre = 0.0005 pre = 0.001 pre = 0.005 pre = 0.01 Figure 2: Off-manifold Results in W- space of StyleGAN2.

Validity of the estimated local dimension

We suggest the Off-manifold experiment to assess the validity of the estimated local dimension. Intuitively, the intrinsic local dimension at w ∈ M is the number of coordinate axes required to locally describe the learned manifold M. Here, the tangent vector at w along k-th axis is the k-th LB. In this respect, the Off-manifold experiment tests whether the latent perturbation along the k-th LB v w k stays in the latent manifold M = f (Z). If the margin of M at w in the k-th LB direction is large enough, then the k-th axis is needed to locally approximate M. To be more specific, we solve the following optimization problem by Adam optimizer (Kingma & Ba, 2015) on MSE loss with a learning rate 0.005 for 1000 iterations for each k: w init = f (z init ), w ptb = w init + c • v w k , z opt = arg min z ∥w ptb -f (z)∥ 2 with z 0 = z init . ( ) We ran the Off-manifold experiments on W-space of StyleGAN2. Figure 2 shows the final objective ∥w ptb -f (z opt )∥ 2 after the optimization for each LB v w i with c = 2. The red vertical lines denote the estimated local dimension for each θ pre . (See the appendix for the Off-manifold results with various c = ∥w ptb -w init ∥.) The monotonous increase in the final loss shows that f (z opt ) cannot approach close to w ptb . In other words, the diameter in the k-th LB direction decreases as the index k increases. Although there is a dependency on the preprocessing threshold, the rank estimation algorithm chooses the principal part of local manifold around w init without overestimates as desired. Particularly, the estimated rank with θ pre = 0.005 appears to find a transition point of the final loss. Sparsity Constraint LowRankGAN (Zhu et al., 2021) introduced a convex optimization problem called Principal Component Pursuit (PCP) (Candès et al., 2011) to find a low-rank factorization of Jacobian (∇ z f )(z) (Eq 13): minimize L,S ∥L∥ * + γ • ∥S∥ 1 s.t. L + S = (∇ z f ) ⊺ (z) • (∇ z f )(z). ( ) where ∥L∥ * = i σ i (L) is the nuclear norm, i.e. the sum of all singular values, ∥S∥ 1 = i,j |S i,j |, and γ > 0 is a positive regularization parameter. PCP encourages the sparsity on corruption S through ℓ 1 regularizer. However, we believe that the sparsity assumption is not adequate for finding the intrinsic rank of Jacobian. To test the validity of the sparsity assumption, we monitored how the low-rank representation L changes as we vary the regularization parameter n = 1/γ as in (Zhu et al., 2021) (Fig 3 ). The estimated rank decreases unceasingly without saturation as we increase n, i.e., refine the Jacobian stronger. We consider that the rank saturation should occur if this assumption is adequate for finding an intrinsic rank because it implies regularization robustness. But the low-rank factorization through PCP does not show any saturation until the Frobenius norm of corruption ∥S∥ F reaches over 50% of the initial matrix ∥(∇ z f ) ⊺ (z) • (∇ z f )(z)∥ F . Interpretation as Frobenious Norm The Pseudorank algorithm can be interpreted as a Nuclear-Norm Penalization (NNP) problem (Eq 14) for matrix denoising (Donoho & Gavish, 2014) . This NNP framework is similar to PCP in LowRankGAN except for the regularization ∥E∥ F . While PCP requires an iterative optimization of Alternating Directions Method of Multipliers (ADMM) (Boyd et al., 2011; Lin et al., 2010) , NNP provides an explicit closed-form solution L * through SVD. minimize L,S ∥L∥ * + γ • ∥E∥ F s.t. L + E = (∇ z f )(z). ( ) ⇒ L * = U Σ - 1 2γ • I + V ⊺ where (∇ z f )(z) = U ΣV ⊺ (SVD), for (M + ) i,j = max (M i,j , 0). Therefore, the intrinsic rank estimation by NNP is determined by choosing a threshold 1/(2γ) for the singular values {σ z i } i of Jacobian. The Pseudorank algorithm selects this threshold by running a series of hypothesis tests.

3.3. LATENT SPACE ANALYSIS OF STYLEGAN

We analyzed the intermediate layers of the mapping network in StyleGAN2 trained on FFHQ using our local dimension estimation (See Fig 11 for StyleGAN architectures). First, Figure 4 shows the distribution of estimated local dimensions for 1k samples of each intermediate layer for each θ pre (See the appendix for the rank statistics under all θ pre ). Note that the algorithm provides an unstable rank estimate on the most unsparse 1st layer (Fig 1b ) under the small θ pre ∈ {0.0005, 0.001}. However, this phenomenon was not observed in the other layers. Hence, we focus on the layers with reasonable depth, i.e., from 3 to 8. Even though changing θ pre results in an overall shift of the estimation, the ∥ F along the i-th LB v w i at w, estimated by the finite difference scheme. The result shows that the estimated rank covers the major variations in the image space. One advantage of the unsupervised method over the supervised method for finding disentangled perturbation is that the discovered semantic is not restricted to the pre-defined attributes. However, we cannot know the number of discovered perturbations without additional inspections. Figure 5 shows that the estimated dimension provides an upper bound on the number of these perturbations.

4. UNSUPERVISED GLOBAL DISENTANGLEMENT EVALUATION

In this section, we investigate two closely related important questions on the disentanglement property of a GAN. As a reminder, global basis refers to the sample-independent semantically meaningful perturbations on a latent space, such as GANSpace and SeFa. In this regard, the global-basiscompatibility represents how well the optimal global basis can work on the target latent space. Specifically, the global-basis-compatibility is defined as the quality of image traversal along the optimal global basis. This is a property of the latent space itself. If the global basis does not exist in the first place, all proposed global basis can only show limited success in that latent space, no matter how we find it. Then, the two questions are as follows: Q1. Can we evaluate the global-basis-compatibility of the latent space without posterior assessment? (Choi et al., 2022b ) Q2. Can we evaluate the disentanglement without attribute annotations? (Locatello et al., 2019) These two questions are closely related because the ideal disentanglement includes a global basis representation where each element corresponds to the attribute-coordinate. In this paper, the global disentanglement property of a latent space denotes this global representability along the attribute-coordinate. To answer these questions, we propose an unsupervised global disentanglement metric, called Distortion. We evaluated the global-basis-compatibility by the image fidelity under global basis perturbation (Q1) and the disentanglement by semantic factorization (Q2). Our experimental results show that our proposed metric has a high correlation with the global-basiscompatibility (Q1) and the supervised disentanglement score (Q2) on various StyleGANs. (See the appendix for robustness of Distortion to θ pre .) Global Disentanglement Score Intuitively, our global disentanglement score assesses the inconsistency of intrinsic tangent space for each latent manifold. The framework of analyzing the semantic property of a latent space via its tangent space was first introduced in Choi et al. (2022b) . Choi et al. (2022b) suggested this framework, inspired by the observation that each basis vector (LB) of a tangent space corresponds to a local disentangled latent perturbation. In this work, we develop this idea and propose a layer-wise score for global disentanglement property. Following Choi et al. (2022b) , we employ the Grassmannian (Boothby, 1986) metric to measure a distance between two tangent spaces. In particular, we use a dimension-normalized version of the Geodesic Metric (Ye & Lim, 2016) . We chose the Geodesic Metric instead of the Projection Metric (Karrasch, 2017) of its better discriminability (See the appendix for detail). The dimension-normalized version is adopted because the local dimension changes according to its estimated region. For two k-dimensional subspaces W, W ′ of R n , let M W , M W ′ ∈ R n×k be the column-wise concatenation of orthonormal basis for W, W ′ , respectively. Then, the dimension-normalized Geodesic Metric is defined as d k geo (W, W ′ ) = 1 k k i=1 θ 2 i 1/2 where θ i = cos -1 (σ i (M ⊤ W M W ′ )) denotes the i-th principal angle between W and W ′ for i-th singular value σ i . Then, Distortion score D M for the latent manifold M is evaluated as follows: 1. To assess the overall inconsistency of M, measure the expectation of Grassmannian distance between two intrinsic tangent spaces T wi M ki wi (Eq 5) at two random w ∈ M I rand = E zi∼p(z),wi=f (zi) d k geo T w1 M k w1 , T w2 M k w2 for k = min(k 1 , k 2 ) . 2. To normalize the overall inconsistency, measure the same Grassmannian distance between two close w ∈ M for ϵ = 0.1 I local = E z1∼p(z),|z2-z1|=ϵ d k geo T w1 M k w1 , T w2 M k w2 for k = min(k 1 , k 2 ) . (17) 3. Distortion of M is defined as the relative inconsistency D M = I rand /I local .

Distortion and Global Disentanglement

In this paragraph, we clarify why the globally disentangled latent space shows a low Distortion score. Assume a latent space M is globally disentangled. Then, there exists an optimal global basis of M, where each basis vector corresponds to an image attribute on the entire M. By definition, this optimal global basis is the local basis at all latent variables. Assuming that LB finds the local basis Choi et al. (2022b) , each global basis vector would correspond to one LB vector at each latent variable. In this regard, our local dimension estimation finds a principal subset of LB, which includes these corresponding basis vectors. In conclusion, if the latent space is globally disentangled, this principal set of LB at each latent variable would contain the common global basis vectors. Hence, the intrinsic tangent spaces would contain the common subspace generated by this common basis, which leads to a small Grassmannian metric between them. Therefore, the global disentanglement of the latent space leads to a low Distortion score.

Global-Basis-Compatibility

We tested whether Distortion D M is meaningful in estimating the global-basis-compatibility. We chose GANSpace as a reference global basis because of its broad applicability. The global basis proposed in GANSpace is PCA components of latent variable samples (Härkönen et al., 2020) . Hence, we can find a global basis in any intermediate layers. We chose FID (Heusel et al., 2017) gap between LB and GANSpace as a measure of global-basis-compatibility. FID is measured for 50k samples of perturbed images along the 1st component of LB and GANSpace, respectively. Distortion metric is tested on StyleGAN2 on LSUN Cat (Yu et al., 2015) , StyleGAN2 with configs E and F (Karras et al., 2020b) Traversal Comparison For a visual comparison of the global-basis-compatibility, we observed the image traversal along the global basis on the max-distorted layer 3, min-distorted layer 7, and layer 8 (W-space) of StyleGAN2 on FFHQ. Our global-basis-compatibility result implies that the global basis would perform better in terms of image fidelity on the min-distorted layer. To impose a more challenging condition, we introduced the subspace traversal (Choi et al., 2022b) along the first and second components of the global basis with a perturbation intensity 4. In Fig 8 , the global basis shows visual artifacts at the corners in the subspace traversal on the max-distorted layer 3 and W-space. Nevertheless, the min-distorted layer 7 shows the stable traversal without any failure. This result proves that comparing Distortion scores can be a criterion for selecting a better latent space with higher global-basis-compatibility. (See the appendix for additional results.)

5. CONCLUSION

In this paper, we proposed a local intrinsic dimension estimation algorithm for the intermediate latent space in a pre-trained GAN. Using this algorithm, we analyzed the intermediate layers in the mapping network of StyleGANs on various datasets. Moreover, we suggested an unsupervised global disentanglement metric called Distortion. The analysis of the mapping network demonstrates that Distortion metric shows a high correlation between the global-basis-compatibility and disentanglement score. Although finding an optimal preprocessing hyperparameter θ pre was beyond the scope of this work, the proposed metric showed robustness to the hyperparameter. Moreover, our local dimension estimation scheme has the potential to be applied to various models. For example, the adversarial robustness of the classifier can be analyzed by projecting the adversarial noise onto the estimated feature space. We consider this kind of research would be an interesting future research.

A DEFINITION OF DISENTANGLED LATENT SPACE

Disentangled perturbation In the GAN disentanglement literature, several studies investigated the disentanglement property of the latent space by finding disentangled perturbations that make a disentangled transformation of an image in one generative factor, such as GANSpace (Härkönen et al., 2020) , SeFa (Shen & Zhou, 2021) , and Local Basis (Choi et al., 2022b) . To be more specific, for a latent variable z ∈ Z ⊂ R d , let f = (f 1 , f 2 , • • • , f d ) be a generative factor of G(z) where G denotes the generator. T j (x) denotes a transformation of an image x in the j-th generative factor. The disentangled perturbation v j (z) for the base latent variable z on the j-th generative factor is defined as follows (The perturbation intensity ∥v j (z)∥ and the corresponding change in j-th generative factor △f j is omitted for brevity.): G(z + v j (z)) = T j (G(z)). In this paper, the global basis refers to the sample-independent disentangled perturbations on a latent space: v j (z) = v j for all z ∈ Z. For example, consider a pre-trained GAN model that generates face images. Then, the disentangled perturbation in this model is the latent perturbation direction that make the generated face change only in the wrinkles or hair color as presented in Härkönen et al. (2020) . This disentangled perturbation is the global basis if all generated images show the same semantic variation when latent perturbed along it. Disentangled space The (globally) disentangled latent space is defined in terms of disentangled perturbations. The latent space is globally disentangled if there exists the global basis for the generative factors of data. In other words, for each generative factor f j for 1 ≤ j ≤ d, there exists a corresponding latent perturbation direction v j such that all latent variables show the semantic variation in f j when perturbed along v j . Then, we can interpret the vector component of this global basis v j as having a correspondence with the j-th generative factor f j . f j ←→ c j when z = 1≤j≤d c j • v j and f j denotes the j-th generative factor of G(z). In this paper, we described the above correspondence as the representation of globally disentangled latent space in the attribute-coordinate in Sec 4. This is consistent with the definition of disentanglement, introduced in Bengio et al. (2013) . For example, consider the dSprites dataset. The dSprites is a synthetic dataset consisting of two-dimensional shape images, which is widely used for disentanglement evaluation. The generative factors of dSprites are shape, scale, orientation, position on the x-axis, and position on the y-axis. Then, the (globally) disentangled latent space for the dSprites dataset is a five-dimensional vector space Z = R 5 where c 1 represents the shape, c 2 represents the scale, and so on.

B NOISE ESTIMATION OF PSEUDORANK ALGORITHM

For completeness, we include the convergence theorem for the largest eigenvalue of the empirical covariance matrix for Gaussian noise in Johnstone (2001) and the noise estimation algorithm provided in Kritchman & Nadler (2008) . Theorem 1 ( (Johnstone, 2001) ). The distribution of the largest eigenvalue λ 1 of the empirical covariance matrix for n-samples of N (0, I p ) converges to a Tracy-Widom distribution: P λ 1 < σ 2 (µ n,p + s • σ n,p ) → F β (s) as n, p → ∞ with c = p/n fixed. ( ) where µ n,p = 1 n n - 1 2 + p - 1 2 2 , σ n,p = 1 n n - 1 2 + p - 1 2 1 n -1/2 + 1 p -1/2) 1/3 , where F β denotes the Tracy-Widom distribution of order β = 1 for real-valued observations. Algorithm Solve the following non-linear system of K +1 equations involving the K +1 unknowns ρ1 , • • • , ρK and σ 2 est : σ 2 KN - 1 p -K   p j=K+1 λ j + K j=1 (λ j -ρj )   = 0, ρ2 j -ρj λ j + σ 2 est -σ 2 est p -K n + λ j σ 2 est = 0. This system of equations can be solved iteratively. Check Kritchman & Nadler (2008) for detail.

C RELATION BETWEEN RANK ESTIMATION ALGORITHM AND OPTIMIZATION

Theorem 2. The following optimization problem, called Nuclear-Norm Penalization (NNP), minimize L,S ∥L∥ * + γ • ∥E∥ F s.t. L + E = (∇ z f )(z), has a solution L * = U Σ - 1 2γ • I + V ⊺ , where (∇ z f )(z) = U ΣV ⊺ (SVD) and (M + ) i,j = max (M i,j , 0). Proof. Denote Y := ∇ z f and h(L) := ∥L∥ * + γ∥Y -L∥ F . We want to show that L * minimizes h(L). Then, the necessary and sufficient condition for this is: 0 ∈ ∂h( L * ) = {2γ(L * -Y ) + z : z ∈ ∂∥ L * ∥ * } ⇐⇒ 2γ(Y -L * ) ∈ ∂∥ L * ∥ * . Note that Y = U ΣV ⊺ and L = U (Σ -1 2γ I) + V ⊺ . We can write Y = U 1 Σ 1 V ⊺ 1 + U 2 Σ 2 V ⊺ 2 , with diag(Σ 1 ) > 1 2γ and diag(Σ 2 ) ≤ 1 2γ . Then, L * = U 1 Σ - 1 2γ I V ⊺ 1 , Y -L * = U 2 Σ 2 V ⊺ 2 + 1 2γ U 1 V ⊺ 1 = 1 2γ (U 1 V ⊺ 1 + 2γU 2 Σ 2 V ⊺ 2 ) . By doing tedious calculation, we can verify that U 1 V ⊺ 1 + 2γU 2 Σ 2 V ⊺ 2 meet the condition of Lemma 1, so that U 1 V ⊺ 1 + 2γU 2 Σ 2 V ⊺ 2 ∈ ∂∥ L * ∥. Therefore, 0 ∈ ∂h( L * ) and it completes the proof. Lemma 1. Let X ∈ R m×n and f (x) = ∥X∥ * . Then, ∂f (X) = ∂∥X∥ * = {Z ∈ R m×n : ∥Z∥ 2 ≤ 1 and ⟨Z, X⟩ = ∥X∥ * }. ( ) Proof. If Z ∈ ∂f (X), then f (Y ) ≥ f (X) + ⟨Z, Y -X⟩, ∀Y ∈ R m×n , ⇔ ⟨Z, X⟩ -∥X∥ * ≥ ⟨Z, Y ⟩ -∥Y ∥ * , ∀Y ∈ R m×n , ⇔ ⟨Z, X⟩ -∥X∥ * ≥ sup Y ∈R m×n (⟨Z, Y ⟩ -∥Y ∥ * ) = 0, if ∥Z∥ 2 ≤ 1, ∞, otherwise. ( ) And 0 ≤ ⟨Z, X, ⟩ -∥X∥ * = ⟨Z, X, ⟩ -sup ∥M ∥2≤1 ⟨M, X, ⟩ ≤ 0, thus ⟨Z, X, ⟩ = ∥X∥ * .

D GRASSMANNIAN METRIC FOR DISTORTION -GEODESIC VS. PROJECTION

Our proposed Distortion metric D = I rand /I local is defined as the relative inconsistency of intrinsic tangent spaces on a latent manifold (Sec 4). The inconsistency (I rand , I local ) is measured by the Grassmannain (Boothby, 1986) distance between tangent spaces, particularly by Geodesic Metric (Ye & Lim, 2016) . In this section, we present why we choose the Geodesic Metric instead of the Projection Metric (Karrasch, 2017) among Grassmannian distances. Informally, the Geodesic Metric provides a better discriminability compared to the Projection Metric. For completeness, we begin with the definitions of the Grassmannian manifold and two distances defined on it. Definitions Let V be the n-dimensional vector space. The Grassmannian manifold Gr(k, V ) (Boothby, 1986) is defined as the set of all k-dimensional linear subspaces of V . Then, for two k-dimensional subspaces W, W ′ ∈ Gr(k, V ), two Grassmannian metrics are defined as follows: d proj (W, W ′ ) = ∥P W -P W ′ ∥ , d geo (W, W ′ ) = k i=1 θ 2 i 1/2 . ( ) For the Projection Metric d proj (W, W ′ ), P W and P W ′ denote the projection into each subspaces and ∥ • ∥ represents the operator norm. For the Geodesic Metric d geo (W, W ′ ), θ i denotes the ith principal angle between W and W ′ . To be more specific, θ i = cos -1 (σ i (M ⊤ W M W ′ )) where M W , M W ′ ∈ R n×k are the column-wise concatenation of orthonormal basis for W, W ′ and σ i represents the i-th singular value. Experiments To test the discriminability of these two metrics, we designed a simple experiment. Let W, W ′ be the two 50-dimensional subspaces of R 512 because the dimension of intermediate layers in the mapping network is 512. We measure the Grassmannian distance between two subspaces as we vary dim (W  ∩ W ′ ) = k 0 , W = ⟨e 1 , e 2 , • • • , e k ⟩, W =

E ROBUSTNESS TO PREPROCESSING

In this section, we assessed the robustness of Distortion D M to preprocessing hyperparameter θ pre . Figure 10 presents the distribution of 1k samples of distortion before taking an expectation, i.e., d k geo T w1 M k w1 , T w2 M k w2 /I local , for each intermediate layer. In Fig 10, increasing θ pre makes an overall translation of Distortion. However, the relative ordering between the layers remains the same. The low Distortion score of layer 8 provides an explanation for the superior disentanglement of W-space observed in many literatures (Karras et al., 2019; Härkönen et al., 2020) . Moreover, the results suggest that the min-distorted layer 7 can serve as a similar-or-better alternative. to its isotropic gaussian assumption. However, this phenomenon is not observed in the layers with moderate depth, i.e., from 3 to 8. As we introduce the higher preprocessing ratio θ pre , the algorithm gives more strict, i.e., smaller, dimension estimates. Nevertheless, the relative trend between layers is the same. The deeper the latent manifold, the smaller its dimension. 



Figure 1: (a) Overview of Local Dimension Estimation. Our goal is to find dimension k such that the k-dimensional submanifold M k w can properly describe the local latent manifold M w ⊆ R d M . (b) Singular Value Distribution of Jacobian matrix for each subnetwork of the mapping network in StyleGAN2. As the layer gets deeper, many of the singular values are close to zero. This supports our claim of pruning latent dimension by interpreting the near-zero singular values as noise.

Figure 3: Rank Estimation with Sparsity under various n = 1/γ.

Figure 4: Local Dimension Distribution of the intermediate layers in the mapping network.

Figure 5: Local Dimension Evaluation in Image Space where d denotes the estimated local dimension with θ pre = 0.01. Fig 5c shows the image variation intensity ∥∇ v w i g(w)∥ F along each LB v i .

Figure 6: Correlation between Distortion metric (↓) and FID gap (↓) when θ pre = 0.005. FID gap represents the difference between FID score of LB and the global basis (Härkönen et al., 2020). Each point represents a i-th intermediate layer in the mapping network. trend and relative ordering between layers are the same. In accordance with Fig 1b, the intrinsic dimension monotonically decrease as the layer goes deeper. Second, we evaluated the estimated rank on image space (Fig 5). Figure 5a and 5b show the image traversal along the first two axes and the two axes (d -1, d) around the estimated rank d with θ pre = 0.01. Fig 5c presents the size of the directional derivative ∥∇ v wi g(w)∥ F along the i-th LB v w i at w, estimated by the finite difference scheme. The result shows that the estimated rank covers the major variations in the image space. One advantage of the unsupervised method over the supervised method for finding disentangled perturbation is that the discovered semantic is not restricted to the pre-defined attributes. However, we cannot know the number of discovered perturbations without additional inspections. Figure5shows that the estimated dimension provides an upper bound on the number of these perturbations.

Figure 7: Correlation between Distortion metric (↓) and DCI (↑) when θ pre = 0.005. DCI (Eastwood & Williams, 2018) is a supervised disentanglement metric that requires attribute annotations.

on FFHQ to test the generalizability of correlation to the global-basis-compatibility. StyleGAN2 in Fig 6 denotes StyleGAN2 with config F because config F is the usual StyleGAN2 model. The perturbation intensity is set to 5 in LSUN Cat and 3 in FFHQ. Distortion metric shows a strong positive correlation of 0.98, 0.81 and 0.70 to FID gap in Fig 6. This result demonstrates that Distortion metric can be an unsupervised criterion for selecting the latent space with high global-basis-compatibility. Before finding a global basis, we can use Distortion metric as a prior investigation for selecting an appropriate target latent space.

Figure 8: Subspace Traversal on the intermediate layers along the global basis. The upper-right corner of max-distorted layer 3 and layer 8 show visual artifacts. However, the min-distorted layer 7 does not show such a failure. The initial image (center) is traversed along the 1st (horizontal) and 2nd (vertical) components of GANSpace.Disentanglement Score We assessed a correlation between the unsupervised Distortion metric and a supervised disentanglement score. Following the work ofWu et al. (2020), we adopted DCI score(Eastwood & Williams, 2018) as the supervised disentanglement score for evaluation, and employed 40 binary attribute classifiers pre-trained on CelebA(Liu et al., 2015) to label generated images. Each DCI score is assessed on 10k samples of latent variables with the corresponding attribute labels. In Fig 7, StyleGAN1, StyleGAN2-e, and StyleGAN2 refer to StyleGAN1 and StyleGAN2s with config E and F trained on FFHQ. Note that DCI experiments are all performed on FFHQ because the DCI score requires attribute annotations. DCI and Distortion metrics show a strong negative correlation on StyleGAN1 and StyleGAN2-e. The correlation is relatively moderate on StyleGAN2. This moderate correlation is because Distortion metric is based on the Grassmannian metric. The Grassmannian metric measures the distance between tangent spaces, while DCI is based on their specific basis.Even if the tangent space is identical so that Distortion becomes zero, DCI can have a relatively low value depending on the choice of basis. Hence, in StyleGAN2, the high-distorted layers showed low DCI scores, but the low-distorted layers showed relatively high variance in DCI score. Nevertheless, the strong correlation observed in the other two experiments suggests that, in practice, the basis vector corresponding to a specific attribute has a limited variance in a given latent space. Therefore, Distortion metric can be an unsupervised indicator for the supervised disentanglement score.

⟨{e 1 , e 2 , • • • , e k0 } ∪ {e k+1 , • • • , e 2k-k0 }⟩ (37) where {e i } 1≤i≤n denotes the standard basis of R n . Fig 9 reports the results. The Geodesic Metric reflects the degree of intersection between two subspaces. As we increase the dimension of intersection, the Geodesic Metric decreases. However, the Projection Metric cannot discriminate the intersected dimension until it reaches the entire space.

Figure9: Grassmannian metric between two 50-dimensional subspaces W, W ′ ∈ Gr(50, R 512 ) for each intersected dimension k 0 = dim (W ∩ W ′ ). While the Geodesic Metric monotonically decreases as more dimensions intersect, the Projection Metric cannot discriminate 0 ≤ k 0 ≤ 49.

Figure 10: Robustness of Distortion metric D (↓) to θ pre of StyleGAN2 on FFHQ.

Figure 11: Architecture of StyleGANs. Our analysis in Sec 4 is performed in the intermediate layers of the mapping network.

Figure 13: Local Dimension Distribution of the intermediate layers in the mapping network of StyleGAN2 on FFHQ. Each figure presents the distribution of estimated local dimension at each intermediate layer as we vary θ pre . The distributions are illustrated for 1k samples, respectively. The algorithm gives a rather unstable dimension estimate on the most unsparse first layer (Fig 1b) dueto its isotropic gaussian assumption. However, this phenomenon is not observed in the layers with moderate depth, i.e., from 3 to 8. As we introduce the higher preprocessing ratio θ pre , the algorithm gives more strict, i.e., smaller, dimension estimates. Nevertheless, the relative trend between layers is the same. The deeper the latent manifold, the smaller its dimension.

Figure 15: Robustness of Distortion metric D to θ pre of StyleGAN2 on FFHQ. Each boxplot shows the distribution of 1k samples of Distortion before taking an average, i.e., d k geo T w1 M k w1 , T w2 M k w2 /I local (Sec 4), for each intermediate layer. Increasing θ pre makes a slight increase in Distortion for all layers. Nevertheless, the relative ordering between layers is robust under the change of θ pre .

Figure 18: Correlation between Distortion metric and FID gap of StyleGAN2 on LSUN Cat.

Figure 20: Correlation between Distortion metric and FID gap StyleGAN2 on LSUN Church.

Figure 21: Correlation between Distortion metric and DCI of StyleGAN2 on Each DCI(Eastwood & Williams, 2018) score is evaluated for 10k samples of generated images, while the attribute label is generated by 40 attribute classifiers pre-trained on CelebA(Liu et al., 2015). As in Fig16, each point and red-line represents the intermediate layers and linear regression, respectively. The Distortion and DCI score show a negative correlation.

Figure 25: Subspace Traversal (Choi et al., 2022b) on the min-distorted (7th) and max-distorted (3rd) intermediate layers along the global basis (Härkönen et 2020) and Local Basis (Choi et al., 2022b) of StyleGAN2 on FFHQ. The initial image (center) is traversed along the 1st (horizontal) and 2nd (vertical) components of the chosen traversal directions with the perturbation intensity 9. The global basis shows a decent image quality on the min-distorted layer, similar to Local Basis. However, on the max-distorted layer, the subspace traversal along global basis exhibits significant failures at corners, such as image collapse (lower-left), visual artifacts (lower-right), and unnatural transformations (top-left).

Figure 26: Subspace Traversal on StyleGAN2-FFHQ. The upper-left corner of layer 3 is severely deteriorated.

Figure 29: Subspace Traversal on StyleGAN2-LSUN Church. The upper-left corner of layer 3 are severely deteriorated.

G ADDITIONAL EXPERIMENTAL RESULTS

,QGH[RI6LQJXODU9DOXH )LQDO/RVV)LQDO/RVV pre = 0.0005 pre = 0.001 pre = 0.005 pre = 0.01 The small final loss implies that the linear perturbation along that Local Basis component stays inside the learned latent manifold. For every perturbation intensity, there is a transition point from a slow increase to a sharp increase, which is interpreted as an escape from the manifold. The results demonstrate that the proposed dimension estimation algorithm finds a reasonable point without crossing the transition point. In our case, these results are interpreted as choosing the principal part of the manifold without overestimating its dimension. 

