MULTI-DOMAIN IMAGE GENERATION AND TRANSLA-TION WITH IDENTIFIABILITY GUARANTEES

Abstract

Multi-domain image generation and unpaired image-to-to-image translation are two important and related computer vision problems. The common technique for the two tasks is the learning of a joint distribution from multiple marginal distributions. However, it is well known that there can be infinitely many joint distributions that can derive the same marginals. Hence, it is necessary to formulate suitable constraints to address this highly ill-posed problem. Inspired by the recent advances in nonlinear Independent Component Analysis (ICA) theory, we propose a new method to learn the joint distribution from the marginals by enforcing a specific type of minimal change across domains. We report one of the first results connecting multi-domain generative models to identifiability and shows why identifiability is essential and how to achieve it theoretically and practically. We apply our method to five multi-domain image generation and six image-toimage translation tasks. The superior performance of our model supports our theory and demonstrates the effectiveness of our method. The training code are available at https://github.com/Mid-Push/i-stylegan.

1. INTRODUCTION

Multi-domain image generation and unpaired image-to-image translation are two important and closely related problems in computer vision and machine learning. They have many promising applications such as domain adaptation (Liu & Tuzel, 2016; Hoffman et al., 2018; Murez et al., 2018; Wang & Jiang, 2019 ) and medical analysis (Armanious et al., 2019; 2020; Kong et al., 2021) . As shown in Fig. 1 , multi-domain image generation takes as input the random noise ϵ and domain label u and the task aims to generate image tuples where the images in the tuple share the same content, e.g., different facial expressions of the same people. The second task takes as input an image in one domain and target domain label u and aims to generate another image which is in the target domain but share the same content of input, e.g., the output image has the same identity but different facial expression from the input image. Both tasks can be viewed as instantiations of joint distribution learning problem. A joint distribution of multi-domain images is a probability density function that gives a density value to each joint occurrence of images in different domains such as images of the same people with different facial expressions. Once the joint distribution is learned, it can be used to generate meaningful tuples (the first task) and translate an input image into another domain without content distortion (the second task). If the correspondence across domains is given (e.g., the identity), we can apply supervised approaches to learn the joint distribution easily. However, collecting corresponding data across domains can be prohibitively expensive . For example, collecting different facial expressions of same person may need controlled experiments. By contrast, collecting image domains without correspondence can be relatively cheap, e.g., facial expressions of different people can be easily accessed online (once permission is granted). Therefore, we consider the problem of unsupervised joint distribution learning P (x (1) , x (2) , ..., x (d) ) where we are only given multiple marginal distributions in different yet related domains {P (x (i) )} n i=1 , where d is the number of domains and x (i) denotes the images in domain i. However, there can be infinitely many joint distributions that can produce the same marginals (Lindvall, 2002) . For example, we can apply conditional GANs (Mirza & Osindero, 2014; Odena et al., 2017; Miyato & Koyama, 2018; Brock et al., 2018) to match the marginal distributions in each domain and they learn joint distributions implicitly. We can sample a tuple ⟨x (0) , ..., x (d) ⟩ from its learned joint distribution using the same random noise and different domain labels. However, there is no guarantee that the learned joint distribution in conditional GANs are optimal and it may drop the correspondence in the generated tuples (e.g., Fig.5(b) ). To tackle this problem, CoGAN (Liu & Tuzel, 2016) proposes to use different generators for each domain and share weights of high level layers in different generators. JointGAN (Pu et al., 2018) proposes to factorize the joint distribution (e.g., P (x (1) , x (2) ) = P (x (1) )P (x (2) |x (1) )) and learn one marginal and one conditional distribution with cycle consistency (Zhu et al., 2017) . While existing approaches (e.g., Liu et al. (2017) ; Pu et al. (2018) add constraints for the purpose of removing unwanted joint distributions, they are not guaranteed to find the true joint distribution. If we are unsure about the learned joint distribution, we cannot guarantee that the generated tuples are meaningful. Learning the true joint distribution seems to be impossible since we only have access to marginal distributions and domain label and we do not have access to the latent content and style variables. Fortunately, recent advances in nonlinear Independent Component Analysis (ICA) theory (Hyvarinen & Morioka, 2016; 2017; Hyvarinen et al., 2019; Khemakhem et al., 2020b; b; Von Kügelgen et al., 2021; Kong et al., 2022) show that deep nonlinear latent variable models are identifiable with auxiliary variables (e.g., domain label in our case), meaning that we can recover the latent variables (e.g., content and style in our case) up to some component-wise transformation. Inspired by these advances, we propose a new method to learn the true joint distribution from the marginals by enforcing a specific type of minimal changes across domains. Specifically, we assume that the influence of domain information (i.e., the underlying changes across domains) is minimal in the data generation process, e.g., only facial expression changes and no hair style changes is allowed. To achieve the minimal changes, we inject the domain information through a component-wise strictly increasing transformation instead arbitrary complex transformation. In addition, we assume the number of the underlying components affected by the domain information (and thus with changing distributions) is minimal. Then we show that if the influence of domain information is minimal, the true joint distribution can be recovered from the marginal distributions. Afterwards, we can use the learned joint distribution to sample meaningful tuples and translate input images into another domain without content distortion. Our method can be applied to datasets where the content across domains are aligned and most existing datasets in multi-domain image generation and translation satisfy this requirement, e.g., the facial image dataset, animal face and digit images. Our method may not be as effective when the contents when some domains contain unaligned contents. We may need to pay more attention to the artworks-related dataset, since different painters focus on different objects in their painting (e.g., there are many fruit paintings by Cezanne and landscape paintings by Monet). Our proposed method has several theoretical and practical contributions: 1. We provide the properties of an image generation process under which the true joint distribution can be recovered from marginal distributions. 2. In light of our theoretical results, we provide a practical method for multi-domain image generation. The proposed method can automatically determine the underlying dimension of the latent variables with changing distributions. Our method achieves promising results across five image generation tasks. 3. We propose to encourage the mapping function in image translation task to preserve the correspondence learned by our image generation model. The significant gain over the baseline methods on six image translation tasks demonstrate the effectiveness of our technique. Multi-domain image generation and translation Generative Adversarial Network (GAN) (Goodfellow et al., 2014) performs adversarial training between the generator and the discriminator. At the end, the distribution of sample generated by the generator is matched to the real data distribution. Conditional GAN (Mirza & Osindero, 2014 ) is a variant of GAN that incorporates additional information and have been widely applied for class-conditional image generation (Brock et al., 2018) . CoGAN (Liu & Tuzel, 2016) learns the joint distribution by sharing the higher layers of two generators. JointGAN (Pu et al., 2018) proposes to factorize the joint distribution and learn marginal and conditionals and regularize the conditionals with cycle consistency. RegCGAN (Mao & Li, 2018) penalizes the distance between the features of the synthesized pairs. Methods by Liu & Tuzel (2016) ; Pu et al. (2018) ; Mao & Li (2018) are shown to be effective on some tasks but they do not have identifiability guarantees that they are recovering the true joint distribution (we provide a formal definition of identifiability later). Unpaired image-to-image translation relies on additional assumptions to address the task and cycle consistency (Zhu et al., 2017) is arguably most widely used. However, it has been argued that cycle consistency may not be enough to learn a good mapping (Alami Mejjati et al., 2018; Xie et al., 2022) . Therefore, we propose a new regularization with generated tuples to help find a good mapping. We provide more related works in appendix C. Nonlinear ICA Nonlinear ICA aims to recover independent latents from the data that are generated by nonlinear invertible transformations from the underlying independent variables (Hyvarinen & Morioka, 2016; 2017; Hyvarinen et al., 2019) . Recent works have shown that the true latent variables may be identifiable given additional information (Khemakhem et al., 2020a; Gresele et al., 2020; Locatello et al., 2020; Shu et al., 2019; Zimmermann et al., 2021; Hälvä & Hyvarinen, 2020; Klindt et al., 2020a) ). Khemakhem et al. (2020a) prove that the latent variable is identifiable if the prior distribution is conditionally factorized. Von Kügelgen et al. (2021) show that the latent content variable is block-identifiable given two views of same image, which is not applicable in our case since we do not have paired data. In the next section, we will show how our formulated problem is related to some variant of nonlinear ICA and how we can establish the identifiability of the changes across domains.

3. UNSUPERVISED JOINT DISTRIBUTION LEARNING

In this section, we first provide the formulations of our conditional generative model and some additional conditioins, under which the true joint distribution is identifiable. Then we provide a practical implementation based on conditional GAN to achieve unsupervised multi-domain image generation. Finally, we propose a novel regularization technique to improve unpaired image-to-image translation.

3.1. IDENTIFIABILITY OF THE JOINT DISTRIBUTION

Given d marginal distributions {P θ (x (i) )} d i=1 derived from the true but unknown joint distribution P θ (x (1) , ..., x (d) ) where θ ∈ Θ is a vector of parameters, we need to recover the true joint distribution to generate meaningful image tuples or translating images without content distribution. To recover the true joint distribution, it needs to be identifiable. Formally, the true joint distribution is identifiable if following hold: ∀θ ′ : {P θ (x (i) ) = P θ ′ (x (i) )} n i=1 ⇔ P θ (x (1) , x (2) , ..., x (d) ) = P θ ′ (x (1) , x (2) , ..., x (d) ). That is, if another model with parameter θ ′ ∈ Θ matches the marginal distributions in each domain, then this would imply that the true joint distribution is also matched perfectly by the model with parameter θ ′ . This is a well-known ill-posed problem: the joint distribution is not identifiable from marginal distributions without further assumptions. Motivated by the multiple domain image generation and translation problems, we define the following data generation process using a latent variable model (we provide the graphical model in appendix D): P θ (x (1) , ..., x (d) ) = • • • d i=1 P θ (x (i) |z c , z (i) s )P θ (z c ) d i=1 P θ (z (i) s )dz c dz (1) s . . . dz (d) s , where z c is the common content and z (i) s is the style in the i-th domain. For instance, in expression generation, z c represents the human identity information, and z s represents the expression. Then a data point sampled from the joint distribution of our model would be one person with different expressions. In practice, we may not have sample from the joint distribution; instead, we only have samples from the marginal distributions of P θ (x (1) , ..., x (d) ): P θ (x (u) ) = P θ (x|z c , z s )P θ (z c )P θ (z s |u)dz c dz s , where u is the domain label and P θ (z s |u = i) = P θ (z (i) s ). To recover the true joint distribution, we first recover the true content z c and true style z s up to some transformation and then we establish the identifiability of the true joint distribtion. Recovering z c and z s seems to be impossible since we only have the observations {x (u) } and domain label u. Fortunately, recent advances on nonlinear ICA theory have shown that deep non-linear latent variable models are identifiable given auxiliary variable (e.g., domain label in our case) (Khemakhem et al., 2020b; a; Von Kügelgen et al., 2021; Kong et al., 2022) . Specifically, we define P θ (z s |u) = P (z s ) |J fu | ; z s = f u (z s ), zs ∼ P (z s ), where f u is domain-specific component-wise strictly increasing transformation (i.e., in each domain, each dimension of z s is a strictly increasing function of the corresponding dimension of zs ) and |J fu | is the absolute value of determinant of the Jacobian matrix of function f u . P (z s ) is a prior distribution and we define it as N (0, I). Then we also define: P θ (z c ) = P (z c ); P θ (x|z c , z s ) = δ(x -g(z c , z s )), ( ) where g is the true generation function, δ(.) is the Dirac delta function and P (z c ) is a prior distribution and we also define it as N (0, I) in our model. We now have θ = (f u , g) and we can train a model with parameter θ ′ = ( fu , ĝ) to match the marginal distributions P θ (x (i) ). As for the true variables (z c , z s , x (u) ), we also have the learned ones (ẑ c , ẑs , x(u) ). Now we show that the true content z c and style z s are identifiable up to some transformations. We first define z = [z c ; z s ], n = dim(z), n s = dim(z s ), n c = dim(z c ), Z ⊆ R n is the domain of latent variable z, Z c ⊆ R nc is the domain of latent variable z c , Z s ⊆ R ns is the domain of latent variable z s , U is the support of the distribution of u. We say z s is component-wise identifiable if there exists a component-wise invertible transformation h s s.t. ẑs = h s (z s ) for the recovered ẑs . The content z c is block-wise identifiable means that there exists an invertible function h c s.t. ẑc = h c (z c ) for the recovered ẑc . Lemma 3.1. If the underlying data generation process is consistent with (3,4,5) and following assumptions hold: • A1 (Smooth and Positive Density): The probability density function of latent variables is smooth (the second order derivative of the log density exists) and positive i.e. P (z|u) is smooth and P (z|u) > 0 for all z ∈ Z and u ∈ U. • A2 (Conditional independence): Conditioned on u, the components of z are mutually independent, which implies P (z|u) = n i P (z i |u). • A3 (Linear independence): For any z s ∈ Z s ⊆ R ns , there exist 2n s + 1 values of u, i.e., u j with j = 0, 1, ..., 2n s , such that the 2n s vectors w(z s , u j ) -w(z s , u 0 ) with j = 1, ..., 2n s , are linearly independent, where vector w(s, u) is defined as follows: w(z s , u) = ∂q nc+1 (z nc+1 , u) ∂z nc+1 , . . . , ∂q n (z n , u) ∂z n , ∂ 2 q nc+1 (z nc+1 , u) ∂z 2 nc+1 , . . . , ∂ 2 q n (z n , u) ∂z 2 n , where q i (z i , u) = log P (z i |u). • A4 (Domain Variability): For any set A z ⊆ Z with the following two properties: (1) A z has nonzero probability measure, i.e. P ({z ∈ A z }|{u = u ′ }) > 0 for any u ′ ∈ U. (2) A z cannot be expressed as B zc × Z s for any set B z c ⊂ Z c . ∃u 1 , u 2 ∈ U, such that z∈Az P (z|u 1 )dz ̸ = z∈Az P (z|u 2 )dz. by matching the marginal distributions {P θ (x (i) )} d i=1 of each domain, the component-wise identifiability of the style z s is ensured, z c is block-wise identifiable. We provide the proof and conditions in the supplementary. Intuitively, assumptions A3 and A4 require that the distribution P (z|u) varies sufficiently across domains such that the we have sufficient contrastive information to achieve the identifiability. A3 requires that the changes in the probability density functions are complex enough such that they don't always lie within (2n s + 1)-dimensional subspace. Kong et al. (2022) provide a similar result in the context of domain adaptation but it requires the dimensions of input and output of the generating function g to be the same. If we use GAN, the dimension of input noise and output image are obviously different, so our results are more general. After recovering the true content z c with ẑc and true style z s with ẑs , we now proceed to address the joint distribution identifiability problem. A main challenge is the indeterminacy of the recovered content z c and style z s caused by the unknown transformation h c and h s . Fortunately, we show that this indeterminacy can be removed and the recovered joint distribution is the true one. Theorem 3.2. If 1) the underlying data generation process is consistent with (3,4,5), 2) assumptions A1-A4 hold, and 3) the marginal distributions are matched, i.e., P θ ′ (x (i) ) = P θ (x (i) ) for any domain i ∈ [d], then the true joint distribution is identical to that produced by the model with parameter θ ′ , i.e., P θ ′ (x (1) , x (2) , ..., x (d) ) = P θ (x (1) , x (2) , ..., x (d) ). The proof is provided in appendix F. Unlike existing approaches, this theorem provides a guarantee that we are able to recover the true joint distribution under some conditions. In other words, the tuples generated by the learned generative model with parameter θ ′ can also be viewed being sampled from the true generative model. As a consequence, elements in the generated tuple ⟨x (1) , ..., x (d) ⟩ have different styles but share the same content. We can apply this model to address the challenging multi-domain image generation task directly. As for unpaired image-to-image translation, we will show that we can also use the generated tuples to help improve the performance of image translation.

3.2. A PRACTICAL IMPLEMENTATION OF MULTI-DOMAIN IMAGE GENERATION

In this section, we provide a practical implementation of our conditional generative model and we can use it to achieve multi-domain image generation. Given images from d domains, we would like to train a conditional GAN model such that the generated tuples ⟨G(ϵ, 1), ..., G(ϵ, d)⟩ share the content where G is the generator in GAN and ϵ is the random Gaussian noise and u ∼ {1, .., d}. In order to generate meaningful tuples, we build our conditional GAN following (3, 4, 5) . Given random noise ϵ ∼ N (0, I) and domain label u, a naive way would be splitting the noise into two parts, i.e., ϵ = [ẑ c ; zs ] and apply component-wise strictly increasing transformation fu on zs . But the problem is that the true latent variables z c , z s are only identifiable when the dimensions match, i.e., dim(ẑ c ) = n c = dim(z c ) and dim(ẑ s ) = n s = dim(z s ). We usually do not have access to the true dimension n c and n s . To address this problem, we have to assume the total dimension is matched, i.e., n s + n c = dim(ẑ c ) + dim(ẑ s ). Then we only need to determine n s . We may treat n s as a hyper-parameter and sweep the possible values. But it can be expensive since we have to determine the values for each dataset. To address this issue, we propose a simple way to allow the network to learn the optimal dimension automatically with an additional regularization term. The whole computation flow is shown in Fig. 2 . Specifically, we apply component-wise strictly increasing transformation fu to all components of the input noise ϵ. Then we multiply it with a trainable mask m and add it back to the input noise: ẑ = ϵ + m ⊙ fu (ϵ), G(ϵ, u) = ĝ(ẑ c , ẑs ) (6) where the output ẑ is the input to the generator G. The mask m is of the same dimension as noise ϵ and the elements are in the range [0, +∞). For any element i, if m i = 0, ẑi only contains information from the shared ϵ and it belongs to ẑc . Otherwise, ẑi contains the domain information u and it belongs to ẑs . In order to encourage the network determine the optimal dimension automatically, we apply L 1 loss on the mask, L sparsity = ∥m∥ 1 . ( ) Figure 2 : The computation flow of our model for multi-domain image generation. ϵ denotes the input noise, u is domain index, fu is the domain-specific component-wise strictly increasing transformation, m is the trainable mask, ẑc is the content which is shared across domains while ẑs is the style which changes across domains and xu is the output image in domain u. Finally, we can proceed as normal conditional GAN method. We introduce an conditional discriminator D and training objective is: L gan = E[log(D(x, u))] + E[log(1 -D(G(ϵ, u), u))]. Our full objective for multi-domain image generation is L generation = L gan + λL sparsity , where λ controls the influence of domain u. We found that λ = 0.1 works well across all datasets and is more stable than tuning the dimension n s in section 4.1.2. Unpaired image-to-image translation aims to map input images to a target domain while preserving important content information. This task can also be formulated as a joint distribution learning problem (Liu et al., 2017) . For example, given two domains x (u0) and x (u1) , image translation aims to learn a reasonable conditional distribution P θ ′ (x (u1) |x (u0) ) through a mapping function F , which should be close to the true conditional P θ (x (u1) |x (u0) ). Therefore, we can use our multi-domain image generation model to help find a proper conditional distribution. The differences between two tasks are visualized in Figure . 1.

3.3. APPLICATION: UNPAIRED IMAGE-TO-IMAGE TRANSLATION

We employ StarGAN-v2 (Choi et al., 2020) as our backbone method and other methods are also applicable. StarGAN-v2 mainly consists of: the mapping function F , style encoder E. The training loss is L stargan = L adv + L cyc + L sty -λ div L div , where L adv , L cyc , L sty and L div are used for distribution matching, cycle consistency, style reconstruction and generation diversity, respectively. For more details, we refer readers to the original paper or our appendix J.1. After matching the marginal distributions, G(ϵ, u) allows us to sample meaningful tuples by changing the value of domain label u, e.g., ⟨x (u0) = G(ϵ, u 0 ), x (u1) = G(ϵ, u 1 )⟩. Then we can use them to further regularize the mapping function F as L tuple = E∥F (x (u0) , s1 ) -x (u1) ∥ 1 , where s1 = E(x (u1) , u 1 ) is the style code of domain u 1 . Through L tuple , we are encouraging the mapping function F to reconstruct the corresponding image G(ϵ, u 1 ) from the input image G(ϵ, u 0 ). It means that the mapping function F is trained to preserve the correspondence between images of the generated tuples. Since we are able to recover the true joint distribution, L tuple encourage the mapping function to produce the true conditional distribution, i.e., P θ (x (u1) |x (u0) ). Our full objective for unpaired image translation is L translation = L stargan + λ tuple L tuple , where λ tuple is the hyper-parameter to control the influence of our propose tuple regularization.

4. EXPERIMENTS

In this section, we first present results and analysis on multi-domain image generation task. Then we provide the results on unpaired image translation. Datasets We use five datasets to evaluate our method: CELEBA-HQ (Choi et al., 2020) contains female and male faces domains; AFHQ (Choi et al., 2020) contains 3 domains: cat, dog and wild Evaluation Metrics. We evaluate our method using the Frechet inception distance (FID), which is a widely used metric for distribution divergence between the generated images and the real images. lower FID is better. As for the first four datasets, there is no pair data. So, we use the domaininvariant perceptual distance (DIPD) to measure the semantic correspondence (Liu et al., 2019) . DIPD computes the distance between two instance-normalized Conv5 features of VGG network. As for MNIST7, we have ground truth tuples, so we first compute inception features of images in the tuple, concatenate the features and reduce the dimension by variance thresholding. Then we can compute the frechet distance between the features of ground truth and generated tuples. We name it Joint-FID as it computes the divergence between joint distributions. Baselines. Since our method is built on StyleGAN2-ADA (Karras et al., 2020a) , we also run StyleGAN2-ADA to verify the effectiveness of our introduced modules. We also run the most recent conditional GAN model -TGAN (Shahbazi et al., 2022) . TGAN observes that the result of conditional generation of StyleGAN2-ADA is unsatisfactory when the number of classes are large. Therefore it starts as the unconditional StyeleGAN2-ADA and then gradually transits to the conditional training. We reimplement CoGAN (Liu & Tuzel, 2016) based on the StyleGAN2-ADA architecture. We let the generators share all synthesis blocks except the last one.

4.1.2. RESULTS AND ANALYSIS

Comparison with Baselines We present the quantitative results in Table 3 and show samples generated by different methods in Fig. 4 . Images of each row are generated by the same Gaussian noise with different values of domain label u. Our method achieves best FID on all datasets. The significant improvement over StyleGAN2-ADA highlight the importance to consider the correspondence information across domains. Our method also achieves lowest DIPD values on three datasets. The low values of DIPD demonstrates that images in our method share more content than other methods. In other words, our method is able to learn a proper joint distribution given only marginals. The proposed sparsity loss reduces the influence of domain stably and help us avoid expensive hyper-parameter sweeping. As we mentioned in section 3.2, a naive way to further reduce the influence of domain label u is to reducing the dimension of z s directly. But the results can be sensitive to the dimension n s . To verify this point, we sweep the n s in [256, 64, 32, 16, 8, 4] (and set λ = 0) and show the results in Fig. 5 (a) (top two rows). We observe that our sparsity constraint works well if λ is in the range [0,1], which is a common range for hyper-parameter. However, tuning n s works well when n s = 16, 32. We plot the valid number of ẑs by checking the value of the trainable mask m in Fig. 5 (a) (bottom left), it learns to decrease n s to around 50, which is close to the optimal values by tuning n s . The samples in Fig. 5 (b) also suggests that the generated tuples with sparsity loss share more content (when dim=32, it confuses between digit 3 and 5, second row). The identifiability when the number of domain is small The identifiability of the latent variables requires 2n s + 1 domains as shown in section 3.1. Therefore, it would be interesting to see what happens when the number of domains is small. We have shown that our method achieves the best result across different real world datasets in Table 3 . Now, we choose MNIST7 dataset and decrease the number of domains used in the training. We show the results in Fig. 5 (a) (bottom right). We only compute the Joint-FID of the first two domains to ensure the results are comparable. We can observe a clear trend that the Joint-FID is improving as we increase the number of domains, which supports our theory. We observe that the StyleGAN2-ADA achieves Joint-FID of 54 and our method achieves 22 when there are only 2 domains. We also show the generated examples in Fig. 5 (b), last panel. StyleGAN2-ADA totally drops the correspondence between images in the generated tuples while our method is able to preserve the correspondence.

4.2. APPLICATION: IMAGE-TO-IMAGE TRANSLATION

Method latent reference FID↓ LPIPS ↑ FID ↓ LPIPS ↑ MUNIT (Huang et al., 2018b) 

4.2.1. RESULTS

We present the experiment setup in the appendix J. We present the average results of 6 pairwise image-to-image translation tasks in Table 2 . We can observe significant improvements over StarGAN-V2 in both latent and reference based tasks. The significant improvement of our method indicates that the generated tuples can help match the distribution as well as improving the diversity of the outputs. We present more samples in the appendix J. Our model is trained to learn the true joint distribution. However, an essential assumption is that the content z c across domains are aligned. For example, the female and male domains share the same human content. For more complex data where the content are not aligned, our assumption may be violated. For example, we find that Cezanne domain contains many fruit paintings while other domains contains landscape in the ArtPhoto dataset. Although our method has achieved best quality as well as semantic correspondence compared to other methods, we still observe some failure cases. As shown in Figure . 7, to match the distribution of Cezanne domain, the generator learns to generate the fruit painting (first column). Although we can observe that our method still works on other three domains (the rest columns), we can observe content mismatch due to the unaligned training dataset. Our proposed architecture may over-constrain the influence of domain label in this misalignment case. We may also resort to some other quantitative measures of the influence to help address this challenging problem and we leave it as future work.

A ETHICAL STATEMENT

Generative models such as GAN used in our paper enable various applications and our model facilitates the technique and may reduce the manual labeling cost. Unfortunately, as revealed by previous reports, it becomes easier to manipulate image data, such as the deepfakes. In addition, the generative models may reveal the training data information. How to address these negative impacts still remains an important problem. All datasets we used are publicly available. B DISCUSSION ABOUT OUR MODEL AND THE CONDITIONAL GANS Conditional GANs usually feed the concatenation of the input noise ϵ and domain label u into the generator G. By contrast, we inject the domain influence to the noise ϵ through a component-wise strictly increasing transformation fu . From a theoretic perspective, our architecture allows us to recover the true joint distribution as shown in section 3.1. From an empirical perspective, we would like to reduce the influence of domain information in our generative model. To generate images in each domain, the generator needs to utilize the input domain label u. However, the influence of u can be very large if no constraint is applied. If the influence of domain variable is unnecessarily large, the generator focuses on the domain variable u and pays little attention to the content variable z c . As a consequence, the generated images in the tuple may lose correspondence (e.g., the generated animals have different poses on AFHQ dataset, Fig. 4 ). An extreme case would be that the content is totally ignored by the generator. So the generator still outputs images in different domains (because the input domain labels are different ) but all images in the same domain look the same (because the content variable z c is ignored) (e.g., the generated digits look the same on MNIST7 dataset, Fig. 4 ). By contrast, we first use the simple transformation f u to inject the domain information into ϵ rather than concatenating it with ϵ. Secondly, the sparsity loss L sparsity reduces the number of components affected by the domain label u. Therefore, there are will only be necessary changes between the images in the generated tuples.

C RELATED WORK

Multi-domain Image Generation Conditional GAN aims to maximize the influence of conditioning variable to improve the diversity of the generated samples (Gong et al., 2019; Brock et al., 2018; Miyato & Koyama, 2018; Odena et al., 2017; Kang et al., 2021; Kang & Park, 2020; Tseng et al., 2021; Karras et al., 2020a; Miyato et al., 2018; Zhang et al., 2019a; Brock et al., 2018; Wu et al., 2019; Zhang et al., 2019b; Zhao et al., 2020) . Our task is also related to the conditional GAN problem if we regard the domain label as the conditioning information. A major difference is that conditional GAN mostly focusing on enlarging the influence of the condition class label and generate diverse images across classes (Kang et al., 2021) while our method is trying to reduce the influence of the condition domain label and generate meaningful tuples across domains. Unpaired Image-to-Image Translation Image-to-image translation can also be viewed as a joint distribution learning problem between the source and the target image domain. With pair data, pix2pix (Isola et al., 2017) applies conditional GAN to match the distribution of the target domain and penalize the distance of generated image to the ground truth image. Unfortunately, paired data is usually difficult to collect. Therefore, additional assumptions are made to address the unsupervised task, such as the cycle consistency (Zhu et al., 2017; Kim et al., 2017; Yi et al., 2017; Choi et al., 2018; 2020; Liu et al., 2021; Kim et al., 2022) , shared latent space (Liu et al., 2017; Huang et al., 2018b; Lee et al., 2018; Yu et al., 2019; Liu et al., 2018) , relationship preservation (Park et al., 2020; Han et al., 2021; Wang et al., 2021; Cao et al., 2019; Xu et al., 2022) , density changing Xie et al. (2022) , importance reweighting Xie et al. (2021) . We build our method based on StarGAN-v2 (Choi et al., 2020) , which relies on cycle consistency to preserve content.

D THE GRAPHICAL MODEL

To address the ill-posed problem, we assume the generation process follows following graphical model: Figure 8 : The graphical model of our method. z c is the shared content and z (i) is the style from domain i and x (i) is the observation: data (images) in domain i.

E PROOF OF THE IDENTIFIABILITY OF LATENT VARIABLES

We show that the true content z c and style z s are identifiable in Lemma. 3.1.

E.1 THE IDENTIFIABILITY OF THE STYLE

We first provide the proof of the identifiability of the style variable z s . Proof. As we have matched marginal distributions P θ ′ (x|u) = P θ (x|u), we have P (ĝ(ẑ)|u) = P (g(z)|u) since x = ĝ(ẑ), x = g(z). Then we can apply same transformation ĝ-1 to the variables ĝ(ẑ) and g(z), which results in P (ĝ -1 • ĝ(ẑ)|u) = P (ĝ -1 • g(z)|u) ⇒ P (ẑ|u) = P (ĝ -1 • g(z)|u). We define h = g -1 • ĝ, then h -1 = ĝ-1 • g, we have P (ẑ|u) = P (h -1 (z)|u), which suggests that h is the indeterminacy between the recovered latent variable ẑ and the true latent z. It is worth noting that we don't need to assume that the dimension of z and x are same in order to compute the determinant of g -1 . We further define q u i = log P (z i |u) and qu i = log P (ẑ i |u). With the conditional independence assumption in A2, we have P (z|u) = n i=1 P (z i |u) ⇔ log P (z|u) = n i=1 q u i P (ẑ|u) = n i=1 P (ẑ i |u) ⇔ log P (ẑ|u) = n i=1 qu i According to the change of variable rule, we can transform equation. 11 into P (ẑ|u) = 1 |J h -1 | P (z|u)• ⇐⇒ n i=1 qu i = n i=1 q u i + |J h | where J h -1 is the absolute value of the determinant of the Jacobian matrix of h. Since h is invertible and the input and output share the same number of dimension, we have 1 |J h -1 | = |J h | ̸ = 0. To simplify the notation, we define the following objects: h ′ i,(k) := ∂z i ∂ẑ k , h ′′ i,(k,q) := ∂ 2 z i ∂ẑ k ∂ẑ q ; η ′ i (z i , u) := ∂q u i ∂z i , η ′′ i (z i , u) := ∂ 2 q u i (∂z i ) 2 . Differentiating Equation 12 twice w.r.t. ẑk and ẑq where k, q ∈ [n] and k ̸ = q yields n i=1 η ′′ i (z i , u) • h ′ i,(k) h ′ i,(q) + η ′ i (z i , u) • h ′′ i,(k,q) + ∂ 2 log |J h | ∂ẑ k ∂ẑ q = 0. Therefore, there are 2n s + 1 equations corresponding to u = u 0 , . . . , u 2ns respectively. We subtract equations associated with u 1 , . . . , u 2ns with the equation of u 0 , and we obtain the following 2n s equations: n i=nc+1 (η ′′ i (z i , u j ) -η ′′ i (z i , u 0 )) • h ′ i,(k) h ′ i,(q) + (η ′ i (z i , u j ) -η ′ i (z i , u 0 )) • h ′′ i,(k,q) = 0, where j = 1, . . . 2n s . Due to the invariance of zc , we have P (z c ) = P (z c |u). Thus, we have η ′′ i (z i , u j ) = η ′′ i (z i , u j ′ ) and η ′ i (z i , u j ) = η ′ i (z i , u j ′ ) , ∀j, j ′ . Hence only the style components i = n c + 1, . . . , n remain in the summation of each equation. Under the linear independence condition in Assumption A3, the linear system is a 2n s × 2n s invertible. Therefore, the only solution is h ′ i,(k) h ′ i,(q) = 0 and h ′′ i,(k,q) = 0 for i = n c + 1, . . . , n and k, q ∈ [n], k ̸ = q. As h(•) is smooth over Z, its Jacobian can be written as: J h = A := ∂z c ∂ẑ c B := ∂z c ∂ẑ s C := ∂z s ∂ẑ c D := ∂z s ∂ẑ s , Note that h ′ i,(k) h ′ i,(q) = 0 implies that for each i = n c + 1, . . . , n, h ′ i,(k) ̸ = 0 for at most one element k ∈ [n] . Therefore, there is only at most one non-zero entry in each row indexed by i = n c + 1, . . . , n in the Jacobian matrix J h -1 . Further, the invertibility of h(•) necessitates J h -1 to be full-rank which implies that there is exactly one non-zero component in each row of matrices C and D. Since for every i ∈ {n c + 1, . . . , n}, ẑi has changing distributions over u and all ẑk 's for i ∈ {1, . . . , n c } (i.e. ẑc ) have invariant distributions over u, we can deduce that C = 0 and the only non-zero entry ẑi z k must reside in D with k ∈ {n c + 1, . . . , n}. Therefore, for each estimated variable in the changing part ẑi , i ∈ {n c + 1, . . . , n}, there exists one true variable in the changing part z k , k ∈ {n c + 1, . . . , n} such that ẑi = h ′ i (z k ). Further, because J h -1 is of full-rank (h(•) being invertible) and C is a zero matrix, D must be of full-rank, which implies that h ′ i (•) is invertible for each i ∈ {n c + 1, . . . , n}. Thus, the changing components z s are identified up to component-wise invertible transformations.

E.2 THE IDENTIFIABILITY OF THE CONTENT VARIABLE

Now we provide the proof for the block-identifiability of the content variable z c . Proof. The proof is presented in four steps as follows. Step 1. Due to assumption of the data generating process of the learned model, we can obtain the independence between the generating process ẑc and u. Thus, it follows that for any A zc ⊆ Z c , {ĝ -1 1:nc (x) ∈ A zc }|{u = u 1 } = {ĝ -1 1:nc (x) ∈ A zc }|{u = u 2 }, ∀u 1 , u 2 ∈ U ⇐⇒ {x ∈ (ĝ -1 1:nc ) -1 (A zc )}|{u = u 1 } = {x ∈ (ĝ -1 1:nc ) -1 (A zc )}|{u = u 2 }, ∀u 1 , u 2 ∈ U where ĝ-1 1:nc : X → Z c denotes the estimated transformation from the observation to the content variable and (ĝ -1 1:nc ) -1 (A zc ) ⊆ X is the pre-image set of A zc . On account of the matching assumption, we are able to extend Equation 16 as follows: {x ∈ (ĝ -1 1:nc ) -1 (A zc )}|{u = u 1 } = {x ∈ (ĝ -1 1:nc ) -1 (A zc )}|{u = u 2 } ⇐⇒ {ĝ -1 1:nc (x) ∈ A zc }|{u = u 1 } = {ĝ -1 1:nc (x) ∈ A zc }|{u = u 2 }. Because g and ĝ are smooth and injective, there exists a smooth and injective h = ĝ-1 • g : Z → Z. Expressing ĝ-1 = h • g -1 and h c (•) := h 1:nc (•) : Z → Z c in Equation 17 yields {h c (z) ∈ A zc }|{u = u 1 } = {h c (z) ∈ A zc }|{u = u 2 }, ⇐⇒ {z ∈ h -1 c (A zc )}|{u = u 1 } = {z ∈ h -1 c (A zc )}|{u = u 2 }, ⇐⇒ z∈h -1 c (Az c ) p z|u (z|u 1 ) dz = z∈h -1 c (Az c ) p z|u (z|u 2 ) dz, where h -1 c (A zc ) = {z ∈ Z : h c (z) ∈ A zc } is the pre-image of A zc , i.e. those latent variables containing content variables in A zc after the indeterminacy transformation h. Based on the generating process, we can re-write Equation 18 as follows: ∀A zc ⊆ Z c , [z ⊤ c ,z ⊤ s ] ⊤ ∈h -1 c (Az c ) p zc (z c ) p zs|u (z s |u 1 ) -p zs|u (z s |u 2 ) dz s dz c = 0. Step 2. In this step, we prove that z c := h c ([z ⊤ c , z ⊤ s ] ⊤ ) does not depend on z s . To this end, we first develop an equivalent statement (i.e. Statement 3 below) and prove it subsequently. This enables us to leverage the full-supported density function assumption to avert technical issues. • Statement 1: h c ([z ⊤ c , z ⊤ s ] ⊤ ) does not depend on z s . • Statement 2: ∀z c ∈ Z c , it follows that h -1 c (z c ) = B zc × Z s where B zc ̸ = ∅ and B zc ⊆ Z c . • Statement 3: ∀z c ∈ Z c , r ∈ R + , it follows that h -1 c (B r (z c )) = B + zc × Z s where B r (z c ) := {z ′ c ∈ Z c : ||z ′ c -z c || 2 < r}, Bzc ̸ = ∅, and B + zc ⊆ Z c . Statement 2 is a mathematical formulation of Statement 1. Statement 3 generalizes singletons z c in Statement 2 to open, non-empty balls B r (z c ). Later, we show that under this generalization, Bzc is necessarily of probability measure greater than 0. With this, we can proceed to show its contraction to Equation 19. Leveraging the continuity of h c (•), we show the equivalence between Statement 2 and Statement 3 as follows. We first show that Statement 2 implies Statement 3. ∀z c ∈ Z c , r ∈ R + , h -1 c ((B r (z c ))) = ∪ z ′ c ∈Br(zc) h -1 c (z ′ c ). Statement 2 indicates that every participating sets in the union satisfies h -1 c (z ′ c ) = B ′ zc × Z s , thus the union h -1 c ((B r (z c )) ) also satisfies this property, which is Statement 3. Then, we show that Statement 3 implies Statement 2 by contradiction. Suppose that Statement 2 is false, then ∃ẑ c ∈ Z c such that there exist ẑB c ∈ {z 1:nc : Step 3. In this step, we prove Statement 3 by contradiction. Intuitively, we show that if h c (•) depended on ẑs , the pre-image h -1 c (B r (z c )) could be partitioned into two parts (i.e. B * z and h -1 c (A * zc ) \ B * z defined below). The dependency between h c (•) and ẑs is characterized by B * z , which where we use notation ẑc = h(z) 1:nc and ẑs = h(z) nc+1:n . As we have shown that ẑc does not depend on the style variable z s , it follows B = 0. On the other hand, as h(•) is invertible over Z, J h is non-singular. Therefore, A must be non-singular due to B = 0. Note that A is the Jacobian of the function h ′ c (z c ) := h c (z) : Z c → Z c , which takes only the content part z c of the input z into h c . Also, note that the result of Theorem 3.1 implies that C = 0. Together with the invertibility of h, we can conclude that h ′ c is invertible. Therefore, there exists an invertible function h ′ c between the estimated and the true content variables such that ẑc = h ′ c (z c ), which concludes the proof that z c is block-identifiable via ĝ-1 (•). z ∈ h -1 c (ẑ c )} and ẑB s ∈ Z s resulting in h c (ẑ B ) ̸ = ẑc where ẑB = [(ẑ B c ) ⊤ , (ẑ B s ) ⊤ ] ⊤ . As h c (•) is continuous, there exists r ∈ R + such that h c (ẑ B ) ̸ ∈ B r (ẑ c ). That is, ẑB ̸ ∈ h -1 c (B r (ẑ c )).

F PROOF OF THE IDENTIFIABILITY OF TRUE JOINT DISTRIBUTION

In this section, we provide the proof of theorem 3.2. Proof. Two domains We consider the case of two variables first, i.e., we prove that P θ ′ (x (1) , x (2) ) = P θ (x (1) , x (2) ). We factorize the joint distribution P (x (1) , x (2) ) = P (x (1) )P (x (2) |x (1) ). Since we have matched the marginal distributions for each domain, we already have P θ ′ (x (1) ) = P θ (x (1) ). So we now only need to prove P θ ′ (x (2) |x (1) ) = P θ (x (2) |x (1) ). Given x = g(z c , z s ), we can identify the style and content through the inverse of the model, i.e., ẑc , ẑs = ĝ-1 (x). According to lemma. 3.1, we have h c (z c ) = ẑc and h s (z s ) = ẑs , where h c is an invertible transformation and h s is component-wise invertible transformation. Therefore, we have g(z c , z s ) = ĝ(ẑ c , ẑs ) = ĝ(h c (z c ), h s (z s )) For the ease of notation, we denote c = z c , s = z s . Given a pair of images ⟨x (1) , x (2) ⟩ sampled from the true joint distribution, we have x (1) = g(c (1) , s (1) ); x (2) = g(c (1) , s (2) ); Since we assume that the domain specific transformation is component-wise monotonic f u , f u is component-wise invertible. We denote f 1 = f u=1 , f 2 = f u=2 , then we have s (2) = f 2 • f -1 1 (s (1) ) s (1) = f -1 2 • f 1 (s (2) ) Similarly, the learned model fu is also component-wise invertible, we also have ŝ(2) = f2 • f -1 1 (ŝ (1) ). Given x (1) in the first domain, we have ĉ(1) = h c (c (1) ), ŝ(1) = h s (s (1) ); x (1) = g(c (1),s (1) ) = ĝ(ĉ (1) , ŝ(1) ). Then we have ) ), by equality in 24 (26) ŝ(2) = f2 • f -1 1 (ŝ ( = f2 • f -1 1 (h s (s (1) )), by equality in 25 = f2 • f -1 1 (h s (f 1 • f -1 2 (s (2) ))),

by equality in 23

We now prove that ŝ(2) is a function of the true s (2) . According to our lemma, the function between ŝ(2) and s (2) can only be h s . Therefore, we have ŝ(2) = h s (s (2) ). Then the output generated by our learned model with parameter θ ′ is x(2) = ĝ(ĉ (1) , ŝ(2) ) (27) = ĝ(h c (c (1) ), h s (s (2) )) = g(c (1) , s (2) ), by equality in 21 = x (2) The results show that given a input x (1) , our learned generative model with parameter θ ′ outputs the same result as the true generative model with parameter θ. In other words, P θ ′ (x (2) |x (1) ) = P θ (x (2) |x (1) ). Generalization of the result with mathematical induction So far, we have proved that 2) ). Now we generalize the results into more domains with mathematical induction on the number of domains. P θ ′ (x (1) , x (2) ) = P θ (x (1) , x • Base Case When the number of domains is 2, we have proved that P θ ′ (x (1) , x (2) ) = P θ (x (1) , x (2) ).

• Inductive

Step. Suppose the true joint distribution is identifiable when the number of domains is d -1, i.e., P θ ′ (x (1) , ..., x (d-1) ) = P θ (x (1) , ..., x (d-1) ), now we prove that the true joint distribution is still identifiable when the number of domains is d, i.e., P θ ′ (x (1) , ..., x (d) ) = P θ (x (1) , ..., x (d) ). In fact, we only need to prove P θ ′ (x)(x (d) |x (1),...,x (d-1) ) = P θ ′ (x)(x (d) |x (1),...,x (d-1) ) according to the induction assumption. P (x (d) |x (1),...,x (d-1) ) = P (x)(x (d) |c (1) , s (1) , c (1) , s (2) , ..., ), we apply the invertible function g -1 (28) = P (x (d) |c (1) , s (1) , ..., s (d-1) ) (29) = P (x (d) |c (1) , f -1 1 (s (1) ), ..., , f d-1 (s (d-1) )), we apply the invertible transformations f -1 (30) = P (x (d) |c (1) , s, s, ..., s), = P (x (d) |c (1) , s) (32) = P (x (d) |c (1) , f 1 (s)) (33) = P (x (d) |c (1) , s (1) ) We can find that it becomes the two domain case, i.e., we just need to prove P θ ′ (x (d) |x (1) ) = P θ (x (d) |x (1) ) , which already holds according to our results in two domains. Therefore, the true joint distribution is identifiable.

G MULTI-DOMAIN IMAGE GENERATION G.1 DATASET

We use five datasets to verify our model. • CELEBA-HQ (Choi et al., 2020) contains 2 domain: female and male. We also use the training set to train our model. Female domain contains 17943 images and male domain contains 10057 images. We train the model at resolution 256×256. • AFHQ (Choi et al., 2020) contains 3 domains: cat, dog and wild life (e.g., foxes, tigers and lions). We use the training set to train our conditional GAN model. Three domains contain 5153, 4739, 4738 images, respectively. We train the model at resolution 256 × 256. • ArtPhoto contains 4 domains: Monet, Cezanne, Ukiyoe paintings and real photos (Zhu et al., 2017 ). • CelebA5 contains 5 domains: Black Hair, Blonde Hair, Eyeglasses, Mustache and Pale Skin. They are subsets of domain CelebA (Liu et al., 2015) . We train them at 64×64 resolution. • MNIST7 contains 7 domains: blue, cyan, green, purple, red, white and yellow MNIST digits. We generate these digits using the training MNIST dataset (LeCun et al., 1998) . We train the model at resolution 32×32.

G.2 MORE RESULTS

We now provide more results of multi-domain image generation.

G.3 T-SNE OF REAL AND GENERATED SAMPLE H TWO VERSIONS OF MASK MECHANISM

In our main paper, we propose to use z s = ϵ + m ⊙ f u (ϵ) to encourage the network select the optimal dimension of n s automatically. Another possible version would be z J UNPAIRED IMAGE-TO-IMAGE TRANSLATION J.1 DETAILS ABOUT STARGAN-V2 2 s = (1 -m 2 ) ⊙ ϵ + m 2 ⊙ f u (ϵ). As mentioned in section 3.3, we build our method based on StarGAN-V2 Choi et al. (2020) . Now we provide more details about the method. StarGAN-v2 consists of four modules: mapping network H(., .), style encoder E(., .), the shared generator F and the discriminator D. The training loss is L stargan = L adv + L cyc + L sty -λ div L div , where L adv , L cyc , L sty and L div are used for distribution matching, cycle consistency, style reconstruction and generation diversity, respectively. StarGAN-V2 supports two kinds of tasks: latentbased image translation and reference-based image translation. Given an input image, latent-based generates the style code s from the random noise ϵ with the mapping network H, i.e., s = H(ϵ, y), where y is the target domain label. As for the reference based, the style is extracted from an image x 2 in the target domain y, i.e., s = E(x 2 , y). The shared generator takes an input image x and the style code s and outputs an image from the domain y. StarGAN-V2 concatenates multiple domain specific heads for each domain and we denote it as D y for domain y. L sty = ∥s -E(F (x, s))∥ 1 , encourages the network F contains the style information of s. L div = E[∥F (x, s1 ) -F (x, s2 )∥ 1 ], which encourages two output images to be different by maximizing this loss. The final one is the cycle consistency (Zhu et al., 2017) , L cyc = E[∥x -F (F (x, s), s)∥ 1 ], which encourages the network F to reconstruct the input from the translated one. In other words, it encourages the mapping to be one-to-one. J.2 OUR TUPLE LOSS However, as shown in previous literature (Alami Mejjati et al., 2018; Kim et al., 2019) , the cycle consistency is not enough and can still leads to large content distortion. Hence, we introduce our loss L tuple = E∥F (G(ϵ, u 0 ), s2 ) -G(ϵ, u 1 )∥ 1 , where G(ϵ, u 0 ), G(ϵ, u 1 ) are pair data generated by our multi-domain image generation model. As proved in section F, the tuples can be viewed as sampling from the true joint distribution. Therefore, we are using L tuple to encourage the mapping network F to recover the second image from the first image. In other words, by minimizing this tuple loss, the network F learns the correspondence relationship between our method. Therefore, our tuple loss helps further regularize the mapping network F and avoid large content distortion. Figure 15 : The trend of dimension of z s as training proceeds on MNIST7 dataset. We observe that the sigmoid version decreases much slower than our proposed version when we set λ = 0.1. Even if we set λ = 1.0, the speed of decreasing the dimension z s is smaller than our version. However, decreasing the dimension of z s is very important at the beginning phase as we need to reduce the influence of domain variable to avoid conditional collapse. The sigmoid version decreases faster than our proposed method in the middle of training progress. The reason is that the GAN already collapse and GAN loss is close to 0 and it helps little in selecting the optimal dimension n s . 



Figure 1: Two tasks.

Figure 3: Our regularization.

Figure4: Samples of multi-domain image generation on the CELEBA-HQ, AFHQ, ArtPhoto, CelebA5 and MNIST7. We provide more samples and methods in appendix G.2. Each row of the method shares the same input noise ϵ. We observe that there are unnecessary changes between the images (e.g., the added sun-glasses in the first row, the different poses of animals of StyleGAN2-ADA in second row) without regularization.

Figure 5: Experiments on MNIST7 dataset. We observe that our proposed sparsity constraint enables network select the optimal dimension of changing components automatically (bottom left in (a)) and is more stable than tuning the dimension manually (top left and right in the (a), examples in (b)).We visualize the real and generated samples with t-SNE in (c). Different colors denote different domains. We can find that the generated samples cover all classes in each domain. We provide t-SNE comparisons in the Fig.9.

Figure 6: Samples of image translation.

Figure 7: Failure case on ArtPhoto dataset, columns are Cezanne, Monet, Photo and Ukyioe

On the other hand, Statement 3 suggests that h -1 c (B r (ẑ c )) = Bzc × Z s . By definition of ẑB , it is clear that ẑB 1:nc ∈ Bzc . However, the fact that ẑB ̸ ∈ h -1 c (B r (ẑ c )) contradicts Statement 3. Therefore, Statement 2 is true under the premise of Statement 3. We have shown that Statement 3 implies Statement 2. Consequently, Statement 2 and Statement 3 are equivalent, and therefore proving Statement 3 suffices to show Statement 1.

adv = E[log D y (x)] + E[log(1 -D ỹ (F (x, s)))],(36)performs adversarial training between the generator and the discriminator.

Figure9: The t-SNE of real andgenerated samples for MNIST7 dataset. We can find that the baseline methods fail to recover the joint distribution. By contrast, our method matches the joint distributions.

Figure 16: The trend of dimension of z s as training proceeds on AFHQ dataset. .

Figure 17: Latent-Based: cat→dog and cat→wild

Results of multi-domain image generation on five datasets.

Average results on 6 tasks

Datset n s = 32 n s = 16 n s = 8 FID values when tuning n s . the relationship between traditional information criterion (like BIC)-based model selection or subset selection and parameter shrinkage (say, with the penalty) for variable selection.

ACKNOWLEDGEMENT

We thank the anonymous reviewers for their devoted time and constructive feedbacks, which are really helpful to improve the quality of this paper.This project was partially supported by the National Institutes of Health (NIH) under Contract R01HL159805, by the NSF-Convergence Accelerator Track-D award 2134901, by a grant from Apple Inc., a grant from KDDI Research Inc, and generous gifts from Salesforce Inc., Microsoft Research, and Amazon Research. MG was supported by ARC DE210101624.

annex

would not emerge otherwise. In contrast, h -1 c (A * zc ) \ B * z also exists when h c (•) does not depend on ẑs . We evaluate the invariance relation Equation 19and show that the integral over h -1 c (A * zc ) \ B * z (i.e. T 1 ) is always 0, however, the integral over B * z (i.e. T 2 ) is necessarily non-zero, which leads to the contraction with Equation 19and thus shows the h c (•) cannot depend on ẑs .First, note that because B r (z c ) is open and h c (•) is continuous, the pre-image h -1 c (B r (z c )) is open. In addition, the continuity of h(•) and the matched observation distributions ∀u ′ ∈ U, {x ∈ A x }|{u = u ′ } = {x ∈ A x }|{u = u ′ } lead to h(•) being bijection as shown in (Klindt et al., 2020b) ,which implies that hthat the style part z nc+1:n cannot take on any value in Z s . Only certain values of the style part were able to produce specific outputs of indeterminacy h c (•). Clearly, this would suggest that h c (•) depends on z c . To show contraction with Equation 19, we evaluate the LHS of Equation 19 with such a A * zc :We first look at the value ofTherefore, in both cases T 1 evaluates to 0 for A * zc .Now, we address T 2 . As discussed above,Therefore, for such A * zc , we would have T 1 + T 2 ̸ = 0 which leads to contradiction with Equation 19. We have proved by contradiction that Statement 3 is true and hence Statement 1 holds, that is, h c (•) does not depend on the style variable z s .Step 4. With the knowledge that h c (•) does not depend on the style variable z s , we now show that there exists an invertible mapping between the true content variable z c and the estimated version ẑc .As h(•) is smooth over Z, its Jacobian can written as: Table 3 : Results of multi-domain image generation on five datasets. We observe that the precisons of our method are slightly lower while the recalls are high across datasets, which is generally desirable since recall can be traded into precision via truncation, whereas the opposite is not true (Karras et al., 2020b; Kynkäänniemi et al., 2019) .There is a main implementation-related reason why we choose the first version: if we choose the second version, we have to wrap m as a value inside [0,1]. A common way is to use sigmoid function. However, sigmoid is known for gradient vanishing (the gradient is very small when input is far from 0). So, our sparsity penalty and the GAN loss may have little effect on the sigmoid mask, which leaves the influence of domain very large. To testify the claim, we run experiments on AFHQ and CELEBA-HQ dataset with two versions in Table . 4. We set λ = 0.1.

Mask Version

Table 4 : The results of two mask mechanisms.

I THE NECESSITY OF THE MASK MECHANISM

Our mask mechanism allows selecting the dimensions for injecting domain influences automatically. Another possible way seems to be manually define the dimension of style n s and tune it. It is worth noting that when one tries to find the optimal value of from 4,8,16,32.., in principle, one has to consider not only different values, but also which dimensions of the input of GAN should have changing distributions for each value of n s , in order to achieve the best performance. Because of the complexity of the transformation implied by GAN, which dimensions of the input of GAN should have changing distributions may heavily depend on the initialization of GAN. That is, if we just allow the first inputs of GAN to have changing distributions, it is completely possible that other inputs, actually should learn to be one of those with changing distributions. In this case, it will be hard for GAN to learn the optimal function , and hence the strategy of forcing the first dimensions of the input of GAN to have changing distribution, even with the optimal value, may lead to high variability in the final performance. (On the other hand, it is not computationally feasible to consider each subset of the inputs of GAN of size , run the procedure, and find the best one.) This phenomenon is analogous to 

