MULTI-DOMAIN IMAGE GENERATION AND TRANSLA-TION WITH IDENTIFIABILITY GUARANTEES

Abstract

Multi-domain image generation and unpaired image-to-to-image translation are two important and related computer vision problems. The common technique for the two tasks is the learning of a joint distribution from multiple marginal distributions. However, it is well known that there can be infinitely many joint distributions that can derive the same marginals. Hence, it is necessary to formulate suitable constraints to address this highly ill-posed problem. Inspired by the recent advances in nonlinear Independent Component Analysis (ICA) theory, we propose a new method to learn the joint distribution from the marginals by enforcing a specific type of minimal change across domains. We report one of the first results connecting multi-domain generative models to identifiability and shows why identifiability is essential and how to achieve it theoretically and practically. We apply our method to five multi-domain image generation and six image-toimage translation tasks. The superior performance of our model supports our theory and demonstrates the effectiveness of our method. The training code are available at https://github.com/Mid-Push/i-stylegan.

1. INTRODUCTION

Multi-domain image generation and unpaired image-to-image translation are two important and closely related problems in computer vision and machine learning. They have many promising applications such as domain adaptation (Liu & Tuzel, 2016; Hoffman et al., 2018; Murez et al., 2018; Wang & Jiang, 2019 ) and medical analysis (Armanious et al., 2019; 2020; Kong et al., 2021) . As shown in Fig. 1 , multi-domain image generation takes as input the random noise ϵ and domain label u and the task aims to generate image tuples where the images in the tuple share the same content, e.g., different facial expressions of the same people. The second task takes as input an image in one domain and target domain label u and aims to generate another image which is in the target domain but share the same content of input, e.g., the output image has the same identity but different facial expression from the input image. Figure 1 : Two tasks. Both tasks can be viewed as instantiations of joint distribution learning problem. A joint distribution of multi-domain images is a probability density function that gives a density value to each joint occurrence of images in different domains such as images of the same people with different facial expressions. Once the joint distribution is learned, it can be used to generate meaningful tuples (the first task) and translate an input image into another domain without content distortion (the second task). If the correspondence across domains is given (e.g., the identity), we can apply supervised approaches to learn the joint distribution easily. However, collecting corresponding data across domains can be prohibitively expensive . For example, collecting different facial expressions of same person may need controlled experiments. By contrast, collecting image domains without correspondence can be relatively cheap, e.g., facial expressions of different people can be easily accessed online (once permission is granted). Therefore, we consider the problem of unsupervised joint distribution learning P (x (1) , x (2) , ..., x (d) ) where we are only given multiple marginal distributions in different yet related domains {P (x (i) )} n i=1 , where d is the number of domains and x (i) denotes the images in domain i. However, there can be infinitely many joint distributions that can produce the same marginals (Lindvall, 2002) . For example, we can apply conditional GANs (Mirza & Osindero, 2014; Odena et al., 2017; Miyato & Koyama, 2018; Brock et al., 2018) to match the marginal distributions in each domain they learn joint distributions implicitly. We can sample a tuple ⟨x (0) , ..., x (d) ⟩ from its learned joint distribution using the same random noise and different domain labels. However, there is no guarantee that the learned joint distribution in conditional GANs are optimal and it may drop the correspondence in the generated tuples (e.g., Fig. 5(b) ). To tackle this problem, CoGAN (Liu & Tuzel, 2016) proposes to use different generators for each domain and share weights of high level layers in different generators. JointGAN (Pu et al., 2018) proposes to factorize the joint distribution (e.g., P (x (1) , x (2) ) = P (x (1) )P (x (2) |x (1) )) and learn one marginal and one conditional distribution with cycle consistency (Zhu et al., 2017) . While existing approaches (e.g., Liu et al. (2017) ; Pu et al. ( 2018) add constraints for the purpose of removing unwanted joint distributions, they are not guaranteed to find the true joint distribution. If we are unsure about the learned joint distribution, we cannot guarantee that the generated tuples are meaningful. Learning the true joint distribution seems to be impossible since we only have access to marginal distributions and domain label and we do not have access to the latent content and style variables. Fortunately, recent advances in nonlinear Independent Component Analysis (ICA) theory (Hyvarinen & Morioka, 2016; 2017; Hyvarinen et al., 2019; Khemakhem et al., 2020b; b; Von Kügelgen et al., 2021; Kong et al., 2022) show that deep nonlinear latent variable models are identifiable with auxiliary variables (e.g., domain label in our case), meaning that we can recover the latent variables (e.g., content and style in our case) up to some component-wise transformation. Inspired by these advances, we propose a new method to learn the true joint distribution from the marginals by enforcing a specific type of minimal changes across domains. Specifically, we assume that the influence of domain information (i.e., the underlying changes across domains) is minimal in the data generation process, e.g., only facial expression changes and no hair style changes is allowed. To achieve the minimal changes, we inject the domain information through a component-wise strictly increasing transformation instead arbitrary complex transformation. In addition, we assume the number of the underlying components affected by the domain information (and thus with changing distributions) is minimal. Then we show that if the influence of domain information is minimal, the true joint distribution can be recovered from the marginal distributions. Afterwards, we can use the learned joint distribution to sample meaningful tuples and translate input images into another domain without content distortion. Our method can be applied to datasets where the content across domains are aligned and most existing datasets in multi-domain image generation and translation satisfy this requirement, e.g., the facial image dataset, animal face and digit images. Our method may not be as effective when the contents when some domains contain unaligned contents. We may need to pay more attention to the artworks-related dataset, since different painters focus on different objects in their painting (e.g., there are many fruit paintings by Cezanne and landscape paintings by Monet). Our proposed method has several theoretical and practical contributions: 1. We provide the properties of an image generation process under which the true joint distribution can be recovered from marginal distributions. 2. In light of our theoretical results, we provide a practical method for multi-domain image generation. The proposed method can automatically determine the underlying dimension of the latent variables with changing distributions. Our method achieves promising results across five image generation tasks. 3. We propose to encourage the mapping function in image translation task to preserve the correspondence learned by our image generation model. The significant gain over the baseline methods on six image translation tasks demonstrate the effectiveness of our technique.

