MULTI-VIEW DISENTANGLED REPRESENTATION

Abstract

Learning effective representations for data with multiple views is crucial in machine learning and pattern recognition. Recent great efforts have focused on learning unified or latent representations to integrate information from different views for specific tasks. These approaches generally assume simple or implicit relationships between different views and as a result are not able to accurately and explicitly depict the correlations among these views. To address this, we firstly propose the definition and conditions for unsupervised multi-view disentanglement providing general instructions for disentangling representations between different views. Furthermore, a novel objective function is derived to explicitly disentangle the multiview data into a shared part across different views and a (private) exclusive part within each view. The explicit guaranteed disentanglement is of great potential for downstream tasks. Experiments on a variety of multi-modal datasets demonstrate that our objective can effectively disentangle information from different views while satisfying the disentangling conditions.

1. INTRODUCTION

Multi-view representation learning (MRL) involves learning representations by effectively leveraging information from different perspectives. The representations produced by MRL are effective when correlations across different views are accurately modeled and thus properly exploited for downstream tasks. One representative algorithm, Canonical Components Analysis (CCA) (Hotelling, 1992) , aims to maximize linear correlations between two views under the assumption that factors from different views are highly correlated. Under a similar assumption, the extended versions of CCA, including kernelized CCA (Akaho, 2006) and Deep CCA (Andrew et al., 2013) , explore more general correlations. There are also several methods (Cao et al., 2015; Sublime et al., 2017) that maximize the independence between different views to enhance the complementarity. Going beyond the simple assumptions above, the latent representation encodes different views with a degradation process implicitly exploiting both consistency and complementarity (Zhang et al., 2019) . These existing MRL algorithms are effective, however, the assumed correlations between different views are usually simple thus cannot accurately model or explicitly disentangle complex real-world correlations, which hinders the further improvement and interpretability. Although there are a few heuristic algorithms (Tsai et al., 2019; Hu et al., 2017) that explicitly decompose the multiview representation into shared and view-specific parts, they are especially designed for supervised learning tasks without any disentangled representation guarantee and fall short in formally defining the relationships between different parts. To address this issue, we propose to unsupervisedly disentangle the original data from different views into shared representation across different views and exclusive (private) part within each view, which explicitly depicts the correlations and thus not only enhances the performance of existing tasks but could also inspire potential applications. Specifically, we firstly provide a definition for the multi-view disentangled representation by introducing the sufficient and necessary conditions for guaranteeing the disentanglement of different views. According to these conditions, an information-theory-based algorithm is proposed to accurately disentangle different views. To summarize, the main contributions of our work are as follows: • To the best of our knowledge, this is the first work to formally study multi-view disentangled representation with strict conditions, which might serve as the foundations of the future research on this problem. • Based on our definition, we propose a multi-view disentangling model, in which informationtheory-based multi-view disentangling can accurately decompose the information into shared representation across different views and exclusive representation within each view. The explicit decomposition enhances the performance of multi-view analysis tasks and could also inspire new potential applications. • Different from the single-view unsupervised disentangled representation learning (Locatello et al., 2019) , we provide a new paradigm for unsupervised disentangled representation learning from a fresh perspective -disentangling factors between different views instead of each single view. • Extensive experiments on a range of applications verify that the proposed information-theorybased multi-view disentangling algorithm can accurately disentangle data from multiple views into expected shared and exclusive representations.

2. MULTI-VIEW DISENTANGLED REPRESENTATION

𝑒 1 Existing multi-view representation learning methods (Wu & Goodman, 2018; Zhang et al., 2019) can obtain a common representation for multi-view data, however, the correlations between different views are not explicitly expressed. The supervised algorithms (Hu et al., 2017; Tan et al., 2019) can decompose multiple views into a common part and private parts, but there is no disentangling guarantee. Therefore, we propose a multi-view disentanglement algorithm that can explicitly separate the shared and exclusive information in unsupervised settings. Formally, we first propose a definition of a multi-view disentangled representation by introducing four criteria, which are considered as sufficient and necessary conditions of disentangling multiple views. The definition is as follows: 𝑠 1 P o E 𝑟 1 𝑥 1 𝑠 2 P o E 𝑟 2 𝑥 2 𝑥 1 𝑥 2 ④ ② ② ① ① 𝑒 2 ③ Definition 2.1 (Multi-View Disentangled Representation) Given a sample with two views, i.e., X = {x i } 2 i=1 , the representation S dis = {s i , e i } 2 i=1 is a multi-view disentangled representation if the following conditions are satisfied: • Completeness: 1 The shared representation s i and exclusive representation e i should jointly contain all information of the original representation x i ; • Exclusivity: 2 There is no shared information between common representation s i and exclusive representation e i , which ensures the exclusivity within each view (intra-View). 3 There is no shared information between e i and e j , which ensures the exclusivity between private information of different views (inter-View). • Commonality: 4 The common representations s i and s j should contain the same information. Equipped with the exclusivity constraints, the common representations are guaranteed to not only be the same but also contain maximized common information. The necessity for each criterion is illustrated in Fig. 1 (satisfaction of all the four conditions produces exact disentanglement, and violation of any condition may result in an unexpected disentangled representation). Note that, existing (single-view) unsupervised disentanglement focuses on learning a representation to identify explanatory factors of variation, which has been proved fundamentally impossible (Locatello et al., 2019) . The goal of the proposed multi-view disentanglement is to disentangle multiple views into the shared and exclusive parts which can be well guaranteed as illustrated in definition 2.1 and Fig. 1 . Mutual information has been widely used in representation learning (Hjelm et al., 2019; Belghazi et al., 2018) . In probability theory and information theory, the mutual information of two random variables quantifies the "amount of information" obtained about one random variable when observing the other one, which is well-suited for measuring the amount of shared information between two different views. To approach the disentangling goal, according to conditions 1 ∼ 4 , the general form of the object function is naturally induced as: max 2 i,j=1 I(x i ; e i , s i ) 1 -I(e i ; s i ) 2 - i =j I(e i ; e j ) 3 + i =j I(s i ; s j ) 4 , where I(•; •) denotes the mutual information. We provide an implementation in Fig. 2 and, in the following subsections, we will describe this implementation in detail.

2.1. CONDITION x: INFORMATION PRESERVATION FOR THE SHARED AND EXCLUSIVE REPRESENTATIONS

• How to maximize I(x; e, s)? For simplicity, x, s, e and x i , s i , e i are denoted with the same meanings and used alternately, where the former and latter are used for intra-view and inter-view cases, respectively. To preserve the information from the original data in the shared and exclusive representations, the mutual information I(x; e, s) should be maximized. There are different ways to implement the maximization of I(x; e, s) based on the following assumptions. Assumption 2.1 The shared representation s and exclusive representation e are simultaneously independent and conditionally independent: p(s, e) = p(s)p(e), p(s, e|x) = p(s|x)p(e|x). (2) Firstly, we expand I(x; e, s) to obtain the following equation (more details are shown in supplement C): I(x; e, s) = p(x)p(e, s|x) log p(e, s|x) p(e, s) dedsdx. Then, under Assumption 2.1, the following equation is derived (more details are shown in supplement C): I(x; e, s) = I(x; e) + I(x; s). (3) According to the above equation, it seems that we can maximize I(x; e) + I(x; s) to maximize I(x; e, s), which involves making s and e contain as much information from x as possible (ideally, it will produce e and s to meet I(x; e) = I(x; s) = H(x), where H(x) is the entropy of x). This actually leads to a strong correlation between s and e, which is in conflict with the independence Assumption 2.1 about s and e. In other words, it is difficult to balance the completeness (condition 1 ) and intra-view exclusivity (condition 2 ) (see experimental results in supplement B.4). Fortunately, there is an alternative strategy which avoids the difficulty in balancing the completeness and intra-view exclusivity. Specifically, we introduce a latent representation r generated by two independent distributions with respect to s and e under a mild assumption: Assumption 2.2 (Relationship between s, e and r): p(s, e, x) = p(r, x). (4) In our formulation, we define r = f (s, e), where r is derived from s and e with the underlying function f (•) and satisfies p(r, x) = p(s, e, x). Eq. 4 is a mild assumption, for example the invertibility of mapping r = f (s, e) ensuring a sufficient condition which can be easily verified. Note that r = [s, e] is one special case and will be discussed later. Based on Assumption 2.2, we can get (more details are shown in supplement C): p(r) = p(s, e), p(r|x) = p(s, e|x). (5) Then, we can induce the following result (more details are shown in supplement C): I(x; e, s) = I(x; r). This result indicates that the maximization of I(x; e, s) can be achieved by maximizing the mutual information of agency r and x. In this way, the independence of e and s is well preserved and the previous conflict is dispelled. Next, we will explain how to encode the information of x into independent representations s and e by introducing the agency r. • How to obtain independent representations e and s by maximizing I(x; r) ? First, we consider encoding the observed data x into a latent representation r by maximizing the mutual information between x and r. Considering robustness and effectiveness (Alemi et al., 2018) , we can maximize the mutual information between r and x through Variational Autoencoders (VAEs) (Kingma & Welling, 2014) . Accordingly, we have the following objective function: min qr,d E x∼p(x) -E r∼qr(r|x) log(d(x|r)) + E r∼qr(r|x) log q r (r|x) p(r) , where d(x|r) (the "decoder") is a variational approximation to p(x|r), and q r (r|x) (the "encoder") is a variational approximation to p(r|x), which converts the observed data x into the latent representation r. Second, we consider how to obtain independent representations e and s by modeling q r (r|x). For this goal, the relationships between s, e and r should be jointly modeled. As shown in Eq. 5, we obtain p(r|x) = p(s, e|x). Under Assumption 2.1, Eq. 5 can be rewritten as p(r|x) = p(s|x)p(e|x), which implies that q r (r|x) can be considered as the product of p(s|x) and p(e|x). Furthermore, we introduce PoE (product-of-experts) (Hinton, 2002; Wu & Goodman, 2018) to model the product of q s (s|x) and q e (e|x), where the variational networks q s (s|x) and q e (e|x) are designed to approximate p(s|x) and p(e|x). It is worth noting that the key difference from MVAE (Multimodal Variational Autoencoder) (Wu & Goodman, 2018 ) is that our model obtains the latent representation r from two independent components within each single view, while MVAE achieves the unified representation of all views by assuming independence of representations of different views. Under the assumption that the true posteriors for individual factors p(s|x) and p(e|x) are contained in the family of their variational counterparts q s (s|x) and q e (e|x), we have q r (r|x) = q s (s|x)q e (e|x). With Gaussian distributions, we can obtain the closed-form solution for the product of two distributions: µ r = µsσ 2 s +µeσ 2 e σ 2 s +σ 2 e , σ 2 r = σ 2 s σ 2 e σ 2 s +σ 2 e . Therefore, the independent representations between e and s are well preserved by modeling q r (r|x). Accordingly, with the results q r (r|x) = q s (s|x)q e (e|x) and p(r) = p(s)p(e), the objective in Eq. 7 is rewritten as: min qs,qe,d E x∼p(x) -E r∼qs(s|x)qe(e|x) log(d(x|r)) + E s∼qs(s|x) log q s (s|x) p(s) + E e∼qe(e|x) log q e (e|x) p(e) , where p(s) and p(e) are set to Gaussian distributions, which in turn forces q s (s|x) and q e (e|x) to be closer to a Gaussian distribution, allowing us to find the product of the two distributions. The above objective is actually the ELBO (evidence lower bound) (Kingma & Welling, 2014) with the first term being the reconstruction loss, and the second and third terms being the KL divergence. The proposed variant of VAE inherits two advantages from VAE and PoE, respectively. The first is that we can obtain approximate distributions of s and e given x to preserve the independence. The second is that the proposed model still works even when there is a missing case for e or s in the testing. This means that we can use only s or e as input to the decoder to reconstruct x (shown in the experimental section), which is quite different from the concatenation of e and s or other forms that require e and s simultaneously to obtain r. In addition, the way of concatenating s and e does not well exploit the independent prior of s and e.

2.2. CONDITIONS y-z: EXCLUSIVITY

To fulfill conditions 2 and 3 , we minimize the mutual information between two variables by enhancing the independence. There are different strategies to promote the independence between variables, which are endowed with different properties. Specifically, the straightforward way is to promote the independence by minimizing the linear correlation. Accordingly, we have the following loss function: min q i e ,q j e e i e j T e i e j , for condition 2 , and a similar objective could be induced for condition 3 . Although simple and effective, the linearity property may not be powerful enough to handle complex real-world complex correlations. Therefore, we also propose an alternative strategy for general correlation cases in the supplementary material A.1.

2.3. CONDITION {: ALIGNMENT OF THE SHARED REPRESENTATION FROM DIFFERENT VIEWS

For condition 4 , we ensure the commonality between s i and s j by maximizing the mutual information as follows: I(s i ; s j ) = p(s i , s j ) log p(s i , s j ) p(s i )p(s j ) ds i ds j . It is difficult to calculate the mutual information, since the true distribution is usually unknown. Based on the scalable and flexible MINE (Belghazi et al., 2018) , we introduce two different strategies for maximizing the mutual information between the shared representations s i ∼ q i s (s i |x i ) and s j ∼ q i s (s j |x j ) from different perspectives. MINE can estimate mutual information of two variables by training a classifier to distinguish whether the samples come from the joint distribution J or the product of marginals M. MINE actually aims to optimize the tractable lower bound to estimate mutual information based on the Donsker-Varadhan representation (Donsker & Varadhan, 1983) of the KL-divergence, with the following form: I(s i , s j ) ≥ E J T θ (s i , s j ) -log E M e T θ (s i ,s j ) , where T θ is a discriminator function modeled by a neural network with parameters θ. J and M are the joint and product of marginals, respectively. We can maximize the mutual information of s i and s j by maximizing the lower bound. Although the KL-based MI is effective for some tasks, it tends to overemphasize the similarity between samples and thus cannot thoroughly explore the underlying similarity between different distributions. To address this issue, we could replace the KL divergence with JS divergence (Hjelm et al., 2019) , which can focus on the similarity in terms of different distributions instead of samples. Accordingly, we maximize the mutual information of s i and s j by the following form: max q i s ,q j s ,T θ E J -sp -T θ (s i , s j ) -E M sp T θ (s i , s j ) , where s i , s j corresponds to one sample with two views, i.e., the ith and jth views, respectively. s j corresponds to another sample from the jth view, and sp(z) = log(1 + e z ) is the softplus function. Specifically, the inner product is employed in the classifier, i.e., T θ (a, b) = a T b. We have discussed these two methods in the supplementary material A.2.

3. RELATED WORK

The disentanglement of representations aims to depict an object through independent factors in order to provide a more reliable and interpretable representation (Bengio et al., 2013) et al., 2018; Esmaeili et al., 2019) increase the independence between different factors by minimizing a total correlation (TC) loss. DIP (Disentangled Inferred Prior) (Kumar et al., 2018) encourages disentanglement by introducing a disentangled prior to constrain the disentangled representation. However, there are theoretical problems in unsupervised disentanglement learning (Locatello et al., 2019) . Thus there are also several semi-supervised methods (Kingma et al., 2014; Narayanaswamy et al., 2017; Bouchacourt et al., 2018) for disentanglement representation learning which have access to partial real factors of the data. There are also various real-world applications using disentangled representations (Gonzalez-Garcia et al., 2018; Liu et al., 2018) . Multi-view representation learning aims to jointly utilize information from multiple views for better performance. To jointly learn a unified representation between multiple views, CCA-based (Hotelling, 1992; Akaho, 2006; Andrew et al., 2013; Wang et al., 2015) algorithms maximize the correlation between different views to extract shared information. KCCA (Akaho, 2006) and DCCA (Andrew et al., 2013) extend the traditional CCA using kernel and deep neural networks, respectively. DCCAE (Wang et al., 2015) jointly considers the reconstruction of each single view and the correlation across different views. To jointly encode the shared and view-specific information, some latentrepresentation-based models (Zhang et al., 2019 ) have been proposed. There are also models (Wu et al., 2019; Suzuki et al., 2016; Vedantam et al., 2018) that employ a VAE to learn a unified multi-modal representation.

4. EXPERIMENTS

Experimental Settings. We conduct quite comprehensive experiments to evaluate the disentangled representation. Specifically, we investigate the disentanglement quantitively by conducting clustering and classification (section 4.1), and provide visualization results to intuitively evaluate the disentanglement (section 4.2 and section B.5). Furthermore, we conducted ablation experiments (section B.3) and present application experiments based on the disentangled representation (section B.6). Due to the space limitation, some experiments are detailed in the supplementary material. Datasets: Similar to the work (Gonzalez-Garcia et al., 2018) , we construct the dataset MNIST-CBCD, comprising MNIST-CB (MNIST with Colored Background) and MNIST-CD (MNIST with Colored Digit) as two views, by randomly modifying the color of digits and background of digit images from the dataset MNIST. Intuitively, the shape of a digit corresponds to the shared information, while the colors of background and digit correspond to the private information within each view. The same strategy is applied to FashionMNIST. We also conduct experiments on the face image dataset CelebA (Liu et al., 2015) , which is a large-scale face-attributes dataset with more than 200K celebrity images, each of which is with 40 attribute annotations. The image and attribute domains are considered as two different views. We select the 18 most significant attributes as the attribute vector (Perarnau et al., 2016) . For these two views, the shared information are attribute-related information (e.g., viewpoint, hair color, w/ or w/o glasses etc), and the exclusive representation is non-attribute information. To verify the disentanglement, we compare our algorithm with: (1) Raw-data (Raw), which reshapes images directly into vectors as representations; (2) Variational autoencoders (VAE (Kingma & Welling, 2014)), which uses VAE to extract features from the data of each view; (3) CCA-based methods (CCA (Hotelling, 1992) , KCCA (Akaho, 2006) , DCCA (Andrew et al., 2013) and DCCAE (Wang et al., 2015) ), which obtain the common representation by maximizing the correlation between different views. (4) Multimodal variational autoencoders (MVAE (Wu et al., 2019) ), which can learn a common representation of two views. For Raw-data and VAE, we report the clustering results using the representations obtained by view-1, view-2, and representation by concatenating view-1 and view-2. In our method, we use s 1 , s 2 and the concatenated representation for clustering/classification.

4.1. QUANTITATIVE ANALYSIS

To evaluate the disentanglement of our algorithm, we conduct clustering and classification based on the representations, respectively. For simplicity, we employ k-means as the clustering algorithm, since k-means is based on the Euclidean distance, which makes it more objective in measuring the quality of representations. classification experiments based on the shared representations. All experiments are run 20 times and the means are reported in terms of accuracy (refer to the supplement for standard deviations). From the quantitative results in Tables 1 and 2 , the following observations are drawn: (1) directly using the raw features for clustering/classification is not promising, as the digital and color information are mixed. Moreover, since the background region is much larger than the area of the digit, the accuracy of using view-1 is relatively low on MNIST; (2) compared with the raw features, the shared information extracted by our model is competitive due to the clear semantic information; (3) by extracting the shared (digit) information explicitly, our model obtains much better results. Furthermore, we evaluate the exclusive representations on clustering. For the MNIST, the colors of background (MNIST-CB: MNIST with Colored Background) or digits (MNIST-CD: MNIST with Colored Digit) are considered as class labels. According to Table 3 , our algorithm obtains more promising clustering performance with the exclusive representation compared with the raw data, while existing algorithms cannot obtain exclusive representation explicitly. The performance improvement of view-2 (on MNIST-CBCD) is not so substantial. The possible reason is that exclusive information (the color of digit) from the images is not so significant due to small area ratio of digits, which increases the difficulty of disentanglement. We verify our disentangled representation with cross-modal retrieval on CelebA (Liu et al., 2015) . Specifically, after training the disentangling networks, we can obtain the shared representations from the image and attribute views, respectively. Therefore, the attribute vector can be used to retrieve the related face images (attribute-specific cross-modal retrieval). The quantitative results are reported in Table 4 , and examples are in Fig. 9 (a) (in the supplement). Given the specific attributes represented as vector l n , we can obtain attribute vector lnk for the k th most similar retrieved image, which is associated with D attributes (the value of each one is 0 or 1). Accordingly, for the top K retrieved images, we have accuracy = N n=1 K k=1 D d=1 δ(l nkd , lnkd) N ×K×D , where δ(a, b) = 1 when a = b, otherwise δ(a, b) = 0. According to the results in Table 4 , the performances of our model are much higher than those of MVAE due to the promising disentanglement. From Fig. 3 , we have the following observations, which are consistent with the definition of multiview disentanglement: (1) By combining the shared and exclusive information, the original image can be fully reconstructed ((d) and (i)), satisfying condition 1 (completeness); (2) The shared and exclusive representations contain different information. With the shared representation, we can reconstruct images ((b) and (g)) with clear digit shapes rather than color information as in the original images. In contrast, with the exclusive representations, we can reconstruct the color information ((c) and (h)) of the original images rather than the digit shapes. This verifies that the condition 2 (intraview exclusivity) is satisfied. (3) The exclusive representations from different views contain different information. Specifically, the exclusive representation (c) from view-1 contains the information of background color, while the exclusive representation (h) from view-2 contains information of the digit color. This verifies that our model satisfies condition 3 (inter-view exclusivity). ( 4) The shared representations ((b), (g), (e) and (j)) from different views contain (almost) the same information, i.e., condition 4 (commonality). We verify this by reconstructing digit shapes in view-2 using the shared representations from view-1 and vice versa. Similar experiments are done on CelebA (section B.5).

5. CONCLUSION

In this work, we proposed a formal definition for disentangling multi-view data, and based on this developed a principled algorithm which focuses on automatically disentangling multi-view data into shared and exclusive representations without supervision. Extensive experiments validate that the proposed algorithm can promote subsequent analysis tasks (e.g., clustering/classification/retrieval). We consistently validated that the proposed algorithm can provide promising disentanglement and thus is quite effective and flexible in analyzing and manipulating multi-view data. We will focus on the semi-supervised setting to improve the discriminative ability in the future.

A SUPPLEMENTAL MATERIAL FOR METHODS

A.1 SUPPLEMENTAL MATERIAL FOR CONDITIONS y-z Inspired by Choi et al. (2018) , we introduce a classifier to distinguish these (independent) representations generated by the encoders. The loss function of the classification can be defined as: min q,C E R -p(z) log C(z|R)dz , where C is a classifier that distinguishes the representation from different sources (independent representations), R is from a representation set with different sources, e.g., private representation e i and e j from different views, and q corresponds to the encoder of different views. z is a label which indicates the source of R. Generally, it is difficult to strictly guarantee the independence; however, the two different strategies can promote the independence between different representations (generated by the encoders) to a certain extent. We implement both strategies and similar results are observed in practice. A.2 DISCUSSION OF CONDITION { Both KL-and JS-based estimators can maximize the mutual information. However, due to the different properties of the KL and JS divergence, the two estimators are suitable for different scenarios. Since the JS divergence is bounded, in theory, it prevents the estimator from overemphasizing the similarity of two representations for the same sample (even if they are exactly the same, it will not obviously reduce the loss). This prevents the encoder from paying too much attention to generating the exact s i coordination with s j instead of the overall objective function. In contrast, since the estimator based on the KL divergence is unbounded, s i and s j are forced to be as similar as possible. Although this is not appropriate for most tasks, it helps us to observe whether s i and s j intuitively have high mutual information. For example, we can replace s i and s j with each other to see if they can accomplish the same task (which is demonstrated in the experimental part).

B SUPPLEMENTAL EXPERIMENTS B.1 NETWORK ARCHITECTURES

For MNIST and FashionMNIST, two convolutional layers and two fully connected layers are used for the encoder, while we employ two fully connected layers and two deconvolution layers for the decoder. For the face image dataset CelebA, we use four convolutional layers and two fully connected layers to build the encoder for handling the image view, while the decoder is built using two fully connected layers and four deconvolutional layers. For the attribute vector view, three fully connected layers are used to construct both the encoder and decoder. The batch normalization Ioffe & Szegedy (2015) and Swish activation functions Ramachandran et al. (2017) are used between the convolutional layers.

B.2 DETAILED EXPERIMENTAL RESULTS OF QUANTITATIVE EXPERIMENTS

Due to space limitations, we only report the means of clustering and classification experiments in the text, and here we add their standard deviations in Table 5 and 6 .

B.3 ABLATION EXPERIMENTS

To verify the necessity of each criterion in definition 2.1, we conduct experiments on the MNIST-CBCD dataset. Specifically, we conduct the similar experiments by removing 2 , 3 and 4 in the objective function, and the corresponding results are shown in Fig. 4 , 5, and 6 respectively. As shown in Fig. 4 , due to the removal of the intra-view exclusivity ( 2), there are shared information between s i and e i , which is clearly validated by the reconstructed images (g) and (h). Similarly, as shown in Fig. 5 , after removing the inter-view exclusivity ( 3 ), the performance of disentanglement becomes much worse. As shown in Fig. 6 , after removing the condition 4 , we can hardly disentangle the information from different views due to the significant difference between the shared representation s i and s j . their genders cannot be easily identified (condition 2 : intra-view exclusivity). By combining the shared and exclusive representations, the original images can be accurately reconstructed (condition 1 : completeness). Our model recovers the most critical information without emphasizing details because the current task is to obtain good representations for clustering/retrieval. By setting the goal to improve image reconstruction, we can use additional techniques (e.g., using more deeper networks -only 4 layers in our implementation, or using adversarial strategy). It is worth noting that there is a small difference from MNIST-CBCD: the information of view-2 (attributes) is actually contained in view-1 (images). Therefore, it is rather difficult to reconstruct face images using the exclusive representation from view-2 -the reconstructed images are almost all the same (condition 3 : inter-view exclusivity). The visualization experiments on CelebA further verify that our disentangled representation can promisingly satisfy the four conditions in the definition 2.1. 

B.6 SUPPLEMENTAL RESULTS FOR ATTRIBUTE-SPECIFIC CROSS-MODAL RETRIEVAL AND EDITING

In this section, we validate the potential use of our multi-view disentangled representation in two real applications: attribute-specific retrieval and attribute-specific editing. First, we verify our disentangled representation on the attribute-specific face retrieval task on the CelebA Liu et al. (2015) dataset. The details are as described in the text, and here we show some examples in Fig. 9 (a). Second, we demonstrate the potential use of our model in attribute-specific face editing by manipulating the shared representations. The shared representation from the image view allows us to 



Figure 1: Illustration of multi-view disentangled representation. (a): The red and white graphics indicate the shared information between different views, and the (private) exclusive information within each view, respectively. (b): The exact disentangled representation can be achieved -the shared (gray area) and exclusive (white area) components are separated, when the four conditions in definition 2.1 are satisfied. (c)(d)(e)(f): The exact disentangled representation cannot be guaranteed when any condition is violated. Intuitively, the proposed four conditions are necessary and sufficient conditions since any change of (b) will violate the definition.

Figure 2: Illustration of our model, which corresponds to the objective in Eq. 1 and the conditions in definition 2.1. Refer to the text for PoE (Product of Expert) in 2.1.

Figure 3: Visualization of reconstruction with shared and exclusive representations. The top and bottom rows correspond to the reconstruction results from the decoders of view-1 and view-2, respectively. 'Shared', 'Exclusive' and 'S&E' indicate shared, exclusive, and the combination of shared and exclusive representations, respectively. 'View-1' and 'View-2' in the parentheses indicate the view where these representations come from. Note that, the images in (e) ((j)) are reconstructed results using decoder of view-1 (view-2) using the share representation from view-2 (view-1).

Figure 4: Visualization of reconstruction with shared and exclusive representations after removing the condition 2 (intra-view exclusivity between s i and e i ). (Zoom in for best view).

Figure 5: Visualization of reconstruction with shared and exclusive representations after removing the condition 3 (inter-view exclusivity between e i and e j ). (Zoom in for best view).

Figure 6: Visualization of reconstruction with shared and exclusive representations after removing the condition 4 (commonality between s i and s j ). (Zoom in for best view).

Figure 7: Visualization of reconstruction with shared and exclusive representations.'O', 'S' and 'E' indicate original image, shared representation and exclusive representation respectively. 'View-1' and 'View-2' in the parentheses indicate the view where these representations come from. (Zoom in for best view).

Figure 8: Visualization of image reconstruction with shared and exclusive representations. We use the decoder of the image view to reconstruct images by inputting the shared and exclusive representations, where 'Original' indicates the original images, 'Shared' and 'Exclusive' indicates the shared and exclusive representations, respectively. 'View-1 S&E' indicates the combination of shared and exclusive representations. Similarly, 'View-2 shared' and 'View-2 exclusive' indicate the shared and exclusive representations from View-2, respectively, which are used as inputs into the decoder. (Zoom in for best view).

OF I(x; e, s) = I(x; r) First, based on Assumption 2.2, we can get p(r) = p(r, x)dx = p(s, e, x)dx = p(s, e), )p(r|x) log p(r|x) p(r) drdx = I(x; r).

in ELBO to constrain the capacity of the latent space. Different from β-VAE, several methods (Chen

Comparison on the clustering task. 'M' and 'F' indicate MNIST-CBCD and FMNIST-CBCD, respectively. The top three results are in bold and marked with superscript.

Comparison on the classification task. 'KNN' and 'LSVM' indicate the K-Nearest Neighbor and linear SVM respectively. 'M' and 'F' indicate MNIST-CBCD and FMNIST-CBCD, respectively

Clustering with exclusive representation.

Cross-modal retrieval.

Comparison between existing multi-modal representation learning methods and ours on the clustering task. 'M' and 'F' indicate the MNIST-CBCD and FMNIST-CBCD datasets respectively.

Comparison between existing multi-modal representation learning methods and ours on the classification task. 'KNN' and 'LSVM' indicate the K-Nearest Neighbor and linear SVM respectively.

, s, e) log p(x|s, e)dsdedx -p(x, e) log p(x|e)dedx.Under Assumption 2.1, we can get p(s, e) = p(s)p(e) and p(s, e|x) = p(s|x)p(e|x). Substituting this into Eq. 13 yields Eq. 3 in Section 2.1,

annex

We provide more results of attribute-specific editing for more attributes. The experimental results are shown in Fig. 10 . (11) 

