MULTI-VIEW DISENTANGLED REPRESENTATION

Abstract

Learning effective representations for data with multiple views is crucial in machine learning and pattern recognition. Recent great efforts have focused on learning unified or latent representations to integrate information from different views for specific tasks. These approaches generally assume simple or implicit relationships between different views and as a result are not able to accurately and explicitly depict the correlations among these views. To address this, we firstly propose the definition and conditions for unsupervised multi-view disentanglement providing general instructions for disentangling representations between different views. Furthermore, a novel objective function is derived to explicitly disentangle the multiview data into a shared part across different views and a (private) exclusive part within each view. The explicit guaranteed disentanglement is of great potential for downstream tasks. Experiments on a variety of multi-modal datasets demonstrate that our objective can effectively disentangle information from different views while satisfying the disentangling conditions.

1. INTRODUCTION

Multi-view representation learning (MRL) involves learning representations by effectively leveraging information from different perspectives. The representations produced by MRL are effective when correlations across different views are accurately modeled and thus properly exploited for downstream tasks. One representative algorithm, Canonical Components Analysis (CCA) (Hotelling, 1992) , aims to maximize linear correlations between two views under the assumption that factors from different views are highly correlated. Under a similar assumption, the extended versions of CCA, including kernelized CCA (Akaho, 2006) and Deep CCA (Andrew et al., 2013) , explore more general correlations. There are also several methods (Cao et al., 2015; Sublime et al., 2017) that maximize the independence between different views to enhance the complementarity. Going beyond the simple assumptions above, the latent representation encodes different views with a degradation process implicitly exploiting both consistency and complementarity (Zhang et al., 2019) . These existing MRL algorithms are effective, however, the assumed correlations between different views are usually simple thus cannot accurately model or explicitly disentangle complex real-world correlations, which hinders the further improvement and interpretability. Although there are a few heuristic algorithms (Tsai et al., 2019; Hu et al., 2017) that explicitly decompose the multiview representation into shared and view-specific parts, they are especially designed for supervised learning tasks without any disentangled representation guarantee and fall short in formally defining the relationships between different parts. To address this issue, we propose to unsupervisedly disentangle the original data from different views into shared representation across different views and exclusive (private) part within each view, which explicitly depicts the correlations and thus not only enhances the performance of existing tasks but could also inspire potential applications. Specifically, we firstly provide a definition for the multi-view disentangled representation by introducing the sufficient and necessary conditions for guaranteeing the disentanglement of different views. According to these conditions, an information-theory-based algorithm is proposed to accurately disentangle different views. To summarize, the main contributions of our work are as follows: • To the best of our knowledge, this is the first work to formally study multi-view disentangled representation with strict conditions, which might serve as the foundations of the future research on this problem. • Based on our definition, we propose a multi-view disentangling model, in which informationtheory-based multi-view disentangling can accurately decompose the information into shared 2 MULTI-VIEW DISENTANGLED REPRESENTATION Existing multi-view representation learning methods (Wu & Goodman, 2018; Zhang et al., 2019) can obtain a common representation for multi-view data, however, the correlations between different views are not explicitly expressed. The supervised algorithms (Hu et al., 2017; Tan et al., 2019) can decompose multiple views into a common part and private parts, but there is no disentangling guarantee. Therefore, we propose a multi-view disentanglement algorithm that can explicitly separate the shared and exclusive information in unsupervised settings. Formally, we first propose a definition of a multi-view disentangled representation by introducing four criteria, which are considered as sufficient and necessary conditions of disentangling multiple views. The definition is as follows: 𝑒 1 𝑠 1 P o E 𝑟 1 𝑥 1 𝑠 2 P o E 𝑟 2 𝑥 2 𝑥 1 𝑥 2 ④ ② ② ① ① 𝑒 2 ③ Definition 2.1 (Multi-View Disentangled Representation) Given a sample with two views, i.e., X = {x i } 2 i=1 , the representation S dis = {s i , e i } 2 i=1 is a multi-view disentangled representation if the following conditions are satisfied: • Completeness: 1 The shared representation s i and exclusive representation e i should jointly contain all information of the original representation x i ; • Exclusivity: 2 There is no shared information between common representation s i and exclusive representation e i , which ensures the exclusivity within each view (intra-View). 3 There is no shared information between e i and e j , which ensures the exclusivity between private information of different views (inter-View). • Commonality: 4 The common representations s i and s j should contain the same information. Equipped with the exclusivity constraints, the common representations are guaranteed to not only be the same but also contain maximized common information. The necessity for each criterion is illustrated in Fig. 1 (satisfaction of all the four conditions produces exact disentanglement, and violation of any condition may result in an unexpected disentangled representation). Note that, existing (single-view) unsupervised disentanglement focuses on learning



Figure 1: Illustration of multi-view disentangled representation. (a): The red and white graphics indicate the shared information between different views, and the (private) exclusive information within each view, respectively. (b): The exact disentangled representation can be achieved -the shared (gray area) and exclusive (white area) components are separated, when the four conditions in definition 2.1 are satisfied. (c)(d)(e)(f): The exact disentangled representation cannot be guaranteed when any condition is violated. Intuitively, the proposed four conditions are necessary and sufficient conditions since any change of (b) will violate the definition.

Figure 2: Illustration of our model, which corresponds to the objective in Eq. 1 and the conditions in definition 2.1. Refer to the text for PoE (Product of Expert) in 2.1.

