MULTI-VIEW DISENTANGLED REPRESENTATION

Abstract

Learning effective representations for data with multiple views is crucial in machine learning and pattern recognition. Recent great efforts have focused on learning unified or latent representations to integrate information from different views for specific tasks. These approaches generally assume simple or implicit relationships between different views and as a result are not able to accurately and explicitly depict the correlations among these views. To address this, we firstly propose the definition and conditions for unsupervised multi-view disentanglement providing general instructions for disentangling representations between different views. Furthermore, a novel objective function is derived to explicitly disentangle the multiview data into a shared part across different views and a (private) exclusive part within each view. The explicit guaranteed disentanglement is of great potential for downstream tasks. Experiments on a variety of multi-modal datasets demonstrate that our objective can effectively disentangle information from different views while satisfying the disentangling conditions.

1. INTRODUCTION

Multi-view representation learning (MRL) involves learning representations by effectively leveraging information from different perspectives. The representations produced by MRL are effective when correlations across different views are accurately modeled and thus properly exploited for downstream tasks. One representative algorithm, Canonical Components Analysis (CCA) (Hotelling, 1992) , aims to maximize linear correlations between two views under the assumption that factors from different views are highly correlated. Under a similar assumption, the extended versions of CCA, including kernelized CCA (Akaho, 2006) and Deep CCA (Andrew et al., 2013) , explore more general correlations. There are also several methods (Cao et al., 2015; Sublime et al., 2017) that maximize the independence between different views to enhance the complementarity. Going beyond the simple assumptions above, the latent representation encodes different views with a degradation process implicitly exploiting both consistency and complementarity (Zhang et al., 2019) . These existing MRL algorithms are effective, however, the assumed correlations between different views are usually simple thus cannot accurately model or explicitly disentangle complex real-world correlations, which hinders the further improvement and interpretability. Although there are a few heuristic algorithms (Tsai et al., 2019; Hu et al., 2017) that explicitly decompose the multiview representation into shared and view-specific parts, they are especially designed for supervised learning tasks without any disentangled representation guarantee and fall short in formally defining the relationships between different parts. To address this issue, we propose to unsupervisedly disentangle the original data from different views into shared representation across different views and exclusive (private) part within each view, which explicitly depicts the correlations and thus not only enhances the performance of existing tasks but could also inspire potential applications. Specifically, we firstly provide a definition for the multi-view disentangled representation by introducing the sufficient and necessary conditions for guaranteeing the disentanglement of different views. According to these conditions, an information-theory-based algorithm is proposed to accurately disentangle different views. To summarize, the main contributions of our work are as follows: • To the best of our knowledge, this is the first work to formally study multi-view disentangled representation with strict conditions, which might serve as the foundations of the future research on this problem. • Based on our definition, we propose a multi-view disentangling model, in which informationtheory-based multi-view disentangling can accurately decompose the information into shared

