IDENTIFIABILITY RESULTS FOR MULTIMODAL CONTRASTIVE LEARNING

Abstract

Contrastive learning is a cornerstone underlying recent progress in multi-view and multimodal learning, e.g., in representation learning with image/caption pairs. While its effectiveness is not yet fully understood, a line of recent work reveals that contrastive learning can invert the data generating process and recover ground truth latent factors shared between views. In this work, we present new identifiability results for multimodal contrastive learning, showing that it is possible to recover shared factors in a more general setup than the multi-view setting studied previously. Specifically, we distinguish between the multi-view setting with one generative mechanism (e.g., multiple cameras of the same type) and the multimodal setting that is characterized by distinct mechanisms (e.g., cameras and microphones). Our work generalizes previous identifiability results by redefining the generative process in terms of distinct mechanisms with modality-specific latent variables. We prove that contrastive learning can block-identify latent factors shared between modalities, even when there are nontrivial dependencies between factors. We empirically verify our identifiability results with numerical simulations and corroborate our findings on a complex multimodal dataset of image/text pairs. Zooming out, our work provides a theoretical basis for multimodal representation learning and explains in which settings multimodal contrastive learning can be effective in practice.

1. INTRODUCTION

Multimodal representation learning is an emerging field whose growth is fueled by recent developments in weakly-supervised learning algorithms and by the collection of suitable multimodal datasets. Multimodal data is characterized by the co-occurence of observations from two or more dependent data sources, such as paired images and captions (e.g., Salakhutdinov and Hinton, 2009; Shi et al., 2019; Radford et al., 2021) , and more generally, multimodal observations are comprised of aligned measurements from different types of sensors (Baltrušaitis et al., 2019) . Co-occurrence is a form of weak supervision (Shu et al., 2020; Locatello et al., 2020; Chen and Batmanghelich, 2020) , in that paired observations can be viewed as proxies (i.e., weak labels) for a shared but unobserved ground truth factor. Among suitable representation learning methods for weakly supervised data, contrastive learning (Gutmann and Hyvärinen, 2010; Oord et al., 2018) stands out because it is designed to leverage co-occurring observations from different views. In practice, contrastive learning achieves promising results for multi-view and multimodal learning-a prominent example is the contribution of CLIP (Radford et al., 2021) to the groundbreaking advancements in text-to-image generation (Ramesh et al., 2021; 2022; Rombach et al., 2022; Saharia et al., 2022) . Despite its empirical success, it is not sufficiently well understood what explains the effectiveness of contrastive learning in practice. Recent works attribute its effectiveness to the recovery of shared latent factors from the underlying causal graph (Gresele et al., 2019; Zimmermann et al., 2021; von Kügelgen et al., 2021) . From the perspective of multi-view independent component analysis, it was shown that contrastive learning can invert a nonlinear mixing function (i.e., a nonlinear generative process) that is applied to a latent variable with mutually independent components (Gresele et al., 2019; Zimmermann et al., 2021) . More recently, von Kügelgen et al. (2021) show that contrastive learning can recover shared factors up to block-wise indeterminacies, even if there are nontrivial causal and statistical

