ON THE IMPORTANCE OF CONTRASTIVE LOSS IN MULTIMODAL LEARNING

Abstract

Recently, contrastive learning approaches (e.g., CLIP (Radford et al., 2021)) have received huge success in multimodal learning, where the model tries to minimize the distance between the representations of different views (e.g., image and its caption) of the same data point, while keeping the representations of different data points away from each other. However, from a theoretical perspective, it is unclear how contrastive learning can learn to align the representations from different views efficiently, especially in cases where the data is not isotropic. In this work, we analyze the training dynamics of a simple multimodal contrastive learning model and show that contrastive pairs are important for the model to efficiently balance the learned representations. In particular, we reveal a stagewise behavior of the learning process: In the first stage, the model aligns the feature representations using positive pairs and the condition number grows in this stage. Then, in the second stage, the model reduces the condition number of the learned representations using negative pairs.

1. INTRODUCTION

One of the exceptional abilities of humans is to associate data from different modalities (such as texts and images) together. For example, when we hear the words "white dog", we can immediately align it with the image of a dog with white color. When we hear the loud sound of the engine, we can imagine an expensive sports car passing nearby. Recently, in machine learning, multimodal learning methods -training the model to align the data from different modules, has become an increasingly popular research direction, especially in deep learning (He & Peng (2017) 2020)). The major difference between the contrastive approach and other approaches is that contrastive loss not only requires the learned representations from the same pair of data (i.e. positive pairs) to be positively aligned, but it also requires the data from different pairs (i.e. negative pairs) to be as negatively aligned as possible. In the paper, the authors also identify contrastive loss as the most critical part to the success of CLIP. Despite the empirical success of this contrastive learning-based method, from a theoretical perspective, the most fundamental questions are still largely open: In particular, how do contrastive pairs help in this new multimodal learning approach? How can the non-convex contrastive loss be efficiently minimized to learn features from both modules? Unlike the prior theoretical works on contrastive learning which mostly focus on extracting features from one module (e.g., Arora et al. ( 2019 2021)), one main technical challenge of analyzing contrastive learning in a multimodal setting is how the model can be trained to align the feature representations f A , f B from modules A and B respectively. Due to the existence of negative pairs that emphasize negative correlations of f A and f B , it is unclear that the model still has incentives to align the features from different modules. In this paper, we make preliminary theoretical steps towards answering the fundamental theoretical questions of the importance of contrastive loss in multimodal learning. We assume the data from the two modules are of form 2. We consider feature learners f A , f B with normalization, meaning that f A , f B are always normalized to have (expected) norm one during training. Output normalization plays a critical role in the practical success of contrastive learning and is also employed in CLIP, but it is rarely considered in theory due to the additional complexity of the division by norm. x A = Az A + A ξ ξ A and x B = Bz B + B ξ ξ B , 3. We analyze the learning process of stochastic gradient descent from random initialization. We prove that contrastive learning converges efficiently to a nearly optimal solution, which indeed aligns the feature representation f A and f B . 4. We also demonstrate the importance of negative pairs by comparing with training only over the positive pairs: We prove that although the latter can also learn to align f A and f B , the features learned by contrastive learning with negative pairs is much more uniform, meaning that f A , f B can recover all the singular vectors of A and B and normalize them. On the other hand, without negative pairs, the learned representation is close to a rank one solution, meaning that f A , f B will only focus on the top singular direction of A and B. 5. We also perform simulations and more practical experiments to further support our theory.

2. RELATED WORKS

Multimodal learning Despite the empirical success of multimodal learning, there are very few theoretical results on this topic. The one most related to ours is Huang et al. (2021) , in which the authors show that, in certain cases, multimodal methods can provably perform better than singlemodal models. However, the authors consider neither contrastive pairs nor the training dynamics. 



; Stroud et al. (2020); Radford et al. (2021); Ramesh et al. (2021); Xu et al. (2021); Jia et al. (2021); Wang et al. (2022b)). Among them, the recent work CLIP (Radford et al. (2021)) shows remarkable quality results on aligning the features of text and images. The contrastive learning based method CLIP empirically outperforms many existing non-contrastive approaches (Grill et al. (2020); Chen & He (2021); He et al. (

); Jing et al. (2022); Pokle et al. (2022); Tian et al. (2021); Wen & Li (

respectively, where z A , z B are considered as the hidden signals, A, B linear transformations from the signal to the observation and A ξ ξ A , B ξ ξ B the noises. Similar linear models have also been used in previous works (Tian et al. (2021); Wen & Li (2021)) in the context of single-modal learning (A = B). The positive pair of the data shares the same signal z A = z B , and has different noises ξ A , ξ B and transformations A, B. The goal is to learn features f A , f B that align positive pairs while keeping representations of negative pairs away from each other. Under this setting, we make the following contributions: 1. We consider the challenging (but more practical) setting where the features in A and B are inhomogeneous, that is, the condition number of A and B can be ω(1). Prior works (Jing et al. (2022); Tian et al. (2021); Wen & Li (2021)) only consider cases where A and B are exactly column orthonormal matrices even in the simpler single-modal setting (A = B).

Contrastive/Non-contrastive learning theory Another much richer line of research is about contrastive and non-contrastive methods in the context of single-modal self-supervised learning. Starting from Arora et al. (2019), many recent works have provided various explanations on why the representations learned with contrastive learning are useful in downstream tasks (Chuang et al. (2020); Tosh et al. (2021); Nozawa & Sato (2021); Wang et al. (2022a); HaoChen et al. (2021); Lee et al. (2021); Wang & Isola (2020)). These works mostly focus on the generalization aspect of the problem and do not consider training. Among them, Wang & Isola (2020) also study the problem using the notions of alignment and uniformity, and demonstrate that balanced representations benefit downstream tasks. However, they do not provide guarantees on training. Another related line of research is about non-contrastive learning, where the necessity of negative examples is questioned. In this line of research, the optimization problem does get considered as non-contrastive losses have trivial collapsed solutions.Tian et al. (2021)  show that, under certain conditions, non-contrastive learning methods can learn non-collapsed solutions.Jing et al. (2022)  show that, even with negative examples, contrastive learning can still suffer from another type of collapse, where the learned representations only span a low-dimensional subspace of the embedding space. InPokle et al. (2022), the authors show that non-contrastive losses have many non-collapsed bad minima that the training algorithm does not avoid. Another related work that takes optimization into consideration isWen &  Li (2021), in which the authors analyze the training dynamics of contrastive learning and show that,

