ON THE IMPORTANCE OF CONTRASTIVE LOSS IN MULTIMODAL LEARNING

Abstract

Recently, contrastive learning approaches (e.g., CLIP (Radford et al., 2021)) have received huge success in multimodal learning, where the model tries to minimize the distance between the representations of different views (e.g., image and its caption) of the same data point, while keeping the representations of different data points away from each other. However, from a theoretical perspective, it is unclear how contrastive learning can learn to align the representations from different views efficiently, especially in cases where the data is not isotropic. In this work, we analyze the training dynamics of a simple multimodal contrastive learning model and show that contrastive pairs are important for the model to efficiently balance the learned representations. In particular, we reveal a stagewise behavior of the learning process: In the first stage, the model aligns the feature representations using positive pairs and the condition number grows in this stage. Then, in the second stage, the model reduces the condition number of the learned representations using negative pairs.

1. INTRODUCTION

One of the exceptional abilities of humans is to associate data from different modalities (such as texts and images) together. For example, when we hear the words "white dog", we can immediately align it with the image of a dog with white color. When we hear the loud sound of the engine, we can imagine an expensive sports car passing nearby. 2020)). The major difference between the contrastive approach and other approaches is that contrastive loss not only requires the learned representations from the same pair of data (i.e. positive pairs) to be positively aligned, but it also requires the data from different pairs (i.e. negative pairs) to be as negatively aligned as possible. In the paper, the authors also identify contrastive loss as the most critical part to the success of CLIP.

Recently

Despite the empirical success of this contrastive learning-based method, from a theoretical perspective, the most fundamental questions are still largely open: In particular, how do contrastive pairs help in this new multimodal learning approach? How can the non-convex contrastive loss be efficiently minimized to learn features from both modules? Unlike the prior theoretical works on contrastive learning which mostly focus on extracting features from one module (e.g., Arora et al. ( 2019 2021)), one main technical challenge of analyzing contrastive learning in a multimodal setting is how the model can be trained to align the feature representations f A , f B from modules A and B respectively. Due to the existence of negative pairs that emphasize negative correlations of f A and f B , it is unclear that the model still has incentives to align the features from different modules.



, in machine learning, multimodal learning methods -training the model to align the data from different modules, has become an increasingly popular research direction, especially in deep learning (He & Peng (2017); Stroud et al. (2020); Radford et al. (2021); Ramesh et al. (2021); Xu et al. (2021); Jia et al. (2021); Wang et al. (2022b)). Among them, the recent work CLIP (Radford et al. (2021)) shows remarkable quality results on aligning the features of text and images. The contrastive learning based method CLIP empirically outperforms many existing non-contrastive approaches (Grill et al. (2020); Chen & He (2021); He et al. (

); Jing et al. (2022); Pokle et al. (2022); Tian et al. (2021); Wen & Li (

