DESIGN OF THE TOPOLOGY FOR CONTRASTIVE VISUAL-TEXTUAL ALIGNMENT Anonymous authors Paper under double-blind review

Abstract

Pre-training weakly related image-text pairs in the contrastive style shows great power in learning semantic aligning cross-modal models. The common choice to measure the distance between the feature representations of the image-text pairs is the cosine similarity, which can be considered as the negative inner product distance of features embedded on a sphere, mathematically. However, empirically, aligning image-text pairs on the spherical topology is vulnerable to the semantic ambiguity phenomenon resulting from the noise in the pre-training datasets. Specifically, under the noisy training data, instead of the optimal alignmentuniformity solution, the system would achieve an equilibrium (a gap between distances of positive and negative pairs), when the gradients for attraction and repulsion are neutralized. Although intuitively, the model should always find this equilibrium given a sufficiently long training scheme. However, since the contrastive loss is implemented using softmax and cross-entropy loss, which makes the required numerical values for equilibrium much larger, where the equilibrium might be out of the distance range (e.g. [-1, 1] for the cosine similarity). In the practice of former studies, this problem is partly tackled by introducing a learnable softmax temperature parameter, in other words, by explicitly scaling the range of the distance function. In this work, we alternatively design the topology of embedding space and its endowed distance function. Motivated by studies that make use of Riemannian geometry for visual tasks, we propose a rather simple solution to address the aforementioned equilibrium problem. That is, we map the feature representations onto the oblique manifold endowed with the negative inner product as the distance function. Furthermore, we propose a multi-token implementation of the oblique manifold. With this configuration, we outperform the officially released CLIP-ViT/L-14 model using a ViT/B-16 visual backbone on the zero-shot image classification and image-to-text retrieval tasks.

1. INTRODUCTION

Learning visual and textual feature representations that are semantically aligned in their embedding space is an ordinary problem in the vision-language cross-modal tasks (Frome et al., 2013; Karpathy & Fei-Fei, 2015; Romera-Paredes & Torr, 2015; Wang et al., 2016; Faghri et al., 2017; Xian et al., 2016) . In early works that employ feature representations from deep neural networks, e.g. Frome et al. (2013) , the alignment is often achieved by a fundamental metric learning approach with the hinge rank loss. That is, the similarity between a visual feature vector u and a textual feature vector v is calculated as u T W v, where W are the learnable weight parameters. Thanks to the revolutionary advances in computational power, we can now achieve this in a more effective and practical approach termed contrastive learning, where we align quantities of positive samples and push their negative samples away simultaneously in a large mini-batch (Radford et al., 2021; Singh et al., 2022; Jia et al., 2021; Pham et al., 2021; Yuan et al., 2021) . The standard choice of the distance measure between an image-text pair for the contrastive learning algorithm is the Cosine Similarity 2022) scenarios). Mathematically, the Cosine Similarity computes the negative inner product value between feature representation vectors mapped onto the unit sphere embedding space. The spherical embedding space is advantageous in aligning visual and textual feature representations in the following two aspects. First, calculat-ing the inner product consumes low computational resources during both forward and backward propagation. Second, we have a proper definition of uniformity on the sphere. The uniformity is a demanded property in optimizing the contrastive loss, by which the feature representations would preserve maximal information of the data. However, since the data for large-scale contrastive alignment are internet-collected noisy image-text pairs, we often find pairs of semantically related images and texts labeled as "negative" and verse visa, which we term "semantic ambiguity". Because of the ambiguity, it is impossible to achieve the perfect alignment and uniformity conditions of sample embeddings for the system. More specifically, during the training, the false negative samples are pushed away from each other (repulsion), while the false positive samples are pulled together (attraction). As a consequence, the system will gradually find an equilibrium when the noisy samples' gradients for attraction and repulsion are neutralized. In other words, we say the training progress is converged under the given hyper-parameters. To be more concrete, owing to the fact that the gradient is eventually back-propagated from the difference between the positive and negative distances. Given sufficient model capacity, the numerical values between the distances of positive and negative pairs of samples will be optimized to fit the noisy level of the dataset. For instance, if there is a reasonable amount of false negative samples, then the model would learn a smaller positive distance for not being punished too hard, when encountering false negative samples in another mini-batch. Furthermore, the triangular inequality (or a "relaxed" version) of the distance within a group of semantically similar samples will pull the positive pairs of samples away from each other (See Section 3.2). Finally, the model reaches the equilibrium of compromised positive and negative distances, which minimizes the contrastive loss regardless of the semantic ambiguity. Therefore, we say the equilibrium is essentially a property of the dataset, referring to its noise level, under a certain embedding space and its distance function. Here, the problem is that, since we implement the contrastive loss with the combination of softmax and cross-entropy, which makes the required numerical values for equilibrium exponentially larger. The numerical values of distances required by softmax at equilibrium might be out of the distance range (e.g. [-1, 1] for the cosine similarity). Therefore, in the practice of former studies (Wu et al., 2018; Radford et al., 2021) , researchers have to scale the range with a learnable softmax temperature parameter. Although the learnable temperature somehow fixes the range problem, it still has two drawbacks that need to be addressed. Firstly, the learnable temperature delays the learning progress. Empirically, we observe that the model trends to acquire a proper scaling for the distance range earlier than achieving a good alignment. Secondly, a large temperature is numerically unstable for the back-propagation, especially for the low-bit precision computation. In this work, we alternatively design the topology for embedding vectors and its endowed distance function. Motivated by the utilization of Riemannian geometry for visual tasks, we propose a relatively simple solution to address the aforementioned out-of-range equilibrium problem. Our contributions can be summarized as follows: 1. We reveal that the learnable softmax temperature is essentially a scaling factor for the distance range, which indicates the noise level of the dataset in the contrastive visual-textual alignment. We also observe that the model learns a suitable temperature before representations, which degrades the performance. 2. We tackle the out-of-range equilibrium problem resulting from the softmax cross-entropy loss, by designing the topology of embedding space. That is, we employ the oblique manifold endowed with the negative inner product as distance functions. 3. We demonstrate that the proposed approach can be non-painfully implemented by changing only two lines of the training code, whilst improving the baseline performance in the zeroshot image-text retrieval tasks. In the larger scale experiment, we have learned a ViT-B/16 model that outperforms the officially released ViT-L/14 model.

2. PRELIMINARY

Notations: We start with notation and review mathematical expressions of the basic building blocks used in our analysis. In this work, we denote scalars by italic letters, e.g., n, m, B, D ∈ R, and denote vectors and higher-order tensors by boldface letters, e.g., x = [x 0 , x 1 , . . . , x n-1 ] ⊤ ∈ R n and Y ∈ R N ×D . We denote sets by calligraphic letters, e.g., U = {U 1 , U 2 , . . .}. We also employ italic letters to define functions, with subscripts denote their parameters, e.g., f θ (•). The operation ∥ • ∥ p



(in both uni-modal Chen et al. (2020a); Caron et al. (2020); Chen et al. (2020b) and cross-modal Radford et al. (2021); Jia et al. (2021); Singh et al. (

