ON THE INADEQUACY OF OPTIMIZING ALIGNMENT AND UNIFORMITY IN CONTRASTIVE LEARNING OF SENTENCE REPRESENTATIONS

Abstract

Contrastive learning is widely used in areas such as visual representation learning (VRL) and sentence representation learning (SRL). Considering the differences between VRL and SRL in terms of negative sample size and evaluation focus, we believe that the solid findings obtained in VRL may not be entirely carried over to SRL. In this work, we consider the suitability of the decoupled form of contrastive loss, i.e., alignment and uniformity, in SRL. We find a performance gap between sentence representations obtained by jointly optimizing alignment and uniformity on the STS task and those obtained using contrastive loss. Further, we find that the joint optimization of alignment and uniformity during training is prone to overfitting, which does not occur on the contrastive loss. Analyzing them based on the variation of the gradient norms, we find that there is a property of "gradient dissipation" in contrastive loss and believe that it is the key to preventing overfitting. We simulate similar "gradient dissipation" of contrastive loss on four optimization objectives of two forms, and achieve the same or even better performance than contrastive loss on the STS tasks, confirming our hypothesis. 1 .

1. INTRODUCTION

Unsupervised contrastive learning (Wu et al., 2018) is originated from visual representation learning (VRL) (Chen et al., 2020; He et al., 2020; Grill et al., 2020) , and has achieved impressive performances therein. Briefly, contrastive learning forces the representation of an input instance (or "anchor") to be similar to that of an augmented view of the same instance (or "postive example") and to differ from that of some different instances (or "negative examples"). One plausible justification of this approach is that minimizing the loss of contrastive learning (or "contrastive loss") is shown to be equivalent to simultaneously minimizing an "alignment loss" and a "uniformity loss", where the former dictates the representation similarity between an instance and its positive examples, and the latter forces the representations of all instances to spread uniformly on the unit sphere in the representation space (Wang & Isola, 2020) . Notably this decomposition of the contrastive loss relies on the condition that the number of negative examples participating in the contrastive loss approaches infinity. For VRL, it is arguable that such a condition holds reasonably well since usually a large number (e.g., 65536 (He et al., 2020)) of negative examples are used in training. In recent years, contrastive learning has also been adapted to sentence representation learning (SRL), by fine-tuning the representations obtained from a pretrained language model (e.g. BERT (Devlin et al., 2018) ) (Yan et al., 2021; Giorgi et al., 2021; Gao et al., 2021; Zhang et al., 2022b) . Such approaches have demonstrated great performances not only for downstream classification tasks, but also for semantic textual similarity (STS) tasks. It is noteworthy that significantly contrast-ing VRL, which are primarily evaluated using downstream classification tasks (Russakovsky et al., 2015; Krizhevsky et al., 2009 ), or via "extrinsic" protocols (Chiu et al., 2016; Faruqui et al., 2016) , SRL particularly emphasizes the STS tasks, or "intrinsic" protocols, when evaluating the quality of learned sentence representations (Reimers & Gurevych, 2019; Li et al., 2020; Yan et al., 2021; Zhang et al., 2022c) . This is because the representations obtained from the pretrained language models have already shown strong transfer capability to downstream tasks (Devlin et al., 2018; Liu et al., 2019) while their similarities are rather poorly correlated with the human-rated similarities (Reimers & Gurevych, 2019; Li et al., 2020) . Following the justification of contrastive learning in VRL(Wang & Isola, 2020), some works (Gao et al., 2021; Zhang et al., 2022b) attribute the success of contrastive learning on the STS tasks to a good balance between alignment and uniformity. Consequently, alignment and uniformity losses are adopted widely as the key metrics for evaluating the goodness of sentence representations learned from contrastive learning (Gao et al., 2021; Zhang et al., 2022b; c; a; Klein & Nabi, 2022) . Noting that contrastive learning for SRL in fact only uses a rather small number of negative examples (e.g, 63 (Gao et al., 2021; Zhang et al., 2022b) or smaller (Zhang et al., 2022c )), in this paper, we question whether the "decomposition principle", or jointly optimizing alignment and uniformity, adequately explains the performance gain in the STS tasks brought by contrastive learning. After extensive experiments, we find that optimization using alignment and uniformity losses produces lower performance than that with contrastive loss in the STS tasks. Moreover, this performance degradation is not reflected by the alignment and uniformity metrics. Interestingly, we also observe the same phenomenon in contrastive learning with another loss function, Decoupled Contrastive Loss (DCL) (Yeh et al., 2021) , in which the optimization objectives are very similar to alignment and uniformity. Our further studies also show that training with such decoupled forms of contrastive loss cause severe overfitting, which does not occur in training with the standard contrastive loss. These observations suggest that alignment and uniformity losses might not serve suitable substitutes for contrastive loss in SRL and that the success of the standard contrastive learning for SRL can not be adequately explained in terms of alignment and uniformity properties or some delicate balance between the two. This paper focuses on uncovering other important factors, beyond alignment and uniformity, that contribute to the success story of contrastive learning in SRL and explain the performance gap between the training scheme using the contrastive loss and those using a decoupled contrastive loss (in terms of alignment and uniformity or their equivalent). Specifically, we hypothesize that the training dynamics of the standard contrastive learning for SRL plays an essential role in its effectiveness. To that end, we decompose the gradient of the contrastive loss into an alignment component and a uniformity component and compare the norms of these two components with their counter-parts in training with the decoupled contrastive losses. Interestingly, we observe a distinct "gradient dissipation" phenomenon in training with the standard contrastive loss: the gradient signal quickly drops and vanishes as soon as the negative example is adequately further away from the anchor than the positive example, where the adequacy appears to be reflected by a rather moderate threshold. Notably such a phenomenon does not appear in the training schemes using a decoupled contrastive loss, when the negative sample size is small. This observation led us to believe that "gradient dissipation" plays an essential role in standard contrastive learning. To validate this hypothesis, we construct two new loss functions, both capable of inducing "gradient dissipation" in their training dynamics. We test them experimentally and observe that indeed training with both losses gives comparable or even better performance in the STS tasks than the standard contrastive loss. Interestingly, similar to the contrastive loss, training with these new loss functions also eliminates the alleviates the overfitting problem observed in training using decouple contrastive losses. This confirms that gradient dissipation serves a key role in the success of contrastive learning for SRL and suggests that properly conditioning the training dynamics is more important than optimizing alignment and uniformity when not too many negative examples are used. Along our development, we provide insights as to why gradient dissipation is a desirable property. We also provide additional theoretical justifications on the effectiveness of the new loss functions by showing that they are in fact upper bounds of the standard contrastive loss. Due to page limit, some results, derivations and discussions are presented in Appendix.



The codes and checkpoints are released for study at https://github.com/BDBC-KG-NLP/ ICLR2023-Gradient-Dissipation.git

