ON THE INADEQUACY OF OPTIMIZING ALIGNMENT AND UNIFORMITY IN CONTRASTIVE LEARNING OF SENTENCE REPRESENTATIONS

Abstract

Contrastive learning is widely used in areas such as visual representation learning (VRL) and sentence representation learning (SRL). Considering the differences between VRL and SRL in terms of negative sample size and evaluation focus, we believe that the solid findings obtained in VRL may not be entirely carried over to SRL. In this work, we consider the suitability of the decoupled form of contrastive loss, i.e., alignment and uniformity, in SRL. We find a performance gap between sentence representations obtained by jointly optimizing alignment and uniformity on the STS task and those obtained using contrastive loss. Further, we find that the joint optimization of alignment and uniformity during training is prone to overfitting, which does not occur on the contrastive loss. Analyzing them based on the variation of the gradient norms, we find that there is a property of "gradient dissipation" in contrastive loss and believe that it is the key to preventing overfitting. We simulate similar "gradient dissipation" of contrastive loss on four optimization objectives of two forms, and achieve the same or even better performance than contrastive loss on the STS tasks, confirming our hypothesis. 1 .

1. INTRODUCTION

Unsupervised contrastive learning (Wu et al., 2018) is originated from visual representation learning (VRL) (Chen et al., 2020; He et al., 2020; Grill et al., 2020) , and has achieved impressive performances therein. Briefly, contrastive learning forces the representation of an input instance (or "anchor") to be similar to that of an augmented view of the same instance (or "postive example") and to differ from that of some different instances (or "negative examples"). One plausible justification of this approach is that minimizing the loss of contrastive learning (or "contrastive loss") is shown to be equivalent to simultaneously minimizing an "alignment loss" and a "uniformity loss", where the former dictates the representation similarity between an instance and its positive examples, and the latter forces the representations of all instances to spread uniformly on the unit sphere in the representation space (Wang & Isola, 2020) . Notably this decomposition of the contrastive loss relies on the condition that the number of negative examples participating in the contrastive loss approaches infinity. For VRL, it is arguable that such a condition holds reasonably well since usually a large number (e.g., 65536 (He et al., 2020)) of negative examples are used in training. In recent years, contrastive learning has also been adapted to sentence representation learning (SRL), by fine-tuning the representations obtained from a pretrained language model (e.g. BERT (Devlin et al., 2018) ) (Yan et al., 2021; Giorgi et al., 2021; Gao et al., 2021; Zhang et al., 2022b) . Such approaches have demonstrated great performances not only for downstream classification tasks, but also for semantic textual similarity (STS) tasks. It is noteworthy that significantly contrast-ing VRL, which are primarily evaluated using downstream classification tasks (Russakovsky et al., 2015; Krizhevsky et al., 2009) , or via "extrinsic" protocols (Chiu et al., 2016; Faruqui et al., 2016) , SRL particularly emphasizes the STS tasks, or "intrinsic" protocols, when evaluating the quality of learned sentence representations (Reimers & Gurevych, 2019; Li et al., 2020; Yan et al., 2021; Zhang et al., 2022c) . This is because the representations obtained from the pretrained language models have already shown strong transfer capability to downstream tasks (Devlin et al., 2018; Liu et al., 2019) while their similarities are rather poorly correlated with the human-rated similarities (Reimers & Gurevych, 2019; Li et al., 2020) . Following the justification of contrastive learning in VRL (Wang & Isola, 2020) , some works (Gao et al., 2021; Zhang et al., 2022b) attribute the success of contrastive learning on the STS tasks to a good balance between alignment and uniformity. Consequently, alignment and uniformity losses are adopted widely as the key metrics for evaluating the goodness of sentence representations learned from contrastive learning (Gao et al., 2021; Zhang et al., 2022b; c; a; Klein & Nabi, 2022) . Noting that contrastive learning for SRL in fact only uses a rather small number of negative examples (e.g, 63 (Gao et al., 2021; Zhang et al., 2022b) or smaller (Zhang et al., 2022c )), in this paper, we question whether the "decomposition principle", or jointly optimizing alignment and uniformity, adequately explains the performance gain in the STS tasks brought by contrastive learning. After extensive experiments, we find that optimization using alignment and uniformity losses produces lower performance than that with contrastive loss in the STS tasks. Moreover, this performance degradation is not reflected by the alignment and uniformity metrics. Interestingly, we also observe the same phenomenon in contrastive learning with another loss function, Decoupled Contrastive Loss (DCL) (Yeh et al., 2021) , in which the optimization objectives are very similar to alignment and uniformity. Our further studies also show that training with such decoupled forms of contrastive loss cause severe overfitting, which does not occur in training with the standard contrastive loss. These observations suggest that alignment and uniformity losses might not serve suitable substitutes for contrastive loss in SRL and that the success of the standard contrastive learning for SRL can not be adequately explained in terms of alignment and uniformity properties or some delicate balance between the two. This paper focuses on uncovering other important factors, beyond alignment and uniformity, that contribute to the success story of contrastive learning in SRL and explain the performance gap between the training scheme using the contrastive loss and those using a decoupled contrastive loss (in terms of alignment and uniformity or their equivalent). Specifically, we hypothesize that the training dynamics of the standard contrastive learning for SRL plays an essential role in its effectiveness. To that end, we decompose the gradient of the contrastive loss into an alignment component and a uniformity component and compare the norms of these two components with their counter-parts in training with the decoupled contrastive losses. Interestingly, we observe a distinct "gradient dissipation" phenomenon in training with the standard contrastive loss: the gradient signal quickly drops and vanishes as soon as the negative example is adequately further away from the anchor than the positive example, where the adequacy appears to be reflected by a rather moderate threshold. Notably such a phenomenon does not appear in the training schemes using a decoupled contrastive loss, when the negative sample size is small. This observation led us to believe that "gradient dissipation" plays an essential role in standard contrastive learning. To validate this hypothesis, we construct two new loss functions, both capable of inducing "gradient dissipation" in their training dynamics. We test them experimentally and observe that indeed training with both losses gives comparable or even better performance in the STS tasks than the standard contrastive loss. Interestingly, similar to the contrastive loss, training with these new loss functions also eliminates the alleviates the overfitting problem observed in training using decouple contrastive losses. This confirms that gradient dissipation serves a key role in the success of contrastive learning for SRL and suggests that properly conditioning the training dynamics is more important than optimizing alignment and uniformity when not too many negative examples are used. Along our development, we provide insights as to why gradient dissipation is a desirable property. We also provide additional theoretical justifications on the effectiveness of the new loss functions by showing that they are in fact upper bounds of the standard contrastive loss. Due to page limit, some results, derivations and discussions are presented in Appendix.

2.1. CONTRASTIVE LEARNING IN SENTENCE REPRESENTATION LEARNING

Let C be a set of sentences. For each x i ∈ C, we use x ′ i to denote an augmented view of x i . With respect to anchor x i , x ′ i is referred to a positive example, and any x j or x ′ j (j ̸ = i) is referred to as a negative example. There is an encoder mapping each x i and x ′ i to their representations h i and h ′ i , which are vectors in R d . Considering the metric invariance, we usually normalize them to obtain ĥi and ĥ′ i , which are constrained on the unit hypersphere S d-1 centered at the origin. InfoNCE Loss (Oord et al., 2018) or "contrastive loss" for anchor x i is defined by L cl = -log exp ĥT i ĥ′ i /τ exp ĥT i ĥ′ i /τ + N j,j̸ =i exp ĥT i ĥ′ j /τ (1) where τ is the temperature hyperparameter and N represents the number of negative samples. The quality of sentence representations is shown to insensitive to N (Gao et al., 2021) , and a small N , such as 63 (Gao et al., 2021; Zhang et al., 2022b) or smaller (Zhang et al., 2022c ) has shown to be the same sufficient as the bigger one.

2.2. THE DECOUPLE VERSION OF CONTRASTIVE LOSS

Alignment and uniformity (Wang & Isola, 2020) decoupled from contrastive loss are shown to be two significant properties. The losses based on the two properties are defined as La&u = (1 -λ)Lalign + λLuniform (2) L align (f ; α) ≜ E ( ĥi , ĥ′ i )∼ppos ∥ ĥi -ĥ′ i ∥ α 2 , α > 0 (3) L uniform (f ; t) ≜ log E i.i.d ( ĥi , ĥ′ j )∼p data e -t∥ ĥi -ĥ′ j ∥ 2 2 , t > 0 (4) where α, t and λ are three hyperparameters. In the meantime, we follow another work (Yeh et al., 2021) on decoupled contrastive learning, which removes the positive sample part from the denominator of contrastive loss: L dcl = -log exp ĥT i ĥ′ i /τ N j,j̸ =i exp ĥT i h ′ j /τ = -ĥT i ĥ′ i /τ alignment + log   N j,j̸ =i exp ĥT i h ′ j /τ   uniformity Note that the alignment parts in the two works are recognized as equivalent (Yeh et al., 2021) and we proved the uniformity part of them have the same lower bound (Appendix D), which corresponds to the optimization objective of Minimum Energy Problem on the hypersphere (Kuijlaars & Saff, 1998; Liu et al., 2018) . Therefore, the above two decoupled forms of contrastive loss are treated equally in this paper.

2.3. PERFORMANCE COMPARISON

We compare the performance of contrastive loss and its two decoupled forms based on SimCSE (Gao et al., 2021) , and evaluate with seven datasets on semantic textual similarity (STS) tasks and seven downstream classification datasets on transfer (TR) tasks from the SentEval toolkit (Conneau & Kiela, 2018) . We report the average Spearman correlation for the STS tasks and the average accuracy for the TR tasks in Table 1 . Please refer to Appendix A for experimental details. As can be seen from the results, we can find that (1) the performance on both STS and TR tasks is better than original pretrained models after optimizing with any one of the three loss functions; (2) the main improvement over the original pretrained models is on the STS tasks, while the improvement on the TR tasks is relatively weak; (3) the decoupled forms obtain the same or better performance on TR tasks as contrastive loss, but there still have a performance gap between them on STS tasks. These observations reflect to some extent the validity of alignment and uniformity to replace contrastive loss on TR tasks, but it also implies that there are some other factors playing a key role in the improvement of STS tasks, which have been neglected in the previous studies. 

3.1. PROBLEM ANALYSIS

To gain a deeper understanding for the above performance gap, we record several metrics during the training and evaluation process, the images of which are shown in Figure 1 . Figure 1a shows the L uniform -L align scatterplot of sentence representations on STS-B (Cer et al., 2017) development set, where the colors represent the average Spearman correlation of the STS tasks. It is important to note that the scatterplots of this style are widely used in recent works to show that the proposed method is able to achieve a better balance between alignment and uniformity than other methods (Gao et al., 2021; Zhang et al., 2022a; b; c; Klein & Nabi, 2022) . Indeed, the scatterplot shows that the sentence representations whose (L uniform , L align ) located in the middle of the image perform better than those whose location in the top left or bottom right. However, the performance gap between different loss functions are not reflected by the locations in this figure. Specifically, the sentence representations pretrained by L cl and its decoupled forms can obtain the almost same values of alignment and uniformity, but an obvious performance gap on the STS tasks. Figure 1b shows the numerical variation of L cl , L align and L uniform on the training set and development set during the optimization process with L cl . In terms of trends, all metrics on both the training set and development set first decrease and then remain flat, which can be seen as a reference for normal training process in the study. Figure 1c and 1d show the same metrics during the optimization process with L a&u and L dcl . Comparing with Figure 1b , the different variation trends on the training set and the development set imply some degree of overfitting. Specifically, when L a&u is applied for optimization, all three metrics except L align show first decrease and then increase in the development set, while the similar situation occurs in all three metrics when L dcl is applied.

3.2. OPTIMIZATION DYNAMICS STUDY

Since L a&u and L dcl can be treated as equivalent to L cl only when negative samples tend to infinity, we need to pay attention to what exactly is different between them when the number of negative samples is small. To gain more insights on their differences, we investigate the gradient property between contrastive loss and its decoupled forms in optimization dynamics. The gradient of L cl for the anchor h i can be split into two terms: ∂L cl ∂h i = - 1 τ N j,j̸ =i exp ĥT i ĥ′ j /τ ĥ′ i -ĥi exp ĥT i ĥ′ i /τ + N j,j̸ =i exp ĥT i ĥ′ j /τ I -M h i ∥h i ∥ ∇ pos cl - 1 τ N j,j̸ =i exp ĥT i ĥ′ j /τ ĥi -ĥ′ j exp ĥT i ĥ′ i /τ + N j,j̸ =i exp ĥT i ĥ′ j /τ I -M h i ∥h i ∥ N j,j̸ =i ∇ neg j cl ( ) where I is the identity matrix and M hi is the projection matrix on h i . When training is driven by such a gradient signal, the first term ∇ pos cl points from ĥi to ĥ′ i , pulling the anchor and the positive sample close to optimize alignment, while each ∇ neg j cl in the second term points from ĥ′ j to ĥi , pushing the anchor away from the negative samples to optimize uniformity. Then the gradient norm on alignment is computed: ∇ pos cl = 1 τ N j,j̸ =i exp cos θ ij ′ /τ ĥ′ i -cos θ ii ′ ĥi exp cos θ ii ′ /τ + N j,j̸ =i exp cos θ ij ′ /τ 1 ∥h i ∥ = 1 τ N j,j̸ =i exp cos θ ij ′ /τ sin θ ii ′ exp cos θ ii ′ /τ + N j,j̸ =i exp cos θ ij ′ /τ 1 ∥h i ∥ (7) where θ ii ′ represents the angle between the anchor ĥi and the positive sample ĥ′ i , and θ ij ′ represents the angle between the anchor ĥi and the negative sample ĥ′ j . Likewise, we can calculate the gradient norm of L align and L dcl on alignment, which are expressed by ∥∇L align ∥ and ∥∇ pos dcl ∥ separately: ∂L align ∂hi = ∥∇L align ∥ = α(2 sin (θ ii ′ /2)) α-2 ∥hi∥ ĥ′ i -cos θ ii ′ ĥi = α(2 sin (θ ii ′ /2)) α-2 sinθ ii ′ ∥hi∥ (8) ∥∇ pos dcl ∥ = 1 τ ∥hi∥ ĥ′ i -cos θ ii ′ ĥi = sin θ ii ′ τ ∥hi∥ Comparing equation 7, 8 and 9, we find that the gradient directions of these three parts are the same, but a key difference among them is that θ ij ′ is contained in ∥∇ pos cl ∥ but not in ∥∇L align ∥ neither in ∥∇ pos dcl ∥, which leads to a difference in whether the gradient signal is related to θ ij ′ . Figure 2a visualizes this difference by plotting the heatmaps of ∥∇ pos cl ∥, ∥∇L align ∥ and ∥∇ pos dcl ∥ with θ ii′ -θ ij′ . The leftmost four plots show the ∥∇ pos cl ∥ under the different number of negative samples, while the rightmost two plots show ∥∇L align ∥ and ∥∇ pos dcl ∥ under the condition of 64 negative sample size. Observing the leftmost four plots, we can find that the area of the shaded part (or the area with weak gradient signals) gradually decreases as the number of negative sample increases. Therefore, it is conceivable that when the number of negative samples tends to infinity, these images will gradually become identical to the two rightmost images. Comparing the leftmost plot with the rightmost two plots, we can clearly see the difference between them, where the former has a "gradient dissipation" situation that the latter does not have. So let us think what will happen during training. θ ij ′ will increase and θ ii ′ will decrease gradually. For 2 . ∥∇ pos cl ∥, the gradient signals of some anchors will suddenly dissipate at a certain training phase and only the anchors whose θ ij ′ is not too larger than θ ii ′ will still receive the gradient signals. For ∥∇L align ∥ and ∥∇ pos dcl ∥, since they are not constrained by θ ij ′ , their gradient signals will not vary with the horizontal axis. All anchors except those with absolutely small θ ii ′ will receive the continuous gradient signals during training. We point out that the above qualitative conclusion does not fail to hold if the angle is replaced with a generic distance function ρ(., .) on the hypersphere. Then the difference on "gradient dissipation" can be intuitively described as: "When the number of negative samples is small, contrastive loss tries to keep ρ( ĥi , ĥ′ i ) small relative to ρ( ĥi , ĥ′ j ) by some value, while its decoupled forms try to keep ρ( ĥi , ĥ′ i ) absolutely small". Here, we unthinkingly hypothesize that the "gradient dissipation" is a key property in the performance gap between contrastive loss and its decoupled forms, and more insights on this hypothesis will be provided in Section 5. Due to the space limitation of the main text, only the gradients related to the alignment part are analyzed here. Please refer to Appendix B for the analysis of the uniformity part and Appendix C for the derivation of all equations in this section and the subsequent ones.

4.1. VALIDATION VIA THE INTRODUCTION OF L dcl +

To validate our hypothesis, we try to simulate this property with the smallest changes. Then two facts are noticed: (1) L dcl will gradually become negative during training; (2) the "gradient dissipation" only occurs when θ ij ′ is larger enough with respect to θ ii ′ . These two facts inspire us to introduce a ReLU function (Glorot et al., 2011) on top of L dcl to force the truncation of the gradient signal provided to the anchor when L dcl < 0: L dcl + = max -ĥT i ĥ′ i /τ + log N j,j̸ =i exp ĥT i ĥ′ j /τ , 0 Likewise, we can obtain its gradient norm associated with alignment: ∇ pos dcl + =      sin θ ij ′ τ 1 ∥hi∥ , -ĥT i ĥ′ i /τ + log N j,j̸ =i exp ĥT i ĥ′ j /τ > 0 0, -ĥT i ĥ′ i /τ + log N j,j̸ =i exp ĥT i ĥ′ j /τ ≤ 0 The leftmost plot of Figure 2b shows ∇ pos dcl + under the condition of 64 negative sample size. Comparing with the leftmost plot of Figure 2a , we can find that the images of ∇ pos dcl + and ∥∇ pos cl ∥ are almost identical, which proves that the added ReLU function can indeed simulate the property of "gradient dissipation". More importantly, the experimental results in Table 1 demonstrate the effectiveness of L dcl + , which obtains a 3%-7% improvement over L dcl on STS tasks in the different pretrained models.

4.2. VALIDATION VIA THE UPPER BOUNDS FOR L cl

In fact, L dcl + is introduced not only for the above mentioned observations, but also because L dcl + + log 2 can be proved to be the upper bound of L cl . Further, we find that another upper bound with the property of "gradient dissipation", which can be derived after a second relaxation based on L dcl + : L cl ≤ log 2 + max -ĥT i ĥ′ i /τ + log N j,j̸ =i exp ĥT i ĥ′ j /τ , 0 = log 2 + L dcl + ≤ log 2 + 1 τ max -ĥT i ĥ′ i + max j,j̸ =i ĥT i ĥj + τ log(N -1), 0 where τ log(N -1) can be replace as the margin hyperparameter m. Then we can get a new loss function: Lmpt = max -ĥT i ĥ′ i + max j,j̸ =i ĥT i ĥj + m, 0 where the hardest negative sample is selected for optimization.  Lmet = max{∥ ĥi -ĥ′ i ∥2 -min j,j̸ =i ∥ ĥi -ĥ′ j ∥2 + m, 0} Lmat = max{θ ii ′ -min j,j̸ =i θ ij ′ + m, 0} Their performance on STS tasks and TR tasks are reported in Table 1 and find that the performance of these loss functions ahead of the decoupled forms of contrastive loss, and is comparable to or better than those of contrastive loss. Likewise, the gradient norm variations associated with alignment, ∇ pos mpt , ∥∇ pos met ∥ and ∥∇ pos mat ∥, are plotted in Figure 4c . As we can see, there are some difference among the gradient norms variations due to the choice of different distance functions, but it does not bring a significant difference to their performance, which proves that it is the "gradient dissipation" property and not a specific distance function that is at work.

4.3. MECHANISTIC ANALYSIS OF GRADIENT DISSIPATION

To gain a better understanding of the property of "gradient dissipation", we develop a interest in L mpt because the degree of "gradient dissipation" can be adjusted by the only parameter m (Figure 2b ) and independent of the number of negative samples. With this more flexible form, how the margin value effects the performance can be quantitatively observed. Recalling that the "gradient dissipation" occurs when d ij , i.e. |ρ( ĥi , ĥ′ i ) -min j,j̸ =i ρ( ĥi , ĥ′ j )| goes outside a certain range and we can obtain different final d ij by selecting different m during training. Figure 3a shows the performance on STS tasks and TR tasks with different d ij . It can be clearly observed that the performance of STS tasks are much more sensitive to d ij than that of TR task. Specifically, too large or too small d ij will cause the performance of the STS task to drop sharply. In contrast, the performance on TR tasks does not vary significantly with d ij , which seems to explain to some extent why the decoupled forms can work well in VRL. Surprisingly, no trade-offs are needed to guarantee the good performance on STS and TR tasks and a suitable margin value helps to achieve a doublebest performance for both STS and TR tasks, which indicates that sentence representation quality can achieve both intrinsic and extrinsic excellence. For a more in-depth look, we set m to 0.10, 0.23 (corresponding the best performance) and 0.80 respectively, and record the numerical variation of L cl , L align and L uniform on the training and development sets. Comparing Figure 3c and 3d first, we find that L cl can drop to lower on both training and validation sets when m is 0.23, and this difference is magnified by the L uniform , while L align remains almost constant. This observation is consistent with our intuition: when m is too small, the gradient signals may be weak at the beginning of the training, leading to no improvement in the uniformity of the representation vectors. Then turn the attention to Figure 3e , which plots the same metrics when m is 0.80, the curves as a whole is highly similar to Figure 1d , with the same overfitting phenomenon. We think it can be interpreted as an excessively large margin leads the optimization process exceptionally difficult, preventing the "gradient dissipation" phenomenon from occurring during training. To summarize, the above phenomenons further illustrate that the timing of gradient dissipation is very important; too early "gradient dissipation" will result in under-tuned models, while too late "gradient dissipation" will result in degraded model performance and overfitting during training. We further study the connection between L mpt and the Mixup-based approaches (Kalantidis et al., 2020; Zhang et al., 2022b) in Appendix E.

5. DISCUSSION

In this section, we share some thoughts and observations to explain why we believe that the "gradient dissipation" plays a key role in SRL. We start by considering it in the relation to the STS tasks. For consistency with the retrieval scenario in practice, the STS tasks care about the orders among the semantic similarity of the anchors rather than the specific values among them (Reimers & Gurevych, 2019) . In the ideal case, we should let the samples closer to the anchor if they are more semantically similar to the anchor, and no need to care about their absolute distance to the anchor. Intuitively, the property of "gradient dissipation" only works to puts negative samples further away from the anchor compared to positive samples, which is consistent with the need for STS tasks. On the other hand, alignment and uniformity only describe the relation between the anchor and its samples, while ignoring the relations between the positive samples and the negative samples. In other words, this optimization form overemphasizes the characteristics of the individual sentence itself and ignores the actual connections between the semantics. In early exploratory trials, we find that d ij is continuously enlarged when the decouple forms are adopted for training (Figure 3b ), which is far beyond the level when trained by contrastive loss. Considering the limitations of data augmentation to generate positive samples and the noise in sampling negative samples, we suspect that the large distance gap brought by the decouple forms compromises the original semantic information of the pretrained models to some extent. Combining on the above thoughts and observations, we hypothesize that the "gradient dissipation" property is responsible for the better performance in the STS evaluation.

6. RELATED WORK

Sentence representation learning (SRL) (Kiros et al., 2015; Conneau et al., 2017; Reimers & Gurevych, 2019; Li et al., 2020) is one of the fundamental tasks in NLP, aiming at learning semantically rich high-dimensional representation at the sentence level. The good sentence representations need to satisfy both (1) high correlation with human-rated similarities (intrinsic evaluation) and ( 2) good transferability (extrinsic evaluation) (Chiu et al., 2016; Faruqui et al., 2016) . The pretrained language models (Devlin et al., 2018; Liu et al., 2019) were once regraded as the source for obtaining universal sentence representations, and the representations obtained by pretrained models have shown extraordinary performance on tranfer tasks. However, these representations obtained from the pretrained models even get lower performance than the average Glove embeddings (Pennington et al., 2014) on semantic textual similarity (STS) tasks (Reimers & Gurevych, 2019) . Then more studies have found the word representation space of pretrained models, such as BERT (Devlin et al., 2018) and ELMo (Peters et al., 2018) , are anisotropic (Ethayarajh, 2019) , where word embeddings are concentrated on a high-dimensional conical space (Gao et al., 2018) . Early works (Li et al., 2020; Su et al., 2021) try to diminish the anisotropy of the pretrained representation space using the whitening transformation (Su et al., 2021) or the flow function (Li et al., 2020) . These approaches are easy to implement, but limited improvement for STS tasks. Recently, contrastive learning methods based on instance discrimination (Zhang et al., 2022a; b; c) are introduced to SRL. However, a great deal of works (Yan et al., 2021; Gao et al., 2021; Zhou et al., 2022; Zhang et al., 2022b) focus on how to obtain or generate better positive and negative samples. At the same time, we find that even though there are significant differences in current contrastive learning methods in VRL and SRL, such as differences in pretraining and fine tuning, and differences in negative sample sizes (see Section 1 for details), few works have dabbled in the optimization mechanisms of contrastive learning in SRL. Instead, a large number of findings from VRL, such as alignment and uniformity (Wang & Isola, 2020; Gao et al., 2021) , momentum updates (He et al., 2020; Wu et al., 2021) and bootstrap (Grill et al., 2020; Cao et al., 2022) , have been directly applied. Although there are a large number of works (Gao et al., 2021; Zhang et al., 2022b; c; a; Klein & Nabi, 2022) in SRL that use alignment and uniformity as evaluation metrics, to our knowledge, this work is the first work to study their inadequacy as the optimization objectives in SRL.

7. CONCLUSION

In this paper, we focus on the performance gap between contrastive loss and its decoupled forms, i.e., alignment and uniformity in SRL. Our series of new findings contribute to a deeper understanding of how contrastive loss can improve the quality of sentence representation on STS tasks: (1) Alignment and Uniformity Loss or their equivalent are not suitable as alternative loss functions for contrastive loss in SRL due to their lower performance and overfitting problem; (2) The "gradient dissipation" property of contrastive loss under a small number of negative samples plays a key role in the performance on STS tasks and preventing overfitting; (3) The "gradient dissipation" property works to control a suitable distance gap between anchor-positive and anchor-negative samples, while alignment and uniformity are the properties that control the absolute large distance of them; (4) Other two loss functions with "gradient dissipation" property also can solve the overfitting problem, and obtain the same or even better performance than contrastive loss, even if their "gradient dissipation" properties are not sensitive to the number of negative samples. We hope these findings to build a better understanding to the key properties of contrastive loss in SRL. Further, some new loss functions that are not bound to the form of contrastive loss can be designed via these properties, improving the quality of the sentence representation synthetically. α = {1.0, 2.0, 3.0} t ∈ [2, 10], step size is 1 • L dcl + -τ ∈ [0.10, 0.20], step size is 0.01 • L mpt -m ∈ [0.20, 0.35], step size is 0.01 • L met -m ∈ [0.40, 0.50], step size is 0.01 • L mat -m ∈ {0.10π, 0.20π], step size is 0.01π The optimal parameters are shown in Table 2 , which are adopted to report all results and plot to all plots in this paper.

A.4 EVALUATION PROTOCOL

We report the performance of the sentence representation on the STS task using the Spearman's rank correlation, which has been widely used in recent works. Compared with the Pearson's correlation, Spearman's rank correlation focuses on relative ranking instead of absolute scores, which is more in line with practical retrieval applications of text similarity matching (Reimers et al., 2016) . Similar to the main text, we study the uniformity part of the gradient norm of all the loss functions mentioned in the main text, and their formulas are shown below:

B THE ANALYSIS ON UNIFORMITY PART

∇ neg j cl = 1 τ exp (cos θ ij ′ /τ ) sin θ ij ′ exp (cos θ ii ′ /τ ) + N j,j̸ =i exp (cos θ ij ′ /τ ) 1 ∥hi∥ (16) ∇L j uniform = 2t exp (2t cos(θ ij ′ )) sin θ ij ′ N i N j,j̸ =i exp (2t cos(θ ij ′ )) 1 ∥hi∥ (17) ∇ neg j dcl = 1 τ exp (cos θ ij ′ /τ ) sin θ ij ′ N j,j̸ =i exp (cos θ ij ′ /τ ) 1 ∥hi∥ ∇ pos dcl + =              1 τ exp (cos θ ij ′ /τ ) sin θ ij ′ N j,j̸ =i exp (cos θ ij ′ /τ ) 1 ∥hi∥ , ĥT i ĥ′ i /τ -log N j,j̸ =i exp ĥT i ĥ′ j /τ < 0 0, ĥT i ĥ′ i /τ -log N j,j̸ =i exp ĥT i ĥ′ j /τ ≥ 0 ∇ pos mpt =      sin (θ ij ′ ) ∥hi∥ , ĥT i ĥ′ i + max j,j̸ =i ĥT i ĥj < m 0, ĥT i ĥ′ i + max j,j̸ =i ĥT i ĥj ≥ m (20) ∥∇ pos met ∥ =      cos (θ ij ′ /2) ∥hi∥ , ∥ ĥi -ĥ′ i ∥2 -min j,j̸ =i ∥ ĥi -ĥ′ j ∥2 < m 0, ∥ ĥi -ĥ′ i ∥2 -min j,j̸ =i ∥ ĥi -ĥ′ j ∥2 ≥ m ∥∇ pos mat ∥ =      1 ∥hi∥ , θ ii ′ -min j,j̸ =i θ ij ′ < m 0, θ ii ′ -min j,j̸ =i θ ij ′ ≥ m The derivation of all the above equations can be found in Appendix C, and their trends with θ ii ′ -θ ij ′ are plotted in Figure 4 . The findings of the uniformity part is highly similar to those of the alignment part: • ∥∇ neg j cl ∥ approaches ∥∇L j uniform ∥ and ∥∇ neg j dcl ∥ as the negative sample size increases. • When the number of negative samples is small, "gradient dissipation" exists for ∥∇  ∂L cl ∂hi = - 1 τ  ĥ ′ i + exp ĥT i ĥ′ i /τ ĥ′ i + N j,j̸ =i exp ĥT i ĥ′ j /τ ĥ′ j exp ĥT i ĥ′ i /τ + N j,j̸ =i exp ĥT i ĥ′ j /τ   ∂ ĥi ∂hi = - 1 τ N j,j̸ =i exp ĥT i ĥ′ j /τ ĥ′ i -ĥ′ j exp ĥT i ĥ′ i /τ + N j,j̸ =i exp ĥT i ĥ′ j /τ I -M h i ∥hi∥ = - 1 τ N j,j̸ =i exp ĥT i ĥ′ j /τ ĥ′ i -ĥi -ĥ′ j -ĥi exp ĥT i ĥ′ i /τ + N j,j̸ =i exp ĥT i ĥ′ j /τ I -M h i ∥hi∥ = - 1 τ N j, ĥ′ i /τ + N j,j̸ =i exp ĥT i ĥ′ j /τ ĥ′ i -ĥi (I -M h i ) ∥hi∥ = 1 τ N j,j̸ =i exp (cos θ ij ′ /τ ) exp (cos θ ii ′ /τ ) + N j,j̸ =i exp (cos θ ij ′ /τ ) ĥ′ i -ĥi cos θ ij ′ ∥hi∥ = 1 τ N j,j̸ =i exp (cos θ ij ′ /τ ) sin θ ii ′ exp (cos θ ii ′ /τ ) + N j,j̸ =i exp (cos θ ij ′ /τ ) 1 ∥hi∥ (24) ∇ neg j cl = 1 τ exp ĥT i ĥ′ j /τ ĥi -ĥ′ j exp ĥT i ĥ′ i /τ + N j,j̸ =i exp ĥT i ĥ′ j /τ I -M h i ∥hi∥ = 1 τ exp ĥT i ĥ′ j /τ exp ĥT i ĥ′ i /τ + N j,j̸ =i exp ĥT i ĥ′ j /τ ĥi -ĥ′ j (I -M h i ) ∥hi∥ = 1 τ exp (cos θ ij ′ /τ ) exp (cos θ ii ′ /τ ) + N j,j̸ =i exp (cos θ ij ′ /τ ) ĥi cos θ ij ′ -ĥ′ j ∥hi∥ = 1 τ exp (cos θ ij ′ /τ ) sin θ ij ′ exp (cos θ ii ′ /τ ) + N j,j̸ =i exp (cos θ ij ′ /τ ) 1 ∥hi∥ (25) C.3 THE DERIVATION OF EQUATION 8 AND 17 L align = ĥi -ĥ′ i 2 2 α 2 = 2 -2 ĥT i ĥ′ i α 2 (26) ∂L align ∂hi = α 2 2 -2 ĥT i ĥ′ i α-2 2 (-2 ĥ′ i ) I -M h i ∥hi∥ = -α 2 -2 ĥT i ĥ′ i α-2 2 ĥ′ i I -M h i ∥hi∥ (27) ∂L align ∂hi = α(2 -2 cos θ ii ′ ) α-2 2 ĥ′ i -ĥi cos θ ii ′ ∥hi∥ = α(4 sin 2 (θ ii ′ /2)) α-2 2 sin θ ii ′ 1 ∥hi∥ = α(2 sin (θ ii ′ /2)) α-2 sin θ ii ′ 1 ∥hi∥ = ∥∇L align ∥ (28) L uniform = log 1 N (N -1) N i N j,j̸ =i exp -t ĥi -ĥ′ j 2 2 (29) ∂L uniform ∂hi = -2t N j,j̸ =i exp -t ĥi -ĥ′ j 2 2 ĥi -ĥ′ j N i N j,j̸ =i exp -t ĥi -ĥ′ j 2 2 I -M h i ∥hi∥ = N j,j̸ =i ∇L j uniform (30) ∇L j uniform = 2t exp -t ĥi -ĥ′ j 2 2 ĥ′ j -ĥi N i N j,j̸ =i exp -t ĥi -ĥ′ j 2 2 I -M h i ∥hi∥ = 2t exp -4t sin 2 (θ ij ′ /2) N i N j,j̸ =i exp -4t sin 2 (θ ij ′ /2) ĥ′ j -ĥi (I -M h i ) ∥hi∥ = 2t exp -4t sin 2 (θ ij ′ /2) N i N j,j̸ =i exp -4t sin 2 (θ ij ′ /2) ĥ′ j -ĥi cos θ ij ′ ∥hi∥ = 2t exp -4t sin 2 (θ ij ′ /2) sin θ ij ′ N i N j,j̸ =i exp -4t sin 2 (θ ij ′ /2) 1 ∥hi∥ (31) C.4 THE DERIVATION OF EQUATION 9, 11, 18 AND 19 ∂L dcl ∂hi = - ĥ′ i -ĥi τ I -M h i ∥hi∥ ∇ pos dcl - 1 τ N j,j̸ =i exp ĥT i ĥ′ j /τ ĥi -ĥ′ j N j,j̸ =i exp ĥT i ĥ′ j /τ I -M h i ∥hi∥ N j,j̸ =i ∇ neg j dcl ∥∇ pos dcl ∥ = 1 τ ĥ′ j -ĥi (I -M h i ) ∥hi∥ = 1 τ ĥ′ j -ĥi (I -M h i ) ∥hi∥ = 1 τ sin θ ij ′ ∥hi∥ (33) ∇ neg j dcl = 1 τ exp ĥT i ĥ′ j /τ ĥ′ j -ĥi N j,j̸ =i exp ĥT i ĥ′ j /τ I -M h i ∥hi∥ = 1 τ exp ĥT i ĥ′ j /τ N j,j̸ =i exp ĥT i ĥ′ j /τ ĥ′ j -ĥi (I -M h i ) ∥hi∥ = 1 τ exp (cos θ ij ′ /τ ) N j,j̸ =i exp (cos θ ij ′ /τ ) ĥ′ j -ĥi cos θ ij ′ ∥hi∥ = 1 τ exp (cos θ ij ′ /τ ) sin θ ij ′ N j,j̸ =i exp (cos θ ij ′ /τ ) 1 ∥hi∥ (34) C.5 THE DERIVATION OF EQUATION 12 L cl = -log exp ĥT i ĥ′ i /τ exp ĥT i ĥ′ i /τ + N j,j̸ =i exp ĥT i ĥ′ j /τ = log   1 + N j,j̸ =i exp ĥT i ĥ′ j /τ exp ĥT i ĥ′ i /τ   = log   1 + exp -ĥT i ĥ′ i /τ N j,j̸ =i exp ĥT i ĥ′ j /τ   = log   1 + exp   -ĥT i ĥ′ i /τ + log N j,j̸ =i exp ĥT i ĥ′ j /τ     ≤ log 2 + max   -ĥT i ĥ′ i /τ + log N j,j̸ =i exp ĥT i ĥ′ j /τ , 0   ≤ log 2 + max -ĥT i ĥ′ i /τ + log (N -1) max j,j̸ =i exp ĥT i ĥ′ j /τ , 0 = log 2 + max -ĥT i ĥ′ i /τ + max j,j̸ =i ĥT i ĥ′ j /τ + log(N -1), 0 = log 2 + 1 τ max -ĥT i ĥ′ i + max j,j̸ =i ĥT i ĥ′ j + τ log(N -1), 0 where the first inequality sign holds due to the following inequality: f (x) = log(1 + exp(x)) = log   t∈{0,x} exp(t)   ≤ log(2 exp(max(t))) = log 2 + max(x, 0) C.6 THE DERIVATION OF ∇ pos mpt , ∥∇ pos met ∥, ∥∇ pos mat ∥, ∇ neg mpt , ∥∇ neg met ∥ AND ∥∇ neg mat ∥ Consider a generic form of the optimization objective first: L mdt = max 0, ρ ĥi, ĥ′ i -min j,j̸ =i ρ ĥi, ĥ′ j + m where ρ(., .) represents a function used to calculate the similarity. Then the gradient of L mdt with respect to h i is calculated and decomposed: ∂L mdt ∂hi =                ∂ρ ĥi, ĥ′ i ∂hi I -M h i ∥hi∥ ∇ pos mdt - ∂ min j,j̸ =i ρ ĥi, ĥ′ j ∂hi I -M h i ∥hi∥ ∇ neg mdt , ρ ĥi, ĥ′ i -min j,j̸ =i ρ ĥi, ĥ′ j < m 0, ρ ĥi, ĥ′ i -min j,j̸ =i ρ ĥi, ĥ′ j ≥ m where ∇ neg mdt is only contributed by the negative sample with the closest distance to the anchor. By specifying ρ(., .) as dot product, l 2 -norm and angle, we can get the gradient norm of L mpt , L met and L mat with respect to the part of alignment respectively: ∥∇ pos mdt ∥ =                          sin (θ ii ′ ) ∥hi∥ , ρ( ĥi, ĥ′ i ) = ĥT i ĥ′ i and ρ ĥi, ĥ′ i -min j,j̸ =i ρ ĥi, ĥ′ j < m cos (θ ii ′ /2) ∥hi∥ , ρ( ĥi, ĥ′ i ) = -ĥi -ĥ′ i 2 and ρ ĥi, ĥ′ i -min j,j̸ =i ρ ĥi, ĥ′ j < m 1 ∥hi∥ , ρ( ĥi, ĥ′ i ) = -θ ij ′ and ρ ĥi, ĥ′ i -min j,j̸ =i ρ ĥi, ĥ′ j < m 0, ρ ĥi, ĥ′ i -min j,j̸ =i ρ ĥi, ĥ′ j ≥ m Similarly, we can derive the gradient norm of each loss function corresponding to the part of uniformity respectively: ∥∇ neg mdt ∥ =                              sin (min j,j̸ =i θ ij ′ ) ∥hi∥ , ρ( ĥi, ĥ′ j ) = ĥT i ĥ′ j and ρ ĥi, ĥ′ i -min j,j̸ =i ρ ĥi, ĥ′ j < m cos (min j,j̸ =i θ ij ′ /2) ∥hi∥ , ρ( ĥi, ĥ′ j ) = -ĥi -ĥ′ j 2 and ρ ĥi, ĥ′ i -min j,j̸ =i ρ ĥi, ĥ′ j ≥ m 1 ∥hi∥ , ρ( ĥi, ĥ′ j ) = -θ ij ′ and ρ ĥi, ĥ′ i -min j,j̸ =i ρ ĥi, ĥ′ j < m 0, ρ ĥi, ĥ′ i -min j,j̸ =i ρ ĥi, ĥ′ j ≥ m

D THE CONNECTION BETWEEN THE TWO DECOUPLED FORMS

In this paper, two important study objectives are two decoupled forms of contrastive loss. The first one is the alignment and uniformity proposed by Wang & Isola (2020) : Lalign (f ; α) ≜ E ( ĥi , ĥ′ i )∼ppos ∥ ĥi -ĥ′ i ∥ α 2 , α > 0 (41) Luniform (f ; t) ≜ log E i.i.d ( ĥi , ĥ′ j )∼p data exp(-t∥ ĥi -ĥ′ j ∥ 2 2 ) , t > 0 (42) And the second one is decoupled contrastive loss (DCL, L dcl ) proposed by Yeh et al. (2021) : L dcl = -log exp ĥT i ĥ′ i /τ N j,j̸ =i exp ĥT i h ′ j /τ = -ĥT i ĥ′ i /τ + log N j,j̸ =i exp ĥT i h ′ j /τ where the first term of equation 43 is acknowledged by DCL's authors to be equivalent to equation 41, while the difference between the second term and L uniform is only the order of the logarithmic function and the first summation operation when the losses are calculated for all samples in the same mini-batch. Their lower bounds can be obtained through Jensen's Inequality: Luniform = log 1 M (N -1) M i N j,j̸ =i exp(-t∥ ĥi -ĥ′ j ∥ 2 2 ) ≥ 1 M M i log 1 N -1 N j,j̸ =i exp(-t∥ ĥi -ĥ′ j ∥ 2 2 ) ≥ - t M (N -1) M i N j,j̸ =i ∥ ĥi -ĥ′ j ∥ 2 2 ( ) For a mini-batch of data, the total loss is the mean value of the loss calculated for each anchor in the batch: L neg dcl = log N j,j̸ =i exp ĥT i h ′ j /τ (45) L neg dcl -log(N -1) = 1 M M i log 1 N -1 N j,j̸ =i exp( ĥT i h ′ j )/τ ≥ 1 τ M (N -1) M i N j,j̸ =i ĥT i h ′ j = - 1 2τ M (N -1) M i N j,j̸ =i ∥ ĥi -ĥ′ j ∥ 2 2 + 1 2τ M (N -1) (46) where M is batch size. Comparing equation 44 and 46, their common optimization objective is to maximize the sum of the squares of the Euclidean distance of all pairwise samples. Further, this optimization objective corresponds to a specific form of the minimum energy problem on the hypersphere (Kuijlaars & Saff, 1998; Liu et al., 2018) , which is generalized from traditional Thomson Problem (Thomson, 1904) in physics.

E RELATION WITH MIXUP-BASED METHODS

We focus on a class of Mixup-based methods (Kalantidis et al., 2020; Zhang et al., 2022b) in contrastive learning which have similar optimization objectives to L mpt . These methods use Mixup (Zhang et al., 2018) to generate hard negative samples for robustness and performance improvement, indicating the effeteness in VRL (Kalantidis et al., 2020) and SRL (Zhang et al., 2022b) . It should to be noted that such methods are the improvements at the sample level, while do not change the property of "gradient dissipation" in contrastive loss. Existing works generate mixup negative samples with two methods. The first one is generating by linearly weighting the representations of two hard negative samples (Kalantidis et al., 2020) . If we regard ĥi as the anchor, the mixup negative sample for the anchor can be defined as: h′ k = λ ĥ′ m + (1 -λ) ĥ′ n , m ̸ = i, n ̸ = i ( ) where ĥ′ m , ĥ′ n are two hard negative samples and h′ k represents the mixup hard negative sample and λ is the weight parameter. The second method is generating by linearly weighting the representations of a positive sample and a random negative sample (Kalantidis et al., 2020; Zhang et al., 2022b) : h′ k = λ ĥ′ i + (1 -λ) ĥ′ k , k ̸ = i (48) where ĥ′ k is a random negative sample and λ should be less than 0.5 to avoid generating pseudonegative samples with high probability. Then we discuss the relation between both two sample generation methods with L mpt . When M mixup negative samples are added to the original contrastive loss, the following inequality holds:  Then we need to perform a secondary relaxation, and the process needs to be discussed on the following sub-conditions: Condition 1 If h′ k is generated using the way in equation 47, the maximum value of the N + M terms in the log function must be taken in the first N terms, and the following inequality holds:  These observations show the connection between L mpt and the mixup-based contrastive methods: (1) if mixup negative samples are generated with the first method, the optimization objective of L mpt will be equated to some extent with that of L mix by appropriately adjusting m; (2) if mixup negative samples are generated with the second method, L mix will have two upper bounds with different margins, τ log(N + M ) and τ 1-λ log(N + M ), during training. On the contrary, the margin m in L mpt will no longer change during training. Therefore, the optimization objective of L mpt can be similar or equal to the mixup-based methods by using a appropriate m (slightly larger than τ log(N +M )), which provides a new perspective to explain the well-performance of L mpt -like loss functions.



https://huggingface.co/datasets/princeton-nlp/datasets-for-simcse/resolve/main/wiki1m for simcse.txt



L uniform -L align scatterplot. The colors of the points represent the average Spearman correlation on STS tasks. Metrics recorded during training via L dcl .

Figure 1: The difference between contrastive loss and its two decoupled forms in the training and evaluation phase. All above images are plotted based on the optimization process for BERT base .

Comparison among contrastive loss and its decoupled forms in the alignment part of the gradient norm. Comparison among the proposed loss functions for validation in the alignment part of the gradient norm.

Figure 2: Gradient norms of contrastive loss and its two decoupled forms with respect to the alignment part. The 1/∥h i ∥ in all equations are ignored when the images are plotted. All θ ij′ in one equation are treated as the same value. All hyper-parameters in equations are consistent with the columns corresponding to BERT base in Table2.

Metrics recorded during training via Lmpt(m = 0.80).

Figure 3: The plots of the loss functions proposed in this work. d ij in (a) and (b) presents Euclidean distance.We first calculate the average distance of all negative samples to the anchor, then subtract the distance of positive samples to anchor from it to get the distance gap, and finally plot the average distance gap of all anchor in the mini-batch in the figure. All above images are plotted based on the optimization process for BERT base .

Comparison among contrastive loss and its decoupled forms in the uniformity part of the gradient norm. Comparsion among three loss functions like Tripet Loss in the uniformity part of the gradient norm. The horizontal axis represents the minimum angle between the anchor and the negative samples.

Figure 4: Gradient norms of all the loss functions studied in the main text with respect to the uniformity part. All plotted images ignore 1/∥h i ∥ and reflect only the gradient contribution of a single negative sample to the anchor.

+ ∥ exists the similar "gradient dissipation" property of ∥∇ neg j cl ∥. • ∥∇ pos mpt ∥, ∥∇ pos met ∥ and ∥∇ pos mat ∥ also exist the propety of "gradient dissipation", but differs from ∥∇ neg j cl ∥ in the trend of the gradient norms. C FORMULA DERIVATION C.1 THE DERIVATION OF EQUATION 6

If h′ k is generated using the way in equation 48 and the maximum value of the N +M terms in the log function is taken in the first N terms, the inequality 50 still holds.Condition 2.2 If h′k is generated using the way in equation 48 and the maximum value of the N +M terms in the log function is taken in the latter M terms, the another inequality holds:Lmix ≤ log 2 + max -ĥT log 2 + max -ĥT i ĥ′ i /τ + max k ĥT i hk /τ + log(N + M ), 0 = log 2 + max -ĥT i ĥ′ i /τ + max k ĥT i λ ĥ′ i + (1 -λ) ĥ′ k /τ + log(N + M ), 0= log 2 + max -

Performance of the optimization objectives studied in this paper. All results reported are the average value obtained from three runs. STS.Avg represents the average of Spearman correlation on seven semantic textual similarity datasets. TR.Avg represents the average of accuracy on seven downstream classification datasets.

Since this new loss function is very close in form to Triplet Loss (Weinberger & Saul, 2009), we express this loss function as L mpt (Minimum dot Product Triplet Loss). To further increase the diversity of gradient norm variations, we replace the dot product with the Euclidean distance to obtain L met (Minimum Euclidean distance Triplet Loss) and the angle to obtain L mat (Minimum Angle Triplet Loss):

j̸ =i exp ĥT

ACKNOWLEDGEMENT

This work was supported in part by the National Key R&D Program of China under Grant 2021ZD0110700, in part by the Fundamental Research Funds for the Central Universities, and in part by the State Key Laboratory of Software Development Environment.

A.2 TRAINING DETAILS

We validate the effectiveness of all loss functions based on the SimCSE (Gao et al., 2021) as it is effective but simple enough, so the performance is not easily influenced by other factors.Following Gao et al. (2021) , we observe the following points during training: (1) No data in any STS training set is used for training; (2) 1,000,000 sentences sampled from the English Wikipedia 2 are used for training instead; (3) Spearman correlation on STS-B development set is recorded each 125 steps; (4) The checkpoints corresponding to the highest Spearman correlation will be saved for evaluation; (5) The training period is one epoch for all pretrained models.We implement the codes using Python3.7 and Pytorch1.12.0 and experiment with the single 32G NVIDIA V100 GPU.

Method

Parameter Table 2 : The parameters corresponding to the best results of the STS tasks, which are also corresponding to the reported results and the plotted figures in this paper.

A.3 PARAMETER SETTING

For all loss functions, we perform a grid search on learning rate ={7e-6, 1e-5, 3e-5, 5e-5} and batch size ={64, 128, 256, 512}. For other hyperparameters in every optimization objectives, we first narrow the interval with an extensive search, then the grid search is conducted in the following ranges:• L cl and L dcl τ = {0.03, 0.05, 0.07}• L a&u λ = {0.1, 0.3, 0.5, 0.7, 0.9}

