SUPERVISED CONTRASTIVE LEARNING FOR PRE-TRAINED LANGUAGE MODEL FINE-TUNING

Abstract

State-of-the-art natural language understanding classification models follow twostages: pre-training a large language model on an auxiliary task, and then finetuning the model on a task-specific labeled dataset using cross-entropy loss. However, the cross-entropy loss has several shortcomings that can lead to sub-optimal generalization and instability. Driven by the intuition that good generalization requires capturing the similarity between examples in one class and contrasting them with examples in other classes, we propose a supervised contrastive learning (SCL) objective for the fine-tuning stage. Combined with cross-entropy, our proposed SCL loss obtains significant improvements over a strong RoBERTa-Large baseline on multiple datasets of the GLUE benchmark in few-shot learning settings, without requiring specialized architecture, data augmentations, memory banks, or additional unsupervised data. Our proposed fine-tuning objective leads to models that are more robust to different levels of noise in the fine-tuning training data, and can generalize better to related tasks with limited labeled data.

1. INTRODUCTION

State-of-the-art for most existing natural language processing (NLP) classification tasks is achieved by models that are first pre-trained on auxiliary language modeling tasks and then fine-tuned on the task of interest with cross-entropy loss (Radford et al., 2019; Howard & Ruder, 2018; Liu et al., 2019; Devlin et al., 2019) . Although ubiquitous, the cross-entropy loss -the KL-divergence between one-hot vectors of labels and the distribution of model's output logits -has several shortcomings. Cross entropy loss leads to poor generalization performance (Liu et al., 2016; Cao et al., 2019) , and it lacks robustness to noisy labels (Zhang & Sabuncu, 2018; Sukhbaatar et al., 2015) or adversarial examples (Elsayed et al., 2018; Nar et al., 2019) . Effective alternatives have been proposed to modify the reference label distributions through label smoothing (Szegedy et al., 2016; Müller et al., 2019) , Mixup (Zhang et al., 2018) , CutMix (Yun et al., 2019) , knowledge distillation (Hinton et al., 2015) or self-training (Yalniz et al., 2019; Xie et al., 2020) . Fine-tuning using cross entropy loss in NLP also tends to be unstable across different runs (Zhang et al., 2020; Dodge et al., 2020) , especially when supervised data is limited, a scenario in which pre-training is particularly helpful. To tackle the issue of unstable fine-tuning and poor generalization, recent works propose local smoothness-inducing regularizers (Jiang et al., 2020) and regularization methods inspired by the trust region theory (Aghajanyan et al., 2020) to prevent representation collapse. Empirical evidence suggests that fine-tuning for more iterations, reinitializing top few layers (Zhang et al., 2020) , and using debiased Adam optimizer during fine-tuning (Mosbach et al., 2020) can make the fine-tuning stage more stable. Inspired by the learning strategy that humans utilize when given a few examples, we seek to find the commonalities between the examples of each class and contrast them with examples from other classes. We hypothesize that a similarity-based loss will be able to hone in on the important dimensions of the multidimensional hidden representations hence lead to better few-shot learning results and be more stable while fine-tuning pre-trained language models. We propose a novel objective for fine-tuning that includes a supervised contrastive learning (SCL) term that pushes the examples from the same class close and the examples from different classes further apart. The SCL term is similar to the contrastive objectives used in self-supervised representation learning across image, speech, and video domains. (Sohn, 2016; Oord et al., 2018; Wu et al., 2018; Bachman et al., 2019; Hénaff et al., 2019; Baevski et al., 2020; Conneau et al., 2020; Tian et al., 2020; Hjelm et al., 2019; Han et al., 2019; He et al., 2020; Misra & Maaten, 2020; Chen et al., 2020a; b) . Unlike these methods, however, we use a contrastive objective for supervised learning of the final task, instead of contrasting different augmented views of examples. In few-shot learning settings (20, 100, 1000 labeled examples), the addition of the SCL term to the finetuning objective significantly improves the performance on several natural language understanding classification tasks from the popular GLUE benchmark (Wang et al., 2019) over the very strong baseline of fine-tuning RoBERTa-Large with cross-entropy loss only. Furthermore, pre-trained language models fine-tuned with our proposed objective are not only robust to noise in the fine-tuning training data, but can also exhibit improved generalization to related tasks with limited labeled task data. Our approach does not require any specialized network architectures (Bachman et al., 2019; Hénaff et al., 2019) , memory banks (Wu et al., 2018; Tian et al., 2020; Misra & Maaten, 2020) , data augmentation of any kind, or additional unsupervised data. To the best of our knowledge, our work is the first to successfully integrate a supervised contrastive learning objective for fine-tuning pre-trained language models. We empirically demonstrate that the new objective has desirable properties across several different settings. Our contributions in this work are listed in the following: • We propose a novel objective for fine-tuning pre-trained language models that includes a supervised contrastive learning term, as described in Section 2. • We obtain strong improvements in the few-shot learning settings (20, 100, 1000 labeled examples) as shown in Table 2 , leading up to 10.7 points improvement on a subset of GLUE benchmark tasks (SST-2, QNLI, MNLI) for the 20 labeled example few-shot setting, over a very strong baseline -RoBERTa-Large fine-tuned with cross-entropy loss. • We demonstrate that our proposed fine-tuning objective is more robust, in comparison to RoBERTa-Large fine-tuned with cross-entropy loss, across augmented noisy training datasets (used to fine-tune the models for the task of interest) with varying noise levels as shown in Table 3 -leading up to 7 points improvement on a subset of GLUE benchmark tasks (SST-2, QNLI, MNLI) across augmented noisy training datasets. We use a backtranslation model to construct the augmented noisy training datasets of varying noise levels (controlled by the temperature parameter), as described in detail in Section 4.2. • We show that the task-models fine-tuned with our proposed objective have improved generalizability to related tasks despite having limited availability of labeled task data (Table 7 ). This led to a 2.9 point improvement on Amazon-2 over the task model fine-tuned with cross-entropy loss only. Moreover, it considerably reduced the variance across few-shot training samples, when transferred from the source SST-2 sentiment analysis task model.

2. APPROACH

We propose a novel objective that includes a supervised contrastive learning term for fine-tuning pre-trained language models. The loss is meant to capture the similarities between examples of the same class and contrast them with the examples from other classes. For a multi-class classification problem with C classes, we work with a batch of training examples of size N, {x i , y i } i=1,...N . Φ(•) ∈ R d denotes an encoder that outputs the l 2 normalized final encoder hidden layer before the softmax projection; N yi is the total number of examples in the batch that have the same label as y i ; τ > 0 is an adjustable scalar temperature parameter that controls the separation of classes; y i,c denotes the label and ŷi,c denotes the model output for the probability of the ith example belonging to the class c; λ is a scalar weighting hyperparameter that we tune for each downstream task and setting. The overall loss is then given in the following: L = (1 -λ)L CE + λL SCL (1) L CE = - 1 N N i=1 C c=1 y i,c • log ŷi,c L SCL = N i=1 - 1 N yi -1 N j=1 1 i =j 1 yi=yj log exp (Φ(x i ) • Φ(x j )/τ ) N k=1 1 i =k exp (Φ(x i ) • Φ(x k )/τ ) The overall loss is a weighted average of CE and the proposed SCL loss, as given in the equation ( 1). The canonical definition of the multi-class CE loss that we use is given in equation ( 2). The novel SCL loss is given in the equation (3). This loss can be applied using a variety of encoders Φ(•) ∈ R d -for example a ResNet for a computer vision application or a pre-trained language model such as BERT for an NLP application. In this work, we focus on fine-tuning pre-trained language models for single sentence and sentence-pair classification settings. For single sentence classification, each example x i consists of sequence of tokens prepended with the special [CLS] token x i = [[CLS], t 1 , t 2 , . . . , t L , [EOS]]. The length of sequence L is constrained such that L < L max . Similarly, for sentence-pair classification tasks, each example x i is a concatenation of two sequences of tokens [t 1 , t 2 , . . . t L ] and [s 1 , s 2 , . . . , s M ] corresponding to the sentences with special tokens delimiting them: x i = [[CLS], t 1 , t 2 , . . . , t L , [SEP ], s 1 , s 2 , . . . , s M , [EOS]]. The length of concatenated sequences is constrained such that L + M < L max . In both cases, Φ(x i ) ∈ R d uses the embedding of [CLS] token as the representation for example x i . These choices follow standard practices for fine-tuning pre-trained language models for classification (Devlin et al., 2019; Liu et al., 2019) . (Schroff et al., 2015) . The empirical behavior of the adjustable temperature parameter is consistent with the observations of previous work related to supervised contrastive learning. (Chen et al., 2020a; Khosla et al., 2020) . Relationship to Self-Supervised Contrastive Learning Self-supervised contrastive learning has shown success in learning powerful representations, particularly in the computer vision domain. (Chen et al., 2020a; He et al., 2020; Tian et al., 2020; Mnih & Kavukcuoglu, 2013; Gutmann & Hyvärinen, 2012; Kolesnikov et al., 2019) Self-supervised learning methods do not require any labeled data; instead they sample a mini batch from unsupervised data and create positive and negative examples from these samples using strong data augmentation techniques such as AutoAugment (Cubuk et al., 2019) or RandAugment (Cubuk et al., 2020) for computer vision. Positive examples are constructed by applying data augmentation to the same example (cropping, flipping, etc. for an image), and negative examples are simply all the other examples in the sampled mini batch. Intuitively, selfsupervised contrastive objectives are learning representations that are invariant to different views of positive pairs; while maximizing the distance between negative pairs. The distance metric used is often the inner product or the Euclidean distance between vector representations of the examples. For a batch of size N, self-supervised contrastive loss is defined as: L self = 2N i=1 -log exp (Φ(x 2i-1 ) • Φ(x 2i )/τ ) 2N k=1 1 i =k exp (Φ(x i ) • Φ(x k )/τ ) where Φ(•) ∈ R d denotes an encoder that outputs the l 2 normalized final encoder hidden layer before the softmax projection; τ > 0 is a scalar temperature parameter. A is defined as a data augmentation block that generates two randomly generated augmented examples, x 2i and x 2i-1 from the original example x i : A({x i , y i } i=1,...N ) = {x i , y i } i=1,...2N . As an example, A can be RandAugment for a computer vision application; or it could be a back-translation model for an NLP application.

3. RELATED WORK

Traditional Machine Learning and Theoretical Understanding Several works have analyzed the shortcomings of the widely adopted cross-entropy loss, demonstrating that it leads to poor generalization performance due to poor margins (Liu et al., 2016; Cao et al., 2019) , and lack of robustness to noisy labels (Zhang & Sabuncu, 2018; Sukhbaatar et al., 2015) or adversarial examples (Elsayed et al., 2018; Nar et al., 2019) . On the other hand, there has been a body of work that has explored the performance difference for classifiers trained with discriminative (i.e., optimizing for p(y|x), where y is the label and x is the input) losses such as cross-entropy loss and generative losses (i.e. optimizing for p(x|y)). Ng & Jordan (2001) show that classifiers trained with generative losses can outperform their counterparts trained with discriminative losses in the context of Logistic Regression and Naive Bayes. Raina et al. (2003) show that a hybrid discriminative and generative objective outperforms both solely discriminative and generative approaches. In the context of contrastive learning, Saunshi et al. (2019) propose a theoretical framework for analyzing contrastive learning algorithms through hypothesizing that semantically similar points are sampled from the same latent class, which allows showing formal guarantees on the quality of learned representations. Contrastive Learning There has been several recent investigations for the use of contrastive objectives for self-supervised, semi-supervised, and supervised learning methods, primarily in the computer vision domain. (Krizhevsky, 2009) datasets, outperforming state-of-the-art discriminative and generative classifiers. They also demonstrate improved performance for WideResNet-28-10 on robustness, out-of-distribution detection, and calibration, compared to other state-of-the-art generative and hybrid models. Finally, Fang & Xie (2020) propose pre-training language models using a self-supervised contrastive learning objective at the sentence level using back-translation as the augmentation method, followed by fine-tuning by predicting whether two augmented sentences originate from the same sentence -demonstrating improvements over fine-tuning BERT on a subset of GLUE benchmark tasks. Stability and Robustness of Fine-tuning Pre-trained Language Models There has been recent works on analyzing the stability and robustness of fine-tuning pre-trained language models, since they have been shown to overfit to the labeled task data while fine-tuning and hence fail to generalize to unseen data when there is limited labeled data for the task (Aghajanyan et al., 2020) . To improve the generalization performance, Jiang et al. (2020) propose a local smoothness-inducing regularizer to manage the complexity of the model and a Bregman proximal point optimization method, an instance of trust-region methods, to prevent aggressive updating of the model during fine-tuning. They show state-of-the-art performance on GLUE, SNLI (Bowman et al., 2015) , SciTail (Khot et al., 2018) , and ANLI (Nie et al., 2020) natural language understanding benchmarks. Similarly, Aghajanyan et al. (2020) propose a regularized fine-tuning procedure inspired by trust-region theory that replaces adversarial objectives with parametric noise sampled from normal or uniform distribution in order to prevent representation collapse during fine-tuning for better generalization performance, without hurting the performance. They show improved performance on a range of natural language understanding and generation tasks including DailyMail/CNN (Hermann et al., 2015) , Gigaword (Napoles et al., 2012) , Reddit TIFU (Kim et al., 2019) , and the GLUE benchmark. There has also been some empirical analysis that suggests fine-tuning for more epochs, reinitializing top few layers (Zhang et al., 2020) instead of only the classification head, and using debiased Adam optimizer instead of BERTAdam (Devlin et al., 2019) during fine-tuning (Mosbach et al., 2020) can make the fine-tuning procedure more stable across different runs.

4.1. DATASETS AND TRAINING DETAILS

We use datasets from the GLUE natural language understanding benchmark (Wang et al., 2019) for evaluation. We include both single sentence classification tasks and sentence-pair classification tasks to test whether our hypothesis is generally applicable across tasks. We summarize each dataset based on their main task, domain, number of training examples, and number of classes in Table 1 . In our few-shot learning experiments, we sample half of the original validation set of the GLUE benchmark and use it as our test set, and sample ∼500 examples for our validation set from the original GLUE validation set, both taking the label distribution of the original validation set into account. For each task, we want the validation set to be small enough to avoid easy overfitting on the validation set, and big enough to avoid high-variance when early-stopping at various epochs for the few-shot learning experiments. For full dataset experiments, such as the ones shown in Table 5 , Table 6 , Table 8 , and Table 9 , we sample a validation set from the original training set of the GLUE benchmark based on the size of the original validation set of GLUE, and report our test results on the original validation set of GLUE. We run each experiment with 10 different seeds, and report the average test accuracy, standard deviation, along with p-values with respect to the baseline. We pick the best hyperparameter combination based on the average validation accuracy across 10 seeds. For few-shot learning experiments, such as the ones shown in Table 2 , Table 3 , and Table 10 , we sample 10 different training set samples based on the total number of examples N specified from the original training set of the GLUE benchmark, taking the label distribution of the original training set into account. We report the average and the standard deviation of the test accuracies of the top 3 models based on their validation accuracies out of 10 random training set samples. Best hyperparameter combination is picked based on the average validation accuracy of the top 3 models. The reason why we focus on the top 3 models for this setting is that we would like to reduce the variance across training set samples. We use fairseq Ott et al. (2019) library and the open-source RoBERTa-Large model for all of our experiments. During all the fine-tuning runs, we use Adam optimizer with a learning rate of 1e-5, batch size of 16 (unless specified otherwise), and dropout rate of 0.1. For each experiment that includes the SCL term, we conduct a grid-based hyperparameter sweep for λ ∈ {0.1, 0.3, 0.5, 0.7, 0.9, 1.0} and τ ∈ {0.1, 0.3, 0.5, 0.7}. We observe that models with best test accuracies across all experimental settings overwhelmingly use the hyperparameter combination τ = 0.3 and λ = 0.9. 

4.2. CONSTRUCTING AUGMENTED NOISY TRAINING DATASETS

Machine learning researchers or practitioners often do not know how noisy their datasets are, as input examples might be corrupted or ground truth labeling might not be perfect. Therefore, it is preferable to use robust training objectives that can get more information out of datasets of different noise levels, even where there is limited amount of labeled data. We construct augmented noisy training datasets (used to fine-tune the pre-trained language models for the task of interest) of different noise levels using a back-translation model (Edunov et al., 2018) , where we increase the temperature parameter to create more noisy examples. Back-translation refers to the procedure of translating an example in language A into language B and then translating it back to language A, and it is a commonly used data augmentation procedure for NLP applications, as the new examples obtained through back-translation provide targeted inductive bias to the model while preserving the meaning of the original example. Specifically, we use WMT'18 English-German and German-English translation models, use random sampling to get more diverse examples, and employ and augmentation ratio of 1:3 for supervised examples:augmented examples. We observe that employing random sampling with a tunable temperature parameter is critical to get diverse paraphrases for the supervised examples, consistent with the previous work (Edunov et al., 2018; Xie et al., 2019) , since commonly used beam search results in very regular sentences that do not provide diversity to the existing data distribution. We keep the validation and test sets same with the experiments shown in Table 2 .

5.1. GLUE BENCHMARK FEW-SHOT LEARNING RESULTS

We proposed adding the SCL term inspired by the learning strategy of humans when they are given few examples. In Table 2 , we report our few-shot learning results on SST-2, QNLI, and MNLI from the GLUE benchmark with 20, 100, 1000 labeled training examples. Details of the experimental setup are explained in Section 4. We use a very strong baseline of fine-tuning RoBERTa-Large with cross-entropy loss. We observe that the SCL term improves performance over the baseline significantly across all datasets and data regimes, leading to 10.7 points improvement on QNLI, 3.4 points improvement on MNLI, and 2.2 points improvement on SST-2, where we have 20 labeled examples for fine-tuning. This shows that our proposed objective is effective both for binary single sentence classification such as sentiment analysis; and sentence pair classification tasks such as textual entailment and paraphrasing -when we are given only few labeled examples for the task. We see that as we increase the number of labeled examples, performance improvement over the baseline decreases, leading to 1.9 points improvement on MNLI for 100 examples and 0.6 points improvement on QNLI for 1000 examples. We also would like to acknowledge that improvements over the baseline when N=1000 on both SST-2 and MNLI are not statistically significant. In addition, we conduct an ablation study where we investigate the importance of l 2 normalization and temperature scaling where we replace SCL loss with CE loss but keep the l 2 normalization and temperature scaling, as shown in Table 10 in the Appendix under the method name CE+CE. In Figure 2 

5.2. ROBUSTNESS ACROSS AUGMENTED NOISY TRAINING DATASETS

In Table 3 , we report our results on augmented noisy training sets with varying levels of noise. We have 100 labeled examples for fine-tuning for each task, and we augment their training sets with noisy examples using a back-translation model, as described in detail in Section 4.2. Note that we use the back-translation model to simulate training datasets of varying noise levels and not as a method to boost model performance. Experimental setup follows what is described in Section 4 for few-shot learning experiments. T is the temperature for the back-translation model used to augment the training sets, and higher temperature corresponds to more noise in the augmented training set. We observe consistent improvements over the RoBERTa-Large baseline with our proposed objective across all datasets across all noise levels, with 0.4 points improvement on SST-2, 2.5 points improvement on QNLI, and 7 points improvement on MNLI on average across augmented training sets. The improvement is particularly significant for inference tasks (QNLI, MNLI) when the noise levels are higher (higher temperature), leading to 7.7 points improvement on MNLI when T=0.7, and 4.2 points improvement on QNLI when T=0.9. We show some samples of the augmented examples used in this robustness experiment in Table 4 . For T=0.3, examples mostly stay the same with minor changes in their phrasing, while for T=0.9, some grammatical mistakes and factual errors are introduced. 

MNLI Original

However, the link did not transfer the user to a comment box particular to the rule at issue. MNLI Augmented (T=0.3) However, the link did not send the user to a comment field specifically for the rule. Tenants could not enter the apartment complex due to a dangerous chemical spill. MNLI Augmented (T=0.9) Tenants were banned from entering the medical property because of a blood positive substance. Table 4 : Sample of augmented examples with different noise levels for the robustness experiment shown in Table 3 . Higher temperature (T) corresponds to more noise in the augmented training set.

5.3. GLUE BENCHMARK FULL DATASET RESULTS

In Table 5 , we report results using our proposed objective on six downstream tasks from the GLUE benchmark. We use a very strong baseline of fine-tuning RoBERTa-Large with cross-entropy loss, which is currently the standard practice for the state-of-the-art NLP classification models. Details of the experimental setup are explained in Section 4. We observe that adding the SCL term to the objective improves the performance over the RoBERTa-Large baseline that lead to 3.1 points improvement on MRPC, 3.5 points improvement on QNLI, and an average improvement of 1.2 points across all 6 datasets. We conduct these experiments to investigate the effect of the SCL term in high-data regimes, as we observe that it's effective in few-shot learning settings. We acknowledge that only MRPC and QNLI results are statistically significant, and we report the results on the other datasets as a finding for the sake of completeness. We hypothesize larger batch sizes lead to better performance, but we leave that for future work as that requires additional engineering effort. We show evidence for this hypothesis in our ablation studies that we show in Table 6 , where we conduct the full dataset experiments for CE+SCL with the same experimental setup described here for Table 5 on SST-2, CoLA, QNLI, and MNLI for batch sizes 16, 64, and 256 using RoBERTa-Base. We observe that as we increase the batch size, performance improves significantly across all datasets. Specifically, we observe 0.3 points improvement on SST-2, 0.8 points improvement on CoLA, 0.4 points improvement on QNLI, and 1.3 points improvement on MNLI, when we increase the batch size from 16 to 256 for CE+SCL. We also investigate the effect of SCL term in the overall training speed, and we measure that with average updates per second metric, shown in Table 6 . For batch size 16, the batch size we use throughout the paper across all experimental settings, effect of SCL is negligible -decreasing average updates per second from 15.9 to 15.08. As we increase the batch size, effect of SCL to training speed becomes more significantdecreasing average updates per second from 2.46 to 1.54 for batch size 256. In addition, we conduct an ablation study where we investigate the importance of l 2 normalization and temperature scaling where we replace SCL loss with CE loss but keep the normalization and scaling (denoted as CE+CE) both for full dataset results in 

5.4. GENERALIZATION ABILITY OF TASK MODELS

In this experiment, we first fine-tune RoBERTa-Large on SST-2 using its full training set and get a task model with and without SCL term. Then, we transfer this task model to two related single sentence sentiment analysis binary classification tasks for the movie reviews domain -Amazon-2 and Yelp-2 (Zhang et al., 2015) . For both, we sample 20 labeled examples for each class, and follow the few-shot learning experimental setup described in Section 4. In Table 7 , we demonstrate that using the SCL term for both source (SST-2) and target domains (Amazon-2, Yelp-2) lead to better generalization ability, with 2.9 points improvement on Amazon-2 and 0.4 points improvement on Yelp-2 along with significant reduction in variance across training set samples. 

6. CONCLUSION

We propose a supervised contrastive learning objective for fine-tuning pre-trained language models and demonstrate significant improvements over a strong RoBERTa-Large baseline on multiple datasets of the GLUE benchmark in the few-shot learning settings. We also show that our proposed objective leads to models that are more robust to different levels of noise in the training data and can generalize better to related tasks with limited labeled task data. Currently, data augmentation methods in NLP and their effects on the downstream tasks are neither as effective nor as well understood as their counterparts in the computer vision domain. In future work, we plan to study principled and automated data augmentation techniques for NLP that would allow extending our supervised contrastive learning objective to both semi-supervised and self-supervised learning settings. 



Figure 1: Our proposed objective includes a cross-entropy term (CE) and a supervised contrastive learning (SCL) term, and it is formulated to push examples from the same class close and examples from different classes further apart. We show examples from the SST-2 sentiment analysis dataset from the GLUE benchmark, where class A (shown in red) is negative movie reviews and class B (shown in blue) is positive movie reviews. Although we show a binary classification case for simplicity, the loss is generally applicable to any multi-class classification setting.Empirical observations show that both l 2 normalization of the encoded embedding representations and an adjustable scalar temperature parameter τ improve performance. Lower temperature increases the influence of examples that are harder to separate, effectively creating harder negatives. Using hard negatives has been previously shown to improve performance in the context of margin-based loss formulations such as triplet loss(Schroff et al., 2015). The empirical behavior of the adjustable temperature parameter is consistent with the observations of previous work related to supervised contrastive learning.(Chen et al., 2020a;Khosla et al., 2020).

Few-shot learning test results on the GLUE benchmark where we have N=20,100,1000 labeled examples for training. Reported results are the mean and the standard deviation of the test accuracies of the top 3 models based on validation accuracy out of 10 random training set samples, along with p-values for each experiment.

Figure 2: tSNE plots of the learned CLS embeddings on the SST-2 test set in the few-shot learning setting of having 20 labeled examples to fine-tune on -comparing RoBERTa-Large fine-tuned with CE only (left) and with our proposed objective CE+SCL (right) for the SST-2 sentiment analysis task. Blue: positive examples; red: negative examples.

GLUE Benchmark datasets used for evaluation.

, we show tSNE plots of the learned representations of the CLS embeddings on SST-2 test set when RoBERTa-Large is fine-tuned with 20 labeled examples, comparing CE with and without the SCL term. We can clearly see that the SCL term enforces more compact clustering of examples with the same label; while the distribution of the embeddings learned with CE is close to random. We include a more detailed comparison for CE and CE+SCL showing learned representations of examples as tSNE plots, where we have 20, 100 labeled examples and full dataset respectively for fine-tuning in Figure 3 in the Appendix.

Results on the GLUE benchmark for robustness across noisy augmented training sets. Average shows the average performance across augmented training sets.

Table 8, and for batch size ablation in Table 9 in the Appendix. Test results on the validation set of GLUE benchmark. We compare fine-tuning RoBERTa-Large with CE with and without SCL. Best hyperparameter configuration picked based on average validation accuracy. We report average accuracy across 10 seeds for the model with best hyperparameter configuration, its standard deviation, and p-values. RoBERTa Base CE + SCL 256 95.2±0.3 84.5±0.5 92.9±0.3 86.6±0.6 1.54

Ablation study on performance and training speed shown as average updates per second (Avg ups/sec) for fine-tuning RoBERTa-Base with respect to the batch size (Bsz).

Generalization of the SST-2 task model (fine-tuned using the full training set) to related tasks (Amazon-2, Yelp-2) where there are 20 labeled examples for each class.

Hongyi Zhang, M. Cissé, Yann Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In ICLR, 2018. Tianyi Zhang, Felix Wu, Arzoo Katiyar, Kilian Q. Weinberger, and Yoav Artzi. Revisiting few-sample bert fine-tuning. ArXiv, abs/2006.05987, 2020. X. Zhang, J. Zhao, and Y. LeCun. Character-level convolutional networks for text classification. In NeurIPS, 2015. Zhilu Zhang and Mert R. Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. In NeurIPS, 2018.

