SUPERVISED CONTRASTIVE LEARNING FOR PRE-TRAINED LANGUAGE MODEL FINE-TUNING

Abstract

State-of-the-art natural language understanding classification models follow twostages: pre-training a large language model on an auxiliary task, and then finetuning the model on a task-specific labeled dataset using cross-entropy loss. However, the cross-entropy loss has several shortcomings that can lead to sub-optimal generalization and instability. Driven by the intuition that good generalization requires capturing the similarity between examples in one class and contrasting them with examples in other classes, we propose a supervised contrastive learning (SCL) objective for the fine-tuning stage. Combined with cross-entropy, our proposed SCL loss obtains significant improvements over a strong RoBERTa-Large baseline on multiple datasets of the GLUE benchmark in few-shot learning settings, without requiring specialized architecture, data augmentations, memory banks, or additional unsupervised data. Our proposed fine-tuning objective leads to models that are more robust to different levels of noise in the fine-tuning training data, and can generalize better to related tasks with limited labeled data.

1. INTRODUCTION

State-of-the-art for most existing natural language processing (NLP) classification tasks is achieved by models that are first pre-trained on auxiliary language modeling tasks and then fine-tuned on the task of interest with cross-entropy loss (Radford et al., 2019; Howard & Ruder, 2018; Liu et al., 2019; Devlin et al., 2019) . Although ubiquitous, the cross-entropy loss -the KL-divergence between one-hot vectors of labels and the distribution of model's output logits -has several shortcomings. Cross entropy loss leads to poor generalization performance (Liu et al., 2016; Cao et al., 2019) , and it lacks robustness to noisy labels (Zhang & Sabuncu, 2018; Sukhbaatar et al., 2015) or adversarial examples (Elsayed et al., 2018; Nar et al., 2019) . Effective alternatives have been proposed to modify the reference label distributions through label smoothing (Szegedy et al., 2016; Müller et al., 2019 ), Mixup (Zhang et al., 2018 ), CutMix (Yun et al., 2019 ), knowledge distillation (Hinton et al., 2015) or self-training (Yalniz et al., 2019; Xie et al., 2020) . Fine-tuning using cross entropy loss in NLP also tends to be unstable across different runs (Zhang et al., 2020; Dodge et al., 2020) , especially when supervised data is limited, a scenario in which pre-training is particularly helpful. To tackle the issue of unstable fine-tuning and poor generalization, recent works propose local smoothness-inducing regularizers (Jiang et al., 2020) and regularization methods inspired by the trust region theory (Aghajanyan et al., 2020) to prevent representation collapse. Empirical evidence suggests that fine-tuning for more iterations, reinitializing top few layers (Zhang et al., 2020) , and using debiased Adam optimizer during fine-tuning (Mosbach et al., 2020) can make the fine-tuning stage more stable. Inspired by the learning strategy that humans utilize when given a few examples, we seek to find the commonalities between the examples of each class and contrast them with examples from other classes. We hypothesize that a similarity-based loss will be able to hone in on the important dimensions of the multidimensional hidden representations hence lead to better few-shot learning results and be more stable while fine-tuning pre-trained language models. We propose a novel objective for fine-tuning that includes a supervised contrastive learning (SCL) term that pushes the examples from the same class close and the examples from different classes further apart. The SCL term is similar to the contrastive objectives used in self-supervised representation learning across image, speech, and video domains. (Sohn, 2016; Oord et al., 2018; Wu et al., 2018; Bachman et al., 2019; Hénaff et al., 2019; Baevski et al., 2020; Conneau et al., 2020; Tian et al., 2020; Hjelm et al., 2019; Han et al., 2019; He et al., 2020; Misra & Maaten, 2020; Chen et al., 2020a; b) . Unlike these methods, however, we use a contrastive objective for supervised learning of the final task, instead of contrasting different augmented views of examples. In few-shot learning settings (20, 100, 1000 labeled examples), the addition of the SCL term to the finetuning objective significantly improves the performance on several natural language understanding classification tasks from the popular GLUE benchmark (Wang et al., 2019) over the very strong baseline of fine-tuning RoBERTa-Large with cross-entropy loss only. Furthermore, pre-trained language models fine-tuned with our proposed objective are not only robust to noise in the fine-tuning training data, but can also exhibit improved generalization to related tasks with limited labeled task data. Our approach does not require any specialized network architectures (Bachman et al., 2019; Hénaff et al., 2019) , memory banks (Wu et al., 2018; Tian et al., 2020; Misra & Maaten, 2020) , data augmentation of any kind, or additional unsupervised data. To the best of our knowledge, our work is the first to successfully integrate a supervised contrastive learning objective for fine-tuning pre-trained language models. We empirically demonstrate that the new objective has desirable properties across several different settings. Our contributions in this work are listed in the following: • We propose a novel objective for fine-tuning pre-trained language models that includes a supervised contrastive learning term, as described in Section 2. • We obtain strong improvements in the few-shot learning settings (20, 100, 1000 labeled examples) as shown in Table 2 , leading up to 10.7 points improvement on a subset of GLUE benchmark tasks (SST-2, QNLI, MNLI) for the 20 labeled example few-shot setting, over a very strong baseline -RoBERTa-Large fine-tuned with cross-entropy loss. • We demonstrate that our proposed fine-tuning objective is more robust, in comparison to RoBERTa-Large fine-tuned with cross-entropy loss, across augmented noisy training datasets (used to fine-tune the models for the task of interest) with varying noise levels as shown in Table 3 -leading up to 7 points improvement on a subset of GLUE benchmark tasks (SST-2, QNLI, MNLI) across augmented noisy training datasets. We use a backtranslation model to construct the augmented noisy training datasets of varying noise levels (controlled by the temperature parameter), as described in detail in Section 4.2. • We show that the task-models fine-tuned with our proposed objective have improved generalizability to related tasks despite having limited availability of labeled task data (Table 7 ). This led to a 2.9 point improvement on Amazon-2 over the task model fine-tuned with cross-entropy loss only. Moreover, it considerably reduced the variance across few-shot training samples, when transferred from the source SST-2 sentiment analysis task model.

2. APPROACH

We propose a novel objective that includes a supervised contrastive learning term for fine-tuning pre-trained language models. The loss is meant to capture the similarities between examples of the same class and contrast them with the examples from other classes. For a multi-class classification problem with C classes, we work with a batch of training examples of size N, {x i , y i } i=1,...N . Φ(•) ∈ R d denotes an encoder that outputs the l 2 normalized final encoder hidden layer before the softmax projection; N yi is the total number of examples in the batch that have the same label as y i ; τ > 0 is an adjustable scalar temperature parameter that controls the separation of classes; y i,c denotes the label and ŷi,c denotes the model output for the probability of the ith example belonging to the class c; λ is a scalar weighting hyperparameter that we tune for each downstream task and setting. The overall loss is then given in the following:

