SUPERVISED CONTRASTIVE LEARNING FOR PRE-TRAINED LANGUAGE MODEL FINE-TUNING

Abstract

State-of-the-art natural language understanding classification models follow twostages: pre-training a large language model on an auxiliary task, and then finetuning the model on a task-specific labeled dataset using cross-entropy loss. However, the cross-entropy loss has several shortcomings that can lead to sub-optimal generalization and instability. Driven by the intuition that good generalization requires capturing the similarity between examples in one class and contrasting them with examples in other classes, we propose a supervised contrastive learning (SCL) objective for the fine-tuning stage. Combined with cross-entropy, our proposed SCL loss obtains significant improvements over a strong RoBERTa-Large baseline on multiple datasets of the GLUE benchmark in few-shot learning settings, without requiring specialized architecture, data augmentations, memory banks, or additional unsupervised data. Our proposed fine-tuning objective leads to models that are more robust to different levels of noise in the fine-tuning training data, and can generalize better to related tasks with limited labeled data.

1. INTRODUCTION

State-of-the-art for most existing natural language processing (NLP) classification tasks is achieved by models that are first pre-trained on auxiliary language modeling tasks and then fine-tuned on the task of interest with cross-entropy loss (Radford et al., 2019; Howard & Ruder, 2018; Liu et al., 2019; Devlin et al., 2019) . Although ubiquitous, the cross-entropy loss -the KL-divergence between one-hot vectors of labels and the distribution of model's output logits -has several shortcomings. Cross entropy loss leads to poor generalization performance (Liu et al., 2016; Cao et al., 2019) , and it lacks robustness to noisy labels (Zhang & Sabuncu, 2018; Sukhbaatar et al., 2015) or adversarial examples (Elsayed et al., 2018; Nar et al., 2019) . Effective alternatives have been proposed to modify the reference label distributions through label smoothing (Szegedy et al., 2016; Müller et al., 2019) , Mixup (Zhang et al., 2018 ), CutMix (Yun et al., 2019) , knowledge distillation (Hinton et al., 2015) or self-training (Yalniz et al., 2019; Xie et al., 2020) . Fine-tuning using cross entropy loss in NLP also tends to be unstable across different runs (Zhang et al., 2020; Dodge et al., 2020) , especially when supervised data is limited, a scenario in which pre-training is particularly helpful. To tackle the issue of unstable fine-tuning and poor generalization, recent works propose local smoothness-inducing regularizers (Jiang et al., 2020) and regularization methods inspired by the trust region theory (Aghajanyan et al., 2020) to prevent representation collapse. Empirical evidence suggests that fine-tuning for more iterations, reinitializing top few layers (Zhang et al., 2020) , and using debiased Adam optimizer during fine-tuning (Mosbach et al., 2020) can make the fine-tuning stage more stable. Inspired by the learning strategy that humans utilize when given a few examples, we seek to find the commonalities between the examples of each class and contrast them with examples from other classes. We hypothesize that a similarity-based loss will be able to hone in on the important dimensions of the multidimensional hidden representations hence lead to better few-shot learning results and be more stable while fine-tuning pre-trained language models. We propose a novel objective for fine-tuning that includes a supervised contrastive learning (SCL) term that pushes the examples from the same class close and the examples from different classes further apart. The SCL

