FEW-SHOT TEXT CLASSIFICATION WITH DUAL CONTRASTIVE CONSISTENCY TRAINING

Abstract

In this paper, we explore how to utilize pre-trained language model to perform few-shot text classification where only a few annotated examples are given for each class. Since using traditional cross-entropy loss to fine-tune language model under this scenario causes serious overfitting and leads to sub-optimal generalization of model, we adopt supervised contrastive learning on few labeled data and consistency-regularization on vast unlabeled data. Moreover, we propose a novel contrastive consistency to further boost model performance and refine sentence representation. After conducting extensive experiments on four datasets, we demonstrate that our model (FTCC) can outperform state-of-the-art methods and has better robustness.

1. INTRODUCTION

Text classification is a fundamental task in natural language processing with various applications such as question answering (Rajpurkar et al., 2016) , spam detection (Shahariar et al., 2019) and sentiment analysis (Chong et al., 2014) . With the advancement of deep learning, fine-tuning pre-trained language model (Devlin et al., 2019; Liu et al., 2019) achieves significant success. However, it still requires a large amount of labeled data to reach optimal generalization of model. Thus, researchers gradually focus on semi-supervised text classification where only a few annotated data is provided. The success of semi-supervised methods results from the usage of abundant unlabeled data: Unlabeled documents in training dataset provide natural consistency regularization by constraining the model predictions to be invariant to small noises in text input (Xie et al., 2019; Miyato et al., 2016; Chen et al., 2020a) . Despite mitigating the annotation burden, these methods are highly unstable in different runs and can still easily overfit on the very limited labeled data. Inspired by the success of supervised contrastive learning under few-shot settings (Gunel et al., 2021; Chen et al., 2022) , we hypothesize that the learned contrastive representation under this scenario can help us tackle aforementioned high variance issue and impose additional constraints to the model. Label information and feature structure can simultaneously be propagated from labeled examples to unlabeled ones. Thus, we devise a novel contrastive consistency schema to further boost model performance. To validate the effectiveness and robustness of FTCC, we conduct extensive experiments on four datasets. The result of our experiments confirms that FTCC can be leveraged to improve the performance of few-shot text classification. Based on this motivation, the contributions of this paper are as follows: • We integrate supervised contrastive learning objective into consistency-regularized semisupervised framework to perform text classification under few-shot scenario. • We devised a novel contrastive consistency schema to propagate feature structure from labeled data to unlabeled data dynamically. • We demonstrated our model's superiority over state-of-the-art semi-supervised methods and analyze the contribution of each component of FTCC through ablation study and also visualize the learned instance representations, showing the necessity of each loss and the advantage of FTCC on representation learning over BERT fine-tuning with cross-entropy. 

3.1. PROBLEM FORMULATION

The task of our model is to train a text classifier that efficiently utilizes both annotated data and unannotated data in semi-supervised learning. We denote that D l = {(x l i , y i ), i = 1, . . . , n} as labeled training data and D u = {x u i , i = 1, . . . , m} as unlabeled training data where n is the number of the labeled examples, m is the number of the unlabeled examples and n ≪ m.

3.2. SUPERVISED CONTRASTIVE LEARNING

Unlike traditional fine-tuning pre-trained language model, we additionally includes a supervised contrastive learning term to fully leverage the supervised signals (Gunel et al., 2021; Khosla et al., 2020) . This mechanism takes the samples from the same class as positive samples and the samples from different classes as negative samples. The contrastive loss and cross-entropy loss is defined as follows: L ce = - 1 N N i=1 CE(y i ∥p θ (y i |x l i )) (1) L scl = - N i=1 1 N yi -1 N j=1 1 i̸ =j 1 yi=yj log exp(x l i • x l j /τ scl ) N k=1 1 i̸ =k exp(x l i • x l k /τ scl ) where we work with a batch of training examples of size N , x ∈ R d is l 2 normalized embedding on the [CLS] token from an encoder to represent an example. N yi is the total number of examples in the batch that have the same label as y i ; τ scl > 0 is an adjustable scalar temperature; the • symbol



Yan et al., 2021). The basic idea of unsupervised contrastive learning is that after generating different views of the sample example, the model adopts a loss function that can pull an anchor and a positive view closer and push the anchor away from other negative examples in the embedding space. Its effectiveness on downstream linear classification is justified through alignment and uniformity(Wang & Isola, 2020; Arora et al., 2019). Learning good generic representation is widely investigated. For example, in natural language field, Fang & Xie (2020); Wang et al. (2021) adopt data augmentations such as word substitution, back translation, word reordering to generate different views of samples. Gao et al. (2021b); Yan et al. (2021) adjust dropout ratio to generate positive pairs. Subsequently, researchers propose supervised contrastive learning(Khosla et al., 2020) loss that enforces representations of examples from the same class to be similar and the ones from different classes to be distinct. Gunel et al. (2021) adopts this loss to fune-tune pre-trained language model and brings astonishing improvement in low-resource scenario. Chen et al. (2022) introduces label-aware data augmentation and utilizes supervised contrastive learning to simultaneously obtain discriminative feature representations of input examples and corresponding classifiers in the same space.

