FEW-SHOT TEXT CLASSIFICATION WITH DUAL CONTRASTIVE CONSISTENCY TRAINING

Abstract

In this paper, we explore how to utilize pre-trained language model to perform few-shot text classification where only a few annotated examples are given for each class. Since using traditional cross-entropy loss to fine-tune language model under this scenario causes serious overfitting and leads to sub-optimal generalization of model, we adopt supervised contrastive learning on few labeled data and consistency-regularization on vast unlabeled data. Moreover, we propose a novel contrastive consistency to further boost model performance and refine sentence representation. After conducting extensive experiments on four datasets, we demonstrate that our model (FTCC) can outperform state-of-the-art methods and has better robustness.

1. INTRODUCTION

Text classification is a fundamental task in natural language processing with various applications such as question answering (Rajpurkar et al., 2016) , spam detection (Shahariar et al., 2019) and sentiment analysis (Chong et al., 2014) . With the advancement of deep learning, fine-tuning pre-trained language model (Devlin et al., 2019; Liu et al., 2019) achieves significant success. However, it still requires a large amount of labeled data to reach optimal generalization of model. Thus, researchers gradually focus on semi-supervised text classification where only a few annotated data is provided. The success of semi-supervised methods results from the usage of abundant unlabeled data: Unlabeled documents in training dataset provide natural consistency regularization by constraining the model predictions to be invariant to small noises in text input (Xie et al., 2019; Miyato et al., 2016; Chen et al., 2020a) . Despite mitigating the annotation burden, these methods are highly unstable in different runs and can still easily overfit on the very limited labeled data. Inspired by the success of supervised contrastive learning under few-shot settings (Gunel et al., 2021; Chen et al., 2022) , we hypothesize that the learned contrastive representation under this scenario can help us tackle aforementioned high variance issue and impose additional constraints to the model. Label information and feature structure can simultaneously be propagated from labeled examples to unlabeled ones. Thus, we devise a novel contrastive consistency schema to further boost model performance. To validate the effectiveness and robustness of FTCC, we conduct extensive experiments on four datasets. The result of our experiments confirms that FTCC can be leveraged to improve the performance of few-shot text classification. Based on this motivation, the contributions of this paper are as follows: • We integrate supervised contrastive learning objective into consistency-regularized semisupervised framework to perform text classification under few-shot scenario. • We devised a novel contrastive consistency schema to propagate feature structure from labeled data to unlabeled data dynamically. • We demonstrated our model's superiority over state-of-the-art semi-supervised methods and analyze the contribution of each component of FTCC through ablation study and also visualize the learned instance representations, showing the necessity of each loss and the advantage of FTCC on representation learning over BERT fine-tuning with cross-entropy. (He et al., 2019; Chen et al., 2020b) and natural language processing (Fang & Xie, 2020; Yan et al., 2021) . The basic idea of unsupervised contrastive learning is that after generating different views of the sample example, the model adopts a loss function that can pull an anchor and a positive view closer and push the anchor away from other negative examples in the embedding space. Its effectiveness on downstream linear classification is justified through alignment and uniformity (Wang & Isola, 2020; Arora et al., 2019) . Learning good generic representation is widely investigated. For example, in natural language field, Fang & Xie (2020); Wang et al. (2021) adopt data augmentations such as word substitution, back translation, word reordering to generate different views of samples. Gao et al. (2021b) ; Yan et al. (2021) adjust dropout ratio to generate positive pairs. Subsequently, researchers propose supervised contrastive learning (Khosla et al., 2020) loss that enforces representations of examples from the same class to be similar and the ones from different classes to be distinct. Gunel et al. (2021) adopts this loss to fune-tune pre-trained language model and brings astonishing improvement in low-resource scenario. Chen et al. ( 2022) introduces label-aware data augmentation and utilizes supervised contrastive learning to simultaneously obtain discriminative feature representations of input examples and corresponding classifiers in the same space.

3.1. PROBLEM FORMULATION

The task of our model is to train a text classifier that efficiently utilizes both annotated data and unannotated data in semi-supervised learning. We denote that D l = {(x l i , y i ), i = 1, . . . , n} as labeled training data and D u = {x u i , i = 1, . . . , m} as unlabeled training data where n is the number of the labeled examples, m is the number of the unlabeled examples and n ≪ m.

3.2. SUPERVISED CONTRASTIVE LEARNING

Unlike traditional fine-tuning pre-trained language model, we additionally includes a supervised contrastive learning term to fully leverage the supervised signals (Gunel et al., 2021; Khosla et al., 2020) . This mechanism takes the samples from the same class as positive samples and the samples from different classes as negative samples. The contrastive loss and cross-entropy loss is defined as follows: L ce = - 1 N N i=1 CE(y i ∥p θ (y i |x l i )) L scl = - N i=1 1 N yi -1 N j=1 1 i̸ =j 1 yi=yj log exp(x l i • x l j /τ scl ) N k=1 1 i̸ =k exp(x l i • x l k /τ scl ) where we work with a batch of training examples of size N , x ∈ R d is l 2 normalized embedding on the [CLS] token from an encoder to represent an example. N yi is the total number of examples in the batch that have the same label as y i ; τ scl > 0 is an adjustable scalar temperature; the • symbol denotes the dot product; y i denotes the true label and p θ (y i |x l i ) denotes the model classifier output for the probability of the i th example after softmax.

3.3. CONSISTENCY TRAINING

Back-translation (Sugiyama & Yoshinaga, 2019; Edunov et al., 2018) is common data augmentation technique and can generate diverse paraphrases while preserving the semantics of the original sentences. We utilize back translations to paraphrase the unlabeled data to get augmented noisy data. Following Xie et al. (2019) , we minimize the Kullback-Leibler divergence between the augmented view of the example and the original view of example to encourage their consistency and enforce the smoothness of the model. FTCC adopts the loss function as follows: L con = D KL (p θ (y|x u )∥p θ (y|a(x u ))) (3) where a is back-translation transformation, p θ (y|x u ) is an original example's label distribution through classifier, p θ (y|a(x u )) is the back-translated example's label distribution through classifier. θ denotes the stop gradient of θ, as suggested by VAT (Miyato et al., 2017) . We randomly sample a batch of N examples and perform back-translation on every example, resulting in 2N data points. Given one example and its corresponding back-translated example, we treat the other 2(N -1) examples as negatives. Then we denote the similarity between the example x u and the negatives n i (i ∈ {1, ...., 2(N -1)}) in one batch as: P (i) = exp(x u • n i /τ cc ) 2(N -1) j=1 exp(x u • n j /τ cc ) Likewise, the similarity between the corresponding augmented example a(x u ) and the negatives n i (i ∈ {1, ...., 2(N -1)}) in one batch as: Q(i) = exp(a(x u ) • n i /τ cc ) 2(N -1) j=1 exp(a(x u ) • n j /τ cc ) where τ con is a temperature hyperparameter different from τ scl and every embedding is l 2 normalized. Using KL Divergence as the measure of disagreement, we can impose the consistency between the probability distributions P and Q : L cc = D KL (P ∥Q)

3.5. OVERALL TRAINING OBJECTIVE

Above all, the overall training objective is: L = L ce + λ 1 L scl + λ 2 L con + αλ 3 L cc (7) Motivated by Huang et al. (2022) , we also dynamically release contrastive consistency signal. During the first half of training epochs, α is set to 0 to avoid false feature structure propagation. After half of the training epochs finish, the model becomes more stable on representing feature, and we gradually increase α as the epoch grows: α = 2t-T T , where T is the total number of epochs and t is the current epoch number. Figure 1 shows an overview of our FTCC framework. 

4.1. SET UP

We conduct our experiments on following four benchmark text classification datasets. SST2 (Socher et al., 2013 ) is a sentiment classification dataset of movie reviews. SUBJ (Pang & Lee, 2004 ) is a review dataset with sentences labelled as subjective or objective. PC (Ganapathibhotla & Liu, 2008 ) is a binary sentiment classification dataset that includes Pros and Cons data. IMDB (Maas et al., 2011) is a sentiment classification dataset from IMDB movie reviews. The dataset statistics are shown in Table 1 . All datasets are in English language. In few-shot learning scenario, we set the number of training labeled examples K = 10 per class. We follow Gao et al. (2021a) to randomly split training dataset into labeled data, unlabeled data and development data in 3 different samples. During splitting, we take label distribution and the size of dataset into account and do not use all unlabeled data. Three different models are trained on these splits. Then we evaluate the average performance of these three models on given test set. We choose German as the intermediate language to perform data augmentation with Fairseq toolkit (Ott et al., 2019) . We use pre-trained language model BERT as our backbone to encode all examples and use the input format "[CLS] sentence [SEP]" for all models. For three datasets SST2, SUBJ, PC, the max length are set to be the first 128, 128, and 64 tokens. In IMDB dataset, we remained the last 256 tokens, as suggested by Xie et al. (2019) . We use Adam (Kingma & Ba, 2014) as the optimizer with linear decaying schedule. For simplicity, we set λ 1 , λ 2 , λ 3 to 1. For all datasets, the batch size of labeled data is 8 and the batch size of unlabeled data is 32, 24, 32, 32. Detailed information on experimental hyperparameter settings can be found in Table 2 . We run experiments on one NVIDIA RTX A6000 GPU. Grandvalet & Bengio (2004) , Training Signal Annealing) and uses supervised contrastive learning and contrastive consistency to prevent overfit and achieve better and more stable performance. Compared to MixText, FTCC does not redesign pre-train language model between different layers but adopts a simply but effective method to finetune it, which demonstrates that learning quality feature representation for data is more effective and efficient in low-resources scenario and FTCC is more robust and capable of fully leveraging labeled and unlabeled data. 

5. CONCLUSION

In this paper, we propose FTCC that can effectively utilize unlabeled data under few-shot scenario and simultaneously enforce more compact clustering of sentence embeddings with similar semantic meanings. Moreover, we propose a contrastive consistency schema that can improve the quality of feature representation and the ability of model generalization. Our model achieves competitive performance on four text classification datasets, outperforming previous methods. Since using supervised contrastive learning to fine-tune pre-trained language model is a simple but effective training technique, in the future, we hope to incorporate this method into weakly-supervised or unsupervised NLP framework to maximally reduce human cost.



3.4 CONTRASTIVE CONSISTENCYInspired byXie et al. (2019);Wei et al. (2021), we conjecture that consistency regularization not only can propagate label information from label data to unlabeled data, but also can propagate feature structure in their latent space. Thus, we propose to encourage the consistency between the feature pattern of original examples and their corresponding augmented examples, which further refines deep clustering and classification accuracy on augmented data.

Figure 1: Left: The overall framework of FTCC. FTCC utilizes limited labeled data to compute cross-entropy and supervised contrastive learning loss and performs augmentation on unlabeled data to compute consistency training and contrastive consistency loss. Right: The illustration of contrastive consistency. FTCC computes the distances between the original or augmented views and the negatives in one batch to enforce their consistency and enhance the representation ability of the model.

We compare our FTCC to five baselines: (1) BERT-FT: Fine-tuning the pre-trained BERT-baseduncased model on the labeled texts directly. (2) BERT-SCL(Gunel et al., 2021): Fine-tuning BERT with cross-entropy loss and supervised contrastive loss (3) DualCL(Chen et al., 2022): Dual contrastive learning framework learns feature representations of input examples and corresponding classifiers in the same space. It classifies documents based on their similarity with classifier representation (4) UDA (Xie et al., 2019): Unsupervised Data Augmentation uses limited labeled examples for supervised training and encourages model to have consistent prediction between unlabeled examples and corresponding augmented examples. (5) MixText (Chen et al., 2020a): MixText creates multiple augmented training examples by interpolating text in hidden space and then perform consistency training. To make fair comparison, we reproduce the results with the best hyperparameter configurations for all baselines.

Figure 2: t-SNE plots of learned sentence embeddings on four test datasets. Different colors denote different classes. In every pair, the left is cross-entropy, the right is FTCC.

Statistics of four text datasets. Unlabeled and Dev dataset have even label distribution.



Performance (test accuracy(%)) comparison with baselines. We use 10 labeled examples for each class. The results are averaged after three random seeds to show the significance Dror et al.. ± denotes the standard error of the mean.

Ablation study of FTCC. See abbreviation meanings in texts SCL 73.51±9.26 87.07±1.16 90.13±2.50 86.64±4.37 w/o. CC 85.62±1.53 90.14±0.14 92.73±0.49 88.08±4.96 w/o. CON 78.56±2.25 84.88±5.73 92.40±0.17 88.84±3.12

