CLUSTER & TUNE: ENHANCE BERT PERFORMANCE IN TOPICAL TEXT CLASSIFICATION WHEN LABELED DATA IS SCARCE

Abstract

In cases where labeled data is scarce, the common practice of fine-tuning BERT for a target text classification task is prone to producing poor performance. In such scenarios, we suggest performing an unsupervised classification task prior to finetuning on the target task. Specifically, as such an intermediate task, we perform unsupervised clustering, training BERT on predicting the cluster labels. We test this hypothesis on various data sets, and show that this additional classification step can significantly reduce the demand for labeled examples mainly for topical classification tasks. We further discuss under which conditions this task is helpful and why.

1. INTRODUCTION

One of the most practical NLP use cases is the task of text classification, where the goal is to automatically assign a new text instance into a subset of pre-specified categories. Text classification applications include topic detection, sentiment analysis, and spam filtering, to name just a few examples. The standard paradigm relies on supervised learning, where it is well known that the size and quality of the labeled data strongly impact the performance of the trained classifier. Hence, as with many other supervised learning tasks, developing a text classification scheme in practice, typically requires to make the most out of a relatively small set of annotated examples. The emergence of transformer-based pretrained language models such as BERT (Devlin et al., 2018) has reshaped the NLP landscape, leading to significant advances in the performance of most NLP tasks, text classification included (e.g., Nogueira & Cho, 2019; Ein-Dor et al., 2020) . These models typically rely on pretraining a transformer-based neural network on massive and heterogeneous corpora on a general Masked Language Modeling (MLM) task, i.e., predicting a word that is masked in the original text. Later on, the obtained model is fine-tuned to the actual task of interest, termed here the target task, using the labeled data available for this task. Thus, pretrained models serve as general sentence encoders which can be adapted to a variety of tasks (Lacroix et al., 2019; Wang et al., 2020a) . Our focus in this work is on a challenging yet common scenario, where the textual categories are not trivially separated; furthermore, the available labeled data is scarce. There are many real-world cases in which data cannot be sent for massive labeling by the crowd (e.g., due to confidentiality of the data, or the need for a very specific expertise) and the availability of experts is very limited. In such setting, fine-tuning a pretrained model is expected to yield far from optimal performance. To overcome this, one may take a gradual approach composed of various steps. One possibility is to further pretrain the model with the self-supervised MLM task over unlabeled data taken from the target task domain (Whang et al., 2019) . Alternatively, one can train the pretrained model using a supervised intermediate task which is different in nature from the target-task, and for which labeled data is more readily available (Pruksachatkun et al., 2020; Wang et al., 2019a; Phang et al., 2018) . Each of these steps is expected to provide a better starting point -in terms of the model parameters -for the final fine-tuning step, performed over the scarce labeled data available for the target task, aiming to end up with improved performance. Following these lines, here we propose a simple strategy, that exploits unsupervised text clustering as the intermediate task towards fine-tuning a pretrained language model for text classification. Our In the pre-training phase, only general corpora is available. The inter-training phase is exposed to target domain data, but not to its labeled instances. Those are only available at the fine-tuning phase. work is inspired by the use of clustering to obtain labels for training deep networks in computer vision (Gidaris et al., 2018; Kolesnikov et al., 2019) . Specifically, we use an efficient clustering technique, that relies on simple Bag Of Words (BOW) representations, to partition the unlabeled training data into relatively homogeneous clusters of text instances. Next, we treat these clusters as labeled data for an intermediate text classification task, and train BERT -with or without additional MLM pretraining -with respect to this multi-class problem, prior to the final fine-tuning over the actual target-task labeled data. Extensive experimental results demonstrate the practical value of this strategy on a variety of benchmark data, most prominently when the training data available for the target task is relatively small and the classification task is of topical nature. We further analyze the results to gain insights as to when this approach would be most valuable, and propose future directions to expand the present work.

2. INTERMEDIATE TRAINING USING UNSUPERVISED CLUSTERING

A pre-trained transfer model, such as BERT, is typically developed in consecutive phases. First, the model is pretrained over massive general corpora with the MLM task. 1 The obtained model is referred to henceforth as BERT. Second, BERT is finetuned in a supervised manner with the available labeled examples for the target task at hand. This standard flow is represented via Path-1 in Fig. 1 . An additional phase can be added between these two, referred to next as intermediate training, or inter-training in short. In this phase, the model is exposed to the corpus of the target task, or a corpus of the same domain, but still has no access to labeled examples for this task. A common example for such an intermediate phase is to continue to intertrain BERT using the selfsupervised MLM task over the corpus or the domain of interest, sometimes referred to as further or adaptive pre-training (e.g., Gururangan et al., 2020) . This flow is represented via Path-2 in Fig. 1 , and the resulting model is denoted BERT IT:MLM , standing for Intermediate Task: MLM.



BERT was originally also pretrained over a "next sentence prediction" task(Devlin et al., 2018); however, later works such as Yang et al. (2019) and Liu et al. (2019b) have questioned the contribution of this additional task and focused on MLM.



Figure 1: BERT phases -circles are training steps which produce models, represented as rectangles.In the pre-training phase, only general corpora is available. The inter-training phase is exposed to target domain data, but not to its labeled instances. Those are only available at the fine-tuning phase.

