PREFER TO CLASSIFY: IMPROVING TEXT CLASSIFIER VIA PAIR-WISE PREFERENCE LEARNING

Abstract

The development of largely human-annotated benchmarks has driven the success of deep neural networks in various NLP tasks. These benchmarks are collected by aggregating decisions made by different annotators on the target task. Aggregating the annotated decisions via majority is still used as a common practice, despite its inevitable limitation from simple aggregation. In this paper, we establish a novel classification framework, based on task-specific human preference between a pair of samples, which provides an informative training signal to capture fine-grained and complementary task information through pair-wise comparison. Hence, it improves the existing instance-wise annotation system by enabling better task modeling from learning the relation between samples. Specifically, we propose a novel multi-task learning framework, called prefer-to-classify (P2C), to effectively learn human preferences in addition to the given classification task. We collect human preference signals in two ways: (1) extracting relative preferences implicitly from annotation records (for free) or (2) collecting subjective preferences explicitly from (paid) crowd workers. In various text classification tasks, we demonstrate that both extractive and subjective preferences are effective in improving the classifier with our preference learning framework. Interestingly, we found that subjective preference shows more significant improvements than extractive preference, revealing the effectiveness of explicit modeling of human preferences. Our code and preference dataset will be publicly available upon acceptance.

1. INTRODUCTION

The recent success of natural language processing (NLP) systems has been driven by, among other things, the construction of largely human-annotated benchmarks, like GLUE (Wang et al., 2019) or SQuAD (Rajpurkar et al., 2016) . Nevertheless, current NLP benchmarks often include problems occurring in their construction process, such as annotation artifacts (Gururangan et al., 2018) or spurious patterns (Kaushik & Lipton, 2018) . To alleviate these issues, various approaches have been recently proposed to construct more robust and effective benchmarks via human-in-the-loop with model (Kiela et al., 2021; Yuan et al., 2021; Liu et al., 2022) or adversarial sample mining (Kaushik et al., 2020; Nie et al., 2020; Potts et al., 2021) . Despite such a careful selection of samples to annotate, it is relatively under-explored how to aggregate the annotations and assign the label, to fully exploit the advantage of these benchmarks. For example, most NLP data collection still follows a long-standing custom for annotation voting called majority voting (Snow et al., 2008; Hovy et al., 2013) that aggregates multiple annotators' judgments into majority-voted ones. This labeling with majority voting, however, inevitably discards the valuable information embedded in multiple annotators' assessments and their disagreements, such as the inherent difficulty of instance (Pavlick & Kwiatkowski, 2019) or uncertainty from the task's subjectivity (Alm, 2011) . As the modern NLP systems are extending the interest to a greater variety of social issues and subjective tasks (Uma, 2021), such as humor detection (Simpson et al., 2019) and racist language detection (Larimore et al., 2021) , the capability of modeling the fine-grained, distributional opinions from multiple annotators becomes more important. To address the limitation of the simple labeling method, various approaches have been recently proposed, such as label smoothing (Fornaciari et al., 2021; Leonardelli et al., 2021) . However, they are still limited by the discretized annotation space and the limited number of annotators, resulting in coarse-grained modeling of the task. Hence, it inspires us to investigate a new and complementary direction for capturing fine-grained task information, by relatively ordering a pair of two texts and better calibrating them with respect to the task, using human preference. Contribution. In this paper, we establish a new classification framework based on human preference between pair of samples, e.g., which text is more positive for sentiment classification (see Figure 1 Specifically, we propose a novel multi-task learning framework, coined prefer-to-classify (P2C), to effectively train the model from both classification and preference learning tasks. We introduce diverse multiple preference heads beside the classification head of the model to learn from preference labels. Then, we apply a consistency regularization between them for imposing the model to have higher confidence in classification with the preferred samples. We also develop two advanced sampling schemes to select more informative text pairs during the training. To train P2C, we collect two types of human preference labels: extractive preference and subjective preference. Extractive preference is constructed from the existing annotation records in datasets 'without additional cost'; if one sample has been less voted than the other, we treat the latter as a relatively higher preference between the two samples. One may argue that the extracted preferences are somewhat artificial (yet for free) and implicit signals, as they are not obtained from direct comparison by human. To alleviate this, we also collect subjective preferences for 5,000 pairs of texts from (paid) crowd workers by directly asking them which text is more preferred to the task label. We demonstrate the effectiveness of preference learning via P2C in addition to given task-specific labels, on both extractive and subjective preference labels. In six text classification datasets, P2C with extractive preference exhibited 7.59% and 4.27% relative test error reduction on average, compared to the training with majority voting and the previous best method to learn using annotation records, respectively. Moreover, our newly-collected subjective preference labels show clear advantages over the extractive ones, not only with the improvement in task performance but also with better calibration and task modeling; for example, 6.09% of expected calibration error while 9.19% from the same number of task labels. Overall, our work highlights the effectiveness of pairwise human preference for better task learning; we suggest that NLP benchmarks should include annotation records, instead of just providing majority-voted labels, or collect human preferences.

2. IMPROVING TEXT CLASSIFICATION VIA PREFERENCE LEARNING

In this section, we present prefer-to-classify (P2C), a new preference learning framework for text classification. Our main idea is to take advantage of the preference between two input samples to



Figure 1: (a) Example of a pair-wise preference in the sentiment classification. (b) Effect of preference learning. It makes the classifier capture the fine-grained task information; e.g., predictions of classifier become more aligned with human annotations. Test samples are divided into Hard, Normal, Easy based on the annotators' disagreement. (c) Improvement from the collected preference and P2C in various aspects, e.g., better accuracy and calibration. More results are presented in Section 4.2.

(a)), in addition to the task annotations. Learning with human preference has been demonstrated in multiple domains, including reinforcement learning(Christiano et al., 2017)  and generative models(Ziegler et al., 2019), by training the model to follow human behavior and achieve the complex goal with better modeling of the task. While this direction is under-explored in the classification regime, it could provide an informative training signal by capturing the complementary task information through 'pair-wise' comparison that cannot be captured with 'instance-wise' evaluation (see Figure1(b)). Hence, it would effectively improve the classifier, as shown in Figure1(c).

