PRESERVING PRE-TRAINED FEATURES HELPS CALIBRATE FINE-TUNED LANGUAGE MODELS

Abstract

Large pre-trained language models (PLMs) have demonstrated strong performance on natural language understanding (NLU) tasks through fine-tuning. However, fine-tuned models still suffer from overconfident predictions, especially in out-of-domain settings. In this paper, we tackle the problem of calibrating finetuned language models. We demonstrate that the PLMs are well-calibrated on the masked language modeling task with robust predictive confidence under domain shift, yet the fine-tuned models fail to retain such property due to catastrophic forgetting, which impacts the calibration on the downstream classification task. In light of these observations, we evaluate the calibration of several methods that preserve pre-trained features and show that preserving pre-trained features can improve the calibration of fine-tuned language models. Among these methods, our proposed method that encourages the fine-tuned model to learn generative representations with auxiliary language modeling objective achieves competitive accuracy and the lowest expected calibration error compared to several strong baselines under both in-domain and out-of-domain settings on three downstream NLU tasks.

1. INTRODUCTION

Fine-tuning pre-trained language models (PLMs) is a dominating paradigm for natural language understanding (NLU) with state-of-the-art results for a variety of NLU tasks (Peters et al., 2018; Devlin et al., 2019; Liu et al., 2019; He et al., 2021a) . The powerful fine-tuned language models have been experimented with for decision-making in real-world applications such as the healthcare domain (He et al., 2020) and safety-critical domain (Sandagiri et al., 2020) , where the classification networks need to be highly accurate and provide calibrated confidence for their predictions to improve the safety and trustiness of the models (Guo et al., 2017) . For example, suppose a medical language inference LM that predicts the disease given the description of symptoms is well-calibrated, i.e., the model's posterior probabilities (or confidence) align well with the true correctness likelihood. In that case, the wrong predictions can be easier to detect and correct by human doctors by given low predictive confidence. However, as with other modern neural networks, the fine-tuned LMs are shown to suffer from overconfidence (Desai & Durrett, 2020; Jiang et al., 2021) , which creates obstacles and concerns for their deployment in real-world applications. Uncertainty estimation of fine-tuned models is challenging due to the small amount of available data for fine-tuning, especially under out-of-domain settings (Desai & Durrett, 2020; Guo et al., 2021) . While prior works illustrate that simple calibration techniques such as temperature scaling (Guo et al., 2017) and label smoothing (Szegedy et al., 2016) are not sufficient to calibrate the fine-tuned LMs under both in-domain (ID) and out-of-domain (OD) settings (Desai & Durrett, 2020; Park & Caragea, 2022) , several approaches with strong regularization have been developed to calibrate the fine-tuned model on NLU tasks, including knowledge distillation from deep ensembles (Guo et al., 2021) , stochastic network architectures (Fan et al., 2020; Zhang et al., 2021), and Mixup (Park & Caragea, 2022) . However, these existing works mostly utilize general calibration methods for supervised learning, while specific properties of the pre-training & fine-tuning paradigm are still largely neglected. In this work, we tackle the calibration of fine-tuned models from the perspective of better leveraging the powerful PLMs. Through a carefully designed empirical study on both pre-trained and fine-tuned models, we first observe that PLMs themselves are actually well-calibrated on the masked language modeling (MLM) task and robust to higher levels of perturbation to the inputs, which suggests the PLMs can model the predictive uncertainty well across different domains. However, the pre-trained features are only used as initialization and are distorted by the fully discriminative fine-tuning. The phenomenon is known as catastrophic forgetting (McCloskey & Cohen, 1989; Kirkpatrick et al., 2017; Howard & Ruder, 2018) . We show that such forgetting can make the fine-tuned language models fail to hold proper predictive confidence toward the OD and outlier samples, which leads to miscalibration on the downstream tasks. Based on the observations, we hypothesize that preserving the pre-trained features helps calibrate the fine-tuned LMs. To validate our hypothesis, we first evaluate the calibration of some previous methods that can preserve pre-trained features, including (1) Parameter-efficient tuning (Houlsby et al., 2019; Hu et al., 2021; Li & Liang, 2021) , (2) Pre-trained weight decay, (3) Mixout (Lee et al., 2020) . Although these methods were originally designed to improve the performance beyond uncertainty estimation, our experiment demonstrates that these methods outperform vanilla fine-tuning in terms of calibration, especially under out-of-domain settings. Based on our observation that the PLMs are well-calibrated on the MLM task yet the fine-tuned LMs that forget the pre-trained features struggle with overconfidence under domain shift, we propose a simple baseline that utilizes the MLM objective to maintain the consistency between the pre-trained and fine-tuned models. The proposed method achieves the lowest expected calibration error and competitive accuracy compared to existing calibration methods in both ID and OD settings on three NLU tasks, including natural language inference, paraphrase detection, and commonsense reasoning, showing that preserving the pre-trained features is an effective approach for improving the calibration of fine-tuned LMs.

2.1. MASKED LANGUAGE MODELS

Masked language models generally consist of a transformer-based text encoder f φ parameterized by φ and a linear language modeling head g θ parameterized by θ. In the pre-training phase, the model handles the masked language modeling task (Devlin et al., 2019) . Assume we have unsupervised sequence inputs x sampled from large-scale corpora p u (x). A subset of x is first masked by a corruption function or distribution. Denote the indices of the masked tokens as M, the set of masked tokens as x M , and the observed unmasked input as x \M . The model is trained to recover the masked tokens x M . In particular, the masked language model first uses the text encoder to get a hidden representation of the input, denoted as f φ (x \M ). Then the language modeling head g θ with softmax function is applied to f φ (x \M ) to obtain a conditional categorical distribution p mlm (x i |x \M ) over the vocabulary V for each masked position i ∈ M. The masked language modeling objective is: L mlm = -E pu(x) [ i∈M log p mlm (x i |x \M ; φ, θ)] In the fine-tuning phase, assume we have labeled data in the form of (x, y) sampled from data distribution p d , where x corresponds to the text input, and y corresponds to the label. For classification tasks, a task-specific head h ϕ that parameterized by ϕ is applied on the hidden representation of input to obtain the logit for each class. The predictive posterior distribution q(y|x) is given by the logits after softmax operation. In standard fine-tuning, the pre-trained encoder f φ and the task-specific head h ϕ are jointly optimized using the cross-entropy loss: L cls = -E p d (x,y) [log q(y|x; φ, ϕ)] (2) which is also known as full fine-tuning (Full-FT) (Peters et al., 2018; Devlin et al., 2019) .

