PRESERVING PRE-TRAINED FEATURES HELPS CALIBRATE FINE-TUNED LANGUAGE MODELS

Abstract

Large pre-trained language models (PLMs) have demonstrated strong performance on natural language understanding (NLU) tasks through fine-tuning. However, fine-tuned models still suffer from overconfident predictions, especially in out-of-domain settings. In this paper, we tackle the problem of calibrating finetuned language models. We demonstrate that the PLMs are well-calibrated on the masked language modeling task with robust predictive confidence under domain shift, yet the fine-tuned models fail to retain such property due to catastrophic forgetting, which impacts the calibration on the downstream classification task. In light of these observations, we evaluate the calibration of several methods that preserve pre-trained features and show that preserving pre-trained features can improve the calibration of fine-tuned language models. Among these methods, our proposed method that encourages the fine-tuned model to learn generative representations with auxiliary language modeling objective achieves competitive accuracy and the lowest expected calibration error compared to several strong baselines under both in-domain and out-of-domain settings on three downstream NLU tasks.

1. INTRODUCTION

Fine-tuning pre-trained language models (PLMs) is a dominating paradigm for natural language understanding (NLU) with state-of-the-art results for a variety of NLU tasks (Peters et al., 2018; Devlin et al., 2019; Liu et al., 2019; He et al., 2021a) . The powerful fine-tuned language models have been experimented with for decision-making in real-world applications such as the healthcare domain (He et al., 2020) and safety-critical domain (Sandagiri et al., 2020) , where the classification networks need to be highly accurate and provide calibrated confidence for their predictions to improve the safety and trustiness of the models (Guo et al., 2017) . For example, suppose a medical language inference LM that predicts the disease given the description of symptoms is well-calibrated, i.e., the model's posterior probabilities (or confidence) align well with the true correctness likelihood. In that case, the wrong predictions can be easier to detect and correct by human doctors by given low predictive confidence. However, as with other modern neural networks, the fine-tuned LMs are shown to suffer from overconfidence (Desai & Durrett, 2020; Jiang et al., 2021) , which creates obstacles and concerns for their deployment in real-world applications. Uncertainty estimation of fine-tuned models is challenging due to the small amount of available data for fine-tuning, especially under out-of-domain settings (Desai & Durrett, 2020; Guo et al., 2021) . While prior works illustrate that simple calibration techniques such as temperature scaling (Guo et al., 2017) and label smoothing (Szegedy et al., 2016) are not sufficient to calibrate the fine-tuned LMs under both in-domain (ID) and out-of-domain (OD) settings (Desai & Durrett, 2020; Park & Caragea, 2022) , several approaches with strong regularization have been developed to calibrate the fine-tuned model on NLU tasks, including knowledge distillation from deep ensembles (Guo et al., 2021) , stochastic network architectures (Fan et al., 2020; Zhang et al., 2021) , and Mixup (Park & Caragea, 2022) . However, these existing works mostly utilize general calibration methods for supervised learning, while specific properties of the pre-training & fine-tuning paradigm are still largely neglected. In this work, we tackle the calibration of fine-tuned models from the perspective of better leveraging the powerful PLMs. Through a carefully designed empirical study on both pre-trained and fine-tuned models, we first observe that PLMs themselves are actually well-calibrated on the masked language modeling (MLM) task and robust to higher levels of perturbation to the inputs, which suggests the PLMs can model the predictive uncertainty well across different domains. However, the pre-trained features are only used as initialization and are distorted by the fully discriminative fine-tuning. The phenomenon is known as catastrophic forgetting (McCloskey & Cohen, 1989; Kirkpatrick et al., 2017; Howard & Ruder, 2018) . We show that such forgetting can make the fine-tuned language models fail to hold proper predictive confidence toward the OD and outlier samples, which leads to miscalibration on the downstream tasks. Based on the observations, we hypothesize that preserving the pre-trained features helps calibrate the fine-tuned LMs. To validate our hypothesis, we first evaluate the calibration of some previous methods that can preserve pre-trained features, including (1) Parameter-efficient tuning (Houlsby et al., 2019; Hu et al., 2021; Li & Liang, 2021) , (2) Pre-trained weight decay, (3) Mixout (Lee et al., 2020) . Although these methods were originally designed to improve the performance beyond uncertainty estimation, our experiment demonstrates that these methods outperform vanilla fine-tuning in terms of calibration, especially under out-of-domain settings. Based on our observation that the PLMs are well-calibrated on the MLM task yet the fine-tuned LMs that forget the pre-trained features struggle with overconfidence under domain shift, we propose a simple baseline that utilizes the MLM objective to maintain the consistency between the pre-trained and fine-tuned models. The proposed method achieves the lowest expected calibration error and competitive accuracy compared to existing calibration methods in both ID and OD settings on three NLU tasks, including natural language inference, paraphrase detection, and commonsense reasoning, showing that preserving the pre-trained features is an effective approach for improving the calibration of fine-tuned LMs.

2.1. MASKED LANGUAGE MODELS

Masked language models generally consist of a transformer-based text encoder f φ parameterized by φ and a linear language modeling head g θ parameterized by θ. In the pre-training phase, the model handles the masked language modeling task (Devlin et al., 2019) . Assume we have unsupervised sequence inputs x sampled from large-scale corpora p u (x). A subset of x is first masked by a corruption function or distribution. Denote the indices of the masked tokens as M, the set of masked tokens as x M , and the observed unmasked input as x \M . The model is trained to recover the masked tokens x M . In particular, the masked language model first uses the text encoder to get a hidden representation of the input, denoted as f φ (x \M ). Then the language modeling head g θ with softmax function is applied to f φ (x \M ) to obtain a conditional categorical distribution p mlm (x i |x \M ) over the vocabulary V for each masked position i ∈ M. The masked language modeling objective is: L mlm = -E pu(x) [ i∈M log p mlm (x i |x \M ; φ, θ)] In the fine-tuning phase, assume we have labeled data in the form of (x, y) sampled from data distribution p d , where x corresponds to the text input, and y corresponds to the label. For classification tasks, a task-specific head h ϕ that parameterized by ϕ is applied on the hidden representation of input to obtain the logit for each class. The predictive posterior distribution q(y|x) is given by the logits after softmax operation. In standard fine-tuning, the pre-trained encoder f φ and the task-specific head h ϕ are jointly optimized using the cross-entropy loss: L cls = -E p d (x,y) [log q(y|x; φ, ϕ)] which is also known as full fine-tuning (Full-FT) (Peters et al., 2018; Devlin et al., 2019) .

2.2. CONFIDENCE CALIBRATION

The framework of confidence calibration under the supervised classification setting can be expressed as a joint distribution P (ŷ, p) over the label prediction ŷ ∈ |Y| and the corresponding confidence p ∈ [0, 1]. A perfectly calibrated model holds P (ŷ = y|p = p) = p (Guo et al., 2017) . One way to evaluate calibration through finite samples is expected calibration error, i.e., ECE (Naeini et al., 2015) . To compute ECE, the model's predictive confidences are first grouped into M equal-sized bins. Denote B m as the indices of samples whose confidences are in the interval ( m-1 M , m M ]. Suppose we have N samples, the ECE is calculated by the weighted average of the difference between confidence and accuracy in each bin: acc(B m ) = 1 |B m | i∈Bm 1 (ŷ i = y i ) , conf(B m ) = 1 |B m | i∈Bm pi ECE = M m=1 |B m | N |acc(B m ) -conf(B m )| In this work, we set M = 10 following Desai & Durrett (2020) .

3. A CLOSER LOOK TO THE PRE-TRAINED AND FINE-TUNED LANGUAGE MODELS IN CALIBRATION

In this section, we explore the connection between the pre-trained and fine-tuned language models in terms of calibration by examining: (1) The calibration of the pre-trained language models themselves. (2) How fine-tuning affects calibration on the downstream classification tasks.

3.1. WHAT WE FORGET AFTER FINE-TUNING THE PRE-TRAINED LANGUAGE MODELS

Pre-trained language models have demonstrated their ability to capture informative linguistic features from large corpora (Tenney et al., 2019; Jawahar et al., 2019; Ethayarajh, 2019) . Intuitively, the pre-trained features learned on diverse corpora should be capable of performing uncertainty estimation well. In this subsection, we validate that the pre-trained language models are indeed well-calibrated on the MLM task, which suggests that the predominant full fine-tuning method that forgets such pre-trained features is suboptimal. Setup: We evaluate the calibration of the pre-trained RoBERTa BASE (Liu et al., 2019) through the MLM task. We use the test split of the WikiText-103 (Merity et al., 2016) , one of the corpora in the pre-training phase, and six downstream datasets from different domains (see §5.1 and A.1 for details) as sequence inputs for masked language modeling. The inputs are masked and corrupted with three levels of 15%, 30%, and 50% mask probability with the same masking approach (i.e., the 80-10-10 strategy) as in the pre-training phase (Devlin et al., 2019) . Results and Analysis: Table 1 and Figure 1 show the ECE and the reliability diagram (DeGroot & Fienberg, 1983; Niculescu-Mizil & Caruana, 2005) of the pre-trained RoBERTa BASE on the MLM task. The results suggest that the PLM is relatively well-calibrated across different domains, where the model needs to recover the corrupted position with the options of |V|. Moreover, as the mask probability grows higher than in the pre-training phase, the ECE of the PLM increases only by a relatively small amount, which indicates that the PLM's predictive confidence p mlm (x|x \M ) on the MLM task is robust to higher corruption levels. Figure 2 demonstrates that although the hidden representations of 50% masked inputs (visualized by t-SNE (Van der Maaten & Hinton, 2008)) have shifted significantly from the original input, the PLM can still make calibrated predictions to the original inputs. Intuitively, the calibrated confidence on the MLM task suggests that the pre-trained features of PLMs are good at modeling the samples under large domain shifts, which may benefit the calibration of the downstream classification task under OD settings. However, this property is less likely to be retained by the fine-tuned LM due to catastrophic forgetting caused by full fine-tuning with the discriminative objective (Howard & Ruder, 2018) . 

3.2. HOW FINE-TUNING AFFECTS CALIBRATION ON THE DOWNSTREAM TASKS

Although previous works have shown that fine-tuned language models can outperform non-pretrained models in terms of calibration (Desai & Durrett, 2020) , the fine-tuned LMs' calibration performance is still far from satisfactory, especially under the OD settings (Desai & Durrett, 2020; Guo et al., 2021) . To study why fine-tuned LMs are miscalibrated under the OD settings, we conduct a case study on the LM fine-tuned on the QQP dataset, which is a typical failure case that the finetuned model exhibits dramatic disparity in ECE on ID and OD settings, as shown in Figure 3 . Setup: We fine-tune the pre-trained RoBERTa BASE on the QQP training set following the default configuration of the Huggingface Transformers library (Wolf et al., 2020) . Compared to the pretrained model, we visualize the hidden representations of the same inputs with §3.1 given by the fine-tuned model. Besides, we evaluate the average confidence of the fine-tuned model's prediction on several datasets, including the in-domain QQP validation set, the out-of-domain TwitterPPDB validation set, and the outlier WikiText-103 validation set that does not hold any particular attributes of the downstream classification task. Results and Analysis: As shown in Figure 2 , compared to the pre-trained models, fine-tuning changes the hidden representation of the LM in two ways: (1) For the inputs within the same domain, fine-tuning enlarges the difference of the corresponding hidden representations, which aligns with the quantitative results of the previous work (Zhou & Srikumar, 2022) . ( 2) For the inputs across different domains, fine-tuning makes the hidden representations from different domains much harder to distinguish by projecting them to a simpler data manifold, which causes the fine-tuned model fails to give proper predictive confidence for OD and outlier samples. As shown in Figure 3 , the average predictive confidence of the ID, OD, and outlier validation sets increases as the training step increases. The gap between the average predictive confidence and the correctness likelihood (i.e., classification accuracy) in the OD setting is relatively larger than in the ID setting, which results in a larger OD ECE. More crucially, the average confidence of the outlier samples is higher during the whole training process than both ID and OD settings and increases to nearly 100% after three training epochs, which can not be fixed by simple techniques such as temperature scaling and early stopping. Ideally, the model should be uncertain about the outlier samples significantly deviating from the training samples. However, the fine-tuned LM exhibits overcon-fidence toward the OD and outlier samples, which implies that strong regularization methods are needed to improve the confidence modeling for OD and outlier samples. Based on the observations that the pre-trained features of the PLMs can model the predictive uncertainty well across different domains and are distorted by the fine-tuned LMs in §3.1, we hypothesize that preserving the pre-trained features of PLMs helps the fine-tuned LMs better model the predictive confidence and improve calibration on downstream classification tasks.

4. METHODS

To validate the hypothesis that preserving the pre-trained features helps the calibration of fine-tuned LMs, we examine existing methods that can preserve the pre-trained features in different ways. Although these methods are not originally designed for enhancing uncertainty estimation, such as achieving better trade-off between tunable parameters and model's performance for parameterefficient tuning, or improving classification accuracy and stability for pre-trained weight decay and Mixout, we anticipate that these methods may improve calibration by mitigating catastrophic forgetting and evaluate their effectiveness in calibration in §5.

4.1. PARAMATER-EFFICIENT TUNING

Parameter-efficient tuning is a series of fine-tuning methods which keep the pre-trained parameters of the text encoder φ frozen and update only a small number of extra parameters φ ∆ and the task-specific head h ϕ while preserving competitive performance with Full-FT. Since the pre-trained knowledge is encoded to the model's parameters and there are only a small amount of extra parameters, parameter-efficient tuning methods can preserve more pre-trained features compared to Full-FT. In this work, we choose three mainstream parameter efficient tuning methods: (1) Adapter (Houlsby et al., 2019) , which adds a light-weight bottleneck module after the output of each sub-layer in the transformer block; (2) LoRA (Hu et al., 2021) , which updates the attention weight matrix using low-rank reparameterization; (3) Prefix Tuning (Li & Liang, 2021), which prepends tunable prefix vectors to keys and values of the multi-head attention layers.

4.2. REGULARIZATION WITH PRE-TRAINED WEIGHT

Introducing regularization terms using the pre-trained weight can also better leverage the pre-trained features during fine-tuning. In this work, we adopt two common regularization techniques: Pre-trained Weight Decay: Traditional weight decay methods add a regularization term λ 2 ||w|| 2 that penalizes large weights to improve generalization (Krogh & Hertz, 1991) , where λ is a regularization coefficient. As an alternative, performing weight decay towards the pre-trained weight w 0 by adding λ 2 ||w -w 0 || 2 to the task loss function is shown by previous works as an effective way to mitigate catastrophic forgetting caused by fine-tuning (Wiese et al., 2017) and can improve the performance of the downstream task (Chen et al., 2020) . Mixout: To explicitly prevent the deviation from the pre-trained weight w 0 , Lee et al. (2020) propose Mixout that stochastically replaces the model parameters with their pre-trained counterparts with probability p at each training iteration, which has been shown to improve the stability of finetuning and enhance the classification accuracy of the fine-tuned LMs on downstream task.

4.3. JOINT LEARNING WITH MLM OBJECTIVE

Besides fixing or constraining the model parameters to the pre-trained counterpart, we can enforce the consistency between the pre-trained and fine-tuned LMs by utilizing the MLM objective. Previous works have demonstrated that performing the MLM task before or during the fine-tuning process can yield better performance on the downstream tasks (Sun et al., 2019; Wiedemann et al., 2020; Ma et al., 2021) . In this work, to better preserve the pre-trained features, we jointly optimize the MLM objective in the fine-tuning phase: L joint = α mlm L mlm + L cls , where α mlm is the scaling factor of the MLM loss. In addition to introducing the MLM objective, we propose three simple techniques to further strengthen the connection between the pre-trained model and our fine-tuned model: Utilizing corpus of the pre-training phase: The dataset for the MLM task is not required to be labeled as for the downstream task, so we can choose any text dataset. In this work, we present two criteria for selecting p u (x) of the MLM task. The first one uses the dataset of downstream tasks, referred to as JL-D. The second one introduces the corpus of the pre-training phase, referred to as JL-P. We suppose that using the corpus in the pre-training phase is helpful in preserving the pretrained features, while increasing data diversity could enhance the model's uncertainty estimation under OD settings. Distillation from the pre-trained model: Knowledge distillation (Hinton et al., 2015) has been shown to be an effective technique to reap the benefits of a powerful teacher model. In addition to preserving the accuracy, Guo et al. (2021) illustrate that the calibration performance of the teacher model can be distilled into the student model and propose to distill from an ensemble of fine-tuned LMs for better calibration. Inspired by this, we perform knowledge distillation from the pre-trained language model when performing the MLM task. Specifically, instead of calculating the MLM loss using the original text as hard labels, we use the KL-divergence D KL (p mlm (x i |x \M ; φ 0 , θ 0 ) p mlm (x i |x \M ; φ, θ)) between the predictive distribution p mlm (x i |x \M ; φ 0 , θ 0 ) of the pre-trained language model and our model. Regularization on the contextualized representation: Zhou & Srikumar (2022) validate that finetuning enlarges the distance in feature space between samples from different classes. This behavior may increase the deviation of the fine-tuned model from the pre-trained model. To address this problem, we introduce a heuristic regularization by adding the L 2 norm of each training example's contextualized representation f φ (x) with regularization coefficient β L 2 . Putting L joint with the knowledge distillation and regularization term together, we get our final joint learning objective: -L JL =α mlm E pu(x) [ i∈M D KL (p mlm (x i |x \M ; φ 0 , θ 0 ) p mlm (x i |x \M ; φ, θ))] + E p d (x,y) [log q(y|x; φ, ϕ)] + β L 2 ||f φ (x)|| (4) Following previous works (Desai & Durrett, 2020; Park & Caragea, 2022) , we also apply label smoothing (Szegedy et al., 2016) on the classification task, which can mitigate overconfident predictions by distributing a σ fraction of probability mass of the ground-truth label equally to other non-ground-truth classes.

5.1. GENERAL SETUP

Datasets: We conduct experiments on three natural language understanding tasks: natural language inference (NLI), paraphrase detection (PD), and commonsense reasoning (CR). Each task consists of a pair of in-domain (ID) and out-of-domain (OD) datasets. Specifically, SNLI (Bowman et al., 2015) and MNLI (Williams et al., 2018) are ID and OD datasets for NLI; QQP (Shankar et al., 2017) and TwitterPPDB (Lan et al., 2017) are ID and OD datasets for PD; SWAG (Zellers et al., 2018) and HellaSWAG (Zellers et al., 2019) are ID and OD datasets for CR. We use the same train/validation/test split for those six datasets published by Desai & Durrett (2020) . For each task, we fine-tune the model using the ID training set and evaluate the model's performance with both ID and OD test sets. We use WikiText-103 (Merity et al., 2016) as the corpus of the pre-training phase for JL-P. The detailed statistics for each dataset can be found in Appendix A.1.

Setup:

We follow the general training configuration provided by the Huggingface Transformers library (Wolf et al., 2020) . For parameter-efficient tuning methods, we use the default hyperparameter configuration provided by OpenDelta (Ding et al., 2022) library for all three methods and conduct a grid search for learning rates on different tasks. For pre-trained weight decay (PWD), we follow the implementation of the RecAdam (Chen et al., 2020) , which integrates the quadratic penalty between the model parameters and the pre-trained parameters into the Adam optimizer (Kingma & Ba, 2015) . We tune the regularization strength λ PWD for each task. For Mixout, we tune the mixout probability p mixout . For joint learning methods, we use a Bernoulli corruption distributionfoot_0 . We conduct hyperparameter search for the mask probability p mask , the scaling factor α mlm of the MLM loss, and the regularization coefficient β L 2 on the contextualized representation. We also tune the hyperparameter σ ls for label smoothing (LS). We search all the hyperparameters on the validation set of each task independently. Completed setup details for each method on each task can be found in Appendix A.2. All experiments are run for 3 training epochs and are deployed on a single NVIDIA A40 48G GPU within 3 hours to fine-tune a single model.

Evaluation:

In this section, we use pre-trained RoBERTa BASE (Liu et al., 2019) for all experiments. For each NLU task, we report accuracy and expected calibration error (ECE) on both ID and OD test sets. We evaluate "out-of-the-box" calibration of our method, which does not apply any post-hoc calibration methods such as temperature scaling (Guo et al., 2017) . Table 2 : Out-of-the-box calibration results of different fine-tuning methods on in-domain (SNLI, QQP, SWAG) and out-of-domain (MNLI, TwitterPPDB, HellaSWAG) datasets. We report the averaged accuracy and ECE across five random fine-tuning runs. We also report the corresponding standard deviation in subscripts.

RoBERTa-base

In 

5.2. MAIN RESULTS

We summarize our results in Table 2 . Our baselines include full fine-tuning and two advanced methods based on the pre-trained language models: Bayesian Attention Belief Networks (BABN) (Zhang et al., 2021) and Mixup (Park & Caragea, 2022) . We copied the results of the same setting from the original paper. We also report our full fine-tuning runs following the default behavior of the training scripts provided by the Huggingface Transformers library. We present our experimental results for pre-trained weight decay, Mixout, JL-D (w/o KD) and JL-P (w/ KD)foot_1  As shown in Table 2 , full fine-tuned models generally have lower in-domain ECE compared to outof-domain, which illustrates that the fine-tuned language models tend to be overconfident under OD settings. BABN exhibits better generalization performance than deterministic methods, but it has a limited effect on the OD calibration. Mixup significantly lowers the ECE under both ID and OD settings while preserving comparable accuracy to vanilla fine-tuning. Besides, all of the methods that preserve the pre-trained features in different ways are generally more calibrated than full finetuning, especially under the OD settings, which matches our expectations that pre-trained features are helpful to better model the predictive confidence for OD samples. In addition to achieving competitive performance, parameter-efficient tuning methods have advantages in calibration over full fine-tuning. Fine-tuning with regularization to pre-trained models in the parameter space also significantly improves calibration. However, Table 2 and Table 5 show that this requires maintaining a relatively high constraint strength to the pre-trained weights throughout the fine-tuning process, which leads to a large loss of raw quality in some cases, e.g., Mixout on QQP/TwitterPPDB. Notably, the proposed joint learning (JL) methods outperform previous calibration methods in ECE across three tasks simultaneously under both ID and OD settings, which suggests that it may be more effective to encourage fine-tuned models to be consistent with pre-trained models in the function space. In addition to being well-calibrated, JL models achieve the best accuracy in both ID and OD settings on NLI and CR tasks and maintain accuracy within < 1% drop compared to vanilla finetuning on the PD task. We also highlight that the JL models have relatively low standard deviations for both accuracy and ECE compared to other methods. As shown in Figure 4 , compared to full fine-tuning, the hidden representations for OD samples of the methods described in §4 are more consistent with the PLMs, showing that they are able to preserve the pre-trained features and can mitigate catastrophic forgetting. The representations for outlier samples are more distinguishable from the OD samples, as the pre-trained model does. Figure 5 illustrates that preserving the pre-trained features helps calibrate the fine-tuned models by mitigating the overconfident tendency to the OD and outlier samples discussed in §3.2. Specifically, parameterefficient tuning, pre-trained weight decay, and JL-D slightly alleviate the overconfidence toward the OD and outlier samples, while JL-P and Mixout significantly improve the fine-tuned models' ability to model the predictive confidence for OD and outlier samples. Among these methods, JL-P with knowledge distillation is shown to be the most effective regularization that can achieve low ECE and competitive raw quality at the same time. Nevertheless, it requires access to the corpus of the pre-training phase, which may not be available in some cases.

5.3. PRESERVING PRE-TRAINED FEATURES HELPS CALIBRATE FINE-TUNED LMS

There is clearly more room for further improve these methods. Mixout exhibits a promising ability to model the confidence of OD and outlier samples properly and improve the OD generalization (e.g., the result on HellaSWAG) by taking advantage of the pre-trained models. However, Mixout fails to balance the preservation of pre-trained features and the learning of downstream tasks in some cases, which leads to low accuracy and high ECE. Figure 4 demonstrates that the outlier samples can be easily distinguished from the hidden representations of the JL-D models, but they do not provide as reasonable predictive confidence as JL-P, as shown in Figure 7 . We leave these questions for future work.

6. RELATED WORK

Uncertainty estimation of PLMs. Previous works have demonstrated that PLMs can be beneficial for improving the robustness and calibration on downstream tasks compared to non-pre-trained models (Hendrycks et al., 2020; Desai & Durrett, 2020) . However, PLMs can still fail to model their predictive uncertainty on downstream tasks. For example, Desai & Durrett (2020) ; Kong et al. (2020) ; Guo et al. (2021) have shown that fine-tuned masked language models (e.g., BERT (Devlin et al., 2019) , RoBERTa (Liu et al., 2019) ) are overconfident on text classification tasks, while Jiang et al. (2021) has shown that powerful generative PLMs (e.g., T5 (Raffel et al., 2020) , BART (Lewis et al., 2020) , and GPT-2 (Radford et al., 2019) ) are poorly calibrated on QA tasks. In this work, we study a specific failure case of fine-tuned LM in which the models are overconfident of the OD and outlier samples due to catastrophic forgetting. (Zhang et al., 2018) to calibrate fine-tuned language models and exhibit effectiveness in calibration under both ID and OD settings. We tackle the problem of calibrating the fine-tuned LMs from a new perspective by focusing specifically on the pre-training & fine-tuning paradigm and validate that preserving the pre-trained features is an effective way to improve the fine-tuned LMs' calibration. Benefit from mitigating catastrophic forgetting. Previous works have shown that mitigating catastrophic forgetting of PLMs can be helpful for various aspects of downstream tasks. For example, Chen et al. (2020) ; Lee et al. (2020) show that constraining the models' parameters closer to the pre-trained ones can improve the training stability and performance of fine-tuned LMs on downstream tasks. Xie et al. ( 2021) validate that standard fine-tuning can destroy the output structure of pre-trained generative denoiser such as BART and show that preserving pre-trained features via lightweight fine-tuning can improve out-of-distribution generalization on downstream generation tasks. Dong et al. (2021) show that the pre-trained features of PLMs are beneficial for a robust objective model and improve the adversarial robustness of fine-tuned language models by maximizing the mutual information between the hidden representation of the pre-trained and fine-tuned models during the whole fine-tuning process. Our work specifically focus on uncertainty estimation of the fine-tuned LMs and makes a complementary contribution that the calibration of fine-tuned LMs can be improved by mitigating catastrophic forgetting.

7. CONCLUSIONS

In this work, we show that PLMs that pre-trained with large corpora are inherently well-calibrated on the MLM task while the fine-tuned LMs suffer from overconfidence due to catastrophic forgetting. Our experimental results validate that preserving the pre-trained features can better calibrate the finetuned LMs. We hope our work can draw more attention to the deeper exploitation of the pre-trained features learned by PLMs and contribute to building safe and reliable NLP systems for real-world applications.

REPRODUCIBILITY STATEMENT

We have provided detailed setup for all experiments in §3.1, §3.2, §5.1, A.1, and A.2, we submit our codes as the supplementary material. The information we provided is sufficient to reproduce our results.

A APPENDIX

A.1 DATASET DETAILS The general information of in-domain and out-of-domain datasets of the three NLU tasks are shown below: Natural Language Inference: Stanford Natural Language Inference (SNLI) (Bowman et al., 2015) requires the model to learn the textual entailment by predicting the relationship between given premise and hypothesis among entailment, contradiction, or neutral. Multi-Genre Natural Language Inference (MNLI) (Williams et al., 2018) shares the same task form as SNLI, but the samples are from more diverse domains than SNLI. Paraphrase Detection: Quora Question Pairs (QQP) (Shankar et al., 2017) contains question pairs from Quora. The model needs to discriminate whether the given pairs are semantically equivalent. TwitterPPDB (Lan et al., 2017) is the out-of-domain dataset that collects sentence pairs shared with the same URLs. Commonsense Reasoning: Situations With Adversarial Generations (SWAG) (Zellers et al., 2018) is a common sense reasoning task that requires the model to choose the most plausible continuation of a sentence given four possible candidates. HellaSWAG (Zellers et al., 2019) is designed as a more challenging commonsense reasoning task for the pre-trained language models built with Adversarial Filtering. The statistic of the datasets for both MLM and NLU tasks are shown in Table 3 . For the datasets of NLU tasks (SNLI/MNLI, QQP/TwitterPPDB, SWAG/HellaSWAG), we use the published version by Desai & Durrett (2020) . For WikiText-103, we use the version provided by Huggingface Datasets (Lhoest et al., 2021) library. Note that the training splits of the OD datasets (MNLI, Twit-terPPDB, HellaSWAG) are not used. (Williams et al., 2018) 392,702 4,908 4,907 3 QQP (Shankar et al., 2017) 363,871 20,216 20,217 2 TwitterPPDB (Lan et al., 2017) 46,667 5,060 5,060 2 SWAG (Zellers et al., 2018) 73,547 10,004 10,004 4 HellaSWAG (Zellers et al., 2019) 39,905 5,021 5,021 4 WikiText-103 (Merity et al., 2016) We conduct the experiments with the Huggingface Transformers library (Wolf et al., 2020) . For parameter-efficient tuning methods, we use the implementations of OpenDelta (Ding et al., 2022) library for the three parameter-efficient tuning methods (Adapter, LoRA, Prefix Tuning). We use the same default hyperparameters provided by the OpenDelta library for each method across all three tasksfoot_2 . All fine-tuning methods are trained with the AdamW optimizer (Loshchilov & Hutter, 2019) . For full fine-tuning, we use a learning rate of 1e-5 across all tasks. For parameter-efficient tuning methods, we search the learning rate among {1e-5, 2e-5, 5e-5, 1e-4, 2e-4, 5e-4}. We use the one that has the best ID accuracy on the validation set, which is the widespread application scenario for the parameter-efficient tuning methods. Effect of the MLM objective: As shown in Figure 6 , compared to vanilla fine-tuning (α mlm = 0), introducing the MLM objective lowers the ECE effectively in both ID and OD settings with a relatively small effect on accuracy. As the magnitude of the MLM loss increases, the features of the fine-tuned models are closer to those of the pre-trained models, reflected in both the geometry of feature space (shown in Figure 4 ) and euclidean distance between the representations of pretrained and fine-tuned models (shown in Figure 6 ), and the ECE of the fine-tuned models decreases. However, when the weight of the MLM loss becomes too large compared to the classification loss, the performance of the NLU task will be apparently damaged. For the overconfidence toward the OD samples and outlier samples illustrated in §3.2, as shown in Figure 7 , can be mitigated by using a relatively small magnitude of α mlm , while increasing it can have a more significant effect. Effect of introducing corpus of pre-training phase: From Table 2 and Figure 6 , we observe that performing the MLM task with the corpus of the pre-training phase has lower OD ECE on all three tasks. This confirms our belief that using the corpus of the pre-training phase can preserve more helpful features from the PLMs. We also notice that sometimes JL-P degrades the ID calibration (e.g., calibration on SWAG in Table 2 ), however, applying label smoothing on downstream tasks can relieve this negative effect. An interesting observation is that although the hidden representations given by both JL-D and JL-P can distinguish outlier samples easily, the JL-D holds higher confidence toward outlier samples than JL-P. As shown in Figure 7 , compared to JL-P, the JL-D are more confident in samples from the BookCorpus (Zhu et al., 2015) dataset that is not seen by both JL-P and JL-D in the fine-tuning phase for a fair comparison, which suggests utilizing the pre-training phase's corpora that are more diverse than the training datasets are helpful for the fine-tuned LMs to model the confidence of the outlier samples better. Effect of knowledge distillation from the pre-trained model: Table 7 shows that applying knowledge distillation has limited effect on JL-D but enhances the accuracy and ECE of JL-P in most cases. We hypothesize that performing the MLM task with the corpus that is distinct from the downstream datasets can hurt relatively more performance on the downstream tasks, and distilling from the pretrained model's predictive distribution might be a smoother and more effective regularization. Table 8 : Samples generated by JL-D models with SNLI and QQP test set using the Mask-Predict algorithm with rejection sampling. For SNLI, the model generates hypotheses given class labels {none (-), entailment, contradiction, natural} and premise prefixes. For QQP, the model generates sentences given class labels {none (-), non-paraphrase (non-para), paraphrase (para)} and question prefixes. We mark the generated samples whose assigned label is consistent with the given label using [ ] and mark the failure cases where the assigned label is not consistent with the given label using [ ]. We also report the corresponding confidence of the class of the generated text after the label. Text Prefix: A mountain biker rides up a hill on a red bicycle.

[ -]

A mountain biker rides a bike on a hill.

[ Entailment, 99% ]

A mountain biker rides a bike up a hill.

[ Contradiction, 99% ]

A mountain biker rides downhill on a blue bicycle.

[ Natural, 65% ]

A mountain biker is trying to climb a hill. Text Prefix: A man plays the french horn as his pianist plays the supporting melody on stage.

[ -]

A man is playing a french horn for a concert.

[ Entailment, 97% ]

A man is playing a french horn on a stage.

[ Contradiction, 99% ]

A man is playing a flute for a crowd.

[ Natural, 91% ]

A man is playing a song on a concert stage.

Text Prefix:

Two young men in unusual clothing are jumping in a gym.

[ -]

Two men are playing basketball.

[ Entailment, 96% ]

Two men are jumping around.

[ Contradiction, 90% ]

Two men are jumping outside. [ Natural, 44% ] Two men are playing basketball. Text Prefix: a blue and gray race car driving on a dirt track.

[ -]

A race car is driving on a dirt track.

[ Entailment, 98% ]

A race car is driving on a dirt track.

[ Contradiction, 99% ]

A race car is parked on a dirt track.

[ Natural, 66% ]

A race car is racing on a dirt track. Text Prefix: A boy poses in karate form and uniform.

[ -]

A boy is practicing karate.

[ Entailment, 75% ]

A boy is practicing karate.

[ Contradiction, 2% ]

A boy is in a costume.

[ Natural, 25% ]

A boy is practicing karate. Text Prefix: Four females wearing helments are riding on an ATV.

[ -]

Four females are wearing helments on an ATV.

[ Entailment, 98% ]

Four females are wearing helments on an ATV.

[ Contradiction, 99% ]

Four females are riding on a horse in the park. What is a good data analysis book? [ -] What is a good free data analysis book? [ Non-Paraphrase, 78% ] What is a good book for analysis books? [ Paraphrase, 67% ] What is the best free data analysis book? Text Prefix: Feeling bored. What do I do? [ -] What to do to get bored? [ Non-Paraphrase, 97% ] What should I do to myself? [ Paraphrase, 28% ] What should I start to do?



For the MLM objective in joint learning method, we only perform the mask operation instead of the 80-10-10 strategy as in the pre-training phase. We present the results of JL-D without knowledge distillation and JL-P with knowledge distillation in Table2. We show the full results for JL in A.3.1. The default hyperparameters can be found in the definition of each class on the document: https:// opendelta.readthedocs.io/en/latest/modules/deltas.html. The NLI and PD tasks correspond to the text classification setting, while the CR task corresponds to the multiple choice criterion of huggingface transformers library.



Figure 3: Average predictive confidence (up) and ECE (down) for the validation split of QQP (ID), TwitterPPDB (OD), and WikiText-103 (outlier) dataset in different training steps.

Figure 4: t-SNE visualization for hidden representations of the sampled inputs from different domains given by different models fine-tuned on QQP.

Calibrating fine-tuned LMs. Several approaches have been developed to calibrate the fine-tuned LMs on NLU tasks. For instance,Desai & Durrett (2020) demonstrate that temperature scaling and label smoothing can improve the calibration of the models in ID and OD settings, respectively.He  et al. (2021b)  introduce a new discriminative objective under the noise contrastive estimation (NCE) framework to jointly train an energy-based model defined on the classifier, which leads to better ID calibration.Fan et al. (2020);Zhang et al. (2021) model the attention weights as random variables and design a series of methods to optimize the stochastic attention layer with variational inference, which yields better performance in accuracy and calibration compared to vanilla deterministic attention layers.Kong et al. (2020);Park & Caragea (2022) adopt Mixup

Figure 6: Accuracy, ECE and L 2 distance to the pre-trained hidden representations (||f φ (x)f φ0 (x)||) of JL-D (up) and JL-P (down) over different scaling factors α mlm of the MLM Loss in ID and OD settings.

[ Natural, 99% ]   Four females are riding a ATV in the desert. Text Prefix: How does cloud computing work? [ -] How does cloud computing in India work? [ Non-Paraphrase, 95% ] How does Google think cloud computing work? [ Paraphrase, 74% ] How does I understand cloud computing work? Text Prefix: Do you think time travel is possible? [ -] Do you think space and time travel is possible? [ Non-Paraphrase, 73% ] Do you think gravity can make time travel possible? [ Paraphrase, 98% ] Do you think it is possible to time travel? Text Prefix:



The size of the training, validation, test splits and number of labels for all datasets.

Table4shows the learning rate used to fine-tune all the methods. For other training hyperparameters, we follow the default setup of the Huggingface Trans-

ACKNOWLEDGMENTS

Thanks Peng Cui, Zhijie Deng, Wenbo Hu, Weize Chen, Xu Han for valuable discussions. This work was supported by the National Key Research and Development Program of China (2020AAA0106302); NSF of China Projects (Nos. 62061136001, 61620106010, 62076145, U19B2034, U1811461, U19A2081, 6197222); a grant from Tsinghua Institute for Guo Qiang; the High Performance Computing Center, Tsinghua University. J.Z was also supported by the XPlorer Prize.

annex

Published as a conference paper at ICLR 2023 formers library 4 . In specific, we set a batch size of 32, a maximum sequence length of 256, and a weight decay of 0.1. We use a linear decay learning rate scheduler without warmup and do not apply gradient clipping. Note that the reported baselines for full fine-tuning and Mixup (Desai & Durrett, 2020; Park & Caragea, 2022) in Table 2 do not use learning rate scheduler and use a gradient clip of 1.0 compared to our runs. There are also some minor differences, such as the padding strategy for input texts between our fine-tuning setup and theirs.Table 4 : Learning rate of fine-tuning methods on each task. The models are fine-tuned on the training split of SNLI for natural language inference (NLI), QQP for paraphrase detection (PD), SWAG for commonsense reasoning (CR). We conduct hyperparameter search with the ID/OD validation set for the models that joint learning with the MLM objective (JL). Specifically, we set a learning rate of 1e-5 and a batch size of 32 across all three tasks as Desai & Durrett (2020) does, except for fine-tuning JL-P with label smoothing on SWAG, where we use a larger learning rate of 5e-5. For other training parameters, we adopt the same setup described in A.2.1 except for fine-tuning JL-P on QQP, where we do not use a learning rate scheduler. For the hyperparameter of the JL models, we search the scaling factor of the MLM loss α mlm ∈ {0.1, 0.3, 0.5, 1, 2, 3, 4, 5}, the coefficient of the regularization term on contextualized representation β L 2 ∈ {1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4} , the masking probability for a sentence p mask ∈ {0.05, 0.15, 0.3, 0.4, 0.5, 0.6}, and the hyperparameter of label smoothing σ ls ∈ {0.01, 0.03, 0.05} for each task. We also search the maximum sequence length of the MLM task for JL-P and the batch size of MLM tasks for both JL-D and JL-P. In detail, we use a batch size for the MLM task of 32 on NLI and PD tasks and a batch size for the MLM task of 8 on the CR task. The maximum sequence lengths of the MLM task for JL-P are set to 32/32/64 for NLI, PD, and CR tasks, respectively. Training a single model for 3 epochs can be done in two hours with a single NVIDIA A40 48G GPU. We present the detailed hyperparameter setup of each method on each task in Table 5 . Mixout: We tune the mixout probability p mixout ∈ {0.1, 0.3, 0.5, 0.7, 0.9} and use p mixout = 0.9 for all three NLU task.For other training parameters, we use the same default setup described in A.2.2.

A.2.4 COMPARISON OF TRAINING TIME

We compare the training time of different methods using a single NVIDIA A40 48G GPU. We use the full fine-tuning as a baseline (1x). The time cost of 3 training epochs using full fine-tuning is 1/0.8/0.3 GPU hours for SNLI/QQP/SWAG. We present the time cost of different methods in Table 6 . Effect of regularization on the contextualized representation: As shown in Figure 8 , the introduced heuristic regularization term could improve the ECE of JL-D models under both ID and OD settings by smoothing the models' predictive confidence. The effect of this regularization term is similar to applying temperature scaling under the ID validation set, where large magnitudes can lead to overly conservative prediction confidence. We also find that a proper choice of β L 2 can marginally benefit the accuracy under ID and OD settings. However, the effect of this term on Full-FT models is limited. Table 8 shows the samples generated by JL-D models with the sampling algorithm described above. The JL-D models are able to generate text with the given labels, which can be seen as a diagnostic for the model. For example, the models prefer to generate contradictory hypotheses by changing the objective's entity or adjectives with high confidence and tend to copy the prefix to generate positive textual entailment, which may expose some spurious correlations that the models rely on when performing the NLU tasks (Tu et al., 2020) .

