REVISITING FEW-SAMPLE BERT FINE-TUNING

Abstract

This paper is a study of fine-tuning of BERT contextual representations, with focus on commonly observed instabilities in few-sample scenarios. We identify several factors that cause this instability: the common use of a non-standard optimization method with biased gradient estimation; the limited applicability of significant parts of the BERT network for down-stream tasks; and the prevalent practice of using a pre-determined, and small number of training iterations. We empirically test the impact of these factors, and identify alternative practices that resolve the commonly observed instability of the process. In light of these observations, we re-visit recently proposed methods to improve few-sample fine-tuning with BERT and re-evaluate their effectiveness. Generally, we observe the impact of these methods diminishes significantly with our modified process. * Equal contribution, Work done at ASAPP. 1 Fine-tuning instability is also receiving significant practitioner attention.

1. INTRODUCTION

Fine-tuning self-supervised pre-trained models has significantly boosted state-of-the-art performance on natural language processing (NLP) tasks (Liu, 2019; Yang et al., 2019a; Wadden et al., 2019; Zhu et al., 2020; Guu et al., 2020) . One of the most effective models for this process is BERT (Devlin et al., 2019) . However, despite significant success, fine-tuning remains unstable, especially when using the large variant of BERT (BERT Large ) on small datasets, where pre-training stands to provide the most significant benefit. Identical learning processes with different random seeds often result in significantly different and sometimes degenerate models following fine-tuning, even though only a few, seemingly insignificant aspects of the learning process are impacted by the random seed (Phang et al., 2018; Lee et al., 2020; Dodge et al., 2020) . 1 As a result, practitioners resort to multiple random trials for model selection. This increases model deployment costs and time, and makes scientific comparison challenging (Dodge et al., 2020) . This paper is a study of different aspects of the few-sample fine-tuning optimization process. Our goal is to better understand the impact of common choices with regard to the optimization algorithm, model initialization, and the number of fine-tuning training iterations. We identify suboptimalities in common community practices: the use of a non-standard optimizer introduces bias in the gradient estimation; the top layers of the pre-trained BERT model provide a bad initialization point for finetuning; and the use of a pre-determined , but commonly adopted number of training iterations hurts convergence. We study these issues and their remedies through experiments on multiple common benchmarks, focusing on few-sample fine-tuning scenarios. Once these suboptimal practices are addressed, we observe that degenerate runs are eliminated and performance becomes much more stable. This makes it unnecessary to execute numerous random restarts as proposed in Dodge et al. (2020) . Our experiments show the remedies we experiment with for each issue have overlapping effect. For example, allocating more training iterations can eventually compensate for using the non-standard biased optimizer, even though the combination of a bias-corrected optimizer and re-initializing some of the pre-trained model parameters can reduce fine-tuning computational costs. This empirically highlights how different aspects of fine-tuning influence the stability of the process, at times in a similar manner. In the light of our observations, we re-evaluate several techniques (Phang et al., 2018; Lee et al., 2020; Howard & Ruder, 2018 ) that were recently proposed to increase few-sample fine-tuning stability and show a significant decrease in their impact. Our work furthers the empirical understanding of the fine-tuning process, and the optimization practices we outline identify impactful avenues for the development of future methods.

2. BACKGROUND AND RELATED WORK

BERT The Bidirectional Encoder Representations from Transformers (BERT; Devlin et al., 2019) model is a Transformer encoder (Vaswani et al., 2017) trained on raw text using masked language modeling and next-sentence prediction objectives. It generates an embedding vector contextualized through a stack of Transformer blocks for each input token. BERT prepends a special [CLS]token to the input sentence or sentence pairs. The embedding of this token is used as a summary token for the input for classification tasks. This embedding is computed with an additional fully-connected layer with a tanh non-linearity, commonly referred to as the pooler, to aggregate the information for the [CLS]embedding.

Fine-tuning

The common approach for using the pre-trained BERT model is to replace the original output layer with a new task-specific layer and fine-tune the complete model. This includes learning the new output layer parameters and modifying all the original weights, including the weights of word embeddings, Transformer blocks, and the pooler. For example, for sentence-level classification, an added linear classifier projects the [CLS]embedding to an unnormalized probability vector over the output classes. This process introduces two sources of randomness: the weight initialization of the new output layer and the data order in the stochastic fine-tuning optimization. Existing work (Phang et al., 2018; Lee et al., 2020; Dodge et al., 2020) shows that these seemingly benign factors can influence the results significantly, especially on small datasets (i.e., < 10K examples). Consequently, practitioners often conduct many random trials of fine-tuning and pick the best model based on validation performance (Devlin et al., 2019) . (Popel & Bojar, 2018; Nakkiran et al., 2019) . BERT Representation Transferability BERT pre-trained representations have been widely studied using probing methods showing that the pre-trained features from intermediate layers are more transferable (Tenney et al., 2019b; a; Liu et al., 2019a; Hewitt & Manning, 2019; Hewitt & Liang, 2019) or applicable (Zhang et al., 2020) to new tasks than features from later layers, which change more after fine-tuning (Peters et al., 2019; Merchant et al., 2020) . Our work is inspired by these findings, but focuses on studying how the pre-trained weights influence the fine-tuning process. Li et al. (2020) propose to re-initialize the final fully-connected layer of a ConvNet and show performance gain for image classification.foot_0 Concurrent to our work, Tamkin et al. ( 2020) adopt a similar methodology of weight re-initialization (Section 5) to study the transferability of BERT. In contrast to our study, their work emphasizes pinpointing the layers that contribute the most in transfer learning, and the relation between probing performance and transferability.

3. EXPERIMENTAL METHODOLOGY

Data We follow the data setup of previous studies (Lee et al., 2020; Phang et al., 2018; Dodge et al., 2020) to study few-sample fine-tuning using eight datasets from the GLUE benchmark (Wang et al., 2019b) . The datasets cover four tasks: natural language inference (RTE, QNLI, MNLI), paraphrase detection (MRPC, QQP), sentiment classification (SST-2), and linguistic acceptability (CoLA). Appendix A provides dataset statistics and a description of each dataset. We primarily



This concurrent work was published shortly after our study was posted.



-tuning Instability The instability of the BERT fine-tuning process has been known since its introduction (Devlin et al., 2019), and various methods have been proposed to address it. Phang et al. (2018) show that fine-tuning the pre-trained model on a large intermediate task stabilizes later fine-tuning on small datasets. Lee et al. (2020) introduce a new regularization method to constrain the fine-tuned model to stay close to the pre-trained weights and show that it stabilizes fine-tuning. Dodge et al. (2020) propose an early stopping method to efficiently filter out random seeds likely to lead to bad performance. Concurrently to our work, Mosbach et al. (2020) also show that BERTADAM leads to instability during fine-tuning. Our experiments studying the effect of training longer are related to previous work studying this question in the context of training models from scratch

availability

https://github.com/zihangdai/

