REVISITING FEW-SAMPLE BERT FINE-TUNING

Abstract

This paper is a study of fine-tuning of BERT contextual representations, with focus on commonly observed instabilities in few-sample scenarios. We identify several factors that cause this instability: the common use of a non-standard optimization method with biased gradient estimation; the limited applicability of significant parts of the BERT network for down-stream tasks; and the prevalent practice of using a pre-determined, and small number of training iterations. We empirically test the impact of these factors, and identify alternative practices that resolve the commonly observed instability of the process. In light of these observations, we re-visit recently proposed methods to improve few-sample fine-tuning with BERT and re-evaluate their effectiveness. Generally, we observe the impact of these methods diminishes significantly with our modified process. * Equal contribution, Work done at ASAPP. 1 Fine-tuning instability is also receiving significant practitioner attention.

1. INTRODUCTION

Fine-tuning self-supervised pre-trained models has significantly boosted state-of-the-art performance on natural language processing (NLP) tasks (Liu, 2019; Yang et al., 2019a; Wadden et al., 2019; Zhu et al., 2020; Guu et al., 2020) . One of the most effective models for this process is BERT (Devlin et al., 2019) . However, despite significant success, fine-tuning remains unstable, especially when using the large variant of BERT (BERT Large ) on small datasets, where pre-training stands to provide the most significant benefit. Identical learning processes with different random seeds often result in significantly different and sometimes degenerate models following fine-tuning, even though only a few, seemingly insignificant aspects of the learning process are impacted by the random seed (Phang et al., 2018; Lee et al., 2020; Dodge et al., 2020) . 1 As a result, practitioners resort to multiple random trials for model selection. This increases model deployment costs and time, and makes scientific comparison challenging (Dodge et al., 2020) . This paper is a study of different aspects of the few-sample fine-tuning optimization process. Our goal is to better understand the impact of common choices with regard to the optimization algorithm, model initialization, and the number of fine-tuning training iterations. We identify suboptimalities in common community practices: the use of a non-standard optimizer introduces bias in the gradient estimation; the top layers of the pre-trained BERT model provide a bad initialization point for finetuning; and the use of a pre-determined , but commonly adopted number of training iterations hurts convergence. We study these issues and their remedies through experiments on multiple common benchmarks, focusing on few-sample fine-tuning scenarios. Once these suboptimal practices are addressed, we observe that degenerate runs are eliminated and performance becomes much more stable. This makes it unnecessary to execute numerous random restarts as proposed in Dodge et al. (2020) . Our experiments show the remedies we experiment with for each issue have overlapping effect. For example, allocating more training iterations can eventually compensate for using the non-standard biased optimizer, even though the combination of a bias-corrected optimizer and re-initializing some of the pre-trained model parameters can reduce fine-tuning computational costs. This empirically highlights how different aspects of fine-tuning influence the stability of the process, at times in a similar manner. In the light of our observations, we re-evaluate several techniques (Phang et al., 2018; Lee et al., 2020; Howard & Ruder, 2018) that 

availability

https://github.com/zihangdai/

