BETTER FINE-TUNING BY REDUCING REPRESENTA-TIONAL COLLAPSE

Abstract

Although widely adopted, existing approaches for fine-tuning pre-trained language models have been shown to be unstable across hyper-parameter settings, motivating recent work on trust region methods. This paper presents a simplified and efficient method rooted in trust region theory that replaces previously used adversarial objectives with parametric noise (sampling from either a normal or uniform distribution), thereby discouraging representation change during fine-tuning when possible without hurting performance. We also introduce a new analysis to motivate the use of trust region methods more generally, by studying representational collapse; the degradation of generalizable representations from pre-trained models as they are fine-tuned for a specific end task. Extensive experiments show that our fine-tuning method matches or exceeds the performance of previous trust region methods on a range of understanding and generation tasks (including DailyMail/CNN, Gigaword, Reddit TIFU, and the GLUE benchmark), while also being much faster. We also show that it is less prone to representation collapse; the pre-trained models maintain more generalizable representations every time they are fine-tuned.

1. INTRODUCTION

Pre-trained language models (Radford et al., 2019; Devlin et al., 2018; Liu et al., 2019; Lewis et al., 2019; 2020) have been shown to capture a wide array of semantic, syntactic, and world knowledge (Clark et al., 2019) , and provide the defacto initialization for modeling most existing NLP tasks. However, fine-tuning them for each task is a highly unstable process, with many hyperparameter settings producing failed fine-tuning runs, unstable results (considerable variation between random seeds), over-fitting, and other unwanted consequences (Zhang et al., 2020; Dodge et al., 2020) . Recently, trust region or adversarial based approaches, including SMART (Jiang et al., 2019) and FreeLB (Zhu et al., 2019) , have been shown to increase the stability and accuracy of fine-tuning by adding additional constraints limiting how much the fine-tuning changes the initial parameters. However, these methods are significantly more computationally and memory intensive than the more commonly adopted simple-gradient-based approaches. This paper presents a lightweight fine-tuning strategy that matches or improves performance relative to SMART and FreeLB while needing just a fraction of the computational and memory overhead and no additional backward passes. Our approach is motivated by trust region theory while also reducing to simply regularizing the model relative to parametric noise applied to the original pre-trained representations. We show uniformly better performance, setting a new state of the art for RoBERTa fine-tuning on GLUE and reaching state of the art on XNLI using no novel pre-training approaches (Liu et al., 2019; Wang et al., 2018; Conneau et al., 2018) . Furthermore, the low overhead of our family of fine-tuning methods allows our method to be applied to generation tasks where we consistently outperform standard fine-tuning, setting state of the art on summarization tasks.

