BETTER FINE-TUNING BY REDUCING REPRESENTA-TIONAL COLLAPSE

Abstract

Although widely adopted, existing approaches for fine-tuning pre-trained language models have been shown to be unstable across hyper-parameter settings, motivating recent work on trust region methods. This paper presents a simplified and efficient method rooted in trust region theory that replaces previously used adversarial objectives with parametric noise (sampling from either a normal or uniform distribution), thereby discouraging representation change during fine-tuning when possible without hurting performance. We also introduce a new analysis to motivate the use of trust region methods more generally, by studying representational collapse; the degradation of generalizable representations from pre-trained models as they are fine-tuned for a specific end task. Extensive experiments show that our fine-tuning method matches or exceeds the performance of previous trust region methods on a range of understanding and generation tasks (including DailyMail/CNN, Gigaword, Reddit TIFU, and the GLUE benchmark), while also being much faster. We also show that it is less prone to representation collapse; the pre-trained models maintain more generalizable representations every time they are fine-tuned.

1. INTRODUCTION

Pre-trained language models (Radford et al., 2019; Devlin et al., 2018; Liu et al., 2019; Lewis et al., 2019; 2020) have been shown to capture a wide array of semantic, syntactic, and world knowledge (Clark et al., 2019) , and provide the defacto initialization for modeling most existing NLP tasks. However, fine-tuning them for each task is a highly unstable process, with many hyperparameter settings producing failed fine-tuning runs, unstable results (considerable variation between random seeds), over-fitting, and other unwanted consequences (Zhang et al., 2020; Dodge et al., 2020) . Recently, trust region or adversarial based approaches, including SMART (Jiang et al., 2019) and FreeLB (Zhu et al., 2019) , have been shown to increase the stability and accuracy of fine-tuning by adding additional constraints limiting how much the fine-tuning changes the initial parameters. However, these methods are significantly more computationally and memory intensive than the more commonly adopted simple-gradient-based approaches. This paper presents a lightweight fine-tuning strategy that matches or improves performance relative to SMART and FreeLB while needing just a fraction of the computational and memory overhead and no additional backward passes. Our approach is motivated by trust region theory while also reducing to simply regularizing the model relative to parametric noise applied to the original pre-trained representations. We show uniformly better performance, setting a new state of the art for RoBERTa fine-tuning on GLUE and reaching state of the art on XNLI using no novel pre-training approaches (Liu et al., 2019; Wang et al., 2018; Conneau et al., 2018) . Furthermore, the low overhead of our family of fine-tuning methods allows our method to be applied to generation tasks where we consistently outperform standard fine-tuning, setting state of the art on summarization tasks. We also introduce a new analysis to motivate the use of trust-region-style methods more generally, by defining a new notion of representational collapse and introducing a new methodology for measuring it during fine-tuning. Representational collapse is the degradation of generalizable representations of pre-trained models during the fine-tuning stage. We empirically show that standard fine-tuning degrades generalizable representations through a series of probing experiments on GLUE tasks. Furthermore, we attribute this phenomenon to using standard gradient descent algorithms for the fine-tuning stage. We also find that (1) recently proposed fine-tuning methods rooted in trust region, i.e., SMART, can alleviate representation collapse, and (2) our methods alleviate representational collapse to an even greater degree, manifesting in better performance across almost all datasets and models. Our contributions in this paper are the following. • We propose a novel approach to fine-tuning rooted in trust-region theory, which we show directly alleviates representational collapse at a fraction of the cost of other recently proposed fine-tuning methods. • Through extensive experimentation, we show that our method outperforms standard finetuning methodology following recently proposed best practices from Zhang et al. ( 2020). We improve various SOTA models from sentence prediction to summarization, from monolingual to cross-lingual. • We further define and explore the phenomena of representational collapse in fine-tuning and directly correlate it with generalization in tasks of interest.

2. LEARNING ROBUST REPRESENTATIONS THROUGH REGULARIZED FINE-TUNING

We are interested in deriving methods for fine-tuning representations that provide guarantees on the movement of representations, in the sense that they do not forget the original pre-trained representations when they are fine-tuned for new tasks (see Section 4 for more details). We introduce a new fine-tuning method rooted in an approximation to trust region, which provides guarantees for stochastic gradient descent algorithms by bounding some divergence between model at update t and t + 1 (Pascanu & Bengio, 2013; Schulman et al., 2015b; Jiang et al., 2019) . Let f : R m×n → R p be a function which returns some pre-trained representation parameterized by θ f from m tokens embedded into a fixed vector of size n. Let the learned classification head g : R p → R q be a function which takes an input from f and outputs a valid probability distribution parameterized by θ g in q dimensions and let X be our dataset. In the case of generation, we can assume the classification head is simply an identity function or softmax depending on the loss function. Let L(θ) denote a loss function given by θ = [θ f , θ g ]. We are interested in minimizing L with respect to θ such that each update step is constrained by movement in the representational density space p(f ). More formally given an arbitrary arg min ∆θ L(θ + ∆θ) s.t. KL(p(f (• ; θ f ))||p(f (• ; θ f + ∆θ f ))) = This constrained optimization problem is equivalent to doing natural gradient descent directly over the representations (Pascanu & Bengio, 2013) . Unfortunately, we do not have direct access to the density of representations; therefore, it is not trivial to directly bound this quantity. Instead, we propose to do natural gradient over g • f with an additional constraint that g is at most 1-Lipschitz (which naturally constrains change of representations, see Section A.1 in the Appendix). Traditional computation of natural gradient is computationally prohibitive due to the need for inverting the Hessian. An alternative formulation of natural gradient can be stated through mirror descent, using Bregmann divergences (Raskutti & Mukherjee, 2015; Jiang et al., 2019) . This method primarily serves as a robust regularizer by preventing large updates in the model's probability space. This family of methods is classically known as trust-region methods (Pascanu & Bengio, 2013; Schulman et al., 2015a) .

