PREDICTING INDUCTIVE BIASES OF PRE-TRAINED MODELS

Abstract

Most current NLP systems are based on a pre-train-then-fine-tune paradigm, in which a large neural network is first trained in a self-supervised way designed to encourage the network to extract broadly-useful linguistic features, and then finetuned for a specific task of interest. Recent work attempts to understand why this recipe works and explain when it fails. Currently, such analyses have produced two sets of apparently-contradictory results. Work that analyzes the representations that result from pre-training (via "probing classifiers") finds evidence that rich features of linguistic structure can be decoded with high accuracy, but work that analyzes model behavior after fine-tuning (via "challenge sets") indicates that decisions are often not based on such structure but rather on spurious heuristics specific to the training set. In this work, we test the hypothesis that the extent to which a feature influences a model's decisions can be predicted using a combination of two factors: The feature's extractability after pre-training (measured using information-theoretic probing techniques), and the evidence available during finetuning (defined as the feature's co-occurrence rate with the label). In experiments with both synthetic and naturalistic data, we find strong evidence (statistically significant correlations) supporting this hypothesis.

1. INTRODUCTION

Large pre-trained language models (LMs) (Devlin et al., 2019; Raffel et al., 2020; Brown et al., 2020) have demonstrated impressive empirical success on a range of benchmark NLP tasks. However, analyses have shown that such models are easily fooled when tested on distributions that differ from those they were trained on, suggesting they are often "right for the wrong reasons" (McCoy et al., 2019) . Recent research which attempts to understand why such models behave in this way has primarily made use of two analysis techniques: probing classifiers (Adi et al., 2017; Hupkes et al., 2018) , which measure whether or not a given feature is encoded by a representation, and challenge sets (Cooper et al., 1996; Linzen et al., 2016; Rudinger et al., 2018) , which measure whether model behavior in practice is consistent with use of a given feature. The results obtained via these two techniques currently suggest different conclusions about how well pre-trained representations encode language. Work based on probing classifiers has consistently found evidence that models contain rich information about syntactic structure (Hewitt & Manning, 2019; Bau et al., 2019; Tenney et al., 2019a) , while work using challenge sets has frequently revealed that models built on top of these representations do not behave as though they have access to such rich features, rather they fail in trivial ways (Dasgupta et al., 2018; Glockner et al., 2018; Naik et al., 2018) . In this work, we attempt to link these two contrasting views of feature representations. We assume the standard recipe in NLP, in which linguistic representations are first derived from large-scale selfsupervised pre-training intended to encode broadly-useful linguistic features, and then are adapted for a task of interest via transfer learning, or fine-tuning, on a task-specific dataset. We test the

