PREDICTING INDUCTIVE BIASES OF PRE-TRAINED MODELS

Abstract

Most current NLP systems are based on a pre-train-then-fine-tune paradigm, in which a large neural network is first trained in a self-supervised way designed to encourage the network to extract broadly-useful linguistic features, and then finetuned for a specific task of interest. Recent work attempts to understand why this recipe works and explain when it fails. Currently, such analyses have produced two sets of apparently-contradictory results. Work that analyzes the representations that result from pre-training (via "probing classifiers") finds evidence that rich features of linguistic structure can be decoded with high accuracy, but work that analyzes model behavior after fine-tuning (via "challenge sets") indicates that decisions are often not based on such structure but rather on spurious heuristics specific to the training set. In this work, we test the hypothesis that the extent to which a feature influences a model's decisions can be predicted using a combination of two factors: The feature's extractability after pre-training (measured using information-theoretic probing techniques), and the evidence available during finetuning (defined as the feature's co-occurrence rate with the label). In experiments with both synthetic and naturalistic data, we find strong evidence (statistically significant correlations) supporting this hypothesis.

1. INTRODUCTION

Large pre-trained language models (LMs) (Devlin et al., 2019; Raffel et al., 2020; Brown et al., 2020) have demonstrated impressive empirical success on a range of benchmark NLP tasks. However, analyses have shown that such models are easily fooled when tested on distributions that differ from those they were trained on, suggesting they are often "right for the wrong reasons" (McCoy et al., 2019) . Recent research which attempts to understand why such models behave in this way has primarily made use of two analysis techniques: probing classifiers (Adi et al., 2017; Hupkes et al., 2018) , which measure whether or not a given feature is encoded by a representation, and challenge sets (Cooper et al., 1996; Linzen et al., 2016; Rudinger et al., 2018) , which measure whether model behavior in practice is consistent with use of a given feature. The results obtained via these two techniques currently suggest different conclusions about how well pre-trained representations encode language. Work based on probing classifiers has consistently found evidence that models contain rich information about syntactic structure (Hewitt & Manning, 2019; Bau et al., 2019; Tenney et al., 2019a) , while work using challenge sets has frequently revealed that models built on top of these representations do not behave as though they have access to such rich features, rather they fail in trivial ways (Dasgupta et al., 2018; Glockner et al., 2018; Naik et al., 2018) . In this work, we attempt to link these two contrasting views of feature representations. We assume the standard recipe in NLP, in which linguistic representations are first derived from large-scale selfsupervised pre-training intended to encode broadly-useful linguistic features, and then are adapted for a task of interest via transfer learning, or fine-tuning, on a task-specific dataset. We test the hypothesis that the extent to which a fine-tuned model uses a given feature can be explained as a function of two metrics: The extractability of the feature after pre-training (as measured by probing classifiers) and the evidence available during fine-tuning (defined as the rate of co-occurrence with the label). We first show results on a synthetic task, and second using state-of-the-art pre-trained LMs on language data. Our results suggest that probing classifiers can be viewed as a measure of the pre-trained representation's inductive biases: The more extractable a feature is after pre-training, the less statistical evidence is required in order for the model to adopt the feature during fine-tuning. Contribution. This work establishes a relationship between two widely-used techniques for analyzing LMs. Currently, the question of how models' internal representations (measured by probing classifiers) influence model behavior (measured by challenge sets) remains open (Belinkov & Glass, 2019; Belinkov et al., 2020) . Understanding the connection between these two measurement techniques can enable more principled evaluation of and control over neural NLP models. For example, the model assumes that the sentence "the lawyer followed the judge" entails "the judge followed the lawyer" purely because all the words in the latter appear in the former. While this heuristic is statistically favorable given the model's training data, it is not infallible. Specifically, McCoy et al. (2019) report that 90% of the training examples containing lexical overlap had the label "entailment", but the remaining 10% did not. Moreover, the results of recent studies based on probing classifiers suggest that more robust features are extractable with high reliability from BERT representations. For example, given the example "the lawyer followed the judge"/"the judge followed the lawyer", if the model can represent that "lawyer" is the agent of "follow" in the first sentence, but is the patient in the second, then the model should conclude that the sentences have different meanings. Such semantic role information can be recovered at > 90% accuracy from BERT embeddings (Tenney et al., 2019b) . Thus, the question is: Why would a model prefer a weak feature over a stronger one, if both features are extractable from the model's representations and justified by the model's training data?

2. SETUP AND TERMINOLOGY

Abstracting over details, we distill the basic NLP task setting described above into the following, to be formalized in the Section 2.2. We assume a binary sequence classification task where a target feature t perfectly predicts the label (e.g., the label is 1 iff t holds). Here, t represents features which actually determine the label by definition, e.g., whether one sentence semantically entails another. Additionally, there exists a spurious feature s that frequently co-occurs with t in training but is not guaranteed to generalize outside of the training set. Here, s (often called a "heuristic" or "bias" elsewhere in the literature) corresponds to features like lexical overlap, which are predictive of the label in some datasets but are not guaranteed to generalize. Assumptions. In this work, we assume there is a single t and a single s; in practice there may be many s features. Still, our definition of a feature accommodates multiple spurious or target features. In fact, some of our spurious features already encompass multiple features: the lexical feature, for example, is a combination of several individual-word features because it holds if one of a set of words is in the sentence. This type of spurious feature is common in real datasets: E.g., the hypothesis-only baseline in NLI is a disjunction of lexical features (with semantically unrelated words like "no", "sleeping", etc.) (Poliak et al., 2018b; Gururangan et al., 2018) . We assume that s and t frequently co-occur, but that only s occurs in isolation. This assumption reflects realistic NLP task settings since datasets always contain some heuristics, e.g., lexical cues, cultural biases, or artifacts from crowdsourcing (Gururangan et al., 2018) . Thus, our experiments focus on manipulating the occurrence of s alone, but not t alone: This means giving the model evidence against relying on s. This is in line with prior applied work that attempts to influence model behavior by increasing the evidence against s during training ( Elkahky et al., 2018; Zmigrod et al., 2019; Min et al., 2020) .



Our motivation comes from McCoy et al. (2019), which demonstrated that, when fine-tuned on a natural language inference task (Williams et al., 2018, MNLI), a model based on a state-of-the-art pre-trained LM (Devlin et al., 2019, BERT) categorically fails on test examples which defy the expectation of a "lexical overlap heuristic".

