ORCA: INTERPRETING PROMPTED LANGUAGE MODELS VIA LOCATING SUPPORTING EVIDENCE IN THE OCEAN OF PRETRAINING DATA

Abstract

Prompting large pretrained language models leads to strong performance in a variety of downstream tasks. However, it is still unclear from where the model learns task-specific knowledge, especially in zero-shot setups. In this work, we propose a novel method ORCA to identify evidence of the model's task-specific competence in prompt-based learning. Through an instance attribution approach to model interpretability, by iteratively using gradient information related to the downstream task, ORCA locates a very small subset of pretraining data that directly supports the model's predictions in a given task; we call this subset supporting data evidence. We show that supporting data evidence offers new insights about the prompted language models. For example, in the tasks of sentiment analysis and textual entailment, BERT shows a substantial reliance on BookCorpusthe smaller corpus of BERT's two pretraining corpora-as well as on pretraining examples that mask out synonyms to the task labels used in prompts. 1 

1. INTRODUCTION

Large language models (LLMs) are trained on massive text corpora from the web, referred to as the pretraining data (e.g., Devlin et al., 2019; Raffel et al., 2020) . Due to their volume, pretraining data typically cannot be inspected manually and are prone to spelling/logic errors, domain mismatch w.r.t. target tasks, social biases, and other unexpected artifacts (Bender et al., 2021 ). Yet, LLMs pretrained with such noisy data attain surprisingly good performance on numerous downstream tasks, with little or no task-specific tuning (Petroni et al., 2019; Brown et al., 2020) . There are several hypotheses explaining the power of pretrained LLMs. One hypothesis is that the pretraining data is huge and the model might be shallowly memorizing patterns in data (Bender et al., 2021; Carlini et al., 2021; Razeghi et al., 2022) . An alternative hypothesis is that LLMs might be learning to reason through observed patterns in the pretraining data in novel ways (McCoy et al., 2021) . However, the evidence of these conjectures, especially in arbitrary downstream tasks, remains underexplored. Such evidence is useful as it can help explain model decisions, surface problematic patterns in data or model behavior, and shed new light on how to improve the model and data (Zhong et al., 2019; Han & Tsvetkov, 2020; 2021; Pruthi et al., 2022) . Moreover, it can facilitate the trustworthiness of the models (Lipton, 2018; Jacovi et al., 2021) . In this work, we develop a methodology to provide such evidence. Our hypothesis is that among the enormous pretraining corpora, there is a subset of pretraining data that contributes to the model's behavior on a downstream task more than the rest of the pretraining data. Therefore, our task is to locate a task-specific evidence set-a very small amount of pretraining data that particularly helps the model's performance on the task. We call it supporting data evidence (SDE). Such SDE can help interpret the model if we analyze its task-relevant patterns compared to the rest of the corpora. A related line of interpretability research focuses on instance attribution (Koh & Liang, 2017; Yeh et al., 2018; Pruthi et al., 2020; Han et al., 2020) , where the goal is to find which training examples are most influential to the model's decision, focusing on individual test examples. However, in this work we are interested in locating sets of pretraining data influencing the whole task (i.e., a full test set, rather than individual test instances). We seek such "global" evidence for the task because given the scale of the pretraining and task data, it could be inefficient or even infeasible to find and inspect the evidence for each of the task examples. 2We first formulate the problem of finding SDE in pretraining data by upweighting the SDE set and measuring its impact on model performance ( §2.1). In §2.2, we propose a novel method ORCAfoot_2 that effectively identifies the SDE by iteratively using task-specific gradient information. On two classification tasks-sentiment analysis and textual entailment-in a prompt-based setup ( §3), we show the effectiveness of the SDE discovered by ORCA compared to random data subsets and nearest-neighbor data in an embedding space ( §4). Our analyses of the discovered SDE ( §5) show that our base language model BERT (Devlin et al., 2019) has an interestingly high reliance on the smaller corpus of its two pretraining corpora (BookCorpus, Zhu et al., 2015) . Also the pretraining examples in SDE typically mask out synonyms to the task verbalizers (i.e., words mapped to the task labels in the prompts, Schick & Schütze, 2021).

2. ORCA

We develop a method to explain the competence of large pretrained language models used in zeroor few-shot prompt-based classification (Petroni et al., 2019; Brown et al., 2020) . 4 Without conventional finetuning, model decisions rely on knowledge learned from the pretraining data, and our goal is to identify what supporting data evidence (SDE) in pretraining data facilitates model's competence in a specific downstream task.

2.1. PROBLEM FORMULATION

Assume θ PT , a LLM pretrained with a dataset D PT ∋ (x PT context , y PT masked ). For example, for a masked language model x PT context is a block of text with certain tokens masked, and y PT masked are the masked tokens in their original forms, to be reconstructed. θ PT is trained to minimize a loss L over the pretraining examples, θ PT = arg min θ L(D PT ; θ). The LLM can be applied to many downstream tasks without finetuning, via prompting (Schick & Schütze, 2021; Liu et al., 2021) . Given a dataset in a downstream classification task D task ∋ (x task , y task ), the LLM is applied by measuring p θ (verbalizer(y task ) | template(x task )). The template supplies a prompt tailored to the task for the model, and the verbalizer maps the output of the language model to the task's label space (more details in §3.2). We interpret the model decisions by finding the SDE S ⊂ D PT w.r.t. the task data D task . The size of S should be very small (e.g., a few hundred) compared to the whole pretraining data, |S| ≪ |D PT |, to facilitate further manual or semi-automatic analyses. More importantly, S should "contribute" significantly to the performance of the model on the downstream task. However, we first observe that defining this contribution is a non-trivial problem. Prior work in instance attribution like influence functions (Koh & Liang, 2017) adopts a "leave-one-out" perspective (Cook, 1977) . In our case, this would mean removing S from D PT , retraining a new LLM from scratch, and testing it on D task . This is prohibitively expensive. 5We adopt an "upweighting" perspective. Instead of leave-one-out, we upweight certain pretraining examples (e.g., S) by training the model on these examples for an additional epoch. The resulting change to the model should be small to prevent overfitting. Specifically, we randomly batch the SDE S to mini-batches, thereby updating the model via a very small number of optimizer updates: θ PT new ← θ PT + updates θ,L (S)



Code and data will be released at ANONYMIZED upon publication. Directly applying instance attribution methods to the task level has also been shown to yield negative results(Kocijan & Bowman, 2020). Named after the marine mammal for nO paRtiCular reAson. While in this work we focus on text classification, the framework is also adaptable to generation problems. Moreover, the definition of influence functions and even leave-one-out can sometimes be arguable, especially in non-convex models(Basu et al., 2021; K & Søgaard, 2021).

