ORCA: INTERPRETING PROMPTED LANGUAGE MODELS VIA LOCATING SUPPORTING EVIDENCE IN THE OCEAN OF PRETRAINING DATA

Abstract

Prompting large pretrained language models leads to strong performance in a variety of downstream tasks. However, it is still unclear from where the model learns task-specific knowledge, especially in zero-shot setups. In this work, we propose a novel method ORCA to identify evidence of the model's task-specific competence in prompt-based learning. Through an instance attribution approach to model interpretability, by iteratively using gradient information related to the downstream task, ORCA locates a very small subset of pretraining data that directly supports the model's predictions in a given task; we call this subset supporting data evidence. We show that supporting data evidence offers new insights about the prompted language models. For example, in the tasks of sentiment analysis and textual entailment, BERT shows a substantial reliance on BookCorpusthe smaller corpus of BERT's two pretraining corpora-as well as on pretraining examples that mask out synonyms to the task labels used in prompts. 1 

1. INTRODUCTION

Large language models (LLMs) are trained on massive text corpora from the web, referred to as the pretraining data (e.g., Devlin et al., 2019; Raffel et al., 2020) . Due to their volume, pretraining data typically cannot be inspected manually and are prone to spelling/logic errors, domain mismatch w.r.t. target tasks, social biases, and other unexpected artifacts (Bender et al., 2021) . Yet, LLMs pretrained with such noisy data attain surprisingly good performance on numerous downstream tasks, with little or no task-specific tuning (Petroni et al., 2019; Brown et al., 2020) . There are several hypotheses explaining the power of pretrained LLMs. One hypothesis is that the pretraining data is huge and the model might be shallowly memorizing patterns in data (Bender et al., 2021; Carlini et al., 2021; Razeghi et al., 2022) . An alternative hypothesis is that LLMs might be learning to reason through observed patterns in the pretraining data in novel ways (McCoy et al., 2021) . However, the evidence of these conjectures, especially in arbitrary downstream tasks, remains underexplored. Such evidence is useful as it can help explain model decisions, surface problematic patterns in data or model behavior, and shed new light on how to improve the model and data (Zhong et al., 2019; Han & Tsvetkov, 2020; 2021; Pruthi et al., 2022) . Moreover, it can facilitate the trustworthiness of the models (Lipton, 2018; Jacovi et al., 2021) . In this work, we develop a methodology to provide such evidence. Our hypothesis is that among the enormous pretraining corpora, there is a subset of pretraining data that contributes to the model's behavior on a downstream task more than the rest of the pretraining data. Therefore, our task is to locate a task-specific evidence set-a very small amount of pretraining data that particularly helps the model's performance on the task. We call it supporting data evidence (SDE). Such SDE can help interpret the model if we analyze its task-relevant patterns compared to the rest of the corpora. A related line of interpretability research focuses on instance attribution (Koh & Liang, 2017; Yeh et al., 2018; Pruthi et al., 2020; Han et al., 2020) , where the goal is to find which training examples 1 Code and data will be released at ANONYMIZED upon publication. 1

