INTERPRETABILITY IN THE WILD: A CIRCUIT FOR INDIRECT OBJECT IDENTIFICATION IN GPT-2 SMALL

Abstract

Research in mechanistic interpretability seeks to explain behaviors of ML models in terms of their internal components. However, most previous work either focuses on simple behaviors in small models, or describes complicated behaviors in larger models with broad strokes. In this work, we bridge this gap by presenting an explanation for how GPT-2 small performs a natural language task that requires logical reasoning: indirect object identification (IOI). Our explanation encompasses 28 attention heads grouped into 7 main classes, which we discovered using a combination of interpretability approaches including causal interventions and projections. To our knowledge, this investigation is the largest end-to-end attempt at reverse-engineering a natural behavior "in the wild" in a language model. We evaluate the reliability of our explanation using three quantitative criteriafaithfulness, completeness and minimality. Though these criteria support our explanation, they also point to remaining gaps in our understanding. Our work is a case study demonstrating a first step toward a better understanding of pre-trained language models, opening opportunities to scale to both larger models and more complex tasks.

1. INTRODUCTION

Transformer-based language models (Vaswani et al., 2017; Brown et al., 2020) have demonstrated an impressive suite of capabilities, but largely remain black boxes. Understanding these models is difficult because they employ complex non-linear interactions in densely-connected layers and operate in a high-dimensional space. Despite this, they are already deployed in high-impact settings, underscoring the urgency of understanding and anticipating possible model behaviors. Some researchers have even argued that interpretability is necessary for the safe deployment of advanced machine learning systems (Hendrycks & Mazeika, 2022) . Work in mechanistic interpretability aims to discover, understand and verify the algorithms that model weights implement by reverse engineering model computation into human-understandable components (Olah, 2022; Meng et al., 2022; Geiger et al., 2021; Geva et al., 2020) . By understanding underlying mechanisms, we can better predict out-of-distribution behavior (Mu & Andreas, 2020), identify and fix model errors (Hernandez et al., 2021; Vig et al., 2020) , and understand emergent behavior (Nanda & Lieberum, 2022; Barak et al., 2022; Wei et al., 2022) . In this work, we aim to understand how GPT-2 small (Radford et al., 2019) implements a natural language task. To do so, we locate components of the network that produce specific behaviors, and study how they compose to complete the task. We do so by using circuits analysis (Räuker et al., 2022) , identifying an induced subgraph of the model's computational graph that is humanunderstandable and responsible for completing the task. We employed a number of techniques, most notably activation patching, knockouts, and projections, which we believe are useful, general techniques for circuit discovery. 2 

annex

We focus on understanding a non-trivial, algorithmic natural language task that we call Indirect Object Identification (IOI). In IOI, sentences such as 'When Mary and John went to the store, John gave a drink to' should be completed with 'Mary'. We chose this task because it is linguistically meaningful and admits a complex but interpretable algorithm (Section 3).We discover a circuit of 28 attention heads-1.5% of the total number of (head, token position) pairsthat completes this task. The circuit uses 7 different categories of heads (see Figure 2 ) to implement the algorithm. Together, these heads route information between different name tokens, to the end position, and finally to the output. Our work provides, to the best of our knowledge, the most detailed attempt at reverse-engineering a natural end-to-end behavior in a transformer-based language model.Explanations for model behavior can easily be misleading or non-rigorous (Jain & Wallace, 2019; Bolukbasi et al., 2021) . To remedy this problem, we formulate three criteria to help validate our circuit explanations. These criteria are faithfulness (the circuit can perform the task as well as the whole model), completeness (the circuit contains all the nodes used to perform the task), and minimality (the circuit doesn't contain nodes irrelevant to the task). Our circuit shows significant improvements compared to a naïve (but faithful) circuit, but fails to pass the most challenging tests.In summary, our main contributions are: (1) We identify a large circuit in GPT-2 small that performs indirect-object identification on a specific distribution (Figure 2 and Section 3); (2) Through example, we identify useful techniques for understanding models, as well as surprising pitfalls;(3) We present criteria that ensure structural correspondence (in the computational graph abstraction) between the circuit and the model, and check experimentally whether our circuit meets this standard (Section 4).

2. BACKGROUND

In this section, we introduce the IOI task (an original contribution of this work), the transformer architecture, define circuits more formally and describe a technique for "knocking out" model nodes.Task description. In indirect object identification (IOI), two names (the indirect object (IO) and the first occurrence of the subject (S1)) are introduced in an initial dependent clause (see Figure 1 ). A main clause then introduces the second occurrence of the subject (S2), who is usually exchanging an item. The task is to complete the main clause, which always ends with the token 'to', with the non-repeated name (IO). We create many dataset samples for IOI (p IOI ) using 15 templates (see Appendix A) with random single-token names, places and items.We investigate the performance of GPT-2 small on this task. We study the original model from Radford et al. ( 2019), pretrained on a large corpus of internet text and without any fine-tuning. To quantify GPT-2 small performance on the IOI task, we used the logit difference between the logit values placed on the two names, where a positive score means the correct name (IO) has higher probability. This is also the difference in loss the model would receive in training if IO was correct compared to if S was correct. We report this metric averaged over p IOI throughout the paper. GPT-2 small has mean logit difference of 3.55 averaged across over 100,000 dataset examples.

