ABSTRACTING INFLUENCE PATHS FOR EXPLAINING (CONTEXTUALIZATION OF) BERT MODELS

Abstract

While "attention is all you need" may be proving true, we do not yet know why: attention-based transformer models such as BERT are superior but how they contextualize information even for simple grammatical rules such as subjectverb number agreement (SVA) is uncertain. We introduce multi-partite patterns, abstractions of sets of paths through a neural network model. Patterns quantify and localize the effect of an input concept (e.g., a subject's number) on an output concept (e.g. corresponding verb's number) to paths passing through a sequence of model components, thus surfacing how BERT contextualizes information. We describe guided pattern refinement, an efficient search procedure for finding sufficient and sparse patterns representative of concept-critical paths. We discover that patterns generate succinct and meaningful explanations for BERT, highlighted by "copy" and "transfer" operations implemented by skip connections and attention heads, respectively. We also show how pattern visualizations help us understand how BERT contextualizes various grammatical concepts, such as SVA across clauses, and why it makes errors in some cases while succeeding in others.

1. INTRODUCTION

Recent advancements in NLP have been spurred by contextualized representations created in deep neural models such as BERT (Devlin et al., 2019) . These contextualized representations, which are designed to be sensitive to the context in which they appear (Ethayarajh, 2019) , are also shown to capture many grammatical concepts (Lin et al., 2019; Tenney et al., 2019a) , including subject-verb agreements(SVA) and reflexive anaphora(RA) (Goldberg, 2019) . However, the exact mechanism of contextualization in BERT, i.e., the process of developing contextualized representations from representations of individual input words in the sentence context, remains unclear. For example, in the sentence the pilots that the architect likes is/are short, choosing the correct the verb is over are to agree with the subject requires contextualizing the verb with plurality information of the subject. In this paper, we answer the central question: How is contextualization realized in BERT for grammatical concepts such as SVA and RA? Specifically, can we identify sub-components of BERT that are a) sufficient for representing those concepts but also b) sparse enough to legibly show how BERT contextualizes the concepts across layers and whether the contextualization follows correct grammatical rules? Prior works on explaining contextualization in BERT rely on the analysis of layer representations and attention components. Representation analyses, either by training a probing classifier (Lin et al., 2019; Tenney et al., 2019a) , or finding parse trees embedded in the representations (Hewitt & Manning, 2019; Reif et al., 2019) , demonstrate that relevant linguistic concepts are associated with the activations of BERT components (i.e. subject's number associated with the activations of a certain head at a certain layer, or subject's representation closer to that of the verb's under certain transformations), but do not tell us how representations come about inside the model. Meanwhile, inspection of attention weights as indicators of the flow of information between BERT layers (Clark et al., 2019) , requires subjective inference of relevant function (i.e. inference that a certain head may be involved because high attention weights between cells at the subject and cells at the verb), which are found to be problematic in other contexts (Brunner et al., 2020; Jain & Wallace, 2019) . Analysis of attention further disregards the role of skip connections that do not involve attention at all. Neither approach allows us to track a concept as a causal chain from input to output or to distinguish helpful from hindering representations or flows (hindering information such as contextualization of confounding inputs like unrelated nouns in a sentence lead to errors on SVA). To answer the central question while overcoming these limitations, we introduce multi-partitefoot_0 patterns, abstractions of sets of paths through a neural model (a graph). Patterns quantify and localize the effect of an input concept (e.g. a subject's number) on an output concept (e.g. corresponding verb's number) to a collection of paths passing through a sequence of model nodes and/or edges. We describe guided pattern refinement, a search procedure for finding patterns representative of concept-critical paths that let us selectively explore the importance of chosen aspects of a model (e.g. in BERT, we can refine patterns showing criticality of certain heads to paths also showing whether this is due to skip connections or due to attention). To demonstrate the contextualization process, we further extend the experimental framework to integrate impacts of multiple words towards a given concept (as opposed to impact of a single word, e.g. subject on SVA).

Contributions:

1) We describe multi-partite patterns for explaining the model-wide contextualization in neural models like BERT and guided pattern refinement (GPR) to discover influential patterns focusing on model elements of interest. 2) We visualize BERT's contextualization grammatical concepts including subject-verb agreements(SVA) and reflexive anaphora(RA), and qualitatively show how BERT encode these concepts using grammatically correct or incorrect cues. 3) We validate the sufficiency and sparsity of derived patterns with model compression and concentration metrics, respectively. We begin with a summary of requisite techniques in Sec. 2. We describe the core elements of our methodology in Sec. 3 and exemplify them for understanding BERT in Sec. 4. We elaborate on related works in Sec. 5 and conclude in Sec. 6.

2. BACKGROUND

We introduce the basics of the BERT architecture (Fig. 1 ) and the learning task subject to our work. We then discuss existing explanation devices and how they motivate our methods that follow. BERT In BERT, let L be the number of Transformer encoder layers, H be the hidden dimension of embeddings at each layer, and A be the number of attention heads. The list of input word embeddings is x def = [x 1 , x 1 , ..., x N ], x i ∈ R d . We denote the output of the l-th layer as h l 0:N -1 . First layer inputs are h 0 1:N def = x 1:N . We use a l,i j to denote the j-th attention head from the i-th embedding at l-th layer and s l j to denote the skip connection that is "copied" from the input embedding from the previous layer then combined with the attention output. Probability scores for candidates of We focus on Masked Language Modeling (MLM) used in BERT pretraining: predict a masked word represented by [MASK] in a context sentence. The MLM task has been used to evaluate whether BERT learns linguistic concepts such as SVA by measuring if it assigns a higher probability for the correct verb (e.g. are in Fig. 2 ) than the incorrect verb (e.g. is) at the [MASK] position (Goldberg, 2019). Distributional Influence. To explain a DNN's behavior, distributional Influence attributes to each input a measure of impact on model output. Saliency (Baehrens et al., 2010) , as an example, defines influence as the gradient of output w.r.t. the input. In a generalized framework of Leino et al. (2018) , influence quantifies the impact of each input feature towards a concept (e.g. SVA) by instrumenting a model's inputs with a distribution of interest (DoI) and the output with a quantity of interest (QoI). Definition 1 (Distributional Influence) Given a model f : R d → R n , an input x, a DoI D(x), and a QoI q : R n → R, Distributional Influence g q (x) quantifies the impact of an input concept defined



Multi-partite because patterns abstract sets of paths in neural models viewed as multi-partite graphs.



[MASK] are denoted by y i def = softmax(W h L i ), W ∈ R C×H where C is the vocabulary size. We denote the index of [MASK] as m. The layered architecture is presented in Fig. 1(left) and a detailed view of the transformer layer in Fig. 1(right). For further details, refer to Vaswani et al. (2017) and Devlin et al. (2019).

