ABSTRACTING INFLUENCE PATHS FOR EXPLAINING (CONTEXTUALIZATION OF) BERT MODELS

Abstract

While "attention is all you need" may be proving true, we do not yet know why: attention-based transformer models such as BERT are superior but how they contextualize information even for simple grammatical rules such as subjectverb number agreement (SVA) is uncertain. We introduce multi-partite patterns, abstractions of sets of paths through a neural network model. Patterns quantify and localize the effect of an input concept (e.g., a subject's number) on an output concept (e.g. corresponding verb's number) to paths passing through a sequence of model components, thus surfacing how BERT contextualizes information. We describe guided pattern refinement, an efficient search procedure for finding sufficient and sparse patterns representative of concept-critical paths. We discover that patterns generate succinct and meaningful explanations for BERT, highlighted by "copy" and "transfer" operations implemented by skip connections and attention heads, respectively. We also show how pattern visualizations help us understand how BERT contextualizes various grammatical concepts, such as SVA across clauses, and why it makes errors in some cases while succeeding in others.

1. INTRODUCTION

Recent advancements in NLP have been spurred by contextualized representations created in deep neural models such as BERT (Devlin et al., 2019) . These contextualized representations, which are designed to be sensitive to the context in which they appear (Ethayarajh, 2019) , are also shown to capture many grammatical concepts (Lin et al., 2019; Tenney et al., 2019a) , including subject-verb agreements(SVA) and reflexive anaphora(RA) (Goldberg, 2019) . However, the exact mechanism of contextualization in BERT, i.e., the process of developing contextualized representations from representations of individual input words in the sentence context, remains unclear. For example, in the sentence the pilots that the architect likes is/are short, choosing the correct the verb is over are to agree with the subject requires contextualizing the verb with plurality information of the subject. In this paper, we answer the central question: How is contextualization realized in BERT for grammatical concepts such as SVA and RA? Specifically, can we identify sub-components of BERT that are a) sufficient for representing those concepts but also b) sparse enough to legibly show how BERT contextualizes the concepts across layers and whether the contextualization follows correct grammatical rules? Prior works on explaining contextualization in BERT rely on the analysis of layer representations and attention components. Representation analyses, either by training a probing classifier (Lin et al., 2019; Tenney et al., 2019a) , or finding parse trees embedded in the representations (Hewitt & Manning, 2019; Reif et al., 2019) , demonstrate that relevant linguistic concepts are associated with the activations of BERT components (i.e. subject's number associated with the activations of a certain head at a certain layer, or subject's representation closer to that of the verb's under certain transformations), but do not tell us how representations come about inside the model. Meanwhile, inspection of attention weights as indicators of the flow of information between BERT layers (Clark et al., 2019) , requires subjective inference of relevant function (i.e. inference that a certain head may be involved because high attention weights between cells at the subject and cells at the verb), which are found to be problematic in other contexts (Brunner et al., 2020; Jain & Wallace, 2019) . Analysis of attention further disregards the role of skip connections that do not involve attention at all. Neither approach allows us to track a concept as a causal chain from input to output or to distinguish helpful from hindering representations or flows (hindering information such as contextualization of confounding inputs like unrelated nouns in a sentence lead to errors on SVA). To answer the central question while overcoming these limitations, we introduce multi-partitefoot_0 patterns, abstractions of sets of paths through a neural model (a graph). Patterns quantify and localize the effect of an input concept (e.g. a subject's number) on an output concept (e.g. corresponding verb's number) to a collection of paths passing through a sequence of model nodes and/or edges. We describe guided pattern refinement, a search procedure for finding patterns representative of concept-critical paths that let us selectively explore the importance of chosen aspects of a model (e.g. in BERT, we can refine patterns showing criticality of certain heads to paths also showing whether this is due to skip connections or due to attention). To demonstrate the contextualization process, we further extend the experimental framework to integrate impacts of multiple words towards a given concept (as opposed to impact of a single word, e.g. subject on SVA).

Contributions:

1) We describe multi-partite patterns for explaining the model-wide contextualization in neural models like BERT and guided pattern refinement (GPR) to discover influential patterns focusing on model elements of interest. 2) We visualize BERT's contextualization grammatical concepts including subject-verb agreements(SVA) and reflexive anaphora(RA), and qualitatively show how BERT encode these concepts using grammatically correct or incorrect cues. 3) We validate the sufficiency and sparsity of derived patterns with model compression and concentration metrics, respectively. We begin with a summary of requisite techniques in Sec. 2. We describe the core elements of our methodology in Sec. 3 and exemplify them for understanding BERT in Sec. 4. We elaborate on related works in Sec. 5 and conclude in Sec. 6.

2. BACKGROUND

We introduce the basics of the BERT architecture (Fig. 1 ) and the learning task subject to our work. We then discuss existing explanation devices and how they motivate our methods that follow. BERT In BERT, let L be the number of Transformer encoder layers, H be the hidden dimension of embeddings at each layer, and A be the number of attention heads. The list of input word embeddings is x def = [x 1 , x 1 , ..., x N ], x i ∈ R d . We denote the output of the l-th layer as h l 0:N -1 . First layer inputs are h 0 1:N def = x 1:N . We use a l,i j to denote the j-th attention head from the i-th embedding at l-th layer and s l j to denote the skip connection that is "copied" from the input embedding from the previous layer then combined with the attention output. Probability scores for candidates of [MASK] are denoted by y i def = softmax(W h L i ), W ∈ R C×H where C is the vocabulary size. We denote the index of [MASK] as m. The layered architecture is presented in Fig. 1 (left) and a detailed view of the transformer layer in Fig. 1 (right). For further details, refer to Vaswani et al. (2017) and Devlin et al. (2019) . We focus on Masked Language Modeling (MLM) used in BERT pretraining: predict a masked word represented by [MASK] in a context sentence. The MLM task has been used to evaluate whether BERT learns linguistic concepts such as SVA by measuring if it assigns a higher probability for the correct verb (e.g. are in Fig. 2 ) than the incorrect verb (e.g. is) at the [MASK] position (Goldberg, 2019) . Distributional Influence. To explain a DNN's behavior, distributional Influence attributes to each input a measure of impact on model output. Saliency (Baehrens et al., 2010) , as an example, defines influence as the gradient of output w.r.t. the input. In a generalized framework of Leino et al. (2018) , influence quantifies the impact of each input feature towards a concept (e.g. SVA) by instrumenting a model's inputs with a distribution of interest (DoI) and the output with a quantity of interest (QoI). Definition 1 (Distributional Influence) Given a model f : R d → R n , an input x, a DoI D(x), and a QoI q : R n → R, Distributional Influence g q (x) quantifies the impact of an input concept defined by the DoI on the output concept defined by the QoI:

xm-2 "skaters"

h 0 m-2 h l m-2 h l+1 m-2 h L m-2 ym-2 h l m-2 xm-1 "like" h 0 m-1 h l m-1 h l+1 m-1 h L m-1 ym-1 h l m-1 xm [MASK] h 0 m h l m h l+1 m h L m ym QoI h l m xm+1 "young" h 0 m+1 h l m+1 h l+1 g q (x) def = E z∼D(x) ∂q(f (z)) ∂z Instantiations defining SVA and RA concepts in BERT models are found in Sec. 4. Examples of DoI include Gaussian distributions with mean x (Smilkov et al., 2017) , or uniform distributions over a path c = {x + α(x -x b ), α ∈ [0, 1]} from a user-defined baseline input x b to the target input x (Sundararajan et al., 2017) . We use the latter in the rest of paper; we approximate the expectation in Def. 1 by sampling discrete points in the uniform distribution (Sundararajan et al., 2017) (see Appendix B.1 for an analysis of the accuracy of approximation). Explaining Contextualization with Influence Paths. While it can highlight relevant inputs, Distributional Influence cannot show if or how they are contextualized internally to form higherlevel concepts. Influence Paths (Lu et al., 2020) localizes an input influence measurement to paths in a neural model and thus can be used to show how the influence of the input representations flows internally through one internal representation to another. A computation graph G def = (V, F, E) is a set of nodes, activation functions, and edges, respectively. In this paper, we assume the graph is directed, acyclic, and does not contain more than one edge per adjacent pair of nodesfoot_1 . A path p in G is a sequence of graph-adjacent nodes [p 1 , p 2 , • • • , p -1 ]. We denote the Jacobian of the output of node n i w.r.t the output of connected (not necessarily directly) predecessor node n j evaluated at x as ∂n j (x)/∂n i (x) We write x p as the component of the Jacobian passing through path p evaluated at input x as per chain rule: x p def = -1 i=1 ∂p i (x)/∂p i-1 (x). Definition 2 (Individual Path Influence) Given a path p of a computation graph G, the individual path influence for an input x, or χ(x, p) is: 2020) uses individual path influence to decompose distributional influence to paths and explain internal LSTM behaviour under SVA via the most influential path arg max p∈P χ(x, p) where P are all paths from input to a particular output (normally a QoI). The influence path approach relies on enumerating P. χ(x, p) def = E z∼D(x) [ z p] Lu et al. (

3. PATH ABSTRACTION

Directly applying individual influence paths of Lu et al. (2020) to transformer-based models like BERT has computational and conceptual problems. BERT is denser in terms of model connections: each node at each layer integrates information from all nodes of the prior layer (as opposed to the pair of short-term and long-term connections in LSTM). This results in an intractable number of influence paths to enumerate, even for processing the simplest of BERT variants. Our approach is three-fold: first we employ abstractions of sets of paths as the localization and influence quantification instrument; second, we discover influential patterns with a greedy search procedure that refines abstract patterns into more concrete ones, keeping the influence high; and third, we consider the collection of influence patterns from every word in a sentence to the quantity of interest. We begin with the pattern abstraction: Definition 3 (Multi-partite pattern) A multi-partite pattern π is a sequence of nodes [π 1 , π 2 , • • • , π -1 ] such that for any pair of nodes π i , π i+1 adjacent in the sequence (not necessarily adjacent in the graph), there exists a path from π i and π i+1 . A pattern π abstracts a set of paths, written γ(π) that follow the given sequence of nodes but are free to traverse the graph between those nodes in any way. Interpreting paths and patterns as sets, we define γ(π) def = {p ⊆ P : π ⊆ p} where P is the set of all paths from π 1 to π -1 . If every sequenceadjacent pair of nodes is directly connected then the pattern abstracts a single path. Definition 4 (Pattern influence) Given a computation graph and a DoI D, the influence of a multipartite influence pattern π, written I(x, π) is the total influence of all the paths abstracted by the pattern: I(x, π) def = p∈γ(π) χ(x, p) = E z∼D(x) -1 i=1 ∂π i (x) ∂π i-1 (x) Also note that influence of individual paths may be positive or negative so cancellation in the influence of a pattern which aggregates paths is possible. Computation Graphs for BERT A given DNN can be expressed by many computational graphs. For computational and interpretability reasons, an ideal graph would contain as few nodes and edges as possible while exposing structures of interest. For BERT in particular, we propose embeddinglevel graph G e corresponding to the nodes and edges shown in Fig. 1 (left) to explain how the influence of input embeddings flow from one Transformer layer to another and to the eventual prediction of [MASK] ; and attention-level graph G a ⊃ G e that additionally includes head nodes as in Fig. 1 (right), a finer decomposition to demonstrate how influence from the input embedding flows through the attention block within each layer. BERT's semantics are modeled using a computational graph's activation functions which we omit here. As the attention level graph contains a superset of the nodes of the embedding level graph, we can interpret embedding level patterns as abstracting paths in both the embedding level graph and the attention level graph. Furthermore, a concrete path in G e is a pattern in G a as it contains G a -nonadjacent nodes and thus abstracting multiple paths in G a . For a given pattern π of G e we can thus write γ a (π) as the set of paths it abstracts in G a with: γ a (π) def = p∈γe(π) γ a (p) Guided Pattern Refinement(GPR) Instead of enumerating the path space of P for discovering influential paths, we approximate a search by greedily refining patterns while maximizing their influence. Starting with sources and target nodes s and t along with a pattern π 0 = {s, t} representing all paths between s and t, we construct π 1 by adding a node from a guiding set E 0 that maximizes the influence of the resulting pattern. At the first iteration and subsequently, the guiding set defines a cut of the (multi-partite) graph between two sequence adjacent nodes (initially just s and t). The procedure is repeated with additional refinement. At iteration i + 1, a guiding set E i defines a cut between nodes s i and t i while the cut node that refines the pattern to maximal influence is selected: b, c ] denotes the pattern π in which sequence adjacent nodes a, c are replaced with a, b, c, in their position in the sequence. π i+1 def = π i [s i , t i \ s i , e i , t i ] e i def = arg max e i ∈E i I(x, π i s i , t i \ s i , e i , t i ) Above, π [a, c \ a, Repeating the procedure for some number of steps or until some stopping criterion is reached produces a sequence of patterns with decreasing abstraction: γ(π i+1 ) ⊆ γ(π i ). Once a pattern is produced that abstracts a single path, no more refinement can be done though it might not be desirable to continue refinement until that point for interpretability reasons. Also, the choice of guiding sets E i at each iteration can have an impact on the resulting patterns both in terms of their influence significance and computational requirements of iteration. Smaller sets require fewer options to enumerate but are likely to lead to less influential patterns. In our experiments we employ a layer-ordered strategy for the embedding-level pattern refinement and then refine the resulting pattern in the attention-level graph. In the embedding-level analysis, at iteration i, we focus on layer i. The guiding set E i is the cut: (embedding-level guiding set) E i def = h l j j The refinement thus proceeds for L iterations (the input layer can be skipped). If the input node is denoted as x and the quantity of interest is denoted as q, the refinement process results in a pattern π e def = x, h 1 j1 , h 2 j2 , • • • , h L j L , q where j i are indices designating which embeddings at each level i the abstracted paths traverse. The attention-level refinement starts with the embedding-level pattern π e and exposes the attention heads to cut the flow of influence in that starting pattern, also in order of the layers. At iteration i, the cut E i is: (attention-level guiding set) E i def = a i,k ji k ∪ s i ji That is, the cut separates embedding nodes h i ji and h i+1 ji+1 with the attention heads a i,k ji k and a skip edge modeled as a node s i ji . As the attention-level analysis refines the embedding-level analysis, the produced attention-level pattern π a abstracts a strict subset of the paths of the attention-level graph that the embedding-level pattern π e abstracts. That is, π e ⊂ π a and therefore γ a (π e ) ⊃ γ a (π a ). In our experiments, we perform GPR independently for each input word, and refine with most positively influential cut nodes for positively influential words (g q (x i ) ≥ 0)) but refine with the most negatively influential cut nodes for negative (g q (x i ) < 0) words. In the following section, we use π i as the extracted patterns for individual input word i, Π as the set of patterns for all words, and Π + as the set of patterns for all positively influential words. Both terms may be further decorated by a or e to denote attention-level or embedding-level results.

4. EVALUATION

We apply GPR to discover BERT patterns on the level of embedding and attention. We begin with a summary of the linguistic tasks, datasets, models, and hyper-parameters. We evaluate patterns for their sparsity and sufficiency in Sec. 4.1 as measured by the metrics of concentration and compression accuracy (Lu et al., 2020) , respectively. Finally, in Sec. 4.2 we visualize example patterns and discuss how they help explain contextualization of SVA in BERT. Tasks. We consider two linguistic tasks: subject-word agreement (SVA) and reflexive anaphora (RA). We explore different forms of sentence stimuli in each task: object relative clause (Obj.), subject relative clause (Subj.), within sentence complement (WSC), and across prepositional phrase (APP) in SVA; number agreement (NA) and gender agreement (GA) in RA. SVA and RA datasets (Marvin & Linzen, 2018; Lin et al., 2019) are evaluated with MLM in a same way as prior work (Goldberg, 2019) . We sample 200 sentences evenly distributed across different sentence types (e.g. singular/plural subject & singular/plural intervening noun) with a fixed sentence structure from each task; sentence length and the word types in each position are consistent across samples. Examples of each task are found in Appendix A. QoI and Distributional Influence. We use the same QoI from Lu et al. (2020) where q(y m ) def = y m,correct -y m,wrong , e.g. y m,IS -y m,ARE for she [MASK] happy. We select D as an uniform distribution over a linear path from x b to x in the input space for each word with the baseline x b defined as the the input embedding of [MASK]; we view it as a neutral word with no information. Model. We evaluate our methods with a BERT model with L = 6, A = 8, referred hereby as BERT SMALL , of Turc et al. (2019) paper (Devlin et al., 2019) because 1) we find BERT BASE is not significantly better than BERT SMALL in the tasks of interests as shown in Appendix A, 2) when approximating the expectation in Def. 2 with finite points, we find more than 2000 samples are required in BERT BASE for an acceptable margin, while 50 samples suffices for BERT SMALL (see Appendix. B.1), and 3) visualizations of abstracted patterns from BERT SMALL are easier on human interpreters.

4.1. QUANTITATIVE ANALYSIS

Recall that π e and π a denote the abstracted pattern returned by GPR with an embedding-level graph and an attention-level graph, respectively. The quantitative evaluation in this section aims to verify that influential patterns in BERT are 1) sparse: inputs influence the QoI largely through π e or π a and 2) sufficient: BERT retains high task accuracy if only π e or π a are evaluated at inference time. Firstly, we introduce concentration to evaluate sparsity: Definition 5 (Concentration) Given an input x with N words/embeddings, their distributional influences g q (x) and pattern influences I(x, π i ) for a set of patterns Π = {π i } i , the concentration of the positive (and negative) pattern influence C Π+ (and C Π-) are the patterns' share of positive (or negative) influences as compared to the total positive (or negative) distributional influence: C Π+ def = N i {I(x, π i ) i * I[I(x, π i ) i ≥ 0]} N i {g q (x) i * I[g q (x) i ≥ 0]} , C Π-def = N i {I(x, π i ) i * I[I(x, π i ) i < 0]} N i {g q (x) i * I[g q (x) i < 0]} To evaluate sufficiency, we employ a compression study previously used to verify other explanation devices (Dabkowski & Gal, 2017; Ancona et al., 2018; Leino et al., 2018; Lu et al., 2020) . For each example, we compress BERT down to a specific pattern: we only retain the nodes from Π e + (or Π a + ) while replacing all other nodes, layer by layer. Starting from the first layer, the embedding nodes not in Π e + (or the attention/skip connection nodes not in Π a + ) are replaced by the embedding of [MASK] (or zero vectors for attention/skip nodes), while the nodes in Π e + (or Π a + ) remain untouched. The retained and replaced node together are forward passed to the next layer using the original model parameters until a new set of nodes needs to be retained or replaced. We then compare the accuracies of predicting labels of [MASK] between the original model and the compressed model, the former of which we refer to as compressed accuracy. Explanations of Results. Concentration and compressed accuracy of π e and π a are shown in Table 1 using the setup in Sec. 4. Replacing * in the second row with e or a corresponds to the results for the abstracted embedding-level patterns π e and the attention-level patterns π a , respectively. Sparsity with concentration. Per the gradient chain rule, the total influence of all individual paths from the input to QoI equals to the distributional influence, therefore, 0 < C Π+ , C Π-< 1. As Figure 2 : Significant patterns π a extracted by GPR from the attention-level graph for task SVA across Object Relative Clauses (Goldberg, 2019; Marvin & Linzen, 2018) , in two exmaples with attractors. Left: bar plots of the distributional influence g(x i )(yellow), I(x i , π e i ) (purple) and I(x i , π a i ) (blue) for each word at position i. Right: significant patterns π a i from each input word at position i to quantity of interest (verb number correctness). The square nodes denote the input embeddings and circles denote internal contexualized embeddings. Dashed lines correspond to skip connection in the attention block while solid lines correspond to connection through (any) attention heads. Attention connections with high influence flow are marked with the corresponding attention head number (ranging from 1 to 8). Line colors represent the sign of influence (red as negative and green as positive). shown in the first two columns of Table 1 , there are about exp 14 and exp 27 individual patterns in the embedding and attention-level graphs, respectively. However, the abstracted patterns (shown by the 3 rd to 6 th columns) account for a large portion of both positive and negative influence across all tasks. The embedding level ( * = e) abstracted pattern contributes around 30% of the total influence indicating that the concept is concentrated to individual contextualized embeddings in each layer, instead of dispersed to many words. Zooming in on the attention-level ( * = a), concentration suggests that between the contextualized embeddings of adjacent layers, influence is also more concentrated to either one attention head or the skip connection. Sufficiency with model compressed accuracy. We show the original accuracies of the model on different tasks in the last column. The compressed model retains a high accuracy (as shown in 9 th and 10 th columns) with a tiny portion of the models (shown in 7 th and 8 th columns) retained. As a comparison, we denote the compressed accuracy with random patterns in the 11 th and 12 th columns, which compresses the model by retaining the same number of nodes as π e + but has a performance close to 50%, effectively a random guess; randomly chosen patterns of the same size do not abstract the concept at all.

4.2. EXPLAINING CONTEXTUALIZATION OF SVA ACROSS OBJECT RELATIVE CLAUSE

In this section, we explain contextualization between internal representations of BERT by visualizing the significant patterns π e and π a found by GPR for two examples of SVA across object relative clauses as seen in Figure 2 . Results on other tasks are included in the Appendix. B.2. First we observe that in both sentences of Figure 2 , both words in the subject phrase ("the" and the nouns) exert a positive input influence on the correct prediction of the verb, and the intervening noun(attractor) exerts negative influence, which is also true for both I(x, π e i ) i and I(x, π a i ) i . "Copy and Transfer" We observe many horizontal dashed lines in Figure 2 , indicating significant influence flows through layers at the same word position using skip connections. Zooming in on π e i and π a i , we observe that the subject phrase travels through skip connections across the lower layers, and only through attention head 5 in the last layer. This "copy and transfer" procedure indicates that BERT mostly picks up the signal from the subject input embedding without much contextualization, however, exactly how it overcomes(or not) the comparable signals from the attractors is explained the next section. In addition, we speculate that the reason that attentions can be effectively pruned without compromising the performance in prior works (Michel et al., 2019) , is that some concepts does not travel through attention block at all: they are simply "copied" to the next layer through the skip connections. In Appendix B.2 we observe that all the above conclusions are also prevalent in other tasks (such as the contextualization of propositions in WPP task), though different heads might be used for the "transfer" operations in different tasks. The Role of that Comparing two sentences of Figure 2 , we do observe that the influence from the singular subject is weaker than that of the plural subject, especially compared to the negative influence from attractors. The key difference is that that behaves differently for singular and plural subject. that in Figure 4a behaves as a singular noun (since that also means a singular pronoun in English), flowing through the same straightforward pattern as the subject (skip connections + attention head 5); that in Figure 2a , however, behaves more like a grammatical marker(relativizer): the pattern from that travel from itself to the subject in the second to last layer through a different attention head. We speculate that that in plural subject sentences encodes the syntactic boundary of the clauses and help identify the main subject and ignore the intervening noun. As a result, attractors in PS have smaller negative influence, compared to the high negative influence from attractors in SP (a similar comparison is also observed in PP and SS cases). This discrepancy in the behavior of that also corroborates lower SVA accuracy in SP case than in PS case (See Appendix A). We observe this difference consistently across all instances as shown in an aggregated visualization in Appendix B.2, in other tasks as well.

5. RELATED WORK

Previous work has shown the encoding of syntactic dependencies such as subject verb agreements(SVA) in RNN Language models (Linzen et al., 2016) , as well as the explanations for such encoding (Hupkes et al., 2018; Lakretz et al., 2019; Jumelet et al., 2019) . More extensive work has since been done on transformer-based architectures such as BERT (Devlin et al., 2019) . Diagnostic classifiers trained on output and internal representations discover that BERT encodes many types of linguistic knowledge (Elazar et al., 2020; Hewitt & Liang, 2019; Tenney et al., 2019a; b; Jawahar et al., 2019; Klafka & Ettinger, 2020; Liu et al., 2019; Lin et al., 2019) , ranging from syntactic concepts to more complicated semantic ones. Goldberg (2019) discovers that SVA and RA in complex clausal structures is better represented in BERT compared to an RNN model. This is partially explained by (Reif et al., 2019; Hewitt & Manning, 2019) which show that contextual embeddings in BERT can encode syntactic structures hierarchically comparable to those represented in a dependency tree. However all these analyses are done on frozen contextual embedding layers; the exact causal mechanism a concept is encoded from input to output is not explored. Another line of work in interpreting BERT concerns analyzing the self-attention weights of BERT (Clark et al., 2019; Vig & Belinkov, 2019; Lin et al., 2019) , where attention heads are found to have direct correspondences with specific dependency relations. However, attention weights as interpretation devices has been controversial (Serrano & Smith, 2019) , and empirical analysis has shown that attention can be perturbed or pruned while having the same or even better performance (Kovaleva et al., 2019; Michel et al., 2019; Voita et al., 2019) . More importantly, our work demonstrate that attention mechanisms are only part of BERT computation graph, with each attention block complemented by additional architecture such as dense layer and skip connections. The strong influence passing through skip connections also corroborates the findings of Brunner et al. (2020) which find input tokens mostly retain their identity. Besides pruning attentions, other works (Prasanna et al., 2020; Sanh et al., 2019; Jiao et al., 2019) also show that BERT is overparametrized and can be greatly compressed. Our work to some extent corroborates that point by pointing to the sparse gradient flow, while employing model compression only to verify the sufficiency of the extracted patterns. Recent work introducing influence paths (Lu et al., 2020) offers another form of explanation. Lu et al. (2020) decomposed the attribution to path-specific quantities localizing the implementation of the given concept to paths through a model. The authors demonstrated that for LSTM models, a single path is responsible for most of the input-output effect defining SVA, and explored the effects of unhelpful nouns which showed negative influence on SVA. We describe the limitations of this methodology when applied to BERT in Sec. 3.

6. CONCLUSION

We have demonstrated how to use multi-partite influence patterns to localize a DNN model's handling of a concept of interest and along with a pattern refinement method we how BERT handles subject-verb number agreement and reflexive anaphora. We quantitatively validated the sufficieny and sparsity of influence patterns in BERT by way of compression experiments and the influence concentration of discovered patterns. We qualitatively and visually demonstrated BERT's contextualization in the two tasks using our methodology. Our formalism and methods are general enough to apply to the analysis of other aspects of BERT and other models.

B APPENDIX: EXPERIMENT DETAILS B.1 CONVERGENCE CHECK

When the DoI of distributional influence g q (x) is a uniform distribution on a linear path from a baseline input x b to the target input x, the completeness (Sundararajan et al., 2017) axiom shows q(f (x)) -q(f (x b )) = i x i g q (x) i , where q is the selected Quantity of Interest. However, when summation is used to approximate the expectation in practice, the RHS of the completeness axiom does not always converges to the LHS easily. In Fig. 3 , we plot the percentage of difference [q(f (x)) -q(f (x b ))i x i g q (x) i ]/(q(f (x)) -q(f (x b ))) against the resolution, the number of samples drawn from the distribution in the summation. Due to the limit of our GPU memory ( 12GB) and the computational cost, we find the maximum number of batched samples to be 50. Therefore, BERT SMALL has lower approximation error compared to BERT BASE . The much harder approximation of the larger BERT model is likely due to the complicated decision boundaries of larger BERT, masking the output sensitive to small perturbations. 2019) perturbations. X-axis is the number of samples used to approximate the influence, Y-axis is the percentage of deviation from approximation to true influence value.

B.2 AGGREGATED INFLUENCE GRAPHS FOR ALL TASKS

In this section, we show the an aggregated visualization across all examples by superimposing the visualization of individual instances as the ones in Figure 2 , while adjusting the line width to be proportional to the frequency of flow across all examples. The words within parenthesis represent one instance of the word in that position.  . [SEP]  . [SEP] . [SEP] 



Multi-partite because patterns abstract sets of paths in neural models viewed as multi-partite graphs. The single edge restriction is for notational conveniences to follow; if a given neural model does have more than one edge between adjacent nodes, we can replace duplicate edges with 2-length paths through dummy identity nodes to satisfy this requirement without affecting its semantics.



Figure 1: BERT Transformer architecture (left) and details of a transformer layer (right).

singular subject + plural intervening noun(SP)

Figure 3: Convergence Analysis for Calculating the Distributional influence(IG) for SVA Across Object Relative Clauses for BERT SMALL (used in this paper) BERT BASE from Devlin et al. (2019) perturbations. X-axis is the number of samples used to approximate the influence, Y-axis is the percentage of deviation from approximation to true influence value.

Figure 4: SVA Across Object Relative Clause.

Figure 5: SVA Across Subject Relative Clause.

Figure 6: SVA Within Sentence Complements.

Figure 7: SVA Across Prepositional Phrase.

Figure 8: RA: Number Agreement

instead larger models such as BERT BASE used in the original BERT Pattern sufficiency, sparsity, and related metrics on various linguistic tasks. Metrics are shown in the 1 st row while the 2 nd row indicate graph levels: * denotes e or a, corresponding to abstracted embedding-level patterns π e or attention-level patterns π a , respectively. lg|P * |: natural log of the number of possible paths; C Π * + , C Π * -: positive/negative concentrations; |P * | : percentage of paths in the abstracted patterns over the total number of paths; acc.(π * + ): the compressed accuracy of abstracted patterns; acc.(rand. * ): the compressed accuracy of randomly compressed models; acc.(ori.): the accuracy of the original model BERT SMALL .

