ATTENTION FLOWS FOR GENERAL TRANSFORMERS

Abstract

In this paper, we study the computation of how much an input token in a Transformer model influences its prediction. We formalize a method to construct a flow network out of the attention values of encoder-only Transformer models and extend it to general Transformer architectures, including an auto-regressive decoder. We show that running a maxflow algorithm on the flow network construction yields Shapley values, which determine a player's impact in cooperative game theory. By interpreting the input tokens in the flow network as players, we can compute their influence on the total attention flow leading to the decoder's decision. Additionally, we provide a library that computes and visualizes the attention flow of arbitrary Transformer models. We show the usefulness of our implementation on various models trained on natural language processing and reasoning tasks.

1. INTRODUCTION

The Transformer (Vaswani et al., 2017) is the dominant machine learning architecture in recent years, finding application in NLP (e.g., BERT (Devlin et al., 2019) , GPT-3 (Brown et al., 2020) , or LaMDA (Collins and Ghahramani, 2021) ), computer vision (see Khan et al. (2021) for a survey), mathematical reasoning (Lample and Charton, 2019; Han et al., 2021) , or even code and hardware synthesis (Chen et al., 2021; Schmitt et al., 2021) . The Transformer relies on attention (Bahdanau et al., 2015) that mimics cognitive attention, which sets the focus of computation on a few concepts at a time. In this paper, we rigorously formalize constructing a flow network out of attention values (Abnar and Zuidema, 2020) and generalize it to models, including a decoder. While theoretically yielding a Shapley value (Shapley, 1953) quite trivially, we show that this results in meaningful explanations for input tokens' influence on the total flow affecting a Transformer's prediction. Its applicability in various domains has made the Transformer architecture incredibly popular. Models are easily accessible for developers around the world, for example at huggingface.co (Wolf et al., 2019) . However, blindly using or fine-tuning these models might lead to mispredictions and unwanted biases, which will have a considerable negative effect on their application domains. The sheer size of the Transformer models makes it impossible to analyze the networks by hand. Explainability and visualization methods, e.g., Vig (2019) , aid the machine learning practitioner and researcher in finding the cause of a misprediction or revealing unwanted biases. The training method or the dataset can then be adjusted accordingly. Abnar and Zuidema (2020) introduced Attention Flow as a post-processing interpretability technique that treats the self-attention weight matrices of a Transformer encoder as a flow network. This technique allows analyzing the flow of attention through the Transformer encoder: Computing the maxflow for an input token determines the impact of this token on the total attention flow. Ethayarajh and Jurafsky (2021) discussed a possible relation of the maxflow computation through the encoder flow network to Shapley values, which is a concept determining a player's impact in cooperative game theory and can be applied to measure the importance of a model's input features. However, the lack of a clear formalization of the underlying flow network has made it difficult to assess the validity of their claims, which we aim to address in this work. We extend our formalization of the approach to a Transformer-model-agnostic technique, including general encoder-decoder Transformers and decoder-only Transformers such as GPT models Radford et al. (2018) . While, after applying a positional encoding, the encoder processes the input tokens as a whole, the decoder layers operate auto-regressively, i.e., a sequence of tokens will be predicted step-by-step, and already predicted input tokens will be given as input to the decoder. This results in a significantly different shape of the flow network and, in particular, requires normalization to account for the bias towards tokens that were predicted later than others. We account for the auto-regressive nature of the decoder by ensuring positional independence of the computed maxflow values. We implemented our constructions as a Python library, which we will publish under the MIT license. In summary, our contributions are the following. We formalize encoder-only attention flow and generalize the approach to encoder-decoder and decoder-only Transformers in Section 2. Furthermore, we use the formalization to construct an explicit algorithm for attention flow computation and analyze its complexity. In Section 3, we show that the computed attention flow values are Shapley values for all three architectures. Section 4 introduces a tool to compute and visualize attention flow for arbitrary Transformers. We report on qualitative and quantitative experiments that show the effectiveness of our approach, including token bias and single-head attention analyses. Related Work. We would like to emphasize the work on which we build: Abnar and Zuidema (2020) introducing attention flow for Transformer encoders and Ethayarajh and Jurafsky (2021) drawing a possible connection between encoder attention flows and Shapley values. An explainability overview is given by Samek et al. (2017) and Burkart and Huber (2021) . An overview over Shapley value formulations for machine learning models is given by Sundararajan and Najmi (2020) , which are not restricted to Transformer models and do not include attention flow (Lindeman, 1980; Grömping, 2007; Owen, 2014; Owen and Prieur, 2017; Štrumbelj et al., 2009; Štrumbelj and Kononenko, 2014; Datta et al., 2016; Lundberg and Lee, 2017; Lundberg et al., 2018; Aas et al., 2019; Sun and Sundararajan, 2011; Sundararajan et al., 2017; Agarwal et al., 2019) . Shapley values are also used for the valuation of machine learning data (Ghorbani and Zou, 2019) . Raw attention values can be visualized, e.g., Vig (2019) and Wang et al. (2021) . Chefer et al. (2021) assign local relevance based on the Deep Taylor Decomposition principle (Montavon et al., 2017) .

2. ATTENTION FLOW

Attention Flow (Abnar and Zuidema, 2020 ) is a post-processing interpretability technique that treats the self-attention weight matrices of the Transformer encoder as a flow network and returns the maximum flow through each input token. Formally, a flow network is defined as follows. Definition 1 (Flow Network). Given a graph G = (V, E), where V is a set of vertices and E ⊆ V ×V is a set of edges, a flow network is a tuple (G, c, s, t), where c : E → R ∞ is the capacity function and s and t are the source and terminal (sink) nodes respectively. A flow is a function f : E → R satisfying the following two conditions: Flow conservation: ∀v ∈ V \{s, t}. x f (v) = 0, where x f : V → R is defined as x f (u) = Σ v∈V f (v, u), and capacity constraint: ∀e ∈ E. f (e) ≤ c(e). The value of flow |f | is the amount of flow from the source node s to the terminal node t: |f | = v:(s,v)∈E f sv . For a given set K of nodes, we define |f (K)| as the flow value from s to t only passing through nodes in K: |f (K)| = v:(s,v)∈E,v∈K f sv . We define |f o (v)| to be the total outflow value of a node v and |f i (v)| to be the total inflow value of a node v. In optimization theory, the maximum flow problem max (|f |) (Harris and Ross, 1955) is to find flows that push the maximum possible flow value |f | from the source node s to the terminal node t, which we denote by f max .

2.1. ENCODER ATTENTION FLOW

Given an encoder-only Transformer model, such as the BERT (Devlin et al., 2019) model family, with H attention heads, L layers, M input tokens I = {i 1 , . . . , i M } and the resulting self-attention tensor A E ∈ R H×L×M ×M . For some X ∈ N, we define [X] as the set {1, . . . , X}. For a set of positions J, a subset of input tokens I ′ ⊆ I and subset of heads H ′ ⊆ H, we construct a flow network F enc (A E , I ′ , J) = (G, c, s, t) as follows: V := (I × [L + 1]) ∪ {s, t} , c((i j , l), v ′ ) := 1 H ′ H ′ h=1 A E h,l,k,j v ′ = (i k , l + 1) ∞ v ′ = t , E := {((i j , l), (i k , l + 1)) | i j , i k ∈ I ∧ l ∈ [L + 1]} ∪ {((i j , L + 1), t) | i j ∈ I ∧ j ∈ J} ∪ {(s, (i ′ , 0)) | i ′ ∈ I ′ } . We visualize this flow network translation in Figure 1a . The flow network consists of L + 1 columns of nodes and L columns of edges. The attention values are encoded as capacities on the edges. Thus the underlying graph of the flow network requires one additional column of nodes. Computing the maximum flow through this network determines the contribution of the input tokens I ′ to the attention flow towards the final encoder embeddings given by J. Note that the nodes in columns greater than 1 correspond to encoder embeddings and can not be interpreted as input tokens anymore. Residual connections can be taken into account as proposed by Abnar and Zuidema (2020) , i.e., by adding an identity matrix I and re-normalize it as 0.5A + 0.5I. By setting the start node s successively to singleton sets containing only a single input token and all final embeddings to t, we can compute the encoder flow for every encoder input token as introduced by Abnar and Zuidema (2020) . The encoder flow network construction can also be used for models including a classification task (see Section 4). To determine the influence of input tokens on the attention flow towards deciding the class, the terminal node t is only connected to the final embedding of the classification token.

2.2. DECODER ATTENTION FLOW

Generative Transformer models that involve a decoder require a significantly different shape of flow network. We begin by investigating decoder-only models, with H attention heads, L layers, N "output" tokens O = {o 1 , . . . , o N } and the self-attention tensor A D ∈ R H×L×N ×N . Since we consider decoder-only models, a prefix subset O input ⊆ O will be given as a problem input to the neural network model. Note that the first output token is always a special start token. For a set of output tokens O ′ ⊆ O, the position n of output token o n ∈ O and subset of heads H ′ ⊆ H, the construction of a flow network F dec (A D , O ′ , n) = (G, c, s, t) follows the structure of the decoder self-attention: V := O × [L + 1] , E := {(o j , l), (o k , l + 1)) | o j , o k ∈ O ∧ l ∈ [L + 1] ∧ j ≤ k} , c((o j , l), (o k , l + 1)) := 1 H ′ H ′ h=1 A D h,l,k,j , s := {(s, (o ′ , 0)) | o ′ ∈ O ′ } , t := (o n-1 , L + 1) . We visualize the construction in Figure 1b . Because of the auto-regressive nature of the Transformer decoder, we compute the maxflow to the last embedding of the decoder as this embedding will be used in the Transformer to predict the next token. The auto-regression, however, requires a normalization to account for the bias towards tokens that were predicted later than others (later predicted tokens have more incoming edges). Intuitively, we require that the maxflow computation for any sub flow network F ′ constructed from the decoder flow network F to be independent of the absolute position of F ′ in F . Formally, assuming A D to have the same value c for every entry, i.e., the capacity of every edge in the resulting-flow network is fixed to c, we require for every position ) and heads H ′ , we can thus compute the influence of the token set O ′ to the total attention flow towards the embedding that predicts the n-th token, no matter whether it served as part of the problem input or is an already predicted output token. n that ∀o m ∈ O. maxflow (F dec (A D , {o m }, n)) = c,

2.3. ENCODER-DECODER ATTENTION FLOW

For Transformer models consisting of an encoder and a decoder, we combine both flow network translations with the encoder-decoder attention. Figure 2 shows the structure of the flow network for a Transformer model with an encoder (top) and a decoder (bottom). The last nodes of the flow network corresponding to the final embedding of the encoder are, following the Transformer architecture, connected to every node layer of the network corresponding to the decoder. We omit some encoderdecoder edges for better visualization. Given a Transformer with H attention heads, L layers, M input tokens I = {i 0 , . . . , i M }, N output tokens O = {o 0 , . . . , o N }, and resulting encoder self-attention tensor A E ∈ R H×L×M ×M , decoder self-attention tensor A D ∈ R H×L×N ×N and encoder-decoder attention tensor A C ∈ R H×L×N ×M . For a set of input tokens I ′ , the position n of output token o n and subset of heads H ′ ⊆ H, we construct a flow network F(A E , A D , A C , I ′ , n) = (G, c, s, t) from flow networks F enc (A E , I ′ , ∅) = ((V enc , E enc ), c enc , s enc , t enc ) and F dec (A D , ∅, n) = ((V dec , E dec ), c dec , s dec , t dec ) as follows: V := V enc ∪ V dec ∪ s , E := E enc ∪ E dec ∪ {((i j , L + 1), v) | i j ∈ I ∧ v ∈ V dec } t := (o n , L + 1) , ∪ {(s enc , (o m , 0)) | o m ∈ I ∧ m < n} , c(v, v ′ ) :=                      c enc (v, v ′ ) v = (i j , l), v ′ = (i k , l ′ ), i j , i k ∈ I c dec (v, v ′ ) v = (o j , l), v ′ = (o k , l ′ ), o j , o k ∈ O 1 H ′ H ′ h=1 A C h,l,k,j v = (i j , L + 1), v ′ = (o k , l), i j ∈ I, o k ∈ O ∞ v = s, v ′ ∈ I ′ , where l e denotes a layer from the encoder and v d denotes a node from the decoder. Again, we have to normalize to account for the auto-regressive bias, i.e., require positional independence. For a given set on input tokens I ′ and heads H ′ , we can thus asses the contribution of this set to the total attention flow towards the embedding that predicts the n-th token by computing the maxflow through this network. If one is interested in the influence of an already computed output token o m , where m < n, on the prediction of o n , then the construction for the decoder-only case in Section 2.2 applies.

2.4. ALGORITHM

Input: A E , A D , A C , I, O Output: f : O × I → R f = None for o ∈ O do for i ∈ I do f (o, i) ← Ed .Ka.(F(A E , A D , A C , {i}, o)) return f Algorithm 1: Attention flow. The flow network constructions can be directly used in an algorithm to compute the attention flow for input tokens. Algorithm 1 shows the algorithm for computing the attention flow for every input-output token pair. We build the flow network for every pair and compute the maximum flow in the corresponding network with the Edmonds-Karp Edmonds and Karp (1972) algorithm. The runtime of Edmonds-Karp is in O(V E 2 ), where the edges E and nodes V are given by the number of layers and input-output tokens. Since we run this algorithm for every input-output pair (with only partially rebuilding the flow network), we additionally gain linear complexity in the number of input and output tokens. We evaluate the implementation of this algorithm and its variations for encoder-only and decoder-only Transformers in Sec. 4.

2.5. OPTIMIZATIONS

The flow network constructions apply to subsets of heads, especially single heads. The results of head computations are joined using a linear projection, so each head has access to the computations of all heads in the previous layers. The task of a head in layer l can be independent of its task in previous layers l ′ < l. In practice, however, heads are biased towards keeping their respective tasks, such that we also found good interpretability results by considering the attention flow of attention heads independently (see Section 4) . A flow network can be constructed for a single head by following the above constructions, setting H ′ to every singleton. If the computation time of the maxflow for large Transformer models exceeds time limits, relaxations on the flow network are possible. First, note that the flow network only needs to be constructed once. As expected, the computation time of the maxflow in the network constructions increases with larger input and output sequences. Running time can be traded against heuristically shrinking the size of the flow network. This can be done in two dimensions. Following the practical assumption that heads often keep their tasks throughout subsequent layers, the first dimension is to shrink the flow network on the x-axis. This can be done by simply skipping some of the inner layers of the network or by merging layers by taking the average of the raw attention values across layers as capacities. Furthermore, the network can also be shrunk in the y-dimension similarly by grouping input and output tokens. For example, tokens predict and ed can be combined into one node.

3. SHAPLEY VALUE EXPLANATIONS

In this section, we show how the extended flow network constructions over the Transformer decoder F dec (A D , O ′ , n) and F(A E , A D , A C , I ′ , n ) induce Shapley value explanations for the tokens of the input sequence. The Shapley value (Shapley, 1953 ) is a solution concept determining the impact of a player in cooperative game theory and an increasingly popular concept to determine the influence of certain input features on a model's decision. Definition 2. A game with transferable utility (TU) is a pair (P, v), with P = {1, . . . , p} being a finite set of players and v : 2 P → R being the payoff function. A subset S ⊆ P is called a coalition. The payoff function v assigns every coalition of players S a real number v(S) ∈ R with v(∅) = 0. The share of a player i of the allocated payoff is φ i (v). The encoding of the attention values as a flow network is a TU game. A node in the flow network represents a player and the total flow through the network represents the total payoff (Ethayarajh and Jurafsky, 2021) . The Shapley values of the players in a TU game are formally defined as follows. Definition 3 (Shapley Value) . Let Π(P ) be the set of all player permutations and let π ∈ Π(P ) be a permutation of players. Let all players ahead of a player i be defined as P <i (π) := {j ∈ P : π(j) < π(i)}. The Shapley value φ is defined as the share of payoff for a given player i ∈ P : φ i (P, v) := 1 p! π∈Π(P ) (v(P <i (π) ∪ {i}) -v(P <i (π))). From a game-theoretic viewpoint, Shapley values are well-suited for determining the payoff share that players deserve, as the values satisfy the desirable properties efficiency, symmetry, null player, and additivity. The mathematical definition of the properties can be found in App. A These properties above are also responsible for making Shapley values an attractive approach for explaining a model's decisions, i.e., features that do not contribute to the accuracy of a model should be null players, and features that contribute equally should satisfy symmetry. Proposition 1 (Decoder-Only Flow Is a Shapley Value). Consider a Transformer decoder with H attention heads, L layers, N "output" tokens O = {o 1 , . . . , o N } and the self-attention tensor A D ∈ R H×L×N ×N . Let f o max be the maxflow computed in the flow network F dec (A D , {o}, n) as defined in the previous section. Consider the TU-game (P, v), where the players p ∈ P = {1, ..., N } correspond to nodes (o p , 0) from the first layer of the Transformer decoder. For a given coalition S ⊆ P , let the value function be v(S) = s∈S f os max , i.e., the sum of max-flows of nodes corresponding to S. Then, the max-flow f os max for some p ∈ P is its Shapley value. The proof immediately follows from the fact that every max-flow of some node f o max is an independent computation and the payoff of a coalition is defined as a sum of these independent contributions, which trivially qualifies as a Shapley value. Although this theoretical correspondence to a Shapley value is trivial, we show in our experiments in the following section that the maxflow computation indeed yields meaningful explanations for the network's attention flow. Note that our line of reasoning significantly differs from Ethayarajh and Jurafsky (2021) . In particular, we compute a separate max-flow for every token in the set of players. This is because key assumptions about flow networks that they make in their proof do not hold: They argue that as long as nodes come from the same layer, blocking flow through some of these nodes does not change the possible flow through the others, such that they can deduce that the utility a player adds when joining a coalition is independent of the identity of the players already in the coalition. However, this is not the case: Several nodes from the same layer can compete for capacity downstream in the network even if they have no direct connection, e.g., if we have two tokens o 1 , o 2 in one layer each attended to with 0.5 attention by a node o 3 which itself is only attended to with 0.5 attention. Now, the utility o 1 adds upon joining a coalition as defined by Ethayarajh and Jurafsky (2021) does depend on whether o 2 is already part of it. We deduce from the above discussion that it may violate the symmetry of a Shapley value, as the payoff for o 1 and o 2 can be unequally allocated. The ideas outlined for Proposition 1 also apply to the encoder-decoder attention flow. In the following, let f i max be the maxflow computed in the flow network construction F(A E , A D , A C , {i}, n) over the Transformer with H attention heads, L layers, M input tokens I = {i 0 , . . . , i M }, N output tokens O = {o 0 , . . . , o N }, and resulting encoder self-attention tensor A E ∈ R H×L×M ×M , decoder self-attention tensor A D ∈ R H×L×N ×N and encoder-decoder attention tensor A C ∈ R H×L×N ×M . Corollary 2 (Encoder-Decoder Flow Is a Shapley Value). Consider the TU-game (P, v), where the players p ∈ P = {1, ..., N } correspond to nodes (i p , 0) from the first layer. Let the value function or a given coalition S ⊆ P be defined as v(S) = s∈S f is max , i.e., the sum of max-flows of nodes corresponding to S. Then, the max-flow f is max for some p ∈ P is its Shapley Value.

4. EXPERIMENTS

In this section, we report on natural language processing and logical reasoning experiments. We implemented the algorithm from Section 2. 1 The architectural details of the models are shown in Table 1b . We visualize the maxflow attention values in heatmaps, lineplots, and violinplots (see, for example, Figure 3b ). The maxflow is computed with NETWORKX Hagberg et al. (2008) , and the heatmaps comparing the attention flow from input/predicted token to current predicted token are visualized with SEABORN Waskom (2021) . The heatmaps are either showing only the attention flow from input tokens if the model is encoder-only (enc.), are separated into different heatmaps for input tokens and auto-regressive tokens for encoder and decoder (enc. + dec.), or show one heatmap for all tokens if the architecture is decoder only (dec.). Higher values represent higher attention flow. 2019) while decoding the predicted tokens. The input sequence to this decoder-only mode was "My name is John, my profession is". Figure 3a depicts the attention flow after decoding the first, fourth, and sixth tokens. The resulting flow network can be found in Figure 13 in the appendix. Generally, GPT-2 models attend the first token the most (cf. Section 4.3). The differences in the attention flow are visible as the attention flow on previous tokens is different for each decoding step. Most notably, the attention flows shift heavily toward the token "profession" when predicting the token "doctor". We observed these heavy switches in decoder attention flow values throughout our experiments, which is why this approach is a valuable addition to existing analysis methods. The computation time of the flow values for this example only took 1.38, 1.50, and 2.09 seconds. Satisfying assignments for SAT. In this experiment, we considered the problem of computing a satisfying assignment to a propositional logical formula. A formula in propositional logic is constructed out of variables and Boolean connectives ¬ (not), ∨ (or), ∧ (and), → (implication), and ↔ (equivalence). For example, let the following propositional formula be given: b ∨ (a ∧ ¬a). A satisfying assignment is a mapping from variables to truth values, such that the formula evaluates to true. For example, a satisfying assignment for the formula above is the mapping {b → 1, a → 0}. The variable a, however, has no impact on the truth value of the formula. As long as b is set to 1, a can be predicted either as 1 or 0. We conducted an experiment to detect parts of the propositional formula that have no impact on predicted assignments. We trained a Transformer with an encoder and decoder to predict satisfying assignments. The attention flow values for the following two propositional formulas are depicted in Figure 3b : PropSAT 1 := b ∨ (a ∧ ¬a) in tokens: b|(a&!a) and PropSAT 2 := (a ∧ ¬a) ∨ b in tokens: (a&!a)|b. The disjunct (a ∧ ¬a) plays no role in any satisfying assignment since any mapping of a results in this subformula being false. Regardless of the position in the formula, the flow computation of the network detects this as unimportant: the inputs to the encoder a and ¬a have significantly less influence to the total attention flow than b.

4.2. HEAD TASK ANALYSIS

LTL trace prediction. We experimented with predicting satisfying traces to linear-time temporal logic (LTL) (Pnueli, 1977) . We used a Transformer trained on this task by Hahn et al. (2021) . LTL generalizes propositional logic with temporal operators such as (next) or U (until) and is used to specify the behavior of systems that interact with their environments over time. An LTL formula is satisfied by a trace, which is an infinite sequence of propositions that hold at discrete timesteps. We finitely represent satisfying traces to LTL formulas as a prefix, followed by a loop, denoted by curly brackets. For example, the LTL formula (a ∧ ¬a) denotes that in the second position, a must be true, and in the third position a must be false. The model correctly predicts the trace, where the first position and the loop are arbitrary and hence set to true: trace : 1; a; ¬a; {1}. the third position of the trace where a is not allowed. The right head focuses on the left conjunct a, which must appear on the second position of the trace (see Appendix B for another example). Translation. In this experiment, we used the OPUS-MT-EN-DE model Tiedemann and Thottingal (2020) for translating between English and German. The input sentence is "The pilot lost her suitcase.", which is translated to "Der Pilot hat ihren Koffer verloren". The computed flow network can be found in Figure 12 . While the meaning of the original sentence is ambiguous, as "the pilot" could be male or female, the translated sentence is not, since the German phrase Der Pilot means a male pilot. It has been conjectured that such gender-biased translations can facilitate problematic stereotypes (Bolukbasi et al., 2016) . Our analysis technique allows further insight into the internal mechanics of the Transformer model in such a scenario. We analyze the task of the heads, two of them are shown in Figure 6 . By computing the attention flow for the encoder and decoder, we can observe that the depicted heads solve opposing tasks: The head on the left-hand side attends pilot lost her in the encoder and Der Pilot in the decoder, which is the one-to-one translation, but without a corresponding possessive pronoun. The head on the right-hand side attends pilot and suitcase in the encoder and Pilot hat as well as Koffer in the decoder. Hence, from the attention flow, we can see that the second head has little influence on the biased translation, as neither her, nor Der and ihren (the German pronoun corresponding to her) receive significant attention. This approach, therefore, gives us a helpful hint that we have to analyze the first head to get to the root of this biased translation. Head attention. We analyze the influence of each head of GPT2 based on their contribution to the attention flow. Figure 4a shows the attention flow for each token and head for the input and output sentence "My name is John, my profession is to be a doctor. I am a doctor of medicine.". Heads 0, 1, and 2 show high and diverse attention flow values for different tokens, whereas all other heads have shallow and stable attention flow values. To explore this further, Figure 4b shows the accumulated attention flows for all tokens and each head for 300 random samples. It supports the claim that the first three heads have higher attention flow values than all other heads. The resulting flow network can be found in Figure 11 in the appendix. While the first two sentences, "John is a killer." and "John is a good killer." are correctly labeled with negative sentiment (even when having the adjective "good" in the sentence), having an emoji in the sentence immediately shifts the sentiment to be (falsely) labeled as positive. The computation of the attention flow is visualized in Figure 10 in the appendix. For the first two sentences, the attention on killer is the highest, considering only non-special tokens. Although the same holds for the third sentence, i.e., the attention flow denotes killer as the most important word, the low-attended smiley changes the sentiment to positive. When computing the attention flow for each head, we observe heads with an attention flow of 1.0 to the emoji (see Figure 9 in the appendix). One should be aware of this bias when applying this model outside of similar domains.

5. LIMITATIONS AND CONCLUSION

The main limiting factor of this approach is that the attention flow in a Transformer is the largest but not the only factor for deciding the next token prediction. Additionally to the many residual connections (which can be incorporated into the flow networks; see Section 2), Transformer models contain feed-forward networks used as intermediate steps. Another minor caveat is that flow values cannot be compared across different model architectures as their absolute values have no meaning. The values can solely be compared to other tokens in the same layer of the same model. This approach should thus be seen as a valuable addition (not a replacement) to the large toolbox for interpreting machine learning models. It generalizes the efforts in visualizing and interpreting raw attention values and attenion rollout. During our experiments, we found the attention flow values computed with the presented approach instrumental in analyzing models, finding biases, and fixing respective datasets. To conclude, we formalized and extended the technique to construct a flow network from the attention values of encoder-only Transformer models to general Transformer models, including an autoregressive decoder. Running a maxflow algorithm on these constructions returns Shapley values that determine the impact of a token on the total attention flow leading to the decoder's decision. We provide an implementation of our approach that can be applied to arbitrary Transformer models. Our experiments show this analysis method's applicability in various application domains. We hope our implementation and constructions presented in this paper will aid machine learning practitioners and researchers in designing reliable and interpretable Transformer models.

B HEAD TASK ANALYSIS: LTL UNTIL-OPERATOR

In this experiment, we provide another LTL example where one of the heads is focusing on the temporal operator in the formula and another is focusing solely on the propositions of the formula (see Figure 7 ) The input formula is: a U b ∧ 1 U a, where 1 U a denotes that finally an a must occur. The network correctly outputs the following trace: trace : a ∧ b; {1}. Single Head Attention Flow in RoBERTa. Figure 8 depicts the attention flow of the first head in the RoBERTa model. Intuitively, the word killer dominates the sentiment of the sentence. However, the output of RoBERTa is a positive sentiment, although the attention flow is mainly on the word killer (see Figure 10 ). Analyzing the individual heads, one can observe that head 0 attends the smiley with its maximal value (1.0), which could be one explanation for the output of the model.

Bias in DialogPT.

Figure 9 shows the attention flow from each token to the current output. While we observe slight changes of the computed attention flow for each token, the first input token The is highly attended, more than two times the attention flow than any other token. Note that this observation does not directly translate into a bias in the model, it solely shows that the distribution of attention is biased. Encoder Decoder. Figure 12 shows the flow network for OPUS-MT-EN-DE. The underlying architecture consists of an encoder and a decoder with 8 layers each, connected by the cross attention edges in between. For each input token and auto regressive token, we compute the attention flow to each predicted token. In Figure 12 , the attention flow for the third predicted token is computed. Decoder Only. Figure 13 shows the flow network for GPT-2 with the underlying decoder only architecture. The model has 12 layers, attention can only flow from previous auto regressive tokens, including the input tokens. We start computing attention flow for the first output token, which is connected to the terminal node in Figure 13 . 



The code and experiments will be published after the double-blind review phase.



With |L|+1 node columns, |L| edge columns, I ′ = {i5} and J = {1, 3, 5, 6}. The red node depicts the input token for which the maximum flow is computed. The blue node represents the terminal node t. Input token set O ′ = {o2} and embedding t,where the output token o5 is currently predicted. The first "input" token, i.e., o1 is the special start token of the decoder.

Figure 1: Encoder attention in a flow network (left) and decoder attention in a flow network (right).

which we call positional independence. We ensure this by dividing the result of a max flow computation for a given start token o m and end token o n by 1 + (O -(n -m)) -m. For a subset O ′ ⊆ O and a position n (where ∀o ′ m ∈ O ′ .m < n

Figure 2: Sketch of the Encoder-decoder attention flow network for input token i 5 and embedding t, which is used to predict o 5 . Encoder-decoder connections are sketched for the first node.

Figure 3: Heatmap of the attention flow of the GPT-2 model after 1, 4, and 6 predicted tokens in (a) and Heatmap depicting the attention flow for unimportant token detection in SAT assignments in (b).

Figure 4: The attention flow for every head of GPT-2 separately in (a) and the sum of all attention flow values per head for 300 sampled input queries on GPT-2 in (b).

Figure 5: Violinplot for the distribution of attention flow of GPT-2 for 500 samples in (a) and Heatmaps for two heads of the LTLSat model, each attending a different timestep in (b).

Results of the sentiment analysis in (a) and the parameter overview of the models in (b).

Figure 6: Heatmap for two heads, divided in encoder and decoder. The left head attends the pilot, the head on the right the suitcase.

token bias. While analyzing the attention flow of the decoder-only TransformerDialogPT Zhang  et al. (2020)  andGPT-2 Radford et al. (2019), we observed a heavy bias toward the first decoded token (see Figure3aand Figure9in the appendix). We computed the attention flow for 500 random samples of the OPUS-MT-EN-DE test set. The results are visualized in Figure5a. The first token contributes the most to the total attention flow regardless of the input tokens. Since the DialogPT model was trained on a dataset mined from reddit.com, it might be beneficial to overattend the first token as many conversations on reddit.com consist of concise sentences or even single words.

Figure 7: Heatmap of the attention flow for two heads: The left head focuses on the until-operator and the right head focuses on the propositions.

Figure 8: Heatmap showing head 0 of RoBERTa for the example in Figure 10.

Figure 9: Heatmap showing the bias towards the first token in DialogPT.

Figure 10: Heatmap showing the attention flow for 3 variations of the same sentence in RoBERTa.

Figure 11: The flow network of the encoder-only network RoBERTa for the example in Figure 10.

Figure 12: The flow network of the encoder decoder architecture OPUS-MT-EN-DE for the input "The father cooked dinner." and the predicted tokens "Der Vater kochte Abendessen".



6. REPRODUCIBILITY STATEMENT

The supplementary material of this submission includes python notebooks to reproduce the figures presented in this paper with their underlying data. The code, datasets, models, and our notebooks for the reproduction of the experiments will be made publically available once the double-blind reviewing process ends.

