ATTENTION FLOWS FOR GENERAL TRANSFORMERS

Abstract

In this paper, we study the computation of how much an input token in a Transformer model influences its prediction. We formalize a method to construct a flow network out of the attention values of encoder-only Transformer models and extend it to general Transformer architectures, including an auto-regressive decoder. We show that running a maxflow algorithm on the flow network construction yields Shapley values, which determine a player's impact in cooperative game theory. By interpreting the input tokens in the flow network as players, we can compute their influence on the total attention flow leading to the decoder's decision. Additionally, we provide a library that computes and visualizes the attention flow of arbitrary Transformer models. We show the usefulness of our implementation on various models trained on natural language processing and reasoning tasks.

1. INTRODUCTION

The Transformer (Vaswani et al., 2017) is the dominant machine learning architecture in recent years, finding application in NLP (e.g., BERT (Devlin et al., 2019) , GPT-3 (Brown et al., 2020) , or LaMDA (Collins and Ghahramani, 2021)), computer vision (see Khan et al. (2021) for a survey), mathematical reasoning (Lample and Charton, 2019; Han et al., 2021) , or even code and hardware synthesis (Chen et al., 2021; Schmitt et al., 2021) . The Transformer relies on attention (Bahdanau et al., 2015) that mimics cognitive attention, which sets the focus of computation on a few concepts at a time. In this paper, we rigorously formalize constructing a flow network out of attention values (Abnar and Zuidema, 2020) and generalize it to models, including a decoder. While theoretically yielding a Shapley value (Shapley, 1953) quite trivially, we show that this results in meaningful explanations for input tokens' influence on the total flow affecting a Transformer's prediction. Its applicability in various domains has made the Transformer architecture incredibly popular. Models are easily accessible for developers around the world, for example at huggingface.co (Wolf et al., 2019) . However, blindly using or fine-tuning these models might lead to mispredictions and unwanted biases, which will have a considerable negative effect on their application domains. The sheer size of the Transformer models makes it impossible to analyze the networks by hand. Explainability and visualization methods, e.g., Vig (2019), aid the machine learning practitioner and researcher in finding the cause of a misprediction or revealing unwanted biases. The training method or the dataset can then be adjusted accordingly. Abnar and Zuidema (2020) introduced Attention Flow as a post-processing interpretability technique that treats the self-attention weight matrices of a Transformer encoder as a flow network. This technique allows analyzing the flow of attention through the Transformer encoder: Computing the maxflow for an input token determines the impact of this token on the total attention flow. Ethayarajh and Jurafsky (2021) discussed a possible relation of the maxflow computation through the encoder flow network to Shapley values, which is a concept determining a player's impact in cooperative game theory and can be applied to measure the importance of a model's input features. However, the lack of a clear formalization of the underlying flow network has made it difficult to assess the validity of their claims, which we aim to address in this work. We extend our formalization of the approach to a Transformer-model-agnostic technique, including general encoder-decoder Transformers and decoder-only Transformers such as GPT models Radford et al. (2018) . While, after applying a positional encoding, the encoder processes the input tokens as a whole, the decoder layers operate auto-regressively, i.e., a sequence of tokens will be predicted step-by-step, and already predicted input tokens will be given as input to the decoder. This results in a significantly different shape of the flow network and, in particular, requires normalization to account for the bias towards tokens that were predicted later than others. We account for the auto-regressive nature of the decoder by ensuring positional independence of the computed maxflow values. We implemented our constructions as a Python library, which we will publish under the MIT license. In summary, our contributions are the following. We formalize encoder-only attention flow and generalize the approach to encoder-decoder and decoder-only Transformers in Section 2. Furthermore, we use the formalization to construct an explicit algorithm for attention flow computation and analyze its complexity. In Section 3, we show that the computed attention flow values are Shapley values for all three architectures. Section 4 introduces a tool to compute and visualize attention flow for arbitrary Transformers. We report on qualitative and quantitative experiments that show the effectiveness of our approach, including token bias and single-head attention analyses. Related Work. We would like to emphasize the work on which we build: Abnar and Zuidema (2020) introducing attention flow for Transformer encoders and Ethayarajh and Jurafsky (2021) drawing a possible connection between encoder attention flows and Shapley values. An explainability overview is given by Samek et al. (2017) and Burkart and Huber (2021) . An overview over Shapley value formulations for machine learning models is given by Sundararajan and Najmi (2020), which are not restricted to Transformer models and do not include attention flow (Lindeman, 1980; Grömping, 2007; Owen, 2014; Owen and Prieur, 2017; Štrumbelj et al., 2009; Štrumbelj and Kononenko, 2014; Datta et al., 2016; Lundberg and Lee, 2017; Lundberg et al., 2018; Aas et al., 2019; Sun and Sundararajan, 2011; Sundararajan et al., 2017; Agarwal et al., 2019) . Shapley values are also used for the valuation of machine learning data (Ghorbani and Zou, 2019) . Raw attention values can be visualized, e.g., Vig (2019) and Wang et al. (2021 ). Chefer et al. (2021) assign local relevance based on the Deep Taylor Decomposition principle (Montavon et al., 2017) .

2. ATTENTION FLOW

Attention Flow (Abnar and Zuidema, 2020) is a post-processing interpretability technique that treats the self-attention weight matrices of the Transformer encoder as a flow network and returns the maximum flow through each input token. Formally, a flow network is defined as follows. Definition 1 (Flow Network). Given a graph G = (V, E), where V is a set of vertices and E ⊆ V ×V is a set of edges, a flow network is a tuple (G, c, s, t), where c : E → R ∞ is the capacity function and s and t are the source and terminal (sink) nodes respectively. A flow is a function f : E → R satisfying the following two conditions: Flow conservation: ∀v ∈ V \{s, t}. x f (v) = 0, where  x f : V → R is defined as x f (u) = Σ v∈V f (v, u),



With |L|+1 node columns, |L| edge columns, I ′ = {i5} and J = {1, 3, 5, 6}. The red node depicts the input token for which the maximum flow is computed. The blue node represents the terminal node t. Input token set O ′ = {o2} and embedding t,where the output token o5 is currently predicted. The first "input" token, i.e., o1 is the special start token of the decoder.

Figure 1: Encoder attention in a flow network (left) and decoder attention in a flow network (right).

and capacity constraint: ∀e ∈ E. f (e) ≤ c(e). The value of flow |f | is the amount of flow from the source node s to the terminal node t: |f | = v:(s,v)∈E f sv . For a given set K of nodes, we define |f (K)| as the flow value from s to t only passing through nodes in K: |f (K)| = v:(s,v)∈E,v∈K f sv . We define |f o (v)| to be the total outflow value of a node v and |f i (v)| to be the total inflow value of a node v. In optimization theory, the maximum flow problem max (|f |) (Harris and Ross, 1955) is to find flows that push the maximum possible flow value |f | from the source node s to the terminal node t, which we denote by f max .

