ATTENTION FLOWS FOR GENERAL TRANSFORMERS

Abstract

In this paper, we study the computation of how much an input token in a Transformer model influences its prediction. We formalize a method to construct a flow network out of the attention values of encoder-only Transformer models and extend it to general Transformer architectures, including an auto-regressive decoder. We show that running a maxflow algorithm on the flow network construction yields Shapley values, which determine a player's impact in cooperative game theory. By interpreting the input tokens in the flow network as players, we can compute their influence on the total attention flow leading to the decoder's decision. Additionally, we provide a library that computes and visualizes the attention flow of arbitrary Transformer models. We show the usefulness of our implementation on various models trained on natural language processing and reasoning tasks.

1. INTRODUCTION

The Transformer (Vaswani et al., 2017) is the dominant machine learning architecture in recent years, finding application in NLP (e.g., BERT (Devlin et al., 2019) , GPT-3 (Brown et al., 2020) , or LaMDA (Collins and Ghahramani, 2021)), computer vision (see Khan et al. (2021) for a survey), mathematical reasoning (Lample and Charton, 2019; Han et al., 2021) , or even code and hardware synthesis (Chen et al., 2021; Schmitt et al., 2021) . The Transformer relies on attention (Bahdanau et al., 2015) that mimics cognitive attention, which sets the focus of computation on a few concepts at a time. In this paper, we rigorously formalize constructing a flow network out of attention values (Abnar and Zuidema, 2020) and generalize it to models, including a decoder. While theoretically yielding a Shapley value (Shapley, 1953) quite trivially, we show that this results in meaningful explanations for input tokens' influence on the total flow affecting a Transformer's prediction. Its applicability in various domains has made the Transformer architecture incredibly popular. Models are easily accessible for developers around the world, for example at huggingface.co (Wolf et al., 2019) . However, blindly using or fine-tuning these models might lead to mispredictions and unwanted biases, which will have a considerable negative effect on their application domains. The sheer size of the Transformer models makes it impossible to analyze the networks by hand. Explainability and visualization methods, e.g., Vig (2019), aid the machine learning practitioner and researcher in finding the cause of a misprediction or revealing unwanted biases. The training method or the dataset can then be adjusted accordingly. Abnar and Zuidema (2020) introduced Attention Flow as a post-processing interpretability technique that treats the self-attention weight matrices of a Transformer encoder as a flow network. This technique allows analyzing the flow of attention through the Transformer encoder: Computing the maxflow for an input token determines the impact of this token on the total attention flow. Ethayarajh and Jurafsky (2021) discussed a possible relation of the maxflow computation through the encoder flow network to Shapley values, which is a concept determining a player's impact in cooperative game theory and can be applied to measure the importance of a model's input features. However, the lack of a clear formalization of the underlying flow network has made it difficult to assess the validity of their claims, which we aim to address in this work. We extend our formalization of the approach to a Transformer-model-agnostic technique, including general encoder-decoder Transformers and decoder-only Transformers such as GPT models Radford et al. (2018) . While, after applying a positional encoding, the encoder processes the input tokens as a whole, the decoder layers operate auto-regressively, i.e., a sequence of tokens will be predicted step-by-step, and already predicted input tokens will be given as input to the decoder. This results in a 1

