HOLISTICALLY EXPLAINABLE VISION TRANSFORMERS

Abstract

Transformers increasingly dominate the machine learning landscape across many tasks and domains, which increases the importance for understanding their outputs. While their attention modules provide partial insight into their inner workings, the attention scores have been shown to be insufficient for explaining the models as a whole. To address this, we propose B-cos transformers, which inherently provide holistic explanations for their decisions. Specifically, we formulate each model component-such as the multi-layer perceptrons, attention layers, and the tokenisation module-to be dynamic linear, which allows us to faithfully summarise the entire transformer via a single linear transform. We apply our proposed design to Vision Transformers (ViTs) and show that the resulting models, dubbed Bcos-ViTs, are highly interpretable and perform competitively to baseline ViTs on ImageNet.

1. INTRODUCTION

Convolutional neural networks (CNNs) have dominated the last decade of computer vision. However, recently they are often surpassed by transformers (Vaswani et al., 2017) , whichif the current development is any indicationwill replace CNNs for ever more tasks and domains. Transformers are thus bound to impact many aspects of our lives: from healthcare, over judicial decisions, to autonomous driving. Given the sensitive nature of such areas, it is of utmost importance to ensure that we can explain the underlying models, which still remains a challenge for transformers. To explain transformers, prior work often focused on the models' attention layers (Jain & Wallace, 2019; Serrano & Smith, 2019; Abnar & Zuidema, 2020; Barkan et al., 2021), as they inherently compute their output in an interpretable manner. However, as transformers consist of many additional components, explanations derived from attention alone have been found insufficient to explain the full models (Bastings & Filippova, 2020; Chefer et al., 2021) . To address this, our goal is to develop transformers that inherently provide holistic explanations for their decisions, i.e. explanations that reflect all model components. These model components are given by: a tokenisation module, a mechanism for providing positional information to the model, multi-layer perceptrons (MLPs), as well as normalisation and attention layers, see Fig. 2a . By addressing the interpretability of each component individually, we obtain transformers that inherently explain their decisions, see, for example Fig. 1 and Fig. 2b . In detail, our approach is based on the idea of designing each component to be dynamic linear, such that it computes an input-dependent linear transform. This renders the entire model dynamic linear, cf. Böhle et al. (2021; 2022) , s.t. it can be summarised by a single linear transform for each input.



Fig. 1: Inherent explanations (cols. 2+3) of B-cos ViTs vs. attention explanations (cols. 4+5) for the same model. Note that W(x) faithfully reflects the whole model and yields more detailed and class-specific explanations than attention alone. For a detailed discussion, see supplement.

