MULTI-HEAD ATTENTION: COLLABORATE INSTEAD OF CONCATENATE

Abstract

Attention layers are widely used in natural language processing (NLP) and are beginning to influence computer vision architectures. Training very large transformer models allowed significan improvement in both fields, but once trained, these networks show symptoms of over-parameterization. For instance, it is known that many attention heads can be pruned without impacting accuracy. This work aims to enhance current understanding on how multiple heads interact. Motivated by the observation that trained attention heads share common key/query projections, we propose a collaborative multi-head attention layer that enables heads to learn shared projections. Our scheme decreases the number of parameters in an attention layer and can be used as a drop-in replacement in any transformer architecture. For instance, by allowing heads to collaborate on a neural machine translation task, we can reduce the key dimension by 4× without any loss in performance. We also show that it is possible to re-parametrize a pre-trained multi-head attention layer into our collaborative attention layer. Even without retraining, collaborative multi-head attention manages to reduce the size of the key and query projections by half without sacrificing accuracy. Our code is public.

1. INTRODUCTION

Since the invention of attention (Bahdanau et al., 2014) and its popularization in the transformer architecture (Vaswani et al., 2017) , multi-head attention (MHA) has become the de facto architecture for natural language understanding tasks (Devlin et al., 2019) and neural machine translation. Attention mechanisms have also gained traction in computer vision following the work of Ramachandran et al. (2019) and Bello et al. (2019) . Nevertheless, despite their wide adoption, we currently lack solid theoretical understanding of how transformers operate. In fact, many of their modules and hyperparameters are derived from empirical evidences that are possibly circumstantial. The uncertainty is amplified in multi-head attention, where both the roles and interactions between heads are still poorly understood. Empirically, it is well known that using multiple heads can improve model accuracy. However, not all heads are equally informative, and it has been shown that certain heads can be pruned without impacting model performance. For instance, Voita et al. ( 2019) present a method to quantify head utility and prune redundant members. Michel et al. ( 2019) go further to question the utility of multiple heads by testing the effect of heavy pruning in several settings. On the other hand, Cordonnier et al. ( 2020) prove that multiple heads are needed for self-attention to perform convolution, specifically requiring one head per pixel in the filter's receptive field. Beyond the number of heads, finding the adequate head dimension is also an open question. Bhojanapalli et al. (2020) finds that the division of the key/query projection between heads gives rise to a low-rank bottleneck for each attention head expressivity that can be fixed by increasing the head sizes. In contrast, our approach increases heads expressivity by leveraging the low-rankness accross heads to share common query/key dimensions. This work aims to better detect and quantify head redundancy by asking whether independent heads learn overlapping or distinct concepts. This relates to the work on CNN compression that factorizes common filters in a trained convolutional network (Kim et al., 2016) using Tucker decomposition. In attention models, we discover that some key/query projected dimensions are redundant, as trained



https://github.com/... 1

