TRANSFORMERS WITH COMPETITIVE ENSEMBLES OF INDEPENDENT MECHANISMS Anonymous

Abstract

An important development in deep learning from the earliest MLPs has been a move towards architectures with structural inductive biases which enable the model to keep distinct sources of information and routes of processing well-separated. This structure is linked to the notion of independent mechanisms from the causality literature, in which a mechanism is able to retain the same processing as irrelevant aspects of the world are changed. For example, convnets enable separation over positions, while attention-based architectures (especially Transformers) learn which combination of positions to process dynamically. In this work we explore a way in which the Transformer architecture is deficient: it represents each position with a large monolithic hidden representation and a single set of parameters which are applied over the entire hidden representation. This potentially throws unrelated sources of information together, and limits the Transformer's ability to capture independent mechanisms. To address this, we propose Transformers with Independent Mechanisms (TIM), a new Transformer layer which divides the hidden representation and parameters into multiple mechanisms, which only exchange information through attention. Additionally, we propose a competition mechanism which encourages these mechanisms to specialize over time steps, and thus be more independent. We study TIM on a large-scale BERT model, on the Image Transformer, and on speech enhancement and find evidence for semantically meaningful specialization as well as improved performance.

1. INTRODUCTION

A major theme throughout the history of deep learning has been the introduction of inductive biases in neural architectures, more recently with a focus on the ability to dynamically keep distinct types of information separated. While an MLP architecture has one large hidden representation at each layer, a convnet keeps different spatial positions' representations separated by default. This separation enables more appropriate reuse of parameters, improving generalization (e.g. compared with a fully connected MLP) by ensuring that some parts of the hidden representation capturing some aspects of the data can remain unchanged when other aspects are changed. Additionally, it is important to be able to reuse parameters in all situations where the parameters are relevant, and not use parameters in positions where they are irrelevant, and this is where attention mechanisms can be very useful. While dividing information between different positions (for example time steps or spatial positions) is already very useful, it has been recognized from the earliest deep learning work on the notion of disentangling (Bengio, 2009; Glorot et al., 2011; Rifai et al., 2012; Mathieu et al., 2016; Achille & Soatto, 2018) that other features of the data could advantageously be kept well-separated, even over overlapping sets of positions. This has suggested the idea that a model can be decomposed into multiple components, which are often called modules, each operating on a different set of features. Modularity has been identified as an essential ingredient for generalization in machine learning (Ronco et al., 1997; Alet et al., 2018; Goyal et al., 2019) . The motivating intuition is that if the relationship between the modules changes between training and evaluation, then a model which keeps these modules sufficiently separate but can adapt how they are combined could be more robust. It can even be robust to changes where the overall data distribution differs between training and evaluation. This has been studied in the causality literature through the notion of "Independent Mechanisms" ( Peters et al., 2018; Parascandolo et al., 2018) or causal modules, which can be flexibly re-combined, re-used, and re-purposed. While modularity and independent mechanisms ideas are closely related, the latter has a special focus on the notion that mechanisms should have the ability to remain unchanged when unrelated aspects of the world are changed. In that sense it is a more specific idea which builds on the more general concept of modularity. While the study of independent mechanisms in the context of deep architectures is relatively recent (Goyal et al., 2019; Mittal et al., 2020) , a few ideas are considered central. One is that mechanisms are separately parameterized (or dynamically parameterized, with the possibility of separation), which means that the function computed by a module remains the same even as other mechanisms need to be changed. Another central idea is specialization between mechanisms, which is the idea that mechanisms should seek to only model some parts of the world. One way to help accomplish this is by forcing the mechanisms to compete to explain different positions (in time or space), such that some mechanisms would not be used by the model on positions where they are less relevant. In this work we explore how the idea of independent mechanisms can be beneficial in the Transformer architecture. Transformers (Vaswani et al., 2017) are based on information sharing across positions controlled dynamically by a soft-attention mechanism (Bahdanau et al., 2014) , while still using a fully-connected MLP to process the extracted feature vectors (concatenated over a set of attention heads) at each position. An important way in which this improves over convnets is that if this attention becomes sufficiently sparse, then it gains the ability to keep information well-separated between different positions. At the same time, at each position, the Transformer stores a single monolithic hidden representation, over which it applies its entire set of parameters. For example, if we consider a generative model of images of animals in a field, then some of the parameters like those describing how animals have symmetric eyes or a certain number of feet, are only relevant for the positions in the image where the animal is present. A normal Transformer, however, would apply the same parameters to the entire hidden representation at all spatial positions. Additionally, if sources of information need to be accessed over multiple positions, it has no way to keep that information well-separated between parts of the hidden representation, unless a large fraction of the parameters are set to exactly zero. In practice, models tend not to learn these sorts of highly sparse parameter matrices as it is not necessary in order to fit the training set. Thus different underlying factors tend to be freely blended together rather than disentangled: we hypothesize and show empirically that this leads to deteriorated generalization when something about some of these factors changes. Our newly proposed technique, which we call Transformers with Competitive Independent Mechanisms (TIM) seeks to address this limitation of the Transformer by dividing the hidden representation and parameters into multiple distinct mechanisms. These mechanisms perform self-attention (over input elements) separately, and information is exchanged sparingly between the mechanisms using



Figure 1: We show a simplified version of the model at a single position to illustrate the difference between heads and mechanisms (left). Heads allow for parallel attention, but the differentiation between heads is transient: it begins with the projection layer and ends immediately following the attention. As a result, most of the parameters are not head-specific. With independent mechanisms, the information is kept well-separated throughout the entire layer, and all of the layer's parameters are specific to a single mechanism. Competition patterns of an unsupervised image Transformer on CIFAR images (top right) with 2 mechanisms shows that mechanisms learn to specialize over foreground and background patterns on an early layer (center right) and become more confident in a later layer (bottom right).

