TRANSFORMERS WITH COMPETITIVE ENSEMBLES OF INDEPENDENT MECHANISMS Anonymous

Abstract

An important development in deep learning from the earliest MLPs has been a move towards architectures with structural inductive biases which enable the model to keep distinct sources of information and routes of processing well-separated. This structure is linked to the notion of independent mechanisms from the causality literature, in which a mechanism is able to retain the same processing as irrelevant aspects of the world are changed. For example, convnets enable separation over positions, while attention-based architectures (especially Transformers) learn which combination of positions to process dynamically. In this work we explore a way in which the Transformer architecture is deficient: it represents each position with a large monolithic hidden representation and a single set of parameters which are applied over the entire hidden representation. This potentially throws unrelated sources of information together, and limits the Transformer's ability to capture independent mechanisms. To address this, we propose Transformers with Independent Mechanisms (TIM), a new Transformer layer which divides the hidden representation and parameters into multiple mechanisms, which only exchange information through attention. Additionally, we propose a competition mechanism which encourages these mechanisms to specialize over time steps, and thus be more independent. We study TIM on a large-scale BERT model, on the Image Transformer, and on speech enhancement and find evidence for semantically meaningful specialization as well as improved performance.

1. INTRODUCTION

A major theme throughout the history of deep learning has been the introduction of inductive biases in neural architectures, more recently with a focus on the ability to dynamically keep distinct types of information separated. While an MLP architecture has one large hidden representation at each layer, a convnet keeps different spatial positions' representations separated by default. This separation enables more appropriate reuse of parameters, improving generalization (e.g. compared with a fully connected MLP) by ensuring that some parts of the hidden representation capturing some aspects of the data can remain unchanged when other aspects are changed. Additionally, it is important to be able to reuse parameters in all situations where the parameters are relevant, and not use parameters in positions where they are irrelevant, and this is where attention mechanisms can be very useful. While dividing information between different positions (for example time steps or spatial positions) is already very useful, it has been recognized from the earliest deep learning work on the notion of disentangling (Bengio, 2009; Glorot et al., 2011; Rifai et al., 2012; Mathieu et al., 2016; Achille & Soatto, 2018) that other features of the data could advantageously be kept well-separated, even over overlapping sets of positions. This has suggested the idea that a model can be decomposed into multiple components, which are often called modules, each operating on a different set of features. Modularity has been identified as an essential ingredient for generalization in machine learning (Ronco et al., 1997; Alet et al., 2018; Goyal et al., 2019) . The motivating intuition is that if the relationship between the modules changes between training and evaluation, then a model which keeps these modules sufficiently separate but can adapt how they are combined could be more robust. It can even be robust to changes where the overall data distribution differs between training and evaluation. This has been studied in the causality literature through the notion of "Independent Mechanisms"

