GROUP EQUIVARIANT STAND-ALONE SELF-ATTENTION FOR VISION

Abstract

We provide a general self-attention formulation to impose group equivariance to arbitrary symmetry groups. This is achieved by defining positional encodings that are invariant to the action of the group considered. Since the group acts on the positional encoding directly, group equivariant self-attention networks (GSA-Nets) are steerable by nature. Our experiments on vision benchmarks demonstrate consistent improvements of GSA-Nets over non-equivariant self-attention networks.

1. INTRODUCTION

Recent advances in Natural Language Processing have been largely attributed to the rise of the Transformer (Vaswani et al., 2017) . Its key difference with previous methods, e.g., recurrent neural networks, convolutional neural networks (CNNs), is its ability to query information from all the input words simultaneously. This is achieved via the self-attention operation (Bahdanau et al., 2015; Cheng et al., 2016) , which computes the similarity between representations of words in the sequence in the form of attention scores. Next, the representation of each word is updated based on the words with the highest attention scores. Inspired by the capacity of transformers to learn meaningful inter-word dependencies, researchers have started applying self-attention in vision tasks. It was first adopted into CNNs by channel-wise attention (Hu et al., 2018) and non-local spatial modeling (Wang et al., 2018) . More recently, it has been proposed to replace CNNs with self-attention networks either partially (Bello et al., 2019) or entirely (Ramachandran et al., 2019) . Contrary to discrete convolutional kernels, weights in self-attention are not tied to particular positions (Fig. A .1), yet self-attention layers are able to express any convolutional layer (Cordonnier et al., 2020) . This flexibility allows leveraging long-range dependencies under a fixed parameter budget. An arguable orthogonal advancement to deep learning architectures is the incorporation of symmetries into the model itself. The seminal work by Cohen & Welling (2016) provides a recipe to extend the translation equivariance of CNNs to other symmetry groups to improve generalization and sampleefficiency further (see §2). Translation equivariance is key to the success of CNNs. It describes the property that if a pattern is translated, its numerical descriptors are also translated, but not modified. In this work, we introduce group self-attention, a self-attention formulation that grants equivariance to arbitrary symmetry groups. This is achieved by defining positional encodings invariant to the action of the group considered. In addition to generalization and sample-efficiency improvements provided by group equivariance, group equivariant self-attention networks (GSA-Nets) bring important benefits over group convolutional architectures: (i) Parameter efficiency: contrary to conventional discrete group convolutional kernels, where weights are tied to particular positions of neighborhoods on the group, group equivariant self-attention leverages long-range dependencies on group functions under a fixed parameter budget, yet it is able to express any group convolutional kernel. This allows for very expressive networks with low parameter count. (ii) Steerability: since the group acts directly on the positional encoding, GSA-Nets are steerable (Weiler et al., 2018b) by nature. This allows us to go beyond group discretizations that live in the grid without introducing interpolation artifacts.

Contributions:

• We provide an extensive analysis on the equivariance properties of self-attention ( §4). • We provide a general formulation to impose group equivariance to self-attention ( §5). • We provide instances of self-attentive architectures equivariant to several symmetry groups ( §6). • Our results demonstrate consistent improvements of GSA-Nets over non-equivariant ones ( §6).

