GROUP EQUIVARIANT STAND-ALONE SELF-ATTENTION FOR VISION

Abstract

We provide a general self-attention formulation to impose group equivariance to arbitrary symmetry groups. This is achieved by defining positional encodings that are invariant to the action of the group considered. Since the group acts on the positional encoding directly, group equivariant self-attention networks (GSA-Nets) are steerable by nature. Our experiments on vision benchmarks demonstrate consistent improvements of GSA-Nets over non-equivariant self-attention networks.

1. INTRODUCTION

Recent advances in Natural Language Processing have been largely attributed to the rise of the Transformer (Vaswani et al., 2017) . Its key difference with previous methods, e.g., recurrent neural networks, convolutional neural networks (CNNs), is its ability to query information from all the input words simultaneously. This is achieved via the self-attention operation (Bahdanau et al., 2015; Cheng et al., 2016) , which computes the similarity between representations of words in the sequence in the form of attention scores. Next, the representation of each word is updated based on the words with the highest attention scores. Inspired by the capacity of transformers to learn meaningful inter-word dependencies, researchers have started applying self-attention in vision tasks. It was first adopted into CNNs by channel-wise attention (Hu et al., 2018) and non-local spatial modeling (Wang et al., 2018) . More recently, it has been proposed to replace CNNs with self-attention networks either partially (Bello et al., 2019) or entirely (Ramachandran et al., 2019) . Contrary to discrete convolutional kernels, weights in self-attention are not tied to particular positions (Fig. A .1), yet self-attention layers are able to express any convolutional layer (Cordonnier et al., 2020) . This flexibility allows leveraging long-range dependencies under a fixed parameter budget. An arguable orthogonal advancement to deep learning architectures is the incorporation of symmetries into the model itself. The seminal work by Cohen & Welling (2016) provides a recipe to extend the translation equivariance of CNNs to other symmetry groups to improve generalization and sampleefficiency further (see §2). Translation equivariance is key to the success of CNNs. It describes the property that if a pattern is translated, its numerical descriptors are also translated, but not modified. In this work, we introduce group self-attention, a self-attention formulation that grants equivariance to arbitrary symmetry groups. This is achieved by defining positional encodings invariant to the action of the group considered. In addition to generalization and sample-efficiency improvements provided by group equivariance, group equivariant self-attention networks (GSA-Nets) bring important benefits over group convolutional architectures: (i) Parameter efficiency: contrary to conventional discrete group convolutional kernels, where weights are tied to particular positions of neighborhoods on the group, group equivariant self-attention leverages long-range dependencies on group functions under a fixed parameter budget, yet it is able to express any group convolutional kernel. This allows for very expressive networks with low parameter count. (ii) Steerability: since the group acts directly on the positional encoding, GSA-Nets are steerable (Weiler et al., 2018b) by nature. This allows us to go beyond group discretizations that live in the grid without introducing interpolation artifacts.

Contributions:

• We provide an extensive analysis on the equivariance properties of self-attention ( §4). • We provide a general formulation to impose group equivariance to self-attention ( §5). • We provide instances of self-attentive architectures equivariant to several symmetry groups ( §6). • Our results demonstrate consistent improvements of GSA-Nets over non-equivariant ones ( §6). 

2. RELATED WORK

Several approaches exist which provide equivariance to various symmetry groups. The translation equivariance of CNNs has been extended to additional symmetries ranging from planar rotations (Dieleman et al., 2016; Marcos et al., 2017; Worrall et al., 2017; Weiler et al., 2018b; Li et al., 2018; Cheng et al., 2018; Hoogeboom et al., 2018; Bekkers et al., 2018; Veeling et al., 2018; Lenssen et al., 2018; Graham et al., 2020) to spherical rotations (Cohen et al., 2018; 2019b; Worrall & Brostow, 2018; Weiler et al., 2018a; Esteves et al., 2019a; b; 2020 ), scaling (Marcos et al., 2018;; Worrall & Welling, 2019; Sosnovik et al., 2020; Romero et al., 2020b) and more general symmetry groups (Cohen & Welling, 2016; Kondor & Trivedi, 2018; Tai et al., 2019; Weiler & Cesa, 2019; Cohen et al., 2019a; Bekkers, 2020; Venkataraman et al., 2020) . Importantly, all these approaches utilize discrete convolutional kernels, and thus, tie weights to particular positions in the neighborhood on which the kernels are defined. As group neighborhoods are (much) larger than conventional ones, the number of weights discrete group convolutional kernels require proportionally increases. This phenomenon is further exacerbated by attentive group equivariant networks (Romero & Hoogendoorn, 2019; Diaconu & Worrall, 2019; Romero et al., 2020a) . Since attention is used to leverage non-local information to aid local operations, non-local neighborhoods are required. However, as attention branches often rely on discrete convolutions, they effectively tie specific weights to particular positions on a large non-local neighborhood on the group. As a result, attention is bound to growth of the model size, and thus, to negative statistical efficiency. Differently, group self-attention is able to attend over arbitrarily large group neighborhoods under a fixed parameter budget. In addition, group self-attention is steerable by nature ( §5.1) a property primarily exhibited by works carefully designed to that end. Other way to detach weights from particular positions comes by parameterizing convolutional kernels as (constrained) neural networks (Thomas et al., 2018; Finzi et al., 2020) . Introduced to handle irregularly-sampled data, e.g., point-clouds, networks parameterizing convolutional kernels receive relative positions as input and output their values at those positions. In contrast, our mappings change as a function of the input content. Most relevant to our work are the SE(3) and Lie Transformers (Fuchs et al., 2020; Hutchinson et al., 2020) . However, we obtain group equivariance via a generalization of positional encodings, Hutchinson et al. ( 2020) does so via operations on the Lie algebra, and Fuchs et al. (2020) via irreducible representations. In addition, our work prioritizes applications on visual data and extensively analyses theoretical aspects and properties of group equivariant self-attention.

3. STAND-ALONE SELF-ATTENTION

In this section, we recall the mathematical formulation of self-attention and emphasize the role of the positional encoding. Next, we introduce a functional formulation to self-attention which will allow us to analyze and generalize its equivariance properties. Definition. Let X ∈ R N ×C in be an input matrix consisting of N tokens of C in dimensions each.foot_0 A self-attention layer maps an input matrix X ∈ R N ×C in to an output matrix Y ∈ R N ×Cout as: Y = SA(X) ∶= softmax [ , ∶ ] (A)XW val , with W val ∈ R C in ×C h the value matrix, A ∈ R N ×N the attention scores matrix, and softmax [ , ∶ ] (A) the attention probabilities. The matrix A is computed as: A ∶= XW qry (XW key ) ⊺ , parameterized by query and key matrices W qry , W key ∈ R C in ×C h . In practice, it has been found beneficial to apply multiple self-attention operations, also called heads, in parallel, such that different



We consequently consider an image as a set of N discrete objects i ∈ {1, 2, ..., N }.



Figure 1: Behavior of feature representations in group self-attention networks. An input rotation induces a rotation plus a cyclic permutation to the intermediary feature representations of the network. Additional examples for all the groups used in this work as well as their usage are provided in repo/demo/.

