GROUP EQUIVARIANT STAND-ALONE SELF-ATTENTION FOR VISION

Abstract

We provide a general self-attention formulation to impose group equivariance to arbitrary symmetry groups. This is achieved by defining positional encodings that are invariant to the action of the group considered. Since the group acts on the positional encoding directly, group equivariant self-attention networks (GSA-Nets) are steerable by nature. Our experiments on vision benchmarks demonstrate consistent improvements of GSA-Nets over non-equivariant self-attention networks.

1. INTRODUCTION

Recent advances in Natural Language Processing have been largely attributed to the rise of the Transformer (Vaswani et al., 2017) . Its key difference with previous methods, e.g., recurrent neural networks, convolutional neural networks (CNNs), is its ability to query information from all the input words simultaneously. This is achieved via the self-attention operation (Bahdanau et al., 2015; Cheng et al., 2016) , which computes the similarity between representations of words in the sequence in the form of attention scores. Next, the representation of each word is updated based on the words with the highest attention scores. Inspired by the capacity of transformers to learn meaningful inter-word dependencies, researchers have started applying self-attention in vision tasks. It was first adopted into CNNs by channel-wise attention (Hu et al., 2018) and non-local spatial modeling (Wang et al., 2018) . More recently, it has been proposed to replace CNNs with self-attention networks either partially (Bello et al., 2019) or entirely (Ramachandran et al., 2019) . Contrary to discrete convolutional kernels, weights in self-attention are not tied to particular positions (Fig. A.1 ), yet self-attention layers are able to express any convolutional layer (Cordonnier et al., 2020) . This flexibility allows leveraging long-range dependencies under a fixed parameter budget. An arguable orthogonal advancement to deep learning architectures is the incorporation of symmetries into the model itself. The seminal work by Cohen & Welling (2016) provides a recipe to extend the translation equivariance of CNNs to other symmetry groups to improve generalization and sampleefficiency further (see §2). Translation equivariance is key to the success of CNNs. It describes the property that if a pattern is translated, its numerical descriptors are also translated, but not modified. In this work, we introduce group self-attention, a self-attention formulation that grants equivariance to arbitrary symmetry groups. This is achieved by defining positional encodings invariant to the action of the group considered. In addition to generalization and sample-efficiency improvements provided by group equivariance, group equivariant self-attention networks (GSA-Nets) bring important benefits over group convolutional architectures: (i) Parameter efficiency: contrary to conventional discrete group convolutional kernels, where weights are tied to particular positions of neighborhoods on the group, group equivariant self-attention leverages long-range dependencies on group functions under a fixed parameter budget, yet it is able to express any group convolutional kernel. This allows for very expressive networks with low parameter count. (ii) Steerability: since the group acts directly on the positional encoding, GSA-Nets are steerable (Weiler et al., 2018b) by nature. This allows us to go beyond group discretizations that live in the grid without introducing interpolation artifacts.

Contributions:

• We provide an extensive analysis on the equivariance properties of self-attention ( §4). • We provide a general formulation to impose group equivariance to self-attention ( §5). • We provide instances of self-attentive architectures equivariant to several symmetry groups ( §6). • Our results demonstrate consistent improvements of GSA-Nets over non-equivariant ones ( §6). Additional examples for all the groups used in this work as well as their usage are provided in repo/demo/.

2. RELATED WORK

Several approaches exist which provide equivariance to various symmetry groups. The translation equivariance of CNNs has been extended to additional symmetries ranging from planar rotations (Dieleman et al., 2016; Marcos et al., 2017; Worrall et al., 2017; Weiler et al., 2018b; Li et al., 2018; Cheng et al., 2018; Hoogeboom et al., 2018; Bekkers et al., 2018; Veeling et al., 2018; Lenssen et al., 2018; Graham et al., 2020) to spherical rotations (Cohen et al., 2018; 2019b; Worrall & Brostow, 2018; Weiler et al., 2018a; Esteves et al., 2019a; b; 2020) , scaling (Marcos et al., 2018; Worrall & Welling, 2019; Sosnovik et al., 2020; Romero et al., 2020b) and more general symmetry groups (Cohen & Welling, 2016; Kondor & Trivedi, 2018; Tai et al., 2019; Weiler & Cesa, 2019; Cohen et al., 2019a; Bekkers, 2020; Venkataraman et al., 2020) . Importantly, all these approaches utilize discrete convolutional kernels, and thus, tie weights to particular positions in the neighborhood on which the kernels are defined. As group neighborhoods are (much) larger than conventional ones, the number of weights discrete group convolutional kernels require proportionally increases. This phenomenon is further exacerbated by attentive group equivariant networks (Romero & Hoogendoorn, 2019; Diaconu & Worrall, 2019; Romero et al., 2020a) . Since attention is used to leverage non-local information to aid local operations, non-local neighborhoods are required. However, as attention branches often rely on discrete convolutions, they effectively tie specific weights to particular positions on a large non-local neighborhood on the group. As a result, attention is bound to growth of the model size, and thus, to negative statistical efficiency. Differently, group self-attention is able to attend over arbitrarily large group neighborhoods under a fixed parameter budget. In addition, group self-attention is steerable by nature ( §5.1) a property primarily exhibited by works carefully designed to that end. Other way to detach weights from particular positions comes by parameterizing convolutional kernels as (constrained) neural networks (Thomas et al., 2018; Finzi et al., 2020) . Introduced to handle irregularly-sampled data, e.g., point-clouds, networks parameterizing convolutional kernels receive relative positions as input and output their values at those positions. In contrast, our mappings change as a function of the input content. Most relevant to our work are the SE(3) and Lie Transformers (Fuchs et al., 2020; Hutchinson et al., 2020) . However, we obtain group equivariance via a generalization of positional encodings, Hutchinson et al. (2020) does so via operations on the Lie algebra, and Fuchs et al. (2020) via irreducible representations. In addition, our work prioritizes applications on visual data and extensively analyses theoretical aspects and properties of group equivariant self-attention.

3. STAND-ALONE SELF-ATTENTION

In this section, we recall the mathematical formulation of self-attention and emphasize the role of the positional encoding. Next, we introduce a functional formulation to self-attention which will allow us to analyze and generalize its equivariance properties. Definition. Let X ∈ R N ×C in be an input matrix consisting of N tokens of C in dimensions each.foot_0 A self-attention layer maps an input matrix X ∈ R N ×C in to an output matrix Y ∈ R N ×Cout as: Y = SA(X) ∶= softmax [ , ∶ ] (A)XW val , with W val ∈ R C in ×C h the value matrix, A ∈ R N ×N the attention scores matrix, and softmax [ , ∶ ] (A) the attention probabilities. The matrix A is computed as: A ∶= XW qry (XW key ) ⊺ , parameterized by query and key matrices W qry , W key ∈ R C in ×C h . In practice, it has been found beneficial to apply multiple self-attention operations, also called heads, in parallel, such that different heads are able to attend to different parts of the input. In this multi-head self-attention formulation, the output of H heads of output dimension C h are concatenated and projected to C out as: MHSA(X) ∶= concat h∈[H] SA (h) (X) W out + b out , with a projection matrix W out ∈ R HC h ×Cout and a bias term b out ∈ R Cout .

3.1. THE ROLE OF THE POSITIONAL ENCODING

Note that the self-attention operation defined in Eq. 3 is equivariant to permutations of the input rows of X. That is, a permutation of the rows of X will produce the same output Y up to this permutation. Hence, self-attention is blind to the order of its inputs, i.e., it is a set operation. Illustratively, an input image is processed as a bag of pixels and the structural content is not considered. To alleviate this limitation, the input representations in self-attention are often enriched with a positional encoding that provides positional information of the set elements. Absolute positional encoding. Vaswani et al. (2017) introduced a (learnable) positional encoding P ∈ R N ×C in for each input position which is added to the inputs when computing the attention scores: A ∶= (X + P)W qry ((X + P)W key ) ⊺ . ( ) More generally, P can be substituted by any function that returns a vector representation of the position and can be incorporated by means of addition or concatenation, e.g., Zhao et al. (2020) . This positional encoding injects additional structural information about the tokens into the model, which makes it susceptible to changes in the token's positions. Unfortunately, the model must learn to recognize similar patterns at every position independently as absolute positional encodings are unique to each position. This undesired data inefficiency is addressed by relative positional encodings. Relative positional encoding. Introduced by Shaw et al. (2018) , relative encodings consider the relative distance between the query token i -the token we compute the representation of -, and the key token j -the token we attend to -. The calculation of the attention scores (Eq. 2) then becomes: A rel i,j ∶= X i W qry ((X j + P x(j)-x(i) )W key ) ⊺ , where P x(j)-x(i) ∈ R 1×C in is a vector representation of the relative shift and x(i) is the position of the token i as defined in §3.2. Consequently, similar patterns can be recognized at arbitrary positions, as relative query-key distances always remain equal.

3.2. A FUNCTIONAL FORMULATION TO SELF-ATTENTION

Notation. We denote by [n] the set {1, 2, . . . , n}. Given a set S and a vector space V, L V (S) will denote the space of functions {f ∶ S → V}. Square brackets are used when functions are arguments. Let S = {i} N i=1 be a set of N elements. A matrix X ∈ R N ×C in can be interpreted as a vector-valued function f ∶ S → R C in that maps element sets i ∈ S to C in -dimensional vectors: f ∶ i ↦ f (i). Consequently, a matrix multiplication, XW ⊺ y , of matrices X ∈ R N ×C in and W y ∈ R Cout×C in can be represented as a function ϕ y ∶ L V C in (S) → L V C out (S), ϕ y ∶ f (i) ↦ ϕ y (f (i)), parameterized by W y , between functional spaces L V C in (S) = {f ∶ S → R C in } and L V C out (S) = {f ∶ S → R Cout }. Following this notation, we can represent the position-less attention scores calculation (Eq. 2) as: A i,j = α[f ](i, j) = ⟨ϕ qry (f (i)), ϕ key (f (j))⟩. The function α[f ] ∶ S × S → R maps pairs of set elements i, j ∈ S to the attention score of j relative to i. Therefore, the self-attention (Eq. 1) can be written as: Y i,∶ = ζ[f ](i) = j∈S σ j (α[f ](i, j))ϕ val (f (j)) = j∈S σ j ⟨ϕ qry (f (i)), ϕ key (f (j))⟩ ϕ val (f (j)), where σ j = softmax j and ζ[f ] ∶ S → R C h . Finally, multi-head self-attention (Eq. 3) can be written as: MHSA(X) i,∶ = m[f ](i) = ϕ out ⋃ h∈[H] ζ (h) [f ](i) = ϕ out ⋃ h∈[H] j∈S σ j ⟨ϕ (h) qry (f (i)), ϕ (h) key (f (j))⟩ ϕ (h) val (f (j)) , ( ) where ∪ is the functional equivalent of the concatenation operator concat, and m [f ] ∶ S → R Cout . Local self-attention. Recall that α[f ] assigns an attention scores to every other set element j ∈ S relative to the query element i. The computational cost of self-attention is often reduced by restricting its calculation to a local neighborhood N(i) around the query token i analogous in nature to the local receptive field of CNNs (Fig. A.1a) . Consequently, local self-attention can be written as: m[f ](i) = ϕ out ⋃ h∈[H] j∈N(i) σ j ⟨ϕ (h) qry (f (i)), ϕ (h) key (f (j))⟩ ϕ (h) val (f (j)) . Note that Eq. 9 is equivalent to Eq. 8 for N(i) = S, i.e. when considering global neighborhoods. Absolute positional encoding. The absolute positional encoding is a function ρ ∶ S → R C in that maps set elements i ∈ S to a vector representation of its position: ρ ∶ i → ρ(i). Note that this encoding is not dependent on functions defined on the set but only on the set itself.foot_2 Consequently, absolute position-aware self-attention (Eq. 4) can be written as: m[f, ρ](i) = ϕ out ⋃ h∈[H] j∈N(i) σ j ⟨ϕ (h) qry (f (i) + ρ(i)), ϕ (h) key (f (j) + ρ(j))⟩ ϕ (h) val (f (j)) . ( ) The function ρ can be decomposed as two functions ρ P ○ x: (i) the position function x ∶ S → X, which provides the position of set elements in the underlying space X (e.g., pixel positions), and, (ii) the positional encoding ρ P ∶ X → R C in , which provides vector representations of elements in X. This distinction will be of utmost importance when we pinpoint where exactly (group) equivariance must be imposed to the self-attention operation ( §4.3, §5). Relative positional encoding. Here, positional information is provided in a relative manner. That is, we now provide vector representations of relative positions ρ(i, j) ∶= ρ P (x(j)x(i)) among pairs (i, j), i ∈ S, j ∈ N(i). Consequently, relative position-aware self-attention (Eq. 5) can be written as: m r [f, ρ](i) = ϕ out ⋃ h∈[H] j∈N(i) σ j ⟨ϕ (h) qry (f (i)), ϕ (h) key (f (j) + ρ(i, j))⟩ ϕ (h) val (f (j)) .

4. EQUIVARIANCE ANALYSIS OF SELF-ATTENTION

In this section we analyze the equivariance properties of self-attention. Since the analysis largely relies on group theory, we provide all concepts required for proper understanding in Appx. C.

4.1. GROUP EQUIVARIANCE AND EQUIVARIANCE FOR FUNCTIONS DEFINED ON SETS

First we provide the general definition of group equivariance and refine it to relevant groups next. Additionally, we define the property of unique equivariance to restrict equivariance to a given group. Definition 4.1 (Group equivariance). Let G be a group (Def. C.1), S, S ′ be sets, V, V ′ be vector spaces, and L g [⋅], L ′ g [⋅] be the induced (left regular) representation (Def. C.4) of G on L V (S) and L V ′ (S ′ ), respectively. We say that a map ϕ ∶ L V (S) → L V ′ (S ′ ) is equivariant to the action of G -or G-equivariant -, if it commutes with the action of G. That is, if: ϕ L g [f ] = L ′ g ϕ[f ] , ∀f ∈ L V (S), ∀g ∈ G. Example 4.1.1 (Permutation equivariance). Let S = S ′ = {i} N i=1 be a set of N elements, and G = S N be the group of permutations on sets of N elements. A map ϕ ∶ L V (S) → L V ′ (S) is said to be equivariant to the action of S N -or permutation equivariant -, if: ϕ L π [f ] (i) = L ′ π ϕ[f ] (i), ∀f ∈ L V (S), ∀π ∈ S N , ∀i ∈ S, where L π [f ](i) ∶= f (π -1 (i)) , and π ∶ S → S is a bijection from the set to itself. The element π(i) indicates the index to which the i-th element of the set is moved to as an effect of the permutation π. In other words, ϕ is said to be permutation equivariant if it commutes with permutations π ∈ S N . That is, if permutations in its argument produce equivalent permutations on its response. Several of the transformations of interest, e.g., rotations, translations, are not defined on sets. Luckily, as we consider sets gathered from homogeneous spaces X where these transformations are welldefined, e.g., R 2 for pixels, there exists an injective map x ∶ S → X that associates a position in X to each set element, the position function. In Appx. D we show that the action of G on such a set is well-defined and induces a group representation to functions on it. With this in place, we are now able to define equivariance of set functions to groups whose actions are defined on homogeneous spaces. Definition 4.2 (Equivariance of set functions to groups acting on homogeneous spaces). Let G be a group acting on two homogeneous spaces X and X ′ , let S, S ′ be sets and V, V ′ be vector spaces. Let x ∶ S → X and x ′ ∶ S ′ → X ′ be injective maps. We say that a map ϕ ∶ L V (S) → L V ′ (S ′ ) is equivariant to the action of G -or G-equivariant -, if it commutes with the action of G. That is, if: ϕ L g [f ] = L ′ g ϕ[f ] , ∀f ∈ L V (S), ∀g ∈ G, where L g [f ](i) ∶= f (x -1 (g -1 x(i))), L ′ g [f ](i) ∶= f (x ′-1 (g -1 x ′ (i)) ) are the induced (left regular) representation of G on L V (S) and L V ′ (S ′ ), respectively. I.o.w., ϕ is said to be G-equivariant if a transformation g ∈ G on its argument produces a corresponding transformation on its response. Example 4.2.1 (Translation equivariance). Let S, S ′ be sets and let x ∶ S → X and x ′ ∶ S ′ → X ′ be injective maps from the sets S, S ′ to the corresponding homogeneous spaces X, X ′ on which they are defined, e.g., R d and G. With (X, +) the translation group acting on X, we say that a map ϕ ∶ L V (S) → L V ′ (S) is equivariant to the action of (X, +) -or translation equivariant -, if: ϕ L y [f ] (i) = L ′ y ϕ[f ] (i), ∀f ∈ L V (S), ∀y ∈ X, with L y [f ](i) ∶= f (x -1 (x(i)-y)), L ′ y [f ](i) ∶= f (x ′-1 (x ′ (i)-y)). I.o.w. , ϕ is said to be translation equivariant if a translation on its argument produces a corresponding translation on its response.

4.2. EQUIVARIANCE PROPERTIES OF SELF-ATTENTION

In this section we analyze the equivariance properties of the self-attention. The proofs to all the propositions stated in the main text are provided in Appx. G. Proposition 4.1. The global self-attention formulation without positional encoding (Eqs. 3, 8) is permutation equivariant. That is, it holds that: m[L π [f ]](i) = L π [m[f ]](i). Note that permutation equivariance only holds for global self-attention. The local variant proposed in Eq. 9 reduces permutation equivariance to a smaller set of permutations where neighborhoods are conserved under permutation, i.e., S N = {π ∈ S N j ∈ N(i) → π(j) ∈ N(i), ∀i ∈ S}. Permutation equivariance induces equivariance to important (sub)groups. Consider the cyclic group of order 4, Z 4 = {e, r, r 2 , rfoot_3 } which induces planar rotations by 90 ○ . 3 As every rotation in Z 4 effectively induces a permutation of the tokens positions, it can be shown that Z 4 is a subgroup of S N , i.e., S N ≥ Z 4 . Consequently, maps equivariant to permutations are automatically equivariant to Z 4 . However, as the permutation equivariance constraint is harder than that of Z 4 -equivariance, imposing Z 4 -equivariance as a result of permutation equivariance is undesirable in terms of expressivity. Consequently, Ravanbakhsh et al. (2017) introduced the concept of unique G-equivariance to express the family of functions equivariant to G but not equivariant to other groups G ′ ≥ G: Definition 4.3 (Unique G-equivariance). Let G a subgroup of G ′ , G ≤ G ′ (Def. C.2). We say that a map ϕ is uniquely G-equivariant iff it is G-equivariant but not G ′ -equivariant for any G ′ ≥ G. In the following sections, we show that we can enforce unique equivariance not only to subgroups of S N , e.g., Z 4 , but also to other interesting groups not contained in S N , e.g., groups of rotations finer than 90 degrees. This is achieved by enriching set functions with a proper positional encoding. Proposition 4.2. Absolute position-aware self-attention (Eqs. 4, 10) is neither permutation nor translation equivariant. i.e., m[L π [f ], ρ](i) ≠ L π [m[f, ρ]](i) and m[L y [f ], ρ](i) ≠ L y [m[f, ρ]](i). Though absolute positional encodings do disrupt permutations equivariance, they are unable to provide translation equivariance. We show next that translation equivariance is obtained via relative encodings. Proposition 4.3. Relative position-aware self-attention (Eq. 11) is translation equivariant. That is, it holds that: m r [L y [f ], ρ](i) = L y [m r [f, ρ]](i).

4.3. WHERE EXACTLY IS EQUIVARIANCE IMPOSED IN SELF-ATTENTION?

In the previous section we have seen two examples of successfully imposing group equivariance to self-attention. Specifically, we see that no positional encoding allows for permutation equivariance and that a relative positional encoding allows for translation equivariance. For the latter, as shown in the proof of Prop. 4.3 (Appx. G), this comes from the fact that for all shifts y ∈ X, ρ(x -1 (x(i) + y), x -1 (x(j) + y)) = ρ P (x(j) + y -(x(i) + y)) = ρ P (x(j) -x(i)) = ρ(i, j). (12) That is, from the fact that the relative positional encoding is invariant to the action of the translation group, i.e., L y [ρ](i, j) = ρ(i, j), ∀y ∈ X. Similarly, the absence of positional encoding -more precisely, the use of a constant positional encoding -, is what allows for permutation equivariance (Prop. 4.1, Appx. G). Specifically, constant positional encodings ρ c (i) = c, ∀i ∈ S are invariant to the action of the permutation group, i.e., L π [ρ c ](i) = ρ c (i), ∀π ∈ S N . From these observations, we conclude that G-equivariance is obtained by providing positional encodings which are invariant to the action of the group G, i.e., s.t., L g [ρ] = ρ, ∀g ∈ G. Furthermore, unique G-equivariance is obtained by providing positional encodings which are invariant to the action of G but not invariant to the action of any other group G ′ ≥ G. This is a key insight that allows us to provide (unique) equivariance to arbitrary symmetry groups, which we provide next.

5. GROUP EQUIVARIANT STAND-ALONE SELF-ATTENTION

In §4.3 we concluded that unique G-equivariance is induced in self-attention by introducing positional encodings which are invariant to the action of G but not invariant to the action of other groups G ′ ≥ G. However, this constraint does not provide any information about the expressivity of the mapping we have just made G-equivariant. Let us first illustrate why this is important: Consider the case of imposing rotation and translation equivariance to an encoding defined in R 2 . Since translation equivariance is desired, a relative positional encoding is required. For rotation equivariance, we must further impose the positional encoding to be equal for all rotations. That is L θ [ρ](i, j) ! = ρ(i, j), ∀θ ∈ [0, 2π], where L θ [ρ](i, j) ∶= ρ P (θ -1 x(j) -θ -1 x(i)) , and θ -1 depicts a rotation by -θ degrees. This constraint leads to an isotropic positional encoding unable to discriminate among orientations, which in turn enforces rotation invariance instead of rotation equivariance. 4This is alleviated by lifting the underlying function on R 2 to a space where rotations are explicitly encoded (Fig. B .1). To this end, one performs self-attention operations for positional encodings L θ [ρ] of varying values θ and indexes their responses by the corresponding θ value. Next, as rotations are now explicitly encoded, a positional encoding can be defined in this space which is able to discriminate among rotations (Fig. B .2). This in turn allows for rotation equivariance instead of rotation invariance. It has been shown both theoretically (Ravanbakhsh, 2020) and empirically (Weiler & Cesa, 2019) that the most expressive class of G-equivariant functions is given by functions that follow the regular representation of G. In order to obtain feature representations that behave that way, we introduce a lifting self-attention layer (Fig. B .1, Eq. 14) that receives an input function on R d and produces a feature representation on G. Subsequently, arbitrarily many group self-attention layers (Fig. B .2, Eq. 16) interleaved with optional point-wise non-linearities can be applied. At the end of the network a feature representation on R d can be provided by pooling over H. In short, we provide a pure selfattention analogous to Cohen & Welling (2016) . However, as the group acts directly on the positional encoding, our networks are steerable as well (Weiler et al., 2018b) . This allows us to go beyond group discretizations that live in the grid without introducing interpolation artifacts ( §5.1). Though theoretically sound, neural architectures using regular representations are unable to handle continuous groups directly in practice. This is a result of the summation over elements h ∈ H in Eq. 15, which becomes an integral for continuous groups. Interestingly, using discrete groups does not seem to be detrimental in practice. Our experiments indicate that performance saturates for fine discrete approximations of the underlying continuous group (Tab. 2). In fact, (Weiler & Cesa, 2019, Tab. 3) show via extensive experiments that networks using regular representations and fine enough discrete approximations consistently outperform networks handling continuous groups via irreducible representations. We conjecture this is a result of the networks receiving discrete signals as input. As the action of several group elements fall within the same pixel, no further improvement can be obtained.

5.1. GROUP SELF-ATTENTION IS AN STEERABLE OPERATION

Convolutional filters are commonly parameterized by weights on a discrete grid, which approximate the function implicitly described by the filter at the grid locations. Unfortunately, for groups whose action does not live in this grid, e.g., 45 ○ rotations, the filter must be interpolated. This is problematic as these filters are typically small and the resulting interpolation artifacts can be severe (Fig. 2a ). Steerable CNNs tackle this problem by parameterizing convolutional filters on a continuous basis on which the action of the group is well-defined, e.g., circular harmonics (Weiler et al., 2018b) , B-splines As the positional encoding lives on a continuous space, it can be transformed at an arbitrary grade of precision without interpolation (Fig. 2b ).

5.2. LIFTING AND GROUP SELF-ATTENTION

Lifting self-attention (Fig. B.1). Let G = R d ⋊ H be an affine group (Def. C.3) acting on R d . The lifting self-attention m r G↑ [f, ρ] ∶ L V (R d ) → L V ′ (G) is a map from functions on R d to functions on G obtained by modifying the relative positional encoding ρ(i, j) by the action of group elements h ∈ H: {L h [ρ](i, j)} h∈H , L h [ρ](i, j) = ρ P (h -1 x(j) -h -1 x(i)) . It corresponds to the concatenation of multiple self-attention operations (Eq. 11) indexed by h with varying positional encodings L h [ρ] : m r G↑ [f, ρ](i, h) = m r f, L h [ρ] (i) (13) = ϕ out ⋃ h∈[H] j∈N(i) σ j ⟨ϕ (h) qry (f (i)), ϕ (h) key (f (i) + L h [ρ](i, j))⟩ ϕ (h) val (f (j)) . (14) Proposition 5.1. Lifting self-attention is G-equivariant. That is, it holds that: m r G↑ [L g [f ], ρ](i, h) = L g [m r G↑ [f, ρ]](i, h). Group self-attention (Fig. B.2). Let G = R d ⋊ H be an affine group acting on itself and f (i, h) ∈ L V (G), i ∈ S, h ∈ H, be a function defined on a set immersed with the structure of the group G. That is, enriched with a positional encoding ρ((i, h), (j, ĥ)) ∶= ρ P ((x(j)x(i), h-1 ĥ)), i, j ∈ S, h, ĥ ∈ H. The group self-attention m r G [f, ρ] ∶ L V (G) → L V ′ (G) is a map from functions on G to functions on G obtained by modifying the group positional encoding by the action of group elements h ∈ H: {L h [ρ]((i, h), (j, ĥ))} h∈H , L h [ρ]((i, h), (j, ĥ)) = ρ P (h -1 (x(j)x(i)), h -1 ( h-1 ĥ)). It corresponds to the concatenation of multiple self-attention operations (Eq. 11) indexed by h with varying positional encodings L h [ρ] and followed by a summation over the output domain along h: m r G [f, ρ](i, h) = h∈H m r f, L h [ρ] (i, h) (15) = ϕ out ⋃ h∈[H] h∈H (j, ĥ)∈N(i, h) σ j, ĥ ⟨ϕ (h) qry (f (i, h)), ϕ key (f (j, ĥ) + L h [ρ]((i, h), (j, ĥ))⟩ ϕ (h) val (f (j, ĥ)) . ( 16) In contrast to vanilla and lifting self-attention, the group self-attention neighborhood N(i, h) is now defined on the group. This allows distinguishing across group transformations, e.g., rotations. Proposition 5.2. Group self-attention is G-equivariant. That is, it holds that: m r G [L g [f ], ρ](i, h) = L g [m r G [f, ρ]](i, h ). Non-unimodular groups, i.e., groups that modify the volume of the objects they act upon, such as the dilation group, require a special treatment. This treatment is provided in Appx. E.

5.3. GROUP SELF-ATTENTION IS A GENERALIZATION OF THE GROUP CONVOLUTION

We have demonstrated that it is sufficient to define self-attention as a function on the group G and ensure that L g [ρ] = ρ ∀g ∈ G in order to enforce G-equivariance. Interestingly, this observation is inline with the main statement of Kondor & Trivedi (2018) for (group) convolutions: "the group convolution on G is the only (unique) G-equivariant linear map". In fact, our finding can be formulated as a generalization of Kondor & Trivedi (2018) 's statement as: "Linear mappings on G whose positional encoding is G-invariant are G-equivariant." This statement is more general than that of Kondor & Trivedi (2018) , as it holds for data structures where (group) convolutions are not well-defined, e.g., sets, and it is equivalent to Kondor & Trivedi (2018) 's statement for structures where (group) convolutions are well-defined. It is also congruent with results complementary to Kondor & Trivedi (2018) (Cohen et al., 2019a; Bekkers, 2020) as well as several works on group equivariance handling set-like structures like point-clouds (Thomas et al., 2018; Defferrard et al., 2020; Finzi et al., 2020; Fuchs et al., 2020) and symmetric sets (Maron et al., 2020) . In addition, we can characterize the expressivity of group self-attention. It holds that (i) group self-attention generalizes the group convolution and (ii) regular global group self-attention is an equivariant universal approximator. Statement (i) follows from the fact that any convolutional layer can be described as a multi-head self-attention layer provided enough heads (Cordonnier et al., 2020 ), yet self-attention often uses larger receptive fields. As a result, self-attention is able to describe a larger set of functions than convolutions, e.g., Fig. 

6. EXPERIMENTS

We perform experiments on three image benchmark datasets for which particular forms of equivariance are desirable. 5 We evaluate our approach by contrasting GSA-Nets equivariant to multiple symmetry groups. Additionally, we conduct an study on rotMNIST to evaluate the performance of GSA-Nets as a function of the neighborhood size. All our networks follow the structure shown in Fig. F .1 and vary only in the number of blocks and channels. We emphasize that both the architecture and the number of parameters in GSA-Nets is left unchanged regardless of the group used. Our results illustrate that GSA-Nets consistently outperform equivalent non-equivariant attention networks. We further compare GSA-Nets with convolutional architectures. Though our approach does not build upon these networks, this comparison provides a fair view to the yet present gap between self-attention and convolutional architectures in vision tasks, also present in their group equivariant counterparts. Efficient implementation of lifting and group self-attention. Our self-attention implementation takes advantage of the fact that the group action only affects the positional encoding P to reduce the total computational cost of the operation. Specifically, we calculate self-attention scores w.r.t. the content X once, and reuse them for all transformed versions of the positional encoding {L h [ρ]} h∈H . Model designation. We refer to translation equivariant self-attention models as Z2 SA. Reflection equivariant models receive the keyword M, e.g., Z2M SA, and rotation equivariant models the keyword Rn, where n depicts the angle discretization. For example, R8 SA depicts a model equivariant to rotations by 45 degrees. Specific model architectures are provided in Appx. F. RotMNIST. The rotated MNIST dataset (Larochelle et al., 2007) is a classification dataset often used as a standard benchmark for rotation equivariance. It consists of 62k gray-scale 28x28 uniformly rotated handwritten digits, divided into training, validation and test sets of 10k, 2k and 50k images. First, we study the effect of the neighborhood size on classification accuracy and convergence time. We train R4 SA networks for 300 epochs with vicinities NxN of varying size (Tab. 1, Fig. 3 ). Since GSA-Nets optimize where to attend, the complexity of the optimization problem grows as a function of N. Consequently, models with big vicinities are expected to converge slower. However, as the family of functions describable by big vicinities contains those describable by small ones, models with big vicinities are expected to be at least as good upon convergence. Our results show that models with small vicinities do converge much faster (Fig. 3 ). However, though some models with large vicinities do outperform models with small ones, e.g., 7x7 vs. 3x3, a trend of this behavior is not apparent. We conjecture that 300 epochs are insufficient for all models to converge equally well. Unfortunately, due to computational constraints, we were not able to perform this experiment for a larger number of epochs. We consider an in-depth study of this behavior an important direction for future work. Next, we compare GSA-Nets equivariant to translation and rotation at different angle discretizations (Tab. 2). Based on the results of the previous study, we select a 5x5 neighborhood, as it provides the best trade-off between accuracy and convergence time. Our results show that finer discretizations lead to better accuracy but saturates around R12. We conjecture that this is due to the discrete resolution of the images in the dataset, which leads finer angle discretizations to fall within the same pixel. CIFAR-10. The CIFAR-10 dataset (Krizhevsky et al., 2009) consists of 60k real-world 32x32 RGB images uniformly drawn from 10 classes, divided into training, validation and test sets of 40k, 10k and 10k images. Since reflection is a symmetry that appears ubiquitously in natural images, we compare GSA-Nets equivariant to translation and reflection in this dataset (Tab. 2). Our results show that reflection equivariance indeed improves the classification performance of the model. PCam. The PatchCamelyon dataset (Veeling et al., 2018) consists of 327k 96x96 RGB image patches of tumorous/non-tumorous breast tissues extracted from Camelyon16 (Bejnordi et al., 2017) . Each patch is labeled as tumorous if the central region (32x32) contains at least one tumour pixel. As cells appear at arbitrary positions and poses, we compare GSA-Nets equivariant to translation, rotation and reflection (Tab. 2). Our results show that incorporating equivariance to reflection in addition to rotation, as well as providing finer group discretization, improve classification performance.

7. DISCUSSION AND FUTURE WORK

Though GSA-Nets perform competitively to G-CNNs for some tasks, G-CNNs still outperforms our approach in general. We conjecture that this is due to the harder nature of the optimization problem in GSA-Nets and the carefully crafted architecture design, initialization and optimization procedures developed for CNNs over the years. Though our theoretical results indicate that GSA-Nets can be more expressive than G-CNNs ( § 5.3), further research in terms of design, optimization, stability and generalization is required. These are in fact open questions for self-attention in general (Xiong et al., 2020; Liu et al., 2020; Zhao et al., 2020) and developments in this direction are of utmost importance. The main drawback of our approach is the quadratic memory and time complexity typical of selfattention. This is an active area of research, e.g., Kitaev et al. (2020) ; Wang et al. (2020) ; Zaheer et al. (2020) ; Choromanski et al. (2020) and we believe that efficiency advances to vanilla self-attention can be seamlessly integrated in GSA-Nets. Our theoretical results indicate that GSA-Nets have the potential to become the standard solution for applications exhibiting symmetries, e.g., medical imagery. In addition, as self-attention is a set operation, GSA-Nets provide straightforward solutions to set-like data types, e.g., point-clouds, graphs, symmetric sets, which may benefit from additional geometrical information, e.g., Fuchs et al. (2020) ; Maron et al. (2020) . Finally, we hope our theoretical insights serve as a support point to further explore and understand the construction of equivariant maps for graphs and sets, which often come equipped with spatial coordinates: a type of positional encoding. ). Given a budget of 9 parameters, a convolutional filter ties these parameters to specific positions. Subsequently, these parameters remain static regardless of (i) the query input position and (ii) the input signal itself. Self-attention, on the other hand, does not tie parameters to any specific positions at all. Contrarily, it compares the representations of all tokens falling in its receptive field. As a result, provided enough heads, self-attention is more general than convolutions, as it can represent any convolutional kernel, e.g., 

C CONCEPTS FROM GROUP THEORY

Definition C.1 (Group). A group is an ordered pair (G, ⋅) where G is a set and ⋅ ∶ G × G → G is a binary operation on G, such that (i) the set is closed under this operation, (ii) the operation is associative, i.e., (g 1 ⋅ g 2 ) ⋅ g 3 = g 1 ⋅ (g 2 ⋅ g 3 ), g 1 , g 2 , g 3 ∈ G, (iii) there exists an identity element e ∈ G s.t. ∀g ∈ G we have e ⋅ g = g ⋅ e = g, and (iv) for each g ∈ G, there exists an inverse g -1 s.t. g ⋅ g -1 = e. Definition C.2 (Subgroup). Let (G, ⋅) be a group. A subset H of G is a subgroup of G if H is nonempty and closed under the group operation and inverses (i.e., h 1 , h 2 ∈ H implies that h -1 1 ∈ H and h 1 ⋅ h 2 ∈ H). If H is a subgroup of G we write H ≤ G Definition C.3 (Semi-direct product and affine groups). In practice, one is mainly interested in the analysis of data defined on R d , and, consequently, in groups of the form G = R d ⋊ H, resulting from the semi-direct product (⋊) between the translation group (R d , +) and an arbitrary (Lie) group H that acts on R d , e.g., rotation, scaling, mirroring, etc. This family of groups is referred to as affine groups and their group product is defined as: g 1 ⋅ g 2 = (x 1 , h 1 ) ⋅ (x 2 , h 2 ) = (x 1 + h 1 ⊙ x 2 , h 1 ⋅ h 2 ), ( ) with g = (x 1 , h 1 ), g 2 = (x 2 , h 2 ) ∈ G, x 1 , x 2 ∈ R d and h 1 , h 2 ∈ H. The operator ⊙ denotes the action of h ∈ H on x ∈ R d , and it describes how a vector x ∈ R d is modified by elements h ∈ H. Definition C.4 (Group representation). Let G be a group and L 2 (X) be a space of functions defined on some vector space X. The (left) regular group representation of G is a linear transformation L ∶ G × L 2 (X) → L 2 (X), (g, f ) ↦ L g [f ] ∶= f (g -1 ⊙ x) , that shares the group structure via: L g1 L g2 [f ] = L g1⋅g2 [f ] (18) for any g 1 , g 2 ∈ G, f ∈ L 2 (X). That is, concatenating two such transformations, parameterized by g 1 and g 2 , is equivalent to a single transformation parameterized by g 1 ⋅ g 2 ∈ G. If the group G is affine, the group representation L g can be split as: L g [f ] = L x L h [f ], with g = (x, h) ∈ G, x ∈ R d and h ∈ H. Intuitively, the representation of G on a function f describes how the function as a whole, i.e., f (x), ∀x ∈ X, is transformed by the effect of group elements g ∈ G.

D ACTIONS AND REPRESENTATIONS OF GROUPS ACTING ON HOMOGENEOUS SPACES FOR FUNCTIONS DEFINED ON SETS

In this section we show that the action of a group G acting on a homogeneous space X is well defined on sets S gathered from X, and that it induces a group representation of functions defined on S. Let S = {i} be a set and X be a homogeneous space on which the action of G is well-defined, i.e., gx ∈ X, ∀g ∈ G, ∀x ∈ X. Since S has been gathered from X, there exists an injective map x ∶ S → X, that maps set elements i ∈ S to unique elements x i ∈ X. That is, there exists a map x ∶ i ↦ x i that assigns an unique value x i ∈ X to each i ∈ S corresponding to the coordinates from which the set element has been gathered. Since the action of G is well defined on X, it follows that the left regular representation (Def. C.4) L g [f X ](x i ) ∶= f X (g -1 x i ) ∈ L Y (X) of functions f X ∈ L Y (X) exists and is well-defined. Since x is injective, the left regular representation L g [f X ](x i ) = f X (g -1 x i ) can be expressed uniquely in terms of set indices as L g [f X ](x i ) = f X (g -1 x i ) = f X (g -1 x(i)). Furthermore, its inverse x -1 ∶ X → S, x -1 ∶ x i → i also exist and is well-defined. As a consequence, points x i ∈ X can be expressed uniquely in terms of set indices as i = x -1 (x i ), i ∈ S. Consequently, functions f X ∈ L Y (X) can be expressed in terms of functions f ∈ L Y (S) by means of the equality f X (i) = f (x -1 (x i )). Resultantly, we see that the group representation L g [f X ](x i ) = f X (g -1 x(i)) can be described in terms of functions f ∈ L Y (S) as: L g [f X ](x i ) = f X (g -1 (x i )) = f X (g -1 x(i)) = f (x -1 (g -1 x(i))) = L g [f ](i), with a corresponding group representation on L Y (S) given by L g [f ](i) = f (x -1 (g -1 x(i))), and an action of group elements g ∈ G on set elements i given by gi ∶= x -1 (gx(i)).

E THE CASE OF NON-UNIMODULAR GROUPS: SELF-ATTENTION ON THE DILATION-TRANSLATION GROUP

The lifting and group self-attention formulation provided in §5.2 are only valid for unimodular groups. That is, for groups whose action does not change the volume of the objects they act upon, e.g., rotation, mirroring, etc. Non-unimodular groups, however, do modify the volume of the acted objects (Bekkers, 2020) . The most relevant non-unimodular group for this work is the dilation group H = (R >0 , ×). To illustrate why this distinction is important, consider the following example: Imagine we have a circle on R 2 of area πr 2 . If we rotate, mirror or translate the circle, its size is kept constant. If we increase its radius by a factor h ∈ R >0 , however, its size would increase by h 2 . Imagine that we have an application for which we would like to recognize this circle regardless of any of these transformations by means of self-attention. For this purpose, we define a neighborhood N for which the original circle fits perfectly. Since the size of the circle is not modified for any translated, rotated or translated versions of it, we would still be able to detect the circle regardless of these transformations. If we scale the circle by a factor of h > 1, however, the circle would fall outside of our neighborhood N and hence, we would not be able to recognize it. A solution to this problem is to scale our neighborhood N in a proportional way. That is, if the circle is scaled by a factor h ∈ R >0 , we scale our neighborhood by the same factor h: N → hN. Resultantly, the circle would fall within the neighborhood for any scale factor h ∈ R >0 . Unfortunately, there is a problem: self-attention utilizes summations over its neighborhood. Since ∑ i∈hN i > ∑ i∈N i, for h > 1, and ∑ i∈hN i < ∑ i∈N i, for h < 1, the result of the summations would still differ for different scales. Specifically, this result would always be bigger for larger versions of the neighborhood. This is problematic, as the response produced by the same circle, would still be different for different scales. In order to handle this problem, one utilizes a normalization factor proportional to the change of size of the neighborhood considered. This ensures that the responses are equivalent for any scale h ∈ R >0 . That is, one normalizes all summations proportionally to the size of the neighborhood. As a result, we obtain that ∑ i∈h1N (h 2 1 ) -1 i = ∑ i∈h2N (h 2 2 ) -1 i, ∀h 1 , h 2 ∈ R >0 . 6 In the example above we have provided an intuitive description of the (left invariant) Haar measure dµ(h). As its name indicates, it is a measure defined on the group, which is invariant over all group elements h ∈ H. For several unimodular groups, the Haar measure corresponds to the Lebesgue measure as the volume of the objects the group acts upon is kept equal, i.e., dµ(h) = dh.foot_7 For non-unimodular groups, however, the Haar measure requires a normalization factor proportional to the change of volume of these objects. Specifically, the Haar measure corresponds to the Lebesgue measure times a normalization factor h d , where d corresponds to the dimensionality of the space R d the group acts upon (Bekkers, 2020; Romero et al., 2020b) , i.e., dµ(h) = 1 h d dh. In conclusion, in order to obtain group equivariance to non-unimodular groups, the lifting and group self-attention formulation provided in Eqs. 14, 16 must be modified via normalization factors proportional to the group elements h ∈ H. Specifically, they are redefined as: m r G↑ [f, ρ](i, h) = ϕ out ⋃ h∈[H] j∈hN(i) 1 h d σ j ⟨ϕ (h) qry (f (i)), ϕ (h) key (f (i) + L h [ρ](i, j))⟩ ϕ (h) val (f (j)) (20) m r G [f, ρ](i, h) = ϕ out ⋃ h∈[H] h∈H (j, ĥ)∈hN(i, h) 1 h d+1 σ j, ĥ ⟨ϕ (h) qry (f (i, h)), ϕ key (f (j, ĥ) + L h [ρ]((i, h), (j, ĥ))⟩ ϕ (h) val (f (j, ĥ)) . (21) The factor d + 1 in Eq. 21 results from the fact that the summation is now performed on the group G = R d ⋊H, an space of dimensionality d+1. An interesting case emerges when global neighborhoods are considered, i.e., s.t. N(i) = S, ∀i ∈ S. Since hN(i) = N(i) = S for any h > 1, approximation artifacts are introduced. It is not clear if it is better to introduce normalization factors in these situations or not. An in-depth investigation of this phenomenon is left for future research.

E.1 CURRENT EMPIRICAL ASPECTS OF SCALE EQUIVARIANT SELF-ATTENTION

Self-attention suffers from quadratic memory and time complexity proportional to the size of the neighborhood considered. This constraint is particularly important for the dilation group, for which these neighborhoods grow as a result of the group action. We envisage two possible solutions to this limitation left out for future research: The most promising solution is given by incorporating recent advances in efficient self-attention in group self-attention, e.g., Kitaev et al. (2020) ; Wang et al. (2020) ; Zaheer et al. (2020) ; Katharopoulos et al. (2020) ; Choromanski et al. (2020) . By reducing the quadratic complexity of self-attention, the current computational constraints of scale equivariant self-attention can be (strongly) reduced. Importantly, resulting architectures would be comparable to Bekkers (2020) ; Sosnovik et al. (2020); Romero et al. (2020b) in terms of their functionality and the group discretizations they can manage. The second option is to draw a self-attention analogous to Worrall & Welling (2019) , where scale equivariance is implemented via dilated convolutions. One might consider an analogous to dilated convolutions via "sparse" dilations of the self-attention neighborhood. As a result, scale equivariance can be implemented while retaining an equal computational cost for all group elements. Importantly however, this strategy is viable for a dyadic set of scales only, i.e., a set of scales given by a set {2 j } jmax j=0 , and thus, less general that the scale-equivariant architectures listed before.

F EXPERIMENTAL DETAILS

In this section we provide extended details over our implementation as well as the exact architectures and optimization schemes used in our experiments. All our models follow the structure shown in Fig. F .1 and vary only in the number of blocks and channels. All self-attention operations utilize 9 heads. We utilize PyTorch for our implementation. Any missing specification can be safely considered to be the PyTorch default value. Our code is publicly available at https://github.com/dwromero/g selfatt. 

F.2 CIFAR-10

For CIFAR-10 we use a group self-attention network composed of 6 attention blocks with 96 channels for the first two blocks and 192 channels for the rest. attention dropout rate and value dropout rate are both set to 0.1. We use dropout on the input with a rate of 0.3 and additional dropout blocks of rate 0.2 followed by spatial max-pooling after the second and fourth block. We did not use automatic mixed precision training for this dataset as it made all models diverge. We perform training for 350 epochs and utilize stochastic gradient descent with a momentum of 0.9 and cosine learning rate scheduler with base learning rate 0.01 (Loshchilov & Hutter, 2016) . We utilize a batch size of 24, weight decay of 0.0001 and He's initialization.

F.3 PATCHCAMELYON

For PatchCamelyon we use a group self-attention network composed of 4 attention blocks with 12 channels for the first block, 24 channels for the second block, 48 channels for the third and fourth blocks and 96 channels for the last block. attention dropout rate and value dropout rate are both set to 0.1. We use an additional max-pooling block after the lifting block to reduce memory requirements. We did not use automatic mixed precision training for this dataset as it made all models diverge. We perform training for 100 epochs, utilize stochastic gradient descent with a momentum of 0.9 and cosine learning rate scheduler with base learning rate 0.01 (Loshchilov & Hutter, 2016) . We utilize a batch size of 8, weight decay of 0.0001 and He's initialization.

G PROOFS

Proof of Proposition 4.1. If the self-attention formulation provided in Eqs. 3, 8, is permutation equivariant, then it must hold that m [L π [f ]](i) = L π [m[f ]](i). Consider a permuted input signal L π [f ](i) = f (π -1 (i)). The self-attention operation on L π [f ] is given by: m L π [f ] (i) = ϕ out ⋃ h∈[H] j∈S σ j ⟨ϕ (h) qry (L π [f ](i)), ϕ (h) key (L π [f ](j))⟩ ϕ (h) val (L π [f ](j)) =ϕ out ⋃ h∈[H] j∈S σ j ⟨ϕ (h) qry (f (π -1 (i))), ϕ (h) key (f (π -1 (j)))⟩ ϕ (h) val (f (π -1 (j))) =ϕ out ⋃ h∈[H] π( j)∈S σ π( j) ⟨ϕ (h) qry (f ( ī)), ϕ (h) key (f ( j))⟩ ϕ (h) val (f ( j)) =ϕ out ⋃ h∈[H] j∈S σj ⟨ϕ (h) qry (f ( ī)), ϕ (h) key (f ( j))⟩ ϕ (h) val (f ( j)) = m[f ]( ī) = m[f ](π -1 (i)) = L π m[f ] (i) Here we have used the substitution ī = π(i) and j = π(j). Since the summation is defined over the entire set we have that ∑ π( j)∈S [⋅] = ∑j ∈S [⋅]. Conclusively, we see that m[L π [f ]](i) = L π [m[f ]](i). Hence, permutation equivariance indeed holds. Proof of Claim 4.2. Permutation equivariance. If the self-attention formulation provided in Eq. 10 is permutation equivariant, then it must hold that m [L π [f ], ρ](i) = L π [m[f, ρ]](i). Consider a permuted input signal L π [f ](i) = f (π -1 (i)). The self-attention operation on L π [f ] is given by: m L π [f ], ρ (i) = ϕ out ⋃ h∈[H] j∈N(i) σ j ⟨ϕ (h) qry (L π [f ](i) + ρ(i)), ϕ (h) key (L π [f ](j) + ρ(j))⟩ ϕ (h) val (L π [f ](j)) As discussed in §3.2, since there exists permutations in S N able to send elements j in N(i) to elements j outside of N(i), it is trivial to show that Eq. 10 is not equivariant to S N . Consequently, in order to provide a more interesting analysis, we consider global attention here, i.e., cases where N(i) = S. As shown for Proposition 4.1, this self-attention instantiation is permutation equivariant. Consequently, by considering this particular case, we are able to explicitly analyze the effect of introducing absolute positional encodings into the self-attention formulation. We have then that: m L π [f ], ρ (i) = ϕ out ⋃ h∈[H] j∈S σ j ⟨ϕ (h) qry (f (π -1 (i)) + ρ(i)), ϕ (h) key (f (π -1 (j)) + ρ(j))⟩ ϕ (h) val (f (π -1 (j))) = ϕ out ⋃ h∈[H] π( j)∈S σ π( j) ⟨ϕ (h) qry (f ( ī) + ρ(π( ī))), ϕ (h) key (f ( j) + ρ(π( j)))⟩ ϕ (h) val (f ( j)) = ϕ out ⋃ h∈[H] j∈S σj ⟨ϕ (h) qry (f ( ī) + ρ(π( ī))), ϕ (h) key (f ( j) + ρ(π( j)))⟩ ϕ (h) val (f ( j)) Here we have used the substitution ī = π(i) and j = π(j). Since the summation is defined over the entire set we have that ∑ π( j)∈S [⋅] = ∑j ∈S [⋅]. Since ρ(π( ī)) ≠ ρ( ī) and ρ(π( j)) ≠ ρ( j), we are unable to reduce the expression further towards the form of m[f, ρ]( ī). Consequently, we conclude that absolute position-aware self-attention is not permutation equivariant. Translation equivariance. If the self-attention formulation provided in Eq. 10, is translation equivariant, then it must hold that m(L y [f ], ρ)(i) = L y [m(f, ρ)](i). Consider a translated input signal L y [f ](i) = f (x -1 (x(i) -y)). The self-attention operation on L y [f ], is given by: m L y [f ], ρ (i) = ϕ out ⋃ h∈[H] j∈N(i) σ j ⟨ϕ (h) qry (L y [f ](i) + ρ(i)), ϕ (h) key (L y [f ](j) + ρ(j))⟩ ϕ (h) val (L y [f ](j)) = ϕ out ⋃ h∈[H] j∈N(i) σ j ⟨ϕ (h) qry (f (x -1 (x(i) -y)) + ρ(i)), ϕ (h) key (f (x -1 (x(j) -y)) + ρ(j))⟩ ϕ (h) val (f (x -1 (x(j) -y))) = ϕ out ⋃ h∈[H] x -1 (x( j)+y))∈N(x -1 (x( ī)+y)) σ x -1 (x( j)+y) ⟨ϕ (h) qry (f ( ī) + ρ(x -1 (x( ī) + y))), ϕ (h) key (f ( j) + ρ(x -1 (x( j) + y)))⟩ ϕ (h) val (f ( j)) = ϕ out ⋃ h∈[H] x -1 (x( j)+y))∈N(x -1 (x( ī)+y)) σ x -1 (x( j)+y) ⟨ϕ (h) qry (f ( ī) + ρ P (x( ī) + y)), ϕ (h) key (f ( j) + ρ P (x( j) + y))⟩ ϕ (h) val (f ( j)) Here, we have used the substitution ī = x -1 (x(i)y) ⇒ i = x -1 (x( ī) + y) and j = x -1 (x(j)y) ⇒ j = x -1 (x( j) + y). Since the area of summation remains equal to any translation y ∈ R d , we have: x -1 (x( j)+y)∈N(x -1 (x( ī)+y)) [⋅] = x -1 (x( j))∈N(x -1 (x( ī))) [⋅] = j∈N( ī) [⋅]. Hence, we can further reduce the expression above as: m L y [f ], ρ (i) = ϕ out ⋃ h∈[H] j∈N( ī) σj ⟨ϕ (h) qry (f ( ī) + ρ P (x( ī) + y)), ϕ (h) key (f ( j) + ρ P (x( j) + y))⟩ ϕ (h) val (f ( j)) Since, ρ( ī) + y ≠ ρ( ī) and ρ( j) + y ≠ ρ( j), we are unable to reduce the expression further towards the form of m(f, ρ)( ī). Consequently, we conclude that the absolute positional encoding does not allow for translation equivariance either. Proof of Claim 4.3. If the self-attention formulation provided in Eq. 11 is translation equivariant, then it must hold that m r [L y [f ], ρ](i) = L y [m r [f, ρ]](i). Consider a translated input signal L y [f ](i) = f (x -1 (x(i) -y)). The self-attention operation on L y [f ] is given by: m r L y [f ], ρ (i) = ϕ out ⋃ h∈[H] j∈N(i) σ j ⟨ϕ (h) qry (L y [f ](i)), ϕ (h) key (L y [f ](i) + ρ(i, j))⟩ ϕ (h) val (L y [f ](j)) = ϕ out ⋃ h∈[H] j∈N(i) σ j ⟨ϕ (h) qry (f (x -1 (x(i) -y))), ϕ (h) key (f (x -1 (x(j) -y)) + ρ(i, j))⟩ ϕ (h) val (f (x -1 (x(j) -y))) = ϕ out ⋃ h∈[H] x -1 (x( j)+y)∈N(x -1 (x( ī)+y)) σ x -1 (x( j)+y) ⟨ϕ (h) qry (f ( ī)), ϕ (h) key (f ( j) + ρ(x -1 (x( ī) + y), x -1 (x( j) + y)))⟩ ϕ (h) val (f ( j)) Here, we have used the substitution ī = x -1 (x(i)y) ⇒ i = x -1 (x( ī) + y) and j = x -1 (x(j)y) ⇒ j = x -1 (x( j) + y). By using the definition of ρ(i, j) we can further reduce the expression above as: = ϕ out ⋃ h∈[H] x -1 (x( j)+y)∈N(x -1 (x( ī)+y)) σ x -1 (x( j)+y) ⟨ϕ (h) qry (f ( ī)), ϕ key (f ( j) + ρ P (x( j) + y -(x( ī) + y)))⟩ ϕ (h) val (f ( j)) = ϕ out ⋃ h∈[H] x -1 (x( j)+y)∈N(x -1 (x( ī)+y)) σ x -1 (x( j)+y) ⟨ϕ (h) qry (f ( ī)), ϕ (h) key (f ( j) + ρ P (x( j) -x( ī)))⟩ ϕ (h) val (f ( j)) = ϕ out ⋃ h∈[H] x -1 (x( j)+y)∈N(x -1 (x( ī)+y)) σ x -1 (x( j)+y) ⟨ϕ (h) qry (f ( ī)), ϕ (h) key (f ( j) + ρ( ī, j))⟩ ϕ (h) val (f ( j)) Since the area of the summation remains equal to any translation y ∈ R d , we have that: x -1 (x( j)+y)∈N(x -1 (x( ī)+y)) [⋅] = x -1 (x( j))∈N(x -1 (x( ī))) [⋅] = j∈N( ī) [⋅]. Resultantly, we can further reduce the expression above as: m r L y [f ], ρ (i) = ϕ out ⋃ h∈[H] j∈N( ī) σj ⟨ϕ (h) qry (f ( ī)), ϕ (h) key (f ( j) + ρ( ī, j))⟩ ϕ (h) val (f ( j)) = m r [f, ρ]( ī) = m r [f, ρ](x -1 (x(i) -y)) = L y m r [f, ρ] (i) We see that indeed m r [L y [f ], ρ](i) = L y [m r [f, ρ]](i). Consequently, we conclude that the relative positional encoding allows for translation equivariance. We emphasize that this is a consequence of the fact that ρ(x -1 (x( ī) + y), x -1 (x( j) + y))) = ρ( ī, j), ∀y ∈ R d . In other words, it comes from the fact that relative positional encoding is invariant to the action of the translation group. Proof of Claim 5.1. If the lifting self-attention formulation provided in Eq. 11 is G-equivariant, then it must hold that m r G↑ [L g [f ], ρ](i, h) = L g [m r G↑ [f, ρ]](i, h). Consider a g-transformed input signal L g [f ](i) = L y L h [f ](i) = f (x -1 ( h-1 (x(i) -y) )), g = (y, h), y ∈ R d , h ∈ H. The lifting group self-attention operation on L g [f ] is given by: m r G↑ L y L h [f ], ρ (i, h) = ϕ out ⋃ h∈[H] j∈N(i) σ j ⟨ϕ (h) qry (L y L h [f ](i)), ϕ (h) key (L y L h [f ](i) + L h [ρ](i, j))⟩ ϕ (h) val (L y L h [f ](j)) = ϕ out ⋃ h∈[H] j∈N(i) σ j ⟨ϕ (h) qry (f (x -1 ( h-1 (x(i)y)))), ϕ key (f (x -1 ( h-1 (x(j)y))) + L h [ρ](i, j))⟩ ϕ (h) val (f (x -1 ( h-1 (x(j) -y)))) = ϕ out ⋃ h∈[H] x -1 ( hx( j)+y)∈N(x -1 ( hx( ī)+y)) σ x -1 ( hx( j)+y) ⟨ϕ (h) qry (f ( ī)), ϕ key (f ( j) + L h [ρ](x -1 ( hx( ī) + y), x -1 ( hx( j) + y)))⟩ ϕ (h) val (f ( j)) Here we have used the substitution ī = x -1 ( h-1 (x(i)y)) ⇒ i = x -1 ( hx( ī) + y) and j = x -1 ( h-1 (x(j)y)) ⇒ j = x -1 ( hx( j) + y). By using the definition of ρ(i, j) we can further reduce the expression above as: = ϕ out ⋃ h∈[H] x -1 ( hx( j)+y)∈N(x -1 ( hx( ī)+y)) σ x -1 ( hx( j)+y) ⟨ϕ (h) qry (f ( ī)), ϕ key (f ( j) + ρ P (h -1 ( hx( j) + y)h -1 ( hx( ī) + y))⟩ ϕ (h) val (f ( j)) = ϕ out ⋃ h∈[H] x -1 ( hx( j)+y)∈N(x -1 ( hx( ī)+y)) σ x -1 ( hx( j)+y) ⟨ϕ (h) qry (f ( ī)), ϕ key (f ( j) + ρ P (h -1 h(x( j) -x( ī))))⟩ ϕ (h) val (f ( j)) = ϕ out ⋃ h∈[H] x -1 ( hx( j)+y)∈N(x -1 ( hx( ī)+y)) σ x -1 ( hx( j)+y) ⟨ϕ (h) qry (f ( ī)), ϕ key (f ( j) + L h-1 h [ρ]( ī, j))⟩ ϕ (h) val (f ( j)) Since, for unimodular groups, the area of summation remains equal for any g ∈ G, we have that: x -1 ( hx( j)+y)∈N(x -1 ( hx( ī)+y)) [⋅] = x -1 ( hx( j))∈N(x -1 ( hx( ī))) [⋅] = x -1 (x( j))∈N(x -1 (x( ī))) [⋅] = j∈N( ī) [⋅]. Resultantly, we can further reduce the expression above as: m r G↑ L y L h [f ], ρ (i, h) = ϕ out ⋃ h∈[H] j∈N( ī) σj ⟨ϕ (h) qry (f ( ī)), ϕ (h) key (f ( j) + L h-1 h [ρ]( ī, j))⟩ ϕ (h) val (f ( j)) = m r G↑ [f, ρ]( ī, h-1 h) = m r G↑ [f, ρ](x -1 ( h-1 (x(i) -y)), h-1 h) = L y L h m r G↑ [f, ρ] (i, h). We see indeed that m r G↑ [L y L h [f ], ρ](i, h) = L y L h [m r G↑ [f, ρ]](i, h). Consequently, we conclude that the lifting group self-attention operation is group equivariant. We emphasize once more that this is a consequence of the fact that L g [ρ](i, j) = ρ(i, j), ∀g ∈ G. In other words, it comes from the fact that the positional encoding used is invariant to the action of elements g ∈ G. Proof of Claim 5.2. If the group self-attention formulation provided in Eq. 11 is G-equivariant, then it must hold that m r G [L g [f ] , ρ](i, h) = L g [m r G [f, ρ]](i, h). Consider a g-transformed input signal L g [f ](i, h) = L y Lh[f ](i, h) = f (ρ -1 ( h-1 (ρ(i)y)), h h), g = (y, h), y ∈ R d , h ∈ H. The group self-attention operation on L g [f ] is given by: m r G L y Lh[f ], ρ (i, h) = ϕ out ⋃ h∈[H] h∈H (j, ĥ)∈N(i, h) σ j, ĥ ⟨ϕ (h) qry (L y Lh[f ](i, h)), ϕ key (L y Lh[f ](j, ĥ) + L h [ρ]((i, h), (j, ĥ))⟩ ϕ (h) val (L y Lh[f ](j, ĥ)) = ϕ out ⋃ h∈[H] h∈H (j, ĥ)∈N(i, h) σ j, ĥ ⟨ϕ (h) qry (f (x -1 ( h-1 (x(i)y)), h-1 h)), ϕ key (f (x -1 ( h-1 (x(j)y)), h-1 ĥ) + L h [ρ]((i, h), (j, ĥ))⟩ ϕ (h) val (f (x -1 ( h-1 (x(j)y)), h-1 ĥ)) = ϕ out ⋃ h∈[H] hh′ ∈H (x -1 ( hx( j)+y), hĥ′ )∈N(x -1 ( hx( ī)+y), hh′ ) σ x -1 ( hx( j)+y), hĥ′ ⟨ϕ (h) qry (f ( ī, h′ )), ϕ (h) key (f ( j, ĥ′ ) + L h [ρ]((x -1 ( hx( ī) + y), h h′ ), (x -1 ( hx( j) + y), h ĥ′ ))⟩ ϕ (h) val (f ( j, ĥ′ )) Here we have used the substitutions ī = x -1 ( h-1 (x(i)y)) ⇒ i = x -1 ( hx( ī) + y)), h′ = h-1 h, and j = x -1 ( h-1 (x(j)y)) ⇒ i = x -1 ( hx( ī) + y)), ĥ′ = h-1 ĥ. By using the definition of ρ((i, h), (j, ĥ)) we can further reduce the expression above as: = ϕ out ⋃ h∈[H] hh′ ∈H (x -1 ( hx( j)+y), hĥ′ )∈N(x -1 ( hx( ī)+y), hh′ ) σ x -1 ( hx( j)+y), hĥ′ ⟨ϕ (h) qry (f ( ī, h′ )), ϕ key (f ( j, ĥ′ ) + ρ P (h -1 ( hx( j) + y -( hx( ī) + y)), h -1 h h′-1 ĥ′ ))⟩ ϕ (h) val (f ( j, ĥ′ )) = ϕ out ⋃ h∈[H] hh′ ∈H (x -1 ( hx( j)+y), hĥ′ )∈N(x -1 ( hx( ī)+y), hh′ ) σ x -1 ( hx( j)+y), hĥ′ ⟨ϕ (h) qry (f ( ī, h′ )), ϕ key (f ( j, ĥ′ ) + ρ P (h -1 h(x( j)x( ī), h′-1 ĥ′ )))⟩ ϕ (h) val (f ( j, ĥ′ )) = ϕ out ⋃ h∈[H] hh′ ∈H (x -1 ( hx( j)+y), hĥ′ )∈N(x -1 ( hx( ī)+y), hh′ ) σ x -1 ( hx( j)+y), hĥ′ ⟨ϕ (h) qry (f ( ī, h′ )), ϕ key (f ( j, ĥ′ ) + Lh-1 h [ρ](( ī, h′ ), ( j, ĥ′ )))⟩ ϕ (h) val (f ( j, ĥ′ )) Furthermore, since for unimodular groups the area of summation remains equal for any transformation g ∈ G, we have that: (x -1 ( hx( j)+y), hĥ′ )∈N(x -1 ( hx( ī)+y), hh′ ) [⋅] = (x -1 ( hx( j)), hĥ′ )∈N(x -1 ( hx( ī)), hh′ ) [⋅] = (x -1 (x( j)), ĥ′ )∈N(x -1 (x( ī)), h′ ) [⋅] = We see that indeed m r G [L y Lh[f ], ρ](i, h) = L y Lh[m r G [f, ρ]](i, h). Consequently, we conclude that the group self-attention operation on is group equivariant. We emphasize once more that this is a consequence of the fact that L g [ρ]((i, h), (j, ĥ)) = ρ((i, h), (j, ĥ)), ∀g ∈ G. In other words, it comes from the fact that the positional encoding used is invariant to the action of elements g ∈ G.



We consequently consider an image as a set of N discrete objects i ∈ {1, 2, ..., N }. Illustratively, one can think of this as a function returning a vector representation of pixel positions in a grid. Regardless of any transformation performed to the image, the labeling of the grid itself remains exactly equal. e represents a 0 ○ rotation, i.e., the identity. The remaining elements r j represent rotations by (90⋅j) ○ . This phenomenon arises from the fact that R 2 is a quotient of the roto-translation group. Consequently, imposing group equivariance in the quotient space is equivalent to imposing an additional homomorphism of constant value over its cosets. Conclusively, the resulting map is of constant value over the rotation elements and, thus, is not able to discriminate among them. See Ch. 3.1 ofDummit & Foote (2004) for an intuitive description. Our code is publicly available at https://github.com/dwromero/g selfatt. The squared factor in h 2 1 and h 2 2 appears as a result that the neighborhood growth is quadratic in R 2 . This is why this subtlety is often left out in group equivariance literature.



Figure 1: Behavior of feature representations in group self-attention networks. An input rotation induces a rotation plus a cyclic permutation to the intermediary feature representations of the network. Additional examples for all the groups used in this work as well as their usage are provided in repo/demo/.

Figure 2: Steerability analysis of discrete convolutions and group self-attention. (Bekkers, 2020). In group self-attention, the action of the group leaves the content of the image intact and only modifies the positional encoding (Figs. B.1, B.2).As the positional encoding lives on a continuous space, it can be transformed at an arbitrary grade of precision without interpolation (Fig.2b).

Figure A.1: Parameter usage in convolutional kernels (Fig. A.1a) and self-attention (Fig. A.1b). Given a budget of 9 parameters, a convolutional filter ties these parameters to specific positions. Subsequently, these parameters remain static regardless of (i) the query input position and (ii) the input signal itself. Self-attention, on the other hand, does not tie parameters to any specific positions at all. Contrarily, it compares the representations of all tokens falling in its receptive field. As a result, provided enough heads, self-attention is more general than convolutions, as it can represent any convolutional kernel, e.g., Fig. A.1a, as well as several other functions defined on its receptive field.

Figure A.1: Parameter usage in convolutional kernels (Fig. A.1a) and self-attention (Fig. A.1b). Given a budget of 9 parameters, a convolutional filter ties these parameters to specific positions. Subsequently, these parameters remain static regardless of (i) the query input position and (ii) the input signal itself. Self-attention, on the other hand, does not tie parameters to any specific positions at all. Contrarily, it compares the representations of all tokens falling in its receptive field. As a result, provided enough heads, self-attention is more general than convolutions, as it can represent any convolutional kernel, e.g., Fig. A.1a, as well as several other functions defined on its receptive field.

Figure B.2: Lifting self-attention on the roto-translation group for discrete rotations by 90 degrees (also called the Z 4 group). The Z 4 group is defined as H = {e, h, h 2 , h 3 }, where h depicts a 90 ○ rotation. Analogous to lifting self-attention (Fig. B.1), group self-attention corresponds to a concatenation of H = 4 self-attention operations between the input f and h-transformed versions of the positional encoding L[ρ], ∀h ∈ H. However, in contrast to lifting self-attention, both f and ρ are now defined on the group G. Consequently, an additional sum over h is required during the operation (c.f., Eq. 16). Since Z 4 is a cyclic group, i.e., h 4 = e, functions on Z 4 are often represented as responses on a ring (right side of the image). This is a self-attention analogous to the regular group convolution broadly utilized in group equivariant learning literature, e.g.,Cohen & Welling (2016); Romero et al. (2020a).

Figure F.1: Graphical description of group self-attention networks. Dot-lined blocks depict optional blocks. Linear layers are applied point-wise across the feature map. Swish non-linearities (Ramachandran et al., 2017) and layer normalization (Ba et al., 2016) are used all across the network. The GlobalPooling block consists of max-pool over group elements followed by spatial mean-pool.

j, ĥ′ )∈N( ī, h′ )[⋅].Additionally, we have that ∑hh′ ∈H [⋅] = ∑h′ ∈H[⋅]. Resultantly, we can further reduce the expression above as:m r G L y Lh[f ], ρ (i, h) = ϕ out ⋃ h∈[H] h′ ∈H ( j, ĥ′ )∈N( ī, h′ ) σ j, ĥ′ ⟨ϕ (h) qry (f ( ī, h′ )), ϕ (h) key (f ( j, ĥ′ )+ Lh-1 h [ρ](( ī, h′ ), ( j, ĥ′ )))⟩ ϕ (h) val (f ( j, ĥ′ )) = m r G [f, ρ]( ī, h-1 h) = m r G [f, ρ](x -1 ( h-1 (x(i)y)), h-1 h) = L y Lh m r G [f, ρ] (i, h).

A.1. Cordonnier et al. (2020)'s statement can be seamlessly extended to group self-attention by incorporating an additional dimension corresponding to H in their derivations, and defining neighborhoods in this new space with a proportionally larger number of heads. Statement (ii) stems from the finding of Ravanbakhsh (2020) that functions induced by regular group representations are equivariant universal approximators provided full kernels, i.e., global receptive fields. Global receptive fields are required to guarantee that the equivariant map is able to model any dependency among input components. Global receptive fields are readily utilized by our proposed regular global group self-attention and, provided enough heads, one can ensure that any such dependencies is properly modelled.

Accuracy vs. neighborhood size.

Classification results. All convolutional architectures use 3x3 filters.

ACKNOWLEDGMENTS

We gratefully acknowledge Michael Hutchinson, Charline Le Lan, Sheheryar Zaidi, Emilien Dupont, and Hyunjik Kim for useful discussions, Robert-Jan Bruintjes, Fabian Fuchs, Erik Bekkers, Andreas Loukas, Mark Hoogendoorn and our anonymous reviewers for their valuable comments on early versions of this work which largely helped us to improve the quality of our work.David W. Romero is financed as part of the Efficient Deep Learning (EDL) programme (grant number P16-25), partly funded by the Dutch Research Council (NWO) and Semiotic Labs. Jean-Baptiste Cordonnier is financed by the Swiss Data Science Center (SDSC). Both authors thankfully acknowledge everyone involved in funding this work. This work was carried out on the Dutch national e-infrastructure with the support of SURF Cooperative.

