INFUSING LATTICE SYMMETRY PRIORS IN NEURAL NETWORKS USING SOFT ATTENTION MASKS

Abstract

Infusing inductive biases and knowledge priors in artificial neural networks is a promising approach for achieving sample efficiency in current deep learning models. Core knowledge priors of human intelligence have been studied extensively in developmental science and recent work has postulated the idea that research on artificial intelligence should revolve around the same basic priors. As a step towards this direction, in this paper, we introduce LATFORMER, a model that incorporates lattice geometry and topology priors in attention masks. Our study of the properties of these masks motivates a modification to the standard attention mechanism, where attention weights are scaled using soft attention masks generated by a convolutional neural network. Our experiments on ARC and on synthetic visual reasoning tasks show that LATFORMER requires 2-orders of magnitude fewer data than standard attention and transformers in these tasks. Moreover, our results on ARC tasks that incorporate geometric priors provide preliminary evidence that deep learning can tackle this complex dataset, which is widely viewed as an important open challenge for AI research.

1. INTRODUCTION

Infusing inductive biases and knowledge priors in neural networks is regarded as a critical step to improve their sample efficiency (Battaglia et al., 2018; Bengio, 2017; Lake et al., 2017; Lake & Baroni, 2018; Bahdanau et al., 2019) . The Core Knowledge priors for human intelligence have been studied extensively in developmental science (Spelke & Kinzler, 2007) , following the theory that humans are endowed with a small number of separable systems of core knowledge, so that new flexible skills and belief systems can build on these core foundations. Recent research in artificial intelligence (AI) has postulated the idea that the same priors should be incorporated in AI systems (Chollet, 2019) , but it is an open question how to incorporate these priors in neural networks. Following this chain of thought, the Abstraction and Reasoning Corpus (ARC) (Chollet, 2019) was proposed as an AI benchmark built on top of the Core Knowledge priors from developmental science. Chollet (2019) posits that ARC "cannot be meaningfully approached by current machine learning techniques, including Deep Learning". Further, he argues that developing a domainspecific approach based on the Core Knowledge priors is a challenging first step and that "solving this specific subproblem is critical to general AI progress". An important category of Core Knowledge priors includes geometry and topology priors. Indeed, significant attention has been devoted to incorporating such priors in deep learning architectures by rendering neural networks invariant (or equivariant) to transformations represented through group actions (Bronstein et al., 2021) . Group invariant learning helps to build models that systematically ignore specific transformations applied to the input (such as translations or rotations). We take a complementary perspective and aim to help neural networks to learn functions that incorporate geometric transformations of their input (rather than to be invariant to such transformations). In particular, we focus on group actions that belong to the symmetry group of a lattice. These transformations are pervasive in machine learning applications, as basic transformations of sequences, images, and other higher-dimensional regular grids fall in this category. While attention and transformers can in principle learn these kind of group actions, we show that they require a significant amount of training data to do so. Figure 1 : We consider problems that involve learning a geometric transformation on the input data as a sub-problem. The displayed task (taken from ARC) entails learning to map, for each pair, the left to the right image. We investigate how to solve such tasks more sample-efficiently by imbuing self-attention with the ability to exploit lattice symmetry priors. To address this sample complexity issue, we introduce LATFORMER, a model that relies on attention masks in order to learn actions belonging to the symmetry group of a lattice, such as translation, rotation, reflection, and scaling, in a differentiable manner. We show that, for any such action, there exists an attention mask such that an untrained self-attention mechanism initialized to the identity function performs that action. We further prove that these attention masks can be expressed as convolutions of the identity, which motivates a modification to the standard attention module where the attention weights are modulated by a mask generated by a convolutional neural network (CNN). Our paper focuses on ARC and its variants. We see the extension of LATFORMER to other tasks as a promising avenue for future research. Therefore, we conducted an evaluation of our approach based on synthetic tasks, ARC and the recently proposed LARC (Acquaviva et al., 2021) . First, to probe the sample efficiency of our method, we compared its ability to learn synthetic geometric transformations against Transformers and attention modules. Then, we annotated ARC tasks based on the knowledge priors they require, and we evaluated LATFORMER on the ARC and LARC tasks requiring geometric knowledge priors. Our results provide evidence that LATFORMER can learn geometric transformations with 2 orders of magnitude fewer training data than transformers and attention. We also provide the first neural network reaching good performance on a subset of ARC, suggesting that this kind of problem does not lie out of the reach of deep learning models.

2. FORMALIZING THE GROUP-ACTION LEARNING PROBLEM

We are interested in helping neural networks to learn lattice transformations sample efficiently by infusing knowledge priors in the model. Motivated by ARC, we focus on learning geometric transformations that belong to the symmetry group of a lattice. This pertains to the more general problem of learning group actions given the input and the output of the transformation. Concretely, we consider input-output transformations involving a group element g taken from some known group G that can be expressed under the general formulation: y = f (g • x, x) for some g = g(x) ∈ G (group-action learning) Above, x ∈ R din and y ∈ R dout are input and output examples, f, g are unknown functions, and • denotes the application of a group action. As seen, the group element g can depend on the input data itself. More generally, the function f may depend on more than one transformations of x based on elements belonging to various groups of interest. A simple instance of the group-action learning problem is presented in Figure 1 . The example task is borrowed from ARC (Chollet, 2019) and entails learning to fill out the yellow patches in the leftmost image (input) so that the resulting image satisfies a 90 • degree rotation symmetry. The learner is given only a small set of input-output pairs (the ARC tasks have 3. 3 training examples on average) and the prior knowledge of discrete two-dimensional point groups, one of which is the cyclic group of 4-fold rotations C 4 . Though the task is challenging for a general neural network (due to the small number of samples), under the rotation prior it can be easily solved by the composition of a shallow neural network, a group action, and a non-linear activation: 1) mapping yellow to zero, 2) rotating each image x by some g ∈ C 4 , and 3) taking a pixel-wise max. It is important to stress that group-action learning is the exact antithesis of the typical group invariant and equivariant learning problems (Bronstein et al., 2021) : y = f (g • x) for every g ∈ G (invariant learning) g • y = f (g • x) for every g ∈ G. (equivariant learning) Intuitively speaking, whereas in group-action learning one aims to learn functions that involve specific (and data-dependent) transformations of our data by actions of the group, in in/equivariant learning the goal is to build models that are oblivious to such group actions in a systematic manner.

3. ATTENTION MASKS FOR CORE GEOMETRY PRIORS

This section prepares some theoretical grounding for LATFORMER, our approach to learn the transformations for lattice symmetry groups in the form of attention masks. The section defines attention masks and explains how the former can be leveraged to incorporate geometry priors when solving group action learning problems on sequences and images.

3.1. MODULATING ATTENTION WEIGHTS WITH SOFT MASKING

Consider the scaled dot-product attention mechanism as defined in Vaswani et al. (2017) . In our formulation, we consider real-valued masks M ∈ [0, 1] n Q ×n K that rescale attention weights: A = softmax QK ⊤ √ d ⊙ M , where Q ∈ R n Q ×d is the query parameter of the attention mechanism, K ∈ R n K ×d is the key, d is the dimensionality of the model, n Q and n K are the sizes of the sets encoded by the query and key matrices respectively, and ⊙ is the Hadamard product. Attention masks have been widely used to constrain the values of the attention weights and are usually binary masks applied before the softmax activation (Vaswani et al., 2017; Sartran et al., 2022) . However, as we aim to learn M , we apply the mask after the softmax operation in order to avoid squashing the gradient. Therefore, we rescale the attention weights to sum to 1 when calculating the output X of the attention mechanism: MaskedAttention(Q, K, V ; M ) = A A • 1 n K 1 ⊤ n K V , with 1 n being a vector of ones of size n and V ∈ R d×n K being the value parameter of the attention mechanism. Though masking can also be applied in cross-attention, in the following we primarily focus on self-attention, where Q = K = V = X. For ease of notation, we write MaskedAttention(X; M ) whenever the query, key and value are the same matrix X.

3.2. EXISTENCE OF ATTENTION MASKS FOR LATTICE SYMMETRY ACTIONS

This section discusses group actions that can be represented by attention masks. To develop intuition, let us first consider the simple example of translation in a one-dimensional lattice. Supposing that x = (x 1 , . . . , x n ) ⊤ is a vector of n elements, we have: MaskedAttention(x; M ) = (x n , x 1 , . . . , x n-1 ) ⊤ , M =     0 0 • • • 1 1 0 • • • 0 . . . . . . . . . . . . 0 • • • 1 0     . Hence, when M is the circulant permutation matrix shown above, we have that the mask shifts the input x by one element to the right. Beyond translation, it is natural to ask what kinds of group actions we can perform with attention masks on data with a more high-dimensional topological structure. The following theorem provides existence statements for data whose underlying topological space is a hypercubic lattice (such as sequences, images and higher-dimensional regular grids).  o(g i ) k = 0 n = l 1 Translation (by δ) o(g i ) k = -δ n = l 1 Reflection o(g i ) 1 = (n -1), o(g) k = o(g) k-1 -2 n = l 1 Rotation (90 • ) o(g i ) k = k • (l 1 -1) -⌊(k -1)/l 1 ⌋ n = l 1 × l 2 Upscaling (by h) o(g i ) k = (k -1 mod h) + (h -1) • ⌊(k -1)/h⌋ n = l 1 Table 1 : Fourier shifts for the transformations on the 1-dimensional and square lattices. We denote with o(g i ) k the k-th component of the vector o(g i ) ∈ R n , for k = 1, . . . , n. As stated in Theorem 2, attention masks for higher-dimensional lattices can be obtained by the Kronecker product of primitive masks defined over the 1-dimensional and square lattices. Composition of actions is given by matrix multiplication of the masks. Theorem 1 (Existence). Let G m be the symmetry group of the m-dimensional hypercubic lattice, including translational symmetry, 4-fold rotational symmetry and vertical, horizontal and diagonal reflections. Let X ∈ R n×d be a vectorized representation of an m-dimensional tensor X ∈ R l1×•••×lm , with n = l 1 • . . . • l m . For any group action g ∈ G m , there exists an attention mask M g ∈ {0, 1} n×n , such that: MaskedAttention(X; M g ) = g • X. In other words, Theorem 1 states that any translation, rotation or reflection can be expressed in terms of an attention mask. Figure 2 shows some examples of masks corresponding to translation, rotation and reflection operations on square lattices. In the following, we adopt the convention of writing M g to mean the mask that implements action g. For more details and for a proof of Theorem 1, we refer the reader to Appendix D.

3.3. REPRESENTING ATTENTION MASKS FOR LATTICE TRANSFORMATIONS

To facilitate the learning of lattice symmetries, one needs to determine methods to parameterize the set of feasible group elements. Fortunately, as precised in the following theorem, the attention masks considered in Theorem 1 can be expressed conveniently under the same general formulation. Theorem 2 (Representation). Let G m be the symmetry group of the m-dimensional hypercubic lattice and g ∈ G m be an action on a tensor X ∈ R l1×•••×lm . Then, there exist some primitive attention masks M gi ∈ {0, 1} ni×ni such that M g = i M gi and F(M gi ) = F(I ni ) exp(- 2πj n i o(g i ) r ⊤ ni ), where M g ∈ {0, 1} n×n is an attention mask implementing g, g i ∈ G mi for some m i ∈ {1, 2} is an action on the one-dimensional or square lattice, ⊗ is the Kronecker product, F is the Fourier transform applied column-wise, I ni is the n i × n i identity matrix, j is the imaginary unit, r ni = (1, 2, . . . , n i ) ⊤ , and o(g i ) is defined as in Table 1 . To obtain an intuitive understanding of Theorem 2, it helps to revisit the example of translation by δ = 1 of a sequence x ∈ R n on the 1-dimensional lattice (m = 1). Consulting Table 1 , we find that o(g) is a vector containing -1 at every position and we know M g is the permutation circulant matrix of Section 3.2. Indeed, by the time-shifting property of the Fourier transform, M g can be obtained by shifting the rows of the identity by -1. In general, vector o(g) has a convenient intuitive interpretation as its k-th component represents the relative position (with respect to k) of the element that the k-th row of X attends to. For instance, in the one-dimensional example of translation by one element to the right, each element attends to the one immediately before. Hence, we have o(g) k = -1 for any k = 1, . . . , n. For higher-dimensional lattices, attention masks can be expressed as the Kronecker product of the attention masks for lower-dimensional cases. For instance, on the square lattice, a translation by 1 pixel on both dimensions is the Kronecker product of the two circulant matrices corresponding to a translation by 1 pixel on the one-dimensional lattice, as shown in Figure 2a . On more than one dimension, we can additionally define 4-fold rotations, still following the same formulation, with o(g i ) defined as in Table 1 . Although strictly not a symmetry operation, scaling transformations of the lattice can also be defined in terms of attention masks under the same general formulation of Theorem 2, as reported in Table 1 . Therefore, for completeness, we will consider scaling transformations as well in our experiments. Notice that Theorem 2 allows us to derive a way to calculate the attention masks. In particular, we can express our attention masks as a convolution operation on the identity, as stated below. Corollary 1. Let G m be the symmetry group of the m-dimensional hypercubic lattice and let M g ∈ R n×n be an attention mask implementing action g ∈ G m . Then: M g [:, i] = F -1 (exp(- 2πj n • o(g) • r ⊤ n ))[:, i] ⋆ I n [:, i], where ⋆ denotes the convolution operation. In other words, we can represent any mask in our framework as a convolution of the identity matrix with predefined kernels. This motivates us to design a convolutional neural network that produces our attention masks by successive convolutions of the identity.

4. THE LATFORMER ARCHITECTURE

While in principle the problem of inferring group actions from input-output pairs can be solved via search over finite groups, in practice the size of the group for lattice symmetry actions makes this approach unfeasiblefoot_0 . Moreover, we are interested in learning unknown functions jointly with the transformation, which cannot be solved by searching on the space of group actions. Using a neural agent to search the space of possible actions would be a viable alternative, but this would make the problem non-differentiable and we would need to resort to reinforcement learning methods. In this work, we aim to solve the problem in a differentiable way. Inspired by the observations above, we introduce LATFORMER, which incorporates the insights of Section 3 into a neural architecture that leverages our attention masks. We propose to use gated CNNs to parameterize the masks and we introduce an additional smoothing technique for easier optimization.

4.1. LATTICE MASK EXPERTS AS CONVOLUTIONAL NEURAL NETWORKS

Attention modules in neural networks usually include an attention mechanism with learnable linear transformations of the inputsfoot_1 followed by a feed-forward network (FFN), as in the Transformer encoder layer (Vaswani et al., 2017) . To infuse core geometry priors in the attention module, we propose to modulate the attention weights with a mask generated by an additional layer, as shown in Figure 3a . We refer to this layer as Lattice mask expert, as it specializes towards specific transformations of the lattice. To understand the purpose of this layer, it is useful to remember that, by the analysis conducted in Section 3, even if the attention and FFN layers are initialized to the identity function, the mask expert can generate attention masks that produce precise geometric transformations of the input. By Corollary 1, we know that each group action on the lattice can be represented by a mask that is a convolution of the identity and we have an analytical expression to calculate the kernels of the convolution. We can leverage this notion to design CNNs that produce attention masks corresponding to specific group actions by following the general formulation: M 0 = I and M l+1 = α l Conv(M l , K l ) + (1 -α l )M l for l = 0, . . . , L -1, where M L is the predicted mask, α l = σ l (X; θ) = FFN l (X, θ) is the output of a gating function, θ is a learnable parameter, and K l is the kernel of the l-th convolutional layer whose weights are determined based on Corollary 1 and Table 1 . As an example, Figure 3b shows an architecture that generates translation masks. Following Theorem 2, the expert computes the translations along the two dimensions separately and then aggregates the resulting masks doing the Kronecker product. Hence, a Lattice translation expert with L convolutional layers for each dimension can generate any translation up to δ = 2 L -1 elements per dimension. At inference time, the values of the gates can be discretized, in such a way that the generated mask provably performs a meaningful group action. Similarly to the expert in Figure 3b , we can define gated CNNs for transformations like rotation, reflection, and scaling. The product of experts (i.e., the combination of more actions) can be obtained by either chaining the experts or multiplying the attention masks generated by different experts. For more details, we refer the reader to Appendix A.

4.2. MASK SMOOTHING FOR EASIER TRAINING

The framework described so far parameterizes discrete transformations of a lattice in a differentiable manner. Nevertheless, to improve the training of LATFORMER, we found it beneficial to also apply a smoothing operation on the attention masks w.r.t. the intrinsic metric of the group in question. Our approach entails defining an adjacency relation between group elements and applying graph convolution with a heat kernel on the corresponding graph. This encourages the optimizer to favor weight updates that change the masks in a smooth manner w.r.t. the geodesic distance implied by the graph. Concretely, we define the neighbors of each element g i on the lattice as those g j = e • g i reachable by an application of a primitive action e, such as translation by a single pixel in one dimension, rotation by 90 • , and vertical/horizontal reflection. The notion of neighborhood gives rise to a graph whose vertex set is the lattice group and that contains one edge for every pair of neighboring actions. As before, it helps to consider different kinds of transformations separately. For instance, as shown in Figure 4 , for 2D rotations the underlying graph is a cycle with 4 elements due to the underlying point group for 4-fold rotations being the cyclic group C 4 . Performing heat diffusion can be achieved by repeated neighborhood averaging over the cycle and yields a smoothed rotation

Smooth rotation mask

Figure 4 : Rotational smoothing can be obtained by performing heat diffusion over a cyclic graph where each node corresponds to a rotation mask. mask that performs all rotations at the same time (rightmost image in Figure 4 ). We can extend the same approach to all lattice transformations: for instance, in the case of translation, the underlying graph is a grid and the smoothing operation is akin to convolution with a Gaussian kernel. To train LATFORMER with smoothed masks, we compute two predictions: one with the non-smooth mask predicted by the model and one with a smoothed version of the same mask. The final loss is the sum of two terms, one for the original prediction and one for the smoothed prediction. For smooth predictions, in all our experiments we utilize a mean squared error (MSE) loss, whereas for the original non-smooth predictions we use the cross-entropy or MSE loss depending on whether the output data are categorical or continuous.

5. EXPERIMENTS

To evaluate our method, we first developed a set of synthetic tasks in order to compare LATFORMER to attention modules and Transformers with respect to sample efficiency in learning basic geometric transformations. Then, we annotated the ARC tasks based on the knowledge priors they require, and we assessed the performance of our method on this challenging dataset. Finally, we experimented with the LARC (Acquaviva et al., 2021) dataset and compared our method to stronger baselines based on neural program synthesis. We report additional experimental results in Appendix B.3.

5.1. SAMPLE EFFICIENCY ON GEOMETRIC TRANSFORMATIONS

As a preliminary study, we probed the ability of LATFORMER to learn geometric transformations efficiently. To this end, we compared the performance of our model to a transformer (Vaswani et al., 2017) and an attention module (the same architecture as our approach, without the mask expert) on synthetic tasks with increasing number of examples. Inspired by ARC, we generated a set tasks where the model needs to infer a geometric transformation from input-output pairs. The input is a grid taken from the ARC tasks and the output is either a translation, rotation, reflection or scaling of the input. The specific transformation applied to the input grid defines the task and is consistent across all examples in the same task. We evaluated the models based on the mean accuracy across tasks. Figure 5 shows the accuracy of our model compared to the baselines and to a version of LATFORMER without smoothing. The plots show that LATFORMER can generalize better and from fewer examples than transformers and attention modules both with absolute positional encodings and relative positional encodings (Shaw et al., 2018) . Additionally, our results show that the smoothing operation described in Section 4.2 is helpful for larger groups. More details on this experiment are reported in Appendix B.1.

5.2. VISUAL REASONING ON ARC TASKS WITH GEOMETRIC KNOWLEDGE PRIORS

To assess the ability of our approach to learn efficiently on a more challenging use case, we focused on a subset of the ARC dataset (Chollet, 2019) requiring geometric priors for which our method could be a principled solution. To this end, we annotated the ARC tasks based on the knowledge priors they require, using the list of priors provided by Chollet (2019) as a reference. Appendix B.2 provides more details about the annotation of ARC and Figure 7a in the Appendix shows the knowledge priors that we considered and their distribution across the ARC tasks. 

6. RELATED WORK

Our work was inspired by a previous investigation of self-attention layers which identified sufficient conditions such that they can perform convolution when equipped with relative positional encodings (Cordonnier et al., 2020; Andreoli, 2019) . Rather than relying on relative encodings, we here show how soft-masking can be used to learn sample efficiently more general input transformations, such as rotation, reflection, and scaling. To the extent of our knowledge, the group-action learning problem has not been explicitly and generally formulated in previous work. That being said, many previous works have focused on specific instances, such as learning to sort (Graves et al., 2014; Reed & De Freitas, 2015; Li et al., 2020) by selecting an element of the permutation group S n , docking/folding by roto-translating objects according to an action in the special Euclidean group SE(3) (Sverrisson et al., 2022; Stärk et al., 2022; Jumper et al., 2021) , and graph spectrum generation where the learned actions belong to the Stiefel manifold (Martinkus et al., 2022) . Our work is similar in spirit to recent efforts in neuro-symbolic visual reasoning (Johnson et al., 2017b; a; Goyal et al., 2017; Mao et al., 2019; Higgins et al., 2018) . Many approaches based on attention mechanisms have been proposed in the past few years (Hudson & Manning, 2018; 2019) . Our work differentiates from the former in that we aim to learn basic geometric reasoning in a sample-efficient way, rather than modeling relationships between high-level concepts. Finally, some recent works came to our same conclusion on the advantages of using attention masks to incorporate prior knowledge in neural networks. As an example, Yan et al. (2020) focus on the task of learning subroutines (e.g., sorting algorithms) and use a CNN to generate an attention mask for a Transformer encoder. They show that learning the attention mask allows them to generalize to longer sequences than the ones provided at training time. Similarly, Sartran et al. (2022) used precomputed attention masks to incorporate syntactic compositional biases in language models.

7. CONCLUSION

This paper focused on how to help deep learning models to learn geometric transformations efficiently. Specifically, we proposed to incorporate lattice symmetry biases into attention mechanisms by modulating the attention weights using learned soft masks. We have shown that attention masks implementing the actions of the symmetry group of a hypercubic lattice exist, and we provided a way to represent these masks. This motivated us to introduce LATFORMER, a model that generates attention masks corresponding to lattice symmetry priors using a CNN. Our results on synthetic tasks show that our model can generalize better than the same attention modules without masking and Transformers. Moreover, the performance of our method on a subset of ARC provides the first evidence that deep learning can be used on this dataset, which is widely considered as an important challenge for AI research.

A ADDITIONAL DETAILS ON THE MODEL

This section describes the LATFORMER architecture providing additional details that were not covered in Section 4.1. As mentioned in Section 4.1, it is possible to design convolutional neural networks that perform all considered transformations of the lattice. Figure 6 shows the architecture of the four expert models that generate translation, rotation, reflection and scaling masks.

Lattice Translation Expert Lattice Rotation Expert

Lattice Reflection Expert Lattice Scaling Expert S denotes an attention mask implementing an upscaling by h along one dimension. Using Corollary 1, we can derive the kernels of the convolutional layers shown in Figure 6 . These kernels are frozen at training time, the model only learns the gating function, denoted as σ in the figure . Notice that all the models follow the same overall structure. However, for scaling, we also learn an additional gate, denoted as σ(M S , M ⊤ S ) in the Figure 6 . This gate allows the model to transpose the mask and serves the purpose of implementing down-scaling operations (down-scaling is the transpose of up-scaling). The composition of more actions can be obtained by combining different experts. This can be done either by chaining the experts or by matrix multiplication of the masks. In preliminary experiments, we did not notice any significant difference in performance between the two options and we rely on the latter in our implementation.

B ADDITIONAL EXPERIMENTS AND DETAILS ON THE EXPERIMENTAL SETUP

This section provides additional details on the experimental setup of all our experiments, including further information on the generation of the synthetic tasks and the data annotation process for ARC.

B.1 EXPERIMENTS ON SYNTHETIC DATA

We considered four categories of tasks, namely translation, rotation, reflection and scaling. Each task is defined in terms of input-output pairs, which are sampled from the set of all ARC grids and padded to the size of 30 × 30 cells. To each input grid, a synthetic transformation is applied in order to obtain the corresponding output grid. For each task in each category, we generated 2048 training pairs and 100 test pairs. For translation tasks, we have a total of 900 possible translations in a 30 × 30 grid. However, generating data and training models on 900 tasks is computationally expensive, so we randomly sampled 5 translations in the interval [1, 29] × [1, 29], obtaining a total of 100 translation tasks. Rotation tasks include all 4-fold rotations except the identity. Similarly, reflection tasks involve horizontal, vertical and diagonal reflections. Scaling tasks include all possible up/down scaling transformations of the input grid by factors of [2, 5] × [2, 5] for a total of 32 scaling tasks. The models are evaluated based on the mean accuracy on each category. For each task we compute the accuracy on the test set based on how many of the predicted images match exactly the ground truth. In order to experiment with ARC, we first performed an annotation of the dataset to identify the underlying knowledge priors for each task. To this end, we built a user interface where the annotator could browse the tasks and label them by selecting any combination of the available knowledge priors. Figure 7b shows the user interface provided to the annotator, whereas Figure 7a shows the distribution of knowledge priors across the ARC tasks. Most tasks follow in more than one of the categories represented in Figure 7a .

B.2 EXPERIMENTS ON ARC

ARC can be regarded as a meta-learning benchmark, as it provides a set of training tasks and a set of unseen tasks to evaluate the performance of the model learned on the meta-training data. It is important to emphasize that we do not target this use case, as we instead use the same setup as in the synthetic data and learn each task from scratch using only its training set. Though simple and elegant, the supervised-learning formulation prevents our models from reusing knowledge that can be shared between different tasks. In order to mitigate this issue, we rely on a data-augmentation strategy. All models are evaluated based on the ratio of solved tasks and a task is considered solved if the model can predict the correct output grid for all examples in the test set.

B.3 ADDITIONAL EXPERIMENTS ON IMAGE REGISTRATION

As an additional experiment, to assess the applicability of our LATFORMER on natural images, we performed experiments on multimodal image registration, namely the problem of spatially aligning images from different modalities. Image registration is a well-studied problem in computer vision and we do not aim to establish state-of-the-art performance. The main purpose of this experiment is giving a hint on the applicability of our method to natural images beyond ARC. We refer the reader to SuperGlue (Sarlin et al., 2020) and COTR (Jiang et al., 2021) to have a sense of approaches specifically designed for this task. Popular approaches to multimodal image registrations work in two stages: first, they learn a model that converts one modality into the other (or to transfer both modalities in the same representation as proposed by Pielawski et al. (2020) ), then they align the two images using traditional techniques. We follow the experimental setup of Lu et al. (2021) and experiment with two datasets, one containing aerial views of a urban neighborhood and one containing cytological images. The images we employ are views of the same scene, but they are taken with different modalities and they are translated with respect to one another. We use the code of the authors to generate data involving only translations. Lu et al. (2021) additionally consider small rotations, but these transformations are not actions in the symmetry group of a lattice, so we are not interested in resolving them. We employ several state-of-the-art methods for modality translation and we compare our method to α-AMD (Lindblad & Sladoje, 2014) and SIFT (Lowe, 1999) based on the success rate metric defined by Lu et al. (2021) . A registration is considered successful if the relative registration error (i.e., the residual distance between the reference patch and the transformed patch after registration normalized by the height and width of the patch) is below 2%. Table 4 reports our results on the image registration tasks and shows that our approach performs well on both datasets coupled with different methods for modality translation. We use the same models of Lu et al. (2021) for the modality translation task. Then, in order to solve the image registration task with LATFORMER, we divide each image into 30 × 30 patches and we run our model to predict the translation from one patch in an image to its counterpart in the corresponding image.

C LIMITATIONS AND FUTURE WORK

Although we believe our results are interesting and promising for learning group actions with neural networks, we would like to point out some limitations of our approach. First, our method is limited to actions on the symmetry group of the hypercubic lattice and it is not immediately extendable to other groups. For instance, though permutation matrices are still convolutions of the identity and they can be generated by a CNN, providing an architecture with predefined kernels that can compute any permutation matrix is not feasible. Second, the model is hard to fine-tune: we noticed that once the gates of the CNN have been trained, it is hard for the model to adapt to different actions. We believe that both limitations can be addressed by still keeping the same overall idea of modulating attention weights using soft attention masks, possibly with a different parametrization of the masks. Future work will focus on this research direction and on extending our work to cover a wider set of the ARC tasks.

D DEFERRED PROOFS

We prove both Theorem 1 and 2 by induction on the dimensionality of the hypercubic lattice m.

D.1 BASE CASE FOR THEOREMS 1 AND 2

First, it is useful to notice that whenever M ∈ {0, 1} n×n has exactly a single 1 per row, in other words M • 1 n = 1 n , then, for any X ∈ R n×d MaskedAttention(X; M ) = A A • 1 n 1 ⊤ n X = softmax XX ⊤ √ d ⊙ M softmax XX ⊤ √ d ⊙ M • 1 n 1 ⊤ n X = M • X. In order to prove the theorem, we need to show that, for any action g ∈ G 1 , including translations, reflections and rotations, there exists a mask M g such that: MaskedAttention(X; M g ) = g • X. Let us consider different families of actions separately. Translation. As mentioned in Section 3.2, in the 1-dimensional case, a translation by one element to the right for a vector x = (x 1 , x 2 , . . . , x n ) ⊤ is given by the circulant permutation matrix: M = M (1) T =       0 0 0 • • • 1 1 0 0 • • • 0 0 1 0 • • • 0 . . . . . . . . . . . . . . . 0 • • • 0 1 0       . This holds because M (1) T • 1 n = 1 n , so: MaskedAttention(x; M T ) = M (1) T • x = (x n , x 1 , x 2 , . . . , x n-1 ) ⊤ . In general, a translation by δ elements is given by the circulant matrix M (δ) T = (M T ) δ . Therefore, masks implementing translation operations exist in the 1-dimensional case and they are circulant permutation matrices. This is enough for a base case for Theorem 1. This comes directly from the time-shifting property of the Fourier transform. Reflection. In the 1-dimensional case, the reflection of a vector x = (x 0 , x 1 , . . . , x n ) ⊤ is: MaskedAttention(x; M F ) = M F • x = (x n , x n-1 , . . . , x 2 , x 1 ) ⊤ with M F =       0 • • • 0 0 1 0 • • • 0 1 0 0 • • • 1 0 0 . . . • • • . . . . . . . . . 1 • • • 0 0 0       . The attention mask M F can be obtained by shifting the rows of the identity matrix by: o F =       n -1 n -3 n -5 . . . 1       . Therefore, by the time-shifting property of the Fourier transform we have: M F = F -1 F(I n ) exp(- 2πj n o F r ⊤ n ) . Rotation. Rotation (4-fold) is not defined in one dimension, so for a base case we need to consider the square lattice. Let X ∈ R l1•l2 be a vectorized representaiton of a n = l 1 ×l 2 dimensional matrix. We need to define a vector o R ∈ R n such that: M (90) R = F -1 F(I n ) exp(- 2πj n o R r ⊤ n ) is a rotation mask. Since rotation is a permutation of the identity, we know the vector exists. As X is vectorized, the o (90) R needs to take into account the size of the first dimension l 1 . For example, in order to perform a rotation on a vectorized representation, we need to map the first element of X to the position (l 1 -1). The reader can check that the vector given by (o (90) R ) k = k • (l 1 -1) -⌊(k -1)/l 1 ⌋ satisfies the equation above. Scaling. Although scaling is not a group action of the symmetry group of the lattice, we pointed out that it still can be defined within the same general formulation as the other transformations. We can take the 1-dimensional lattice as a base case and consider a vector x = (x 0 , x 1 , . . . , x n ) ⊤ . Let h ∈ N be a parameter specifying the filter size of the scaling operation. As an example, for h = 2 we have: MaskedAttention(x; M (h) S ) = M (h) S • x = (x 1 , x 1 , x 2 , x 2 , . . . , x ⌊n/2⌋ ) ⊤ , where: M (h) S =         1 0 • • • 0 0 1 0 • • • 0 0 0 1 • • • 0 0 0 1 • • • 0 0 . . . . . . • • • . . . . . . 0 0 • • • 0 0         . This kind of matrix can also be obtained by shifting the rows of the identity as follows: M (h) S = F -1 F(I n ) exp(- 2πj n o (h) S r ⊤ n ) , where o (h) S = (k -1 mod h) + (h -1) • ⌊(k -1)/h⌋.



The size of the groups we consider grows with a polynomial of n and exponentially with m. For simplicity, we omitted linear transformations in the definition of MaskedAttention in Section 3.1.



Figure 2: Examples of attention masks implementing transformations in two dimensions, including: (a) translation by 1 pixel on both axes, (b) rotation by 90 • counterclockwise, (c) vertical reflection and (d) horizontal reflection around the center. White represents value 1 and black 0.

Figure 3: A LATFORMER layer (a) and an architecture for a Lattice translation expert (b). The LATFORMER layer (a) is a standard Transformer encoder layer augmented with a Lattice mask expert constrained to generate attention masks corresponding to a geometric transformation of the input. The Lattice translation expert (b) is a particular instance of a Lattice mask expert that produces translation masks. In the architecture above, every convolutional layer is meant to shift the input by a power of 2 and can be skipped by a gating function (denoted as σ).

Figure 6: Model architecture of all the mask experts that we considered. All models are CNNs applied to the identity matrix. In the figure, we use the following notation: • M (δ) T denotes an attention mask implementing a translation by δ along one dimension; • M (90) R denotes an attention mask implementing a translation by 90 • ; • M F denotes an attention mask implementing a reflection along one dimension; • M (h)

Figure 7: Distribution of the considered core knowledge priors across the ARC tasks (a) and user interface built to annotate the dataset (b).

Theorem 2, simply notice that:M (δ) T = F -1 F(I n ) exp(-

Comparison of LATFORMER with neural program synthesis methods with access to both input-output pairs and natural language descriptions on LARC carefully designed DSL. This advantage comes to the expense of being restricted to tasks involving geometric priors, whereas program synthesis approaches can be used on a wider set of tasks. We also observe that the natural language descriptions marginally helped our model on one category of tasks. Our findings corroborate withAcquaviva et al. (2021) in this remark.

At training time, for each model and every iteration, we augment each grid 10 times by mapping each color to a different color (using the same mapping across training examples). The rationale behind this data-augmentation strategy is that (1) we assume that for tasks involving only geometric knowledge priors to be not affected by color mapping and (2) all models (including LATFORMER) need to learn a function from d-dimensional color representations to categorical variables, hence it is beneficial if all colors are represented in the training set. Results of the experiment on image registration. The rows represent different models trained to translate images from modality A to B (A -→ B) or viceversa (B -→ A).

annex

Figure 5 : Sample efficiency of our method compared to the baselines on synthetic tasks on translation (a), rotation (b), reflection (c) and scaling (d) . The y axis denotes the mean accuracy across tasks belonging to the same category, whereas the error shade is the standard deviation. We assessed the performance of our model on the tasks that require only knowledge priors corresponding to the basic geometrical transformations that we addressed in this work, namely translation, rotation, reflection and scaling. Table 2 shows our results compared to neural baselines, including CNNs, attention with relative positional encodings (Shaw et al., 2018) , PixelCNN (Gul et al., 2019) , and Transformers (Vaswani et al., 2017) . We additionally compared to a Transformer model that has access to precomputed transformations of the input (Transformer + data augmentation). Precomputing all group actions is only feasible for smaller groups (rotation, reflection and scaling). Though we restrict to only a subset of the tasks and there is definitely room for improvement even on these tasks, we reach considerably better performance than the baselines. Therefore, we believe our results advocate for the applicability of end-to-end differentiable models even on problems requiring sample-efficient abstract reasoning. To the extent of our knowledge, this is the first evidence of a neural network achieving this performance on ARC tasks.

5.3. COMPARISON WITH NEURAL PROGRAM SYNTHESIS ON LARC

Recently, Acquaviva et al. (2021) introduced the Language-complete Abstraction and Reasoning Corpus (LARC), which provides natural language descriptions of 88% of the ARC tasks, generated by human participants who where asked to communicate to other humans a set of precise instructions to solve a task. Acquaviva et al. (2021) evaluated several models based on neural program synthesis on LARC. All models generate symbolic programs from a carefully designed domainspecific language (DSL). LARC (IO) is a model that has only access to input-output pairs, as our LATFORMER. LARC (IO + NL) has access to the natural language descriptions as well and uses a pre-trained T5 model (Raffel et al., 2020) to represent the text. LARC (IO + NL pseudo) uses pseudo-annotation to encourage the learning of compositional relationships between language and programs: during training, the model is given additional synthetic language-to-program pairs generated by annotating primitive examples in the DSL with linguistic comments.In order to compare to the work of Acquaviva et al. (2021) , we evaluated their models on the set of LARC tasks that correspond to ARC tasks in our subset requiring geometric knowledge priors. Additionally, following Acquaviva et al. (2021) we allowed LATFORMER to access the textual descriptions by using a pre-trained T5 model to generate a representation of the text. This embedding is provided as input both to the Lattice Mask Expert and the FFN layers of LATFORMER. We refer to this model as LatFormer + NL. Table 3 shows the results of our experiments on the LARC dataset. The program-synthesis methods require a training stage on a portion of the tasks. Therefore, the LATFORMER models where only evaluated on the same testing tasks of LARC, using the same train-test split of Acquaviva et al. (2021) . Overall, our results shows that LATFORMER performs better than program synthesis on the subset of tasks requiring geometric priors, with no need for a D.2 INDUCTIVE STEP FOR THEOREMS 1 AND 2 Suppose that M g1 ∈ {0, 1} n1×n1 and M g2 ∈ {0, 1} n2×n2 are attention masks implementing actions g 1 ∈ G m1 and g 2 ∈ G m2 on some tensorsWe have:Now notice that:and similarlyTherefore, we conclude that performing masked attention with the mask M g1 ⊗ M g2 on X is equivalent to applying g 1 on the first m 1 dimensions and g 2 on the last m 2 dimensions of X. This provides a way for building attention masks for higher-dimensional lattices using the primitive masks defined in Section D.1, proving both Theorem 1 and 2.

D.3 PROOF OF COROLLARY 1

The proof of Corollary 1 follows immediately from Theorem 2 and from the property of the Fourier transform according to which multiplying in the Fourier domain implements a convolution in the original domain.

