INFUSING LATTICE SYMMETRY PRIORS IN NEURAL NETWORKS USING SOFT ATTENTION MASKS

Abstract

Infusing inductive biases and knowledge priors in artificial neural networks is a promising approach for achieving sample efficiency in current deep learning models. Core knowledge priors of human intelligence have been studied extensively in developmental science and recent work has postulated the idea that research on artificial intelligence should revolve around the same basic priors. As a step towards this direction, in this paper, we introduce LATFORMER, a model that incorporates lattice geometry and topology priors in attention masks. Our study of the properties of these masks motivates a modification to the standard attention mechanism, where attention weights are scaled using soft attention masks generated by a convolutional neural network. Our experiments on ARC and on synthetic visual reasoning tasks show that LATFORMER requires 2-orders of magnitude fewer data than standard attention and transformers in these tasks. Moreover, our results on ARC tasks that incorporate geometric priors provide preliminary evidence that deep learning can tackle this complex dataset, which is widely viewed as an important open challenge for AI research.

1. INTRODUCTION

Infusing inductive biases and knowledge priors in neural networks is regarded as a critical step to improve their sample efficiency (Battaglia et al., 2018; Bengio, 2017; Lake et al., 2017; Lake & Baroni, 2018; Bahdanau et al., 2019) . The Core Knowledge priors for human intelligence have been studied extensively in developmental science (Spelke & Kinzler, 2007) , following the theory that humans are endowed with a small number of separable systems of core knowledge, so that new flexible skills and belief systems can build on these core foundations. Recent research in artificial intelligence (AI) has postulated the idea that the same priors should be incorporated in AI systems (Chollet, 2019) , but it is an open question how to incorporate these priors in neural networks. Following this chain of thought, the Abstraction and Reasoning Corpus (ARC) (Chollet, 2019) was proposed as an AI benchmark built on top of the Core Knowledge priors from developmental science. Chollet (2019) posits that ARC "cannot be meaningfully approached by current machine learning techniques, including Deep Learning". Further, he argues that developing a domainspecific approach based on the Core Knowledge priors is a challenging first step and that "solving this specific subproblem is critical to general AI progress". An important category of Core Knowledge priors includes geometry and topology priors. Indeed, significant attention has been devoted to incorporating such priors in deep learning architectures by rendering neural networks invariant (or equivariant) to transformations represented through group actions (Bronstein et al., 2021) . Group invariant learning helps to build models that systematically ignore specific transformations applied to the input (such as translations or rotations). We take a complementary perspective and aim to help neural networks to learn functions that incorporate geometric transformations of their input (rather than to be invariant to such transformations). In particular, we focus on group actions that belong to the symmetry group of a lattice. These transformations are pervasive in machine learning applications, as basic transformations of sequences, images, and other higher-dimensional regular grids fall in this category. While attention and transformers can in principle learn these kind of group actions, we show that they require a significant amount of training data to do so. Figure 1 : We consider problems that involve learning a geometric transformation on the input data as a sub-problem. The displayed task (taken from ARC) entails learning to map, for each pair, the left to the right image. We investigate how to solve such tasks more sample-efficiently by imbuing self-attention with the ability to exploit lattice symmetry priors. To address this sample complexity issue, we introduce LATFORMER, a model that relies on attention masks in order to learn actions belonging to the symmetry group of a lattice, such as translation, rotation, reflection, and scaling, in a differentiable manner. We show that, for any such action, there exists an attention mask such that an untrained self-attention mechanism initialized to the identity function performs that action. We further prove that these attention masks can be expressed as convolutions of the identity, which motivates a modification to the standard attention module where the attention weights are modulated by a mask generated by a convolutional neural network (CNN). Our paper focuses on ARC and its variants. We see the extension of LATFORMER to other tasks as a promising avenue for future research. Therefore, we conducted an evaluation of our approach based on synthetic tasks, ARC and the recently proposed LARC (Acquaviva et al., 2021) . First, to probe the sample efficiency of our method, we compared its ability to learn synthetic geometric transformations against Transformers and attention modules. Then, we annotated ARC tasks based on the knowledge priors they require, and we evaluated LATFORMER on the ARC and LARC tasks requiring geometric knowledge priors. Our results provide evidence that LATFORMER can learn geometric transformations with 2 orders of magnitude fewer training data than transformers and attention. We also provide the first neural network reaching good performance on a subset of ARC, suggesting that this kind of problem does not lie out of the reach of deep learning models.

2. FORMALIZING THE GROUP-ACTION LEARNING PROBLEM

We are interested in helping neural networks to learn lattice transformations sample efficiently by infusing knowledge priors in the model. Motivated by ARC, we focus on learning geometric transformations that belong to the symmetry group of a lattice. This pertains to the more general problem of learning group actions given the input and the output of the transformation. Concretely, we consider input-output transformations involving a group element g taken from some known group G that can be expressed under the general formulation: y = f (g • x, x) for some g = g(x) ∈ G (group-action learning) Above, x ∈ R din and y ∈ R dout are input and output examples, f, g are unknown functions, and • denotes the application of a group action. As seen, the group element g can depend on the input data itself. More generally, the function f may depend on more than one transformations of x based on elements belonging to various groups of interest. A simple instance of the group-action learning problem is presented in Figure 1 . The example task is borrowed from ARC (Chollet, 2019) and entails learning to fill out the yellow patches in the leftmost image (input) so that the resulting image satisfies a 90 • degree rotation symmetry. The learner is given only a small set of input-output pairs (the ARC tasks have 3. 3 training examples on average) and the prior knowledge of discrete two-dimensional point groups, one of which is the cyclic group of 4-fold rotations C 4 . Though the task is challenging for a general neural network (due to the small number of samples), under the rotation prior it can be easily solved by the composition of a shallow

