MATRIX SHUFFLE-EXCHANGE NETWORKS FOR HARD 2D TASKS

Abstract

Convolutional neural networks have become the main tools for processing twodimensional data. They work well for images, yet convolutions have a limited receptive field that prevents its applications to more complex 2D tasks. We propose a new neural model, called Matrix Shuffle-Exchange network, that can efficiently exploit long-range dependencies in 2D data and has comparable speed to a convolutional neural network. It is derived from Neural Shuffle-Exchange network and has O(log n) layers and O(n 2 log n) total time and space complexity for processing a n × n data matrix. We show that the Matrix Shuffle-Exchange network is well-suited for algorithmic and logical reasoning tasks on matrices and dense graphs, exceeding convolutional and graph neural network baselines. Its distinct advantage is the capability of retaining full long-range dependency modelling when generalizing to larger instances -much larger than could be processed with models equipped with a dense attention mechanism.

1. INTRODUCTION

Data often comes in a form of two-dimensional matrices. Neural networks are often used for processing such data usually involving convolution as the primary processing method. But convolutions are local, capable of analyzing only neighbouring positions in the data matrix. That is good for images since the neighbouring pixels are closely related, but not sufficient for data having more distant relationships. In this paper, we consider the problem how to efficiently process 2D data in a way that allows both local and long-range relationship modelling and propose a new neural architecture, called Matrix Shuffle-Exchange network, to this end. The complexity of the proposed architecture is O(n 2 log n) for processing n × n data matrix, which is significantly lower than O(n 4 ) if one would use the attention (Bahdanau et al., 2014; Vaswani et al., 2017) in its pure form. The architecture is derived from the Neural Shuffle-Exchange networks (Freivalds et al., 2019; Draguns et al., 2020) by lifting their architecture from 1D to 2D. We validate our model on tasks with differently structured 2D input/output data. It can handle complex data inter-dependencies present in algorithmic tasks on matrices such as transposition, rotation, arithmetic operations and matrix multiplication. Our model reaches the perfect accuracy on test instances of the same size it was trained on and generalizes on much larger instances. In contrast, a convolutional baseline can be trained only on small instances and does not generalize. The generalization capability is an important measure for algorithmic tasks to say that the model has learned an algorithm, not only fitted the training data. Our model can be used for processing graphs by representing a graph with its adjacency matrix. It has a significant advantage over graph neural networks (GNN) in case of dense graphs having additional data associated with graph edges (for example, edge length) since GNNs typically attach data to vertices, not edges. We demonstrate that the proposed model can infer important local and non-local graph concepts by evaluating it on component labelling, triangle finding and transitivity tasks. It reaches the perfect accuracy on test instances of the same size it was trained on and generalizes on larger instances while the GNN baseline struggles to find these concepts even on small graphs. The model can perform complex logical reasoning required to solve Sudoku puzzles. It achieves 100% correct solutions on easy puzzles and 96.6% on hard puzzles which is on par with the state-of-the-art deep learning model which was specifically tailored for logical reasoning tasks.

2. RELATED WORK

Convolutional Neural Networks (CNN) are the primary tools for processing data with a 2D gridlike topology. For instance, VGG (Simonyan & Zisserman, 2014) and ResNet (He et al., 2016a) enable high-accuracy image classification. Convolution is an inherently local operation that limits CNN use for long-range dependency modelling. The problem can be mitigated by using dilated (atrous) convolutions that have an expanded receptive field. Such an approach works well on image segmentation (Yu & Koltun, 2015) but is not suitable for algorithmic tasks where generalization on larger inputs is crucial. The attention mechanism (Bahdanau et al., 2014; Vaswani et al., 2017) is a widespread way of solving the long-range dependency problem in sequence tasks. Unfortunately, its application to 2D data is limited due to it's high O(n 4 ) time complexity for n × n input matrix. Various sparse attention mechanisms have been proposed to deal with the quadratic complexity of dense attention by attending only to a small predetermined subset of locations (Child et al., 2019; Beltagy et al., 2020; Zaheer et al., 2020) . Reformer (Kitaev et al., 2020) uses locality-sensitive hashing to approximate attention in time O(n log n). Linformer (Wang et al., 2020) uses a linear complexity approximation to the original attention by creating a low-rank factorization of the attention matrix. Sparse attention achieves great results on language modelling tasks, yet their application to complex data, where attending to entire input is required, is limited. Graph Convolutional Neural Networks (Micheli, 2009; Atwood & Towsley, 2016) generalizes the convolution operation from the grid to graph data. They have emerged as powerful tools for processing graphs with complex relations (see Wu et al. (2019) for great reference). Such networks have successfully been applied to image segmentation (Gong et al., 2019 ), program reasoning (Allamanis et al., 2018 ), and combinatorial optimization (Li et al., 2018 ) tasks. Nonetheless, Xu et al. (2018) has shown that Graph Convolutional Networks may not distinguish some simple graph structures. To alleviate this problem they introduce Graph Isomorphism Network, that is as powerful as Weisfeiler-Lehman graph isomorphism test and is the most expressive among Graph Neural Network models. Neural algorithm synthesis and induction is a widely explored topic for 1D sequence problems (Abolafia et al., 2020; Freivalds et al., 2019; Freivalds & Liepins, 2018; Kaiser & Sutskever, 2015; Draguns et al., 2020) but for 2D data, only a few works exist. Shin et al. (2018) has proposed Karel program synthesis from the input-output image pairs and execution trace. Differentiable Neural Computer (Graves et al., 2016) , which employs external memory, has been applied to the SGRDLU puzzle game, shortest path finding and traversal tasks on small synthetic graphs. Several neural network architectures have been developed for learning to play board games (Silver et al., 2018 ), including chess (David et al., 2016 ) and go (Silver et al., 2016; 2017) , often using complex architectures or reinforcement learning.

3. 1D SHUFFLE-EXCHANGE NETWORKS

Here we review the Neural Shuffle-Exchange (NSE) network for sequence-to-sequence processing, recently introduced by Freivalds et al. ( 2019) and revised by Draguns et al. (2020) . This architecture offers an efficient alternative to the attention mechanism and allows modelling of long-range dependencies in sequences of length n in O(n log n) time. NSE network is a neural adaption of well-known Shuffle-Exchange and Beneš interconnections networks, that allows linking any two devices using a logarithmic number of switching layers (see Dally & Towles (2004) for an excellent introduction). The NSE network works for sequences of length n = 2 k , where k ∈ Z + , and it consists of alternating Switch and Shuffle layers. Although all the levels are of the same structure, a network formed of 2k -2 Switch and Shuffle layers can learn a broad class of functions, including arbitrary permutation of elements. Such a network is called a Beneš block. A deeper and more expressive network may be obtained by stacking several Beneš blocks; for most tasks, two blocks are enough. The first k -1 Switch and Shuffle layers of the Beneš block form Shuffle-Exchange block, the rest of the k -1 Switch and Inverse Shuffle layers form its mirror counterpart. In the Beneš block, only Switch layers have learnable parameters and weight sharing is employed between layers of the same

