CHORDMIXER: A SCALABLE NEURAL ATTENTION MODEL FOR SEQUENCES WITH DIFFERENT LENGTHS

Abstract

Sequential data naturally have different lengths in many domains, with some very long sequences. As an important modeling tool, neural attention should capture long-range interaction in such sequences. However, most existing neural attention models admit only short sequences, or they have to employ chunking or padding to enforce a constant input length. Here we propose a simple neural network building block called ChordMixer which can model the attention for long sequences with variable lengths. Each ChordMixer block consists of a positionwise rotation layer without learnable parameters and an element-wise MLP layer. Repeatedly applying such blocks forms an effective network backbone that mixes the input signals towards the learning targets. We have tested ChordMixer on the synthetic adding problem, long document classification, and DNA sequence-based taxonomy classification. The experiment results show that our method substantially outperforms other neural attention models. 1

1. INTRODUCTION

Sequential data appear widely in data science. In many domains, the sequences have a diverse distribution of lengths. For example, text information can be as short as an SMS limited to 160 characters or as long as a novel with over 500,000 wordsfoot_1 . In biology, the median human gene length is about 24,000 base pairs (Fuchs et al., 2014) , while the shortest is 76 (Sharp et al., 1985) and the longest is at least 2,300,000 (Tennyson et al., 1995) . Meanwhile, long-range interactions between DNA elements are common and can be up to 20,000 bases away (Gasperini et al., 2020) . Modeling interactions in such sequences is a fundamental problem in machine learning and brings great challenges to attention approaches based on deep neural networks. Most existing neural attention methods cannot handle long sequences with different lengths. For efficient batch processing, architectures such as Transformer and its variants have been proposed, they usually assume constant input length. Otherwise, they have to use chunking, resampling, or padding to enforce the same input length. However, these enforcing approaches either lose much information or cause substantial waste in storage and computation. Even though some architectures, such as scaled dot-product (Vaswani et al., 2017) , can deal with short sequences with variable lengths, they are not scalable to very long sequences. In this paper, we propose a novel neural attention model called ChordMixer to overcome the above drawbacks. The new neural network takes a sequence of any length as input and outputs a tensor of the same size. Moreover, ChordMixer is scalable to very long sequences (we demonstrate lengths up to 1.5M in our experiments). ChordMixer comprises several modules or blocks with a simple and identical architecture. Each block has a Multi-Layer Perceptron (MLP) layer over the sequence channels and a multi-scale rotation layer over the element positions. The rotation has no learnable parameters, and therefore the model size of each block is independent of sequence length. After log 2 N ChordMixer blocks, every number in the output has a full receptive field of all input numbers for length N . We compared ChordMixer with Transformer and many of its variants in three tasks over sequential data: synthetic adding problem, long document classification, and DNA sequence-based taxonomy classification. Our method wins in nearly all tasks, which indicates that ChordMixer mixes well the signals in long sequences with variable lengths and can serve as a transformation backbone in place of conventional neural attention models. The next section will briefly review the definitions and related work. We present the ChordMixer, including its design, properties, and implementation details, in Section 3. Settings and results of three groups of experiments are provided in Section 4. In Section 5, we conclude the paper and discuss future work.

2. BACKGROUND AND RELATED WORK

A sequential data instance, or a sequence, is a one-dimensional array of sequence elements or tokens. In this work, tokens are represented by vectors of the same dimensionality. Therefore a sequence can be treated as a matrix x ∈ R d×N , where N is the sequence length and d is the token dimensionality (or the number of channels). Each sequence may have a different length in a data set, and the distribution of N can have a high range. Neural attention or mixing is a basic function of a neural network, which transforms a data tensor into another tensor toward the learning targets. In the self-attention setting, the input and output tensors usually have the same size. Without losing information, neural attention should have a full receptive field; that is, each output number can receive information from all input numbers. However, naively connecting each pair of input and output numbers is infeasible. Self-attention of a sequence x with the naive implementation requires N 2 d 2 connections, which is too expensive when N d is large. Most existing neural attention methods employ two-stage mixing to relieve the expense due to the full connections. For example, the widely used Transformer model (Vaswani et al., 2017) alternates the token-wise and position-wise mixing steps. The connections are limited in each step, but the receptive field becomes full after one alternation. However, Transformer has a quadratic cost to sequence length because it fully connects every token pair. Numerous approximation methods of Transformer have been proposed to reduce the quadratic cost. For example, Longformer (Beltagy et al., 2020) and ETC (Ainslie et al., 2020) use a learnable side memory module that can access multiple tokens at once; Nyströmformer (Xiong et al., 2021) uses a few landmarks as surrogates to the massive tokens; downsampling methods also include Perceiver (Jaegle et al., 2021) and Swin Transformer (Liu et al., 2021; 2022); Performer (Choromanski et al., 2020) and Random Feature Attention (Peng et al., 2021) approximate the softmax kernel by low-rank matrix products; Switch Transformer (Fedus et al., 2021) and Big Bird (Zaheer et al., 2020) use sparse dot-products at multiple layers. A more thorough survey can be found in Tay et al. (2022) . However, the approximation methods still follow the scaled dot-product approach in the original Transformer, and thus their performance remains mediocre or inferior (Gu et al., 2022; Khalitov et al., 2022; Yu et al., 2022a) .

3. CHORDMIXER

In this work, we go beyond the scaled dot-product approach and aim to develop a neural attention model with the following properties: • full-receptive field: every output number is mixed directly or indirectly from all input numbers of a sequence; • scalability: the new method can give accurate predictions for very long sequences; • decentrality: the new design is decentralized, where no sequence element or position is more central or closer to the output; • length flexibility: the model can handle sequences of diverse lengths without extra preprocessing such as chunking, resampling, or padding.



Code is publicly available at https://github.com/RuslanKhalitov/ChordMixer https://en.wikipedia.org/wiki/List_of_longest_novels

