CHORDMIXER: A SCALABLE NEURAL ATTENTION MODEL FOR SEQUENCES WITH DIFFERENT LENGTHS

Abstract

Sequential data naturally have different lengths in many domains, with some very long sequences. As an important modeling tool, neural attention should capture long-range interaction in such sequences. However, most existing neural attention models admit only short sequences, or they have to employ chunking or padding to enforce a constant input length. Here we propose a simple neural network building block called ChordMixer which can model the attention for long sequences with variable lengths. Each ChordMixer block consists of a positionwise rotation layer without learnable parameters and an element-wise MLP layer. Repeatedly applying such blocks forms an effective network backbone that mixes the input signals towards the learning targets. We have tested ChordMixer on the synthetic adding problem, long document classification, and DNA sequence-based taxonomy classification. The experiment results show that our method substantially outperforms other neural attention models. 1

1. INTRODUCTION

Sequential data appear widely in data science. In many domains, the sequences have a diverse distribution of lengths. For example, text information can be as short as an SMS limited to 160 characters or as long as a novel with over 500,000 wordsfoot_1 . In biology, the median human gene length is about 24,000 base pairs (Fuchs et al., 2014) , while the shortest is 76 (Sharp et al., 1985) and the longest is at least 2,300,000 (Tennyson et al., 1995) . Meanwhile, long-range interactions between DNA elements are common and can be up to 20,000 bases away (Gasperini et al., 2020) . Modeling interactions in such sequences is a fundamental problem in machine learning and brings great challenges to attention approaches based on deep neural networks. Most existing neural attention methods cannot handle long sequences with different lengths. For efficient batch processing, architectures such as Transformer and its variants have been proposed, they usually assume constant input length. Otherwise, they have to use chunking, resampling, or padding to enforce the same input length. However, these enforcing approaches either lose much information or cause substantial waste in storage and computation. Even though some architectures, such as scaled dot-product (Vaswani et al., 2017) , can deal with short sequences with variable lengths, they are not scalable to very long sequences. In this paper, we propose a novel neural attention model called ChordMixer to overcome the above drawbacks. The new neural network takes a sequence of any length as input and outputs a tensor of the same size. Moreover, ChordMixer is scalable to very long sequences (we demonstrate lengths up to 1.5M in our experiments). ChordMixer comprises several modules or blocks with a simple and identical architecture. Each block has a Multi-Layer Perceptron (MLP) layer over the sequence channels and a multi-scale rotation layer over the element positions. The rotation has no learnable parameters, and therefore the model size of each block is independent of sequence length. After log 2 N ChordMixer blocks, every number in the output has a full receptive field of all input numbers for length N .



Code is publicly available at https://github.com/RuslanKhalitov/ChordMixer https://en.wikipedia.org/wiki/List_of_longest_novels

