TRANSFORMING RECURRENT NEURAL NETWORKS WITH ATTENTION AND FIXED-POINT EQUATIONS Anonymous

Abstract

Transformer has achieved state of the art performance in multiple Natural Language Processing tasks recently. Yet the Feed Forward Network(FFN) in a Transformer block is computationally expensive. In this paper, we present a framework to transform Recurrent Neural Networks(RNNs) and their variants into selfattention-style models, with an approximation of Banach Fixed-point Theorem. Within this framework, we propose a new model, StarSaber, by solving a set of equations obtained from RNN with Fixed-point Theorem and further approximate it with a Multi-layer Perceptron. It provides a view of stacking layers. StarSaber achieves better performance than both the vanilla Transformer and an improved version called ReZero on three datasets and is more computationally efficient, due to the reduction of Transformer's FFN layer. It has two major parts. One is a way to encode position information with two different matrices. For every position in a sequence, we have a matrix operating on positions before it and another matrix operating on positions after it. The other is the introduction of direct paths from the input layer to the rest of layers. Ablation studies show the effectiveness of these two parts. We additionally show that other RNN variants such as RNNs with gates can also be transformed in the same way, outperforming the two kinds of Transformers as well.

1. INTRODUCTION

Recurrent Neural Network, known as RNN, has been widely applied to various tasks in the last decade, such as Neural Machine Translation (Kalchbrenner & Blunsom, 2013; Sutskever et al., 2014 ), Text Classification (Zhou et al., 2016) , Name Entity Recognition (Zhang & Yang, 2018; Chiu & Nichols, 2016) , Machine Reading Comprehension (Hermann et al., 2015; Kadlec et al., 2016) and Natural Language Inference (Chen et al., 2017; Wang et al., 2017) . Models applied to these tasks are not the vanilla RNNs but two of their famous variants, Gated Recurrent Unit (Cho et al., 2014) , known as GRU, and Long Short Term Memory (Hochreiter & Schmidhuber, 1997) , known as LSTM, in which gates play an important role. RNNs are hard to be computed parallelly. They are not bidirectional either, meaning that a word cannot utilize the information of words coming after it. A general way to alleviate this problem is to reverse the input sequence and combine results given by two different RNN encoders with operations like concatenation and addition. However, Transformer (Vaswani et al., 2017) has provided a better solution. It is based on purely attention mechanism, which has been widely used in Neural Machine Translation since Bahdanau et al. (2014) . Models based on self-attention mechanism are mostly Transformer and its variants, such as Transformer-XL (Dai et al., 2019 ), Universal Transformer (Dehghani et al., 2019) and Star-Transformer (Guo et al., 2019) . Compared with recurrent units such as GRU and LSTM, self-attention-style models can be computed parallelly, which means they suit better large-scale training. But each of these Transformers has an FFN layer with a very high vector dimension, which still is the bottleneck to improve the computation efficency. In this paper, we present a new framework based on Banach Fixed-point Theorem to transform the vanilla RNN and its variants with self-attention mechanism. StarSaber, one of such transformed models, outperforms both the vanilla Transformer and ReZero (Bachlechner et al., 2020) in our experiments with less parameters and thus less computational power. To start with, we need a different view of attention. Attention is a way to build a relation graph between words, and the vanilla RNN is nothing but a model with a relation graph as a chain. This graph is in fact represented with an adjacent matrix, which is computed by mapping each pair of positions to a positive real number and normalizing the numbers related to each position, which are just those in the same row of the adjacent matrix, so that they sum up to one. The vanilla RNN updates hidden states through a chain, that is, the hidden state for each position only depends on that in the previous position. However, if we have this relation graph, the hidden state for each position depends on hidden states for all other positions in a sequence. This is where we obtain equations. In our opinion, a bidirectional RNN is defined by some equations and Banach Fixed-point Theorem inspires us to iterate according to them. When we fix the number of iterations and specify distinct weights for each of them, a self-attention-style model is then constructed. In Transformer, Position Embedding(PE) as a way to capture word order information in language by adding a matrix to the input, is indispensable. But in StarSaber, position encoding is done in the aggregation step after the construction of a relation graph. For each position, we sum up linear transformations of hidden states in all positions with the corresponding weights in the relation matrix in order to get an attention vector. In the calculation of such a vector, we specify different linear transformation weights for the "future" and the "past". Then the hidden vector for a position is computed with the corresponding attention vector and an input vector, which turns into a direct path from the input layer to each hidden layer. And we directly drop the FFN layer in Transformer achieving still competive and even better results with much less parameters on three datasets provided by CLUE (Xu et al., 2020) : the AFQMC dataset of Sentence Similarity, the TNEWS dataset of Text Classification and the CMNLI dataset of Natural Language Inference. More importantly, our derivation of StarSaber shows a universal way to transform different RNNs, such as LSTM and GRU discussed in the following content, providing possibilities other than Transformers for self-attention models.

2. RELATED WORK

Gates were first introduced into recurrent networks in LSTM and were rediscovered and simplified in GRU. Gate mechanism is an operation that multiplies an output by a single sigmoid layer of the input and is often seen as an approach to address the gradient vanishing issue. But if only so, other approaches which addresses this problem should achieve similar results to LSTM and GRU. In this paper, we show by experiments that in StarSaber which doesn't have such a problem, gates can also help improve the performance. Attention in sequence modeling is a weighted sum of the output in each position of a sequence, which simulates the way a man distributes his attention to all its parts. Weights in this sum are given by a certain function of some inputs. And self-attention is an approach computing both the weighted sum and weights on the same sequence without any other inputs. There are different types of attention like multi-head attention and scaled dot product attention in Transformer, attention based on addition in Bahdanau et al. (2014) , and bilinear attention in Luong et al. (2015) . Our model applies the bilinear attention in the construction of a word relation graph. Residual Connection was proposed by He et al. (2015) . It alleviates the problem of training deep neural networks. In Natural Language Processing, Residual Connection alleviates both the gradient vanishing problem and the degration problem of deep networks. Our model uses a weighted residual connection (Bachlechner et al., 2020) which further alleviates the degration problem. Another similar idea is the highway connection (Srivastava et al., 2015) . In this paper, we inspect the gate mechanism in our self-attention-style model. Note that the highway connection can also fit into our framework, which is a fixed-point generalization of GRU. Pretraining has proved to be extremely useful since Embeddings from Language Models(ELMO) (Peters et al., 2018) . Many works that follow such as BERT (Devlin et al., 2018) , ALBERT (Lan et al., 2020) , XLNET (Yang et al., 2019) have outperformed humans. Pretraining is a training pattern which trains a language model, usually extremely large, on an enormous dataset with one

