TRANSFORMING RECURRENT NEURAL NETWORKS WITH ATTENTION AND FIXED-POINT EQUATIONS Anonymous

Abstract

Transformer has achieved state of the art performance in multiple Natural Language Processing tasks recently. Yet the Feed Forward Network(FFN) in a Transformer block is computationally expensive. In this paper, we present a framework to transform Recurrent Neural Networks(RNNs) and their variants into selfattention-style models, with an approximation of Banach Fixed-point Theorem. Within this framework, we propose a new model, StarSaber, by solving a set of equations obtained from RNN with Fixed-point Theorem and further approximate it with a Multi-layer Perceptron. It provides a view of stacking layers. StarSaber achieves better performance than both the vanilla Transformer and an improved version called ReZero on three datasets and is more computationally efficient, due to the reduction of Transformer's FFN layer. It has two major parts. One is a way to encode position information with two different matrices. For every position in a sequence, we have a matrix operating on positions before it and another matrix operating on positions after it. The other is the introduction of direct paths from the input layer to the rest of layers. Ablation studies show the effectiveness of these two parts. We additionally show that other RNN variants such as RNNs with gates can also be transformed in the same way, outperforming the two kinds of Transformers as well.

1. INTRODUCTION

Recurrent Neural Network, known as RNN, has been widely applied to various tasks in the last decade, such as Neural Machine Translation (Kalchbrenner & Blunsom, 2013; Sutskever et al., 2014 ), Text Classification (Zhou et al., 2016) , Name Entity Recognition (Zhang & Yang, 2018; Chiu & Nichols, 2016 ), Machine Reading Comprehension (Hermann et al., 2015; Kadlec et al., 2016) and Natural Language Inference (Chen et al., 2017; Wang et al., 2017) . Models applied to these tasks are not the vanilla RNNs but two of their famous variants, Gated Recurrent Unit (Cho et al., 2014) , known as GRU, and Long Short Term Memory (Hochreiter & Schmidhuber, 1997) , known as LSTM, in which gates play an important role. RNNs are hard to be computed parallelly. They are not bidirectional either, meaning that a word cannot utilize the information of words coming after it. A general way to alleviate this problem is to reverse the input sequence and combine results given by two different RNN encoders with operations like concatenation and addition. However, Transformer (Vaswani et al., 2017) has provided a better solution. It is based on purely attention mechanism, which has been widely used in Neural Machine Translation since Bahdanau et al. (2014) . Models based on self-attention mechanism are mostly Transformer and its variants, such as Transformer-XL (Dai et al., 2019 ), Universal Transformer (Dehghani et al., 2019) and Star-Transformer (Guo et al., 2019) . Compared with recurrent units such as GRU and LSTM, self-attention-style models can be computed parallelly, which means they suit better large-scale training. But each of these Transformers has an FFN layer with a very high vector dimension, which still is the bottleneck to improve the computation efficency. In this paper, we present a new framework based on Banach Fixed-point Theorem to transform the vanilla RNN and its variants with self-attention mechanism. StarSaber, one of such transformed models, outperforms both the vanilla Transformer and ReZero (Bachlechner et al., 2020) in our experiments with less parameters and thus less computational power. To start with, 1

