UNDERSTANDING AND IMPROVING ENCODER LAYER FUSION IN SEQUENCE-TO-SEQUENCE LEARNING

Abstract

Encoder layer fusion (EncoderFusion) is a technique to fuse all the encoder layers (instead of the uppermost layer) for sequence-to-sequence (Seq2Seq) models, which has proven effective on various NLP tasks. However, it is still not entirely clear why and when EncoderFusion should work. In this paper, our main contribution is to take a step further in understanding EncoderFusion. Many of previous studies believe that the success of EncoderFusion comes from exploiting surface and syntactic information embedded in lower encoder layers. Unlike them, we find that the encoder embedding layer is more important than other intermediate encoder layers. In addition, the uppermost decoder layer consistently pays more attention to the encoder embedding layer across NLP tasks. Based on this observation, we propose a simple fusion method, SurfaceFusion, by fusing only the encoder embedding layer for the softmax layer. Experimental results show that SurfaceFusion outperforms EncoderFusion on several NLP benchmarks, including machine translation, text summarization, and grammatical error correction. It obtains the state-of-the-art performance on WMT16 Romanian-English and WMT14 English-French translation tasks. Extensive analyses reveal that SurfaceFusion learns more expressive bilingual word embeddings by building a closer relationship between relevant source and target embeddings.

1. INTRODUCTION

Sequence-to-Sequence (Seq2Seq) learning (Sutskever et al., 2014) has advanced the state of the art in various natural language processing (NLP) tasks, such as machine translation (Bahdanau et al., 2015; Vaswani et al., 2017; Wu et al., 2019 ), text summarization (Wang et al., 2019b; Zhang et al., 2020) , and grammatical error correction (Kiyono et al., 2019; Kaneko et al., 2020) . Seq2Seq models are generally implemented with an encoder-decoder framework, in which a multi-layer encoder summarizes a source sequence into a sequence of representation and another multi-layer decoder produces the target sequence conditioned on the encoded representation. Recent studies reveal that fusing the intermediate encoder layers (EncoderFusion) is beneficial for Seq2Seq models, such as layer attention (Bapna et al., 2018 ), layer aggregation (Dou et al., 2018; Wang et al., 2019c) , and layer-wise coordination (He et al., 2018) . Despite its effectiveness, not much is known about how fusing encoder layer representations work. The intuitive explanation is that fusing encoder layers exploits surface and syntactic information embedded in the lower encoder layers (Belinkov et al., 2017; Peters et al., 2018) . However, other studies show that attending to lower encoder layers (excluding the encoder embedding layer) does not improve model performance (Domhan, 2018) , which is conflicted with existing conclusions. It is still unclear why and when fusing encoder layers should work in Seq2Seq models. This paper tries to shed light upon behavior of Seq2Seq models augmented with EncoderFusion method. To this end, we propose a novel fine-grained layer attention to evaluate the contribution of individual encoder layers. We conduct experiments on several representative Seq2Seq NLP tasks, including machine translation, text summarization, and grammatical error correction. Through a series of analyses, we find that the uppermost decoder layer pays more attention to the encoder embedding layer. Masking the encoder embedding layer significantly drops model performance by generating hallucinatory (i.e. fluent but unfaithful to the source) predictions. The encoded representation of the standard Seq2Seq models (i.e. w/o fusing encoder layers) may not have enough capacity to model both semantic and surface features (especially at the encoder embedding layer). We call the problem described above the source representation bottleneck. Based on this observation, we simplify the EncoderFusion approaches by only connecting the encoder embedding layer to softmax layer (SurfaceFusion). The SurfaceFusion approach shortens the path distance between source and target embeddings, which can help to learn better bilingual embeddings with direct interactions. Experimental results on several Seq2Seq NLP tasks show that our method consistently outperforms both the vanilla Seq2Seq model and the layer attention model. Extensive analyses reveal that our approach produces more aligned bilingual word embeddings by shortening the path distance between them, which confirm our claim. Our main contributions are as follows: • We introduce a fine-grained layer attention method to qualitatively and quantitatively evaluate the contribution of individual encoder layers. • We demonstrate that the encoder embedding layer is essential for fusing encoder layers, which consolidates conflicted findings reported by previous studies. • We propose a simple yet effective SurfaceFusion approach to directly exploit the encoder embedding layer for the decoder, which produces more expressive bilingual embeddings.

2.1. SEQUENCE-TO-SEQUENCE LEARNING

Seq2Seq learning aims to maximize the log-likelihood of a target sequence y = {y 1 , . . . , y J } conditioned on a source sequence x = {x 1 , . . . , x I }, which is formulated as: ŷ = arg max log P (y|x). Typically, Seq2Seq learning can be implemented as various architectures (Bahdanau et al., 2015; Gehring et al., 2017; Vaswani et al., 2017; Wu et al., 2019) , among which the Transformer (Vaswani et al., 2017) has advanced the state of the art. Without loss of generality, we introduce Transformer as the testbed in this paper. Transformer consists of an encoder E equipped with N identical layers to map the source sequence x into distributed representations, based on which a decoder D equipped with M identical layers generates the target sequence y: X N = E(X 0 ) N := n=1 FFN ATT(X n-1 , X n-1 , X n-1 ) Y M = D(Y 0 , X N ) M := m=1 FFN ATT ATT(Y m-1 , Y m-1 , Y m-1 ), X N , X N (2) where X 0 denotes the sum of the word embeddings X emb and position embeddings X pos of x, Y 0 denotes that of the shifted right y, FFN(•) denotes a position-wise feed-forward network, and ATT(•) denotes a multi-head dot-product attention network with three arguments-query, key and value. Residual connection (He et al., 2016) and layer normalization (Ba et al., 2016) are used in each sub-layer, which are suppressed in Equation 1 and 2 for clarity. Finally, the output representation Y M of the decoder is projected into the probability P (y|x), which is optimized during model training.

2.2. EXPERIMENTAL SETUP

To validate the universality of source representation bottleneck in Seq2Seq models, we conducted experiments on three representative tasks, which vary from the distance between input and output domains and the scale of training data: Machine translation takes a sentence in one language as input, and outputs a semantically-equivalent sentence in another language. We conducted experiments on three benchmarking datasets: smallscale WMT16 Romanian-English (Ro-En; 0.6M instances), medium-scale WMT14 English-German

availability

//github.com/SunbowLiu/SurfaceFusion.

