IOT: INSTANCE-WISE LAYER REORDERING FOR TRANSFORMER STRUCTURES

Abstract

With sequentially stacked self-attention, (optional) encoder-decoder attention, and feed-forward layers, Transformer achieves big success in natural language processing (NLP), and many variants have been proposed. Currently, almost all these models assume that the layer order is fixed and kept the same across data samples. We observe that different data samples actually favor different orders of the layers. Based on this observation, in this work, we break the assumption of the fixed layer order in Transformer and introduce instance-wise layer reordering into model structure. Our Instance-wise Ordered Transformer (IOT) can model variant functions by reordered layers, which enables each sample to select the better one to improve the model performance under the constraint of almost same number of parameters. To achieve this, we introduce a light predictor with negligible parameter and inference cost to decide the most capable and favorable layer order for any input sequence. Experiments on 3 tasks (neural machine translation, abstractive summarization, and code generation) and 9 datasets demonstrate consistent improvements of our method. We further show that our method can also be applied to other architectures beyond Transformer. Our code is released at Github 1 .

1. INTRODUCTION

Transformer (Vaswani et al., 2017) has been the dominant architecture in deep learning models (Hassan et al., 2018; Ng et al., 2019; Carion et al., 2020; Radford et al., 2019; Dai et al., 2019; Lee et al., 2019; Devlin et al., 2018; Yang et al., 2019; Cai & Lam, 2019) . A Transformer model is stacked by several identical blocks, and each block consists of sequentially ordered layers: the self-attention (SA), encoder-decoder attention (ED) (decoder only) and feed-forward (FF) layer. Recently, various modifications have been proposed, where the focus is on replacing or inserting some components (e.g., attention layer/layer norm/position encoding) in standard Transformer (Wu et al., 2019; Lu et al., 2019; Shaw et al., 2018; So et al., 2019; Ahmed et al., 2017) . Reference and just like that , the iceberg shows you a different side of its personality . BLEU↑ TER↓ Order 1 Trans and just so , the iceberg shows a different side of its personality . 77.11 18.75 Order 2 Trans and just like that , the iceberg shows you a different side of its personality . 100.00 0.00 Order 3 Trans and just so , the iceberg gives you another side of his personality . 0.00 37.50 Order 4 Trans and just like this , the iceberg gives you another side of its personality . 38.71 25.00 Order 5 Trans ans so simply , the iceberg shows another side of his personality . 30.33 50.00 Order 6 Trans and just like this , the iceberg shows you another side of his personality . 36.61 25.00 Table 2 : Translations (Trans) from all ordered decoders of Transformer for one example sentence. We first conduct preliminary experiments. We vary the three layers in decoder with all six variants (each with a unique order of the three layers) and train these models. Results on IWSLT14 German→English translation are reported in Table 1 . As we can see, their performances are similar and no one is outstanding. The corpus BLEU variance is only 0.0045, which means that simply reordering the layers and training over the whole corpus impacts little. Press et al. ( 2019) also reported this for machine translation, but they stopped here. This seems to be a negative answer. However, we take a further step and ask one more question: Does different data favor different ordered layers? That is, we investigate whether each specific data has its own preference for one particular order. Intuitively, putting various data patterns in one order should not be the best choice. For example, harder samples may favor a particular order while easier ones favor another one. Thus, for each order, we count the ratio of samples that achieve the best score with that order. In Table 1 , we find they almost lie on a uniform distribution (e.g., 17.9% samples achieve the best BLEU with order SA→ED→FF). Besides, we calculate the BLEU variance for each sample, and average all these variances, the result is 114.76, which is much larger than above corpus variance (0.0045). These both mean the data indeed has its own preference to different orders. In Table 2 , we present translations from all decoders for on example with BLEU and TER score to give an evidence. Motivated by above observations, in this work, we present Instance-wise Ordered Transformer (IOT), in which the layer order is determined by the specific data through instance-wise learning. To achieve this, we utilize a light predictor to predict the confidence for each order, given the corresponding classification losses as training signals. However, directly training the predictor with conventional (i.e., NMT) loss tends to quickly converge to a bad order, and ignore explorations on others. Thus, we introduce an exploration loss and an exploitation loss to make an effective training while keeping an unambiguous prediction for each data so that the best order can be decided during inference. We evaluate our approach on 3 sequence generation tasks, including neural machine translation (NMT), abstractive summarization (ABS) and code generation (CG). For NMT, we work on 8 IWSLT and 2 WMT tasks, both on low-resource and rich-resource scenarios. Our method can consistently obtain 1.0 BELU score improvements over Transformer. For ABS, IOT also outperforms Transformer and other baselines on Gigaword dataset. For CG tasks, the results on 2 large-scale real-world code datasets (Java and Python) collected from Github surpass the state-of-the-art performances. These all demonstrate the effectiveness of our IOT. Furthermore, we provide detailed studies to verify that the instance-wise learning and order selection make a reasonable and necessary modeling. The contributions of this work can be summarized as follows: • We are the first to leverage instance-wise learning for layer order selection in a Transformer model (with shared parameters), and we demonstrate the instance-wise learning is critical. • We demonstrate our learning approach can be universally applied to other structures beside Transformer (e.g., Dynamic Convolutions), as long as there are multiple different layers. • Experiments on 3 sequence generation tasks and 9 datasets verify the effectiveness of IOT with consistent performance improvements.

2. RELATED WORK

Architecture Exploration Inventing novel architectures by human designing or automatic searching plays an important role in deep learning. Specific to Transformer structures, various modifications have been proposed. For example, human knowledge powered designs include DynamicConv (Wu et al., 2019 ), Macaron Network (Lu et al., 2019 ), Reformer (Kitaev et al., 2020) and others (Fonollosa et al., 2019; Ahmed et al., 2017; Shaw et al., 2018) . As for automatic searching, neural architecture

availability

//github.com/instance-wise-ordered-transformer/IOT

