IOT: INSTANCE-WISE LAYER REORDERING FOR TRANSFORMER STRUCTURES

Abstract

With sequentially stacked self-attention, (optional) encoder-decoder attention, and feed-forward layers, Transformer achieves big success in natural language processing (NLP), and many variants have been proposed. Currently, almost all these models assume that the layer order is fixed and kept the same across data samples. We observe that different data samples actually favor different orders of the layers. Based on this observation, in this work, we break the assumption of the fixed layer order in Transformer and introduce instance-wise layer reordering into model structure. Our Instance-wise Ordered Transformer (IOT) can model variant functions by reordered layers, which enables each sample to select the better one to improve the model performance under the constraint of almost same number of parameters. To achieve this, we introduce a light predictor with negligible parameter and inference cost to decide the most capable and favorable layer order for any input sequence. Experiments on 3 tasks (neural machine translation, abstractive summarization, and code generation) and 9 datasets demonstrate consistent improvements of our method. We further show that our method can also be applied to other architectures beyond Transformer. Our code is released at Github 1 .

1. INTRODUCTION

Transformer (Vaswani et al., 2017) has been the dominant architecture in deep learning models (Hassan et al., 2018; Ng et al., 2019; Carion et al., 2020; Radford et al., 2019; Dai et al., 2019; Lee et al., 2019; Devlin et al., 2018; Yang et al., 2019; Cai & Lam, 2019) . A Transformer model is stacked by several identical blocks, and each block consists of sequentially ordered layers: the self-attention (SA), encoder-decoder attention (ED) (decoder only) and feed-forward (FF) layer. Recently, various modifications have been proposed, where the focus is on replacing or inserting some components (e.g., attention layer/layer norm/position encoding) in standard Transformer (Wu et al., 2019; Lu et al., 2019; Shaw et al., 2018; So et al., 2019; Ahmed et al., 2017) . Despite these Transformer alternatives have achieved improved performances, one critical element is almost neglected in current models, which is how to arrange the components within a Transformer network, i.e., the layer order also matters. As pointed by He et al. (2016b) , different orders of ReLU, batch normalization and residual connection significantly affect the performance of ResNet (He et al., 2016a) . Therefore, we ask: What if we reorder the sequential layers in Transformer (e.g., SA→FF or FF→SA of encoder, SA→FF→ED or FF→ED→SA of decoder)? What is the best order for these different layers? Reference and just like that , the iceberg shows you a different side of its personality . BLEU↑ TER↓ Order 1 Trans and just so , the iceberg shows a different side of its personality . 77.11 18.75 Order 2 Trans and just like that , the iceberg shows you a different side of its personality . 100.00 0.00 Order 3 Trans and just so , the iceberg gives you another side of his personality . 0.00 37.50 Order 4 Trans and just like this , the iceberg gives you another side of its personality . 38.71 25.00 Order 5 Trans ans so simply , the iceberg shows another side of his personality . 30.33 50.00 Order 6 Trans and just like this , the iceberg shows you another side of his personality . 36.61 25.00 Table 2 : Translations (Trans) from all ordered decoders of Transformer for one example sentence. We first conduct preliminary experiments. We vary the three layers in decoder with all six variants (each with a unique order of the three layers) and train these models. Results on IWSLT14 German→English translation are reported in Table 1 . As we can see, their performances are similar and no one is outstanding. The corpus BLEU variance is only 0.0045, which means that simply reordering the layers and training over the whole corpus impacts little. Press et al. (2019) also reported this for machine translation, but they stopped here. This seems to be a negative answer. However, we take a further step and ask one more question: Does different data favor different ordered layers? That is, we investigate whether each specific data has its own preference for one particular order. Intuitively, putting various data patterns in one order should not be the best choice. For example, harder samples may favor a particular order while easier ones favor another one. Thus, for each order, we count the ratio of samples that achieve the best score with that order. In Table 1 , we find they almost lie on a uniform distribution (e.g., 17.9% samples achieve the best BLEU with order SA→ED→FF). Besides, we calculate the BLEU variance for each sample, and average all these variances, the result is 114.76, which is much larger than above corpus variance (0.0045). These both mean the data indeed has its own preference to different orders. In Table 2 , we present translations from all decoders for on example with BLEU and TER score to give an evidence. Motivated by above observations, in this work, we present Instance-wise Ordered Transformer (IOT), in which the layer order is determined by the specific data through instance-wise learning. To achieve this, we utilize a light predictor to predict the confidence for each order, given the corresponding classification losses as training signals. However, directly training the predictor with conventional (i.e., NMT) loss tends to quickly converge to a bad order, and ignore explorations on others. Thus, we introduce an exploration loss and an exploitation loss to make an effective training while keeping an unambiguous prediction for each data so that the best order can be decided during inference. We evaluate our approach on 3 sequence generation tasks, including neural machine translation (NMT), abstractive summarization (ABS) and code generation (CG). For NMT, we work on 8 IWSLT and 2 WMT tasks, both on low-resource and rich-resource scenarios. Our method can consistently obtain 1.0 BELU score improvements over Transformer. For ABS, IOT also outperforms Transformer and other baselines on Gigaword dataset. For CG tasks, the results on 2 large-scale real-world code datasets (Java and Python) collected from Github surpass the state-of-the-art performances. These all demonstrate the effectiveness of our IOT. Furthermore, we provide detailed studies to verify that the instance-wise learning and order selection make a reasonable and necessary modeling. The contributions of this work can be summarized as follows: • We are the first to leverage instance-wise learning for layer order selection in a Transformer model (with shared parameters), and we demonstrate the instance-wise learning is critical. • We demonstrate our learning approach can be universally applied to other structures beside Transformer (e.g., Dynamic Convolutions), as long as there are multiple different layers. • Experiments on 3 sequence generation tasks and 9 datasets verify the effectiveness of IOT with consistent performance improvements.

2. RELATED WORK

Architecture Exploration Inventing novel architectures by human designing or automatic searching plays an important role in deep learning. Specific to Transformer structures, various modifications have been proposed. For example, human knowledge powered designs include DynamicConv (Wu et al., 2019) , Macaron Network (Lu et al., 2019) , Reformer (Kitaev et al., 2020) and others (Fonollosa et al., 2019; Ahmed et al., 2017; Shaw et al., 2018) . As for automatic searching, neural architecture We show two ordered encoders/decoders here. After taking X 1 , X 2 , X 3 , the selected order for Y 2 , Y 3 is the lower encoder and upper decoder, while for Y 1 is the upper encoder and lower decoder. search can discover networks with state-of-the-art performances but always with complicated computation, i.e., Evolved Transformer (So et al., 2019) . The underlying principle is to add or replace some components of Transformer. For instance, Wu et al. (2019) replace self-attention with dynamic convolution, So et al. (2019) add a separate convolution layer in a new branch. Different from them, we, instead, only focus on the selection of layer orders for each data sample so as to improve the model performance, without a heavy modification. Besides, our approach is structure agnostic, which can be universally applied to other structures, only if multiple different layers exist. Instance-wise Learning Deep learning models are trained over large-scale datasets, and data samples are often treated equally without modeling the difference between them. Some works attempt to weight each data with different importance (Ren et al., 2018; Hu et al., 2019; Chang et al., 2017) or feed data with curriculum learning according to its difficulty (Bengio et al., 2009; Fan et al., 2018) . However, they often explicitly manipulate the data during training only, while no distinction exists in inference, and under one fixed model. Elbayad et al. (2020) take a step further and propose the depth-adaptive Transformer, which can forecast different depths of the network by predicting the required computation for a particular data. Similarly, Liu et al. (2020) propose a sample-wise adaptive mechanism to dynamically calculate the number of required layers. They both aim at reducing the computation cost and speed up the inference. Schwartz et al. (2020) , Bapna et al. (2020) and Shazeer et al. (2017) all leverage conditional computation for each sample to control the computation and accuracy tradeoff during inference. Instead, we pay attention to the variant modeling functions and perform instance-wise order selection in order to boost the Transformer performance. The most related work is Press et al. (2019) , which manually generates randomly ordered Transformer encoders and finds the Sandwich Transformer can slightly reduce the perplexity of language modeling. However, they find that Sandwich Transformer pattern has no effect on NMT task. Besides, it still performs over the whole corpus without considering each specific data. We, instead, investigate on various sequence-to-sequence generation tasks and greatly improve the task performances through instance-wise learning, so as to discover the optimal ordered Transformer for each particular data.

3. INSTANCE-WISE ORDERED TRANSFORMER

The overall framework of IOT is presented in Figure 1 . In comparison with the standard Transformer, IOT only incorporates light-weighted predictors and reorders the encoder/decoder with weight tying, under the constraint of almost same number of parameters and exempt from heavy modifications. In this section, we introduce the details of IOT, including training, inference and discussions. Notations Sequence-to-sequence learning aims to map one sequence x = [x 1 , x 2 , ..., x Tx ] into another sequence y = [y 1 , y 2 , ..., y Ty ], where x i , y j denotes the i-th and j-th token of x and y, T x and T y are the corresponding lengths. Given one sentence pair (x, y) and a learning model M, we can define the training objective as minimizing the cross-entropy loss L M = -Ty j=1 log P (y j |y <j , x). Besides, D KL (P Q) denotes the Kullback-Leibler (KL) divergence between distributions P and Q.

3.1. INSTANCE-WISE ENCODER/DECODER

IOT intends to break the fixed order of layers in Transformer. As shown in introduction, simply reordering the layers w.r.t the whole corpus impacts little, while each data has its own preference to orders. Therefore, IOT incorporates instance-wise learning to adjust the favorable order for each data. As shown in Figure 1 , both encoder and decoder in IOT consist of several blocks of SA, ED, FF layer with dynamic order, and we assume there are M (e.g., M = 2) ordered encoders and N (e.g., N = 6) ordered decoders (with shared weights). Inspired by the fact that lower training loss implies the higher proficiency confidence for candidate orders, we utilize the cross-entropy loss as signals to learn the confidence. That is, we calculate confidence γ m and λ n for each encoder enc m , decoder dec n (resulted model M m,n ), and use them to weight the training loss L M m,n . To calculate the confidence, we add a simple and light predictor to help distinguish the orders. Training Given one source sequence x = [x 1 , x 2 , ..., x Tx ], we first map each token into word embedding e = [e 1 , e 2 , ..., e Tx ], where e i ∈ R d , and then apply one light encoder predictor π enc to predict the confidence of encoder orders using sentence embedding s e = 1 Tx Tx i=1 e i . Concretely, π enc takes s e as input and predicts γ m for enc m by Gumbel-softmax (Jang et al., 2016) : γ m = exp ((log (π encm ) + g m ) /τ e ) M k=1 exp ((log (π enc k ) + g k ) /τ e ) , π enc = softmax (s e W e ) , where g m is sampled from Gumbel distribution: g m = -log(-log U m ), U m ∼ Uniform(0, 1) , W e ∈ R d×M is the weight matrix, τ e is a constant temperature to control the distribution to be identical approximation with categorical distribution. Simultaneously, the token embeddings e will feed to the encoders to get hidden states h = [h 1 , h 2 , ..., h Tx ], then we can calculate decoder order confidence λ n by one predictor π dec in the same way as π enc : λ n = exp ((log (π decn ) + g n ) /τ d ) N k=1 exp ((log (π dec k ) + g k ) /τ d ) , π dec = softmax (s d W d ) , where s d = 1 Tx Tx i=1 h i and W d is the weight matrix. For each ordered path through enc m and dec n , we can obtain the training loss L M m,n , and the final cross-entropy loss is weighted by confidence γ m and λ n with L M m,n , formulately as: L C = M m=1 N n=1 (γ m • λ n )L M m,n . Inference During inference, we directly replace the Gumbel-softmax used in training with argmax, in order to choose the most capable encoder and decoder for each sequence x: enc = argmax (s e W e ) , dec = argmax (s d W d ) . ( ) Discussion The decoding process is almost the same as standard Transformer, with only little overhead for order predictions. One may concern the training cost is increased through our training. As we present in Section 5.1, the cost is actually affordable with a fast convergence. Currently, we reorder the layers of the encoder/decoder block and stack the same ordered block L times (see Figure 1 ). A complex extension is to reorder all L blocks of encoder/decoder and we take it as future work.

3.2. AUXILIARY LOSSES

As we can see, the predictors are trained in an unsupervised way, and we observe they lean to be lazy so that all samples quickly converge to one same order during training, without a senseful learning. Thus, to make an effective training and inference, we introduce exploration and exploitation losses. (1) Exploration: first, we explore the diverse capability of all orders with help of a loss L D to encourage all orders to participate in training. The spirit is the same as to encourage exploration in reinforcement learning. The expected softmax probability E x [π x ] (encoder/decoder) from the predictor should approximate the uniform distribution Q = [ 1 N , 1 N , . . . , 1 N ] (e.g., decoder orders), and we achieve this by minimizing KL-divergence between the statistical average E x [π x ] and Q: L D = D KL (Q E x [π x ]) = - 1 N N n=1 log(E x [(π x ) n ]) -log N, where (π x ) n is the probability of n-th decoder order for data x. For encoder order, it is (π x ) m processed in a same way as decoder. (2) Exploitation: different from L D to keep all orders effectively trained, during inference, the output distribution π x for each data should be able to make an unambiguous argmax selection. We then introduce another loss L S to constrain each π x to be far away from the uniform distribution Q. Concretely, we maximize the KL-divergence between each probability π x and Q: L S = -E x [D KL (Q π x )] = -E x - 1 N N n=1 log(π x ) n -log N . Note that we clamp the value of probability π x since the KL value is theoretically unbounded. With above auxiliary losses, the final training objective is to minimize: L = L C + c 1 L D + c 2 L S , where c 1 and c 2 are coefficients to make a trade-off between L D and L S . In this way, we can achieve effective training, while keeping the ability to distinguish the favorable order for each data. Discussion L D and L S aim to keep effective training and unambiguous inference. There are several alternatives. The first is to simply decay the temperature τ in Equation ( 1) and ( 2), and remove the auxiliary losses. However, we do not notice obvious gain. Second is to linearly decay c 1 only and remove L S , which is able to fully train all orders at the beginning and loose this constraint gradually. We find this is also beneficial, but our two losses method performs better.

4. EXPERIMENTS

We conduct experiments on 3 sequence generation tasks: neural machine translation (both lowresource and rich-resource), code generation and abstractive summarization. The main settings of each experiment are introduced here, and more details can be found in Appendix A.

4.1. DATASET

Neural Machine Translation For the low-resource scenario, we conduct experiments on IWSLT14 English↔German (En↔De), English↔Spanish (En↔Es), IWSLT17 English↔French (En↔Fr), English↔Chinese (En↔Zh) translations. The training data includes 160k, 183k, 236k, 235k sentence pairs for each language pair respectively. For the rich-resource scenario, we work on WMT14 En→De and WMT16 Romanian→English (Ro→En) translations. For WMT14 En→De, we filter out 4.5M sentence pairs for training and concatenate newstest2012 and newstest2013 as dev set, newstest2014 as test set. For WMT16 Ro→En, we concatenate the 0.6M bilingual pairs and 2.0M back translated datafoot_0 for training, newsdev2016/newstest2016 serve as dev/test set. Code Generation Code generation aims to map natural language sentences to programming language code. We work on one Java (Hu et al., 2018) and one Python dataset (Wan et al., 2018) , following Wei et al. (2019) to process the two datasets. The Java dataset is collected from Java projects on Github, and the Python dataset is collected by Barone & Sennrich (2017) . We split each dataset with ratio 0.8 : 0.1 : 0.1 as training, dev and test set. Abstractive Summarization Abstractive summarization is to summarize one long sentence into a short one. The dataset we utilized is a widely acknowledged one: Gigaword summarization, which is constructed from a subset of Gigaword corpus (Graff et al., 2003) respectively. Others are the same as NMT. For summarization, we take transformer wmt en de, with 6 blocks, embedding size 512 and FFN size 2048. Dropout (Srivastava et al., 2014 ) is set to be 0.3. Other settings are also the same as NMT task. Implementation is developed on Fairseq (Ott et al., 2019) . We first grid search c 1 , c 2 on IWSLT14 De→En dev set, and then apply them on other tasks. The best setting is c 1 = 0.1, c 2 = 0.01, and the importance study of c 1 , c 2 is shown in Appendix B.1.

4.3. EVALUATION

We use multi-bleu.perl to evaluate IWSLT14 En↔De and all WMT tasks for a fair comparison with previous works. For other NMT tasks, we use sacre-bleu for evaluation. During inference, we follow Vaswani et al. (2017) to use beam size 4 and length penalty 0.6 for WMT14 En→De, beam size 5 and penalty 1.0 for other tasks. For code generation, the evaluation is based on two metrics, the sentence BLEU computes the n-gram precision of a candidate sequence to the reference, and the percentage of valid code (PoV) that can be parsed into an abstract syntax tree (AST). As for summarization, the generated summarization is evaluated by ROUGE-1/2/L F1 score (Lin, 2004) .

4.4. MAIN RESULTS

Encoder/Decoder Orders Encoder block only contains SA and FF layers, the resulted max number of encoder layer orders M is 2, while for decoder, the max order variants N is 6. Therefore, we first evaluate the utilization of encoder orders, decoders orders, and both orders on IWSLT14 De→En translation, in order to see the impacts of different number of order candidates and their combinations. In Table 3 (a), we can see that 2 ordered encoders improve the result, and 6 ordered decoders achieve more gain. This meets our expectation, since the search space is limited when there are only 2 ordered encoders. However, if we train both encoder and decoder orders (e.g., M = 2, N = 6), the results (e.g., 35.30) can not surpass the 6 decoders only (35.60). We suspect the search space is too large so that training becomes hard, and decoder orders play a more important role than encoder orders for sequence generation. Therefore, we turn to investigate different decoder order candidates (refer to Appendix A.3 for detailed combinations) in Table 3 (b) . Results show that N = 4, 5, 6 achieve similar strong performances (results on other tasks/datasets are in Appendix A.4). Thus, considering the efficiency and improvements, we utilize N = 4 ordered decoders (order 1, 2, 4, 6 in Table 1 ) to reduce training cost in later experiments. NMT Results BLEU scores on 8 IWSLT low-resource tasks are shown in Table 4 . As we can see, IOT achieves more than 1.0 BLEU points improvement on all tasks (e.g., 1.7 on Fr→En). The consistent gains on various language pairs well demonstrate the generalization and effectiveness of our method. We then present comparison with other works on IWSLT14 De→En task in Table 5 (a), and IOT is also better than several human designed networks. The results of WMT14 En→De and WMT16 Ro→En are reported in Table 6 . We also compare with existing works, such as the unsupervised Ro→En based on pre-trained cross-lingual language model (Lample & Conneau, 2019) . Similarly, our method outperforms them and shows our framework can work well on rich-resource scenario.

Code Generation Results

The results are shown in 

Abstractive Summarization Results

The IOT performances on summarization task are shown in Table 7 . From the results, we can see IOT achieves 0.8, 0.7 and 1.0 scores gain of ROUGE-1, ROUGE-2 and ROUGE-L metrics over standard Transformer on Gigaword summarization. IOT also surpasses other works such as reinforcement learning based method (Wang et al., 2018) , which again verifies our approach is simple yet effective. (Wang et al., 2019b) 37.01 17.10 34.87 RNNSearch+select+MTL+ERAML (Li et al., 2018) 35.33 17.27 33.19 CGU (Lin et al., 2018) 36.30 18.00 33.80 Reinforced-Topic-ConvS2S (Wang et al., 2018) 36.92 18.29 34.58 with N = 2 and N = 3 orders respectively, the increased cost ratio is about 1.72× and 2.47× (but less than 2.0 and 3.0). (2) However, we find that with the shared parameters between these orders, the model convergence also becomes faster. Transformer needs 67 epochs when converge, while our IOT only needs 42(0.63×) and 39(0.58×) epochs for N = 2 and N = 3 orders, much fewer than Transformer. (3) The total training cost actually is not increased much. IOT (N = 2) and IOT (N = 3) are about 1.08× and 1.44× training time compared with Transformer baseline (the ratio for IOT (N = 3) is only 1.10 on IWSLT17 En→Es). From these observations, we can see that the increased training cost is affordable due to the fast convergence.

5.2. CASE VERIFICATION

We perform a study with N = 3 to verify that IOT has made a necessary instance-wise order selection. We first split IWSLT14 En↔De dev set into 3 subsets according to the prediction of π dec , and then we decode each subset use all 3 ordered decoders, and report the BLEU results. As shown in Figure 2 , each subset indeed achieves the best score on the corresponding predicted order (outperforms other orders by 0.2-0.4 BLEU). We also do the same study on the test set, and the predicted order outperforms others by 0.7-0.8 BLEU. These well prove that IOT makes a reasonable prediction. Besides, we find that the predicted orders correlate to different sentence difficulties. In our case, the set 1 sentences belong to decoder 1 achieve highest BLEU than other sets, which means set 1 is relatively simple to translate, and vice versa for samples in set 2. These imply that different difficulty sentences have different structure preferences. We provide statistics and examples in Appendix B.2.

5.3. APPLY ON ANOTHER STRUCTURE (DYNAMICCONV)

As we discussed, our instance-wise layer reordering is structure agnostic. In this subsection, we evaluate this by applying our approach on DynamicConv network (Wu et al., 2019) beyond standard for N = 2, 3, 4 ordered decoders respectively (near 0.7 point gain). Therefore, this study verifies our claim that our approach can be applied to other structures, as long as multiple different layers exist. Ensemble Since our framework involves multiple orders (with shared parameters), which is also done in ensemble framework, we make a comparison with ensemble. The ensemble method trains multiple models with different parameters separately in an independent way. While our work trains orders in a joint way with an intention to make them more diverse. More importantly, from the view of time and memory cost, the ensemble framework increases N times which is totally different from ours. In this sense, our method can be combined with ensemble to further boost performance. The competitive results on IWSTL14 En↔De test set are shown in Table 10 . We can clearly conclude that IOT and ensemble are complementary to each other.

5.4. DISCUSSIONS

Regularization IOT consists of different ordered blocks in a weight tying method, which may looks like a parameter regularization to some extent. However, we show that IOT is more than regularization and can be complementary with other regularization methods. Setting (1): We first train a Transformer model on IWSLT14 De→En task with all shared decoder orders, but without instance-wise learning, and test the performance with each order. We find the BLEU scores on test set are near 34.80 for each order, much worse than IOT, which means that simply regularizing the shared parameters for different orders is not the main contribution to performance improvement, and our instance-wise learning is critical. Setting (2): Another experiment is that we train Transformer with LayerDrop (Fan et al., 2019) 

6. CONCLUSION

In this work, we propose Instance-wise Ordered Transformer, which leverages instance-wise learning to reorder the layers in Transformer for each data. Compared with standard Transformer, IOT only introduces slightly increased time cost. Experiments on 3 sequence generation tasks and 9 datasets demonstrate the effectiveness of IOT. We also verify that our approach can be universally applied to other structures, such as DynamicConv. In future, we plan to work on more complicated reordering in each block, as well as other tasks such as multi-lingual translation and text classification. Set 1 i will come to it later . i wasn &apos;t very good at reading things . Set 2 this is a very little known fact about the two countries . it &apos;s why they leave lights on around the house . Set 3 and how can we do that ? it &apos;s our faith , and we will be lo@@ y@@ al to it . Table 19 : Samples of each English valid subset on IWSLT14 En→De translation. We further take a look at the data and find the sentences in set 1 are mostly "simple sentences", and set 2 contains many "emphatic sentences", while set 3 is somehow mixed. In The first experiment is "ordered Transformer" without instance awareness. That is, all the reordered architectures are trained on the whole same corpus with equal weights, and the parameters for these reordered architectures are shared. More specifically, the decoder block has different ways to order SA, ED, and FF layers (e.g., FF→SA→ED, SA→ED→FF, etc), but the parameters for the reordered blocks are shared. Mathematically, the loss function is: L C = N n=1 (λ n • L M n ), where L M n is the model loss function for n-th ordered decoder. Compared with Eqn (3), the weight λ n is fixed to be 1 here. At inference, we first find out the best order according to the dev performance and apply it on the test set. We cannot use instance-wise reordered model in this setting, while our proposed IOT can. The experiments are conducted with transformer iwslt de en configuration for IWSLT translations, and transformer vaswani wmt en de big configuration for WMT16 Ro→En translation. Setting (2): We integrate another regularization technique 'LayerDrop' (Fan et al., 2019) into both the Transformer baseline and our IOT (N = 4) method, while other settings remain unchanged. The study results of these two settings are represented in Table 20 . From the results, we have same conclusions as discussed in Section 5.4. Simply sharing the parameters of different decoders as a regularization cannot boost the model performance ("Transformer + (1)" in Table 20 ), while our IOT can further improve the performance with other regularization methods.

B.4 ROBUSTNESS

An impact of IOT training besides performance gain is that the model can be more robust compared to one order only. In Table 21, we provide one example to prove the robustness. We train one Transformer model by decoder order 1, and to decode the sentences with all orders in inference. Obviously, only decoding with order 1 leads to good performance, while other orders can not achieve reasonable scores since the layer order is changed and the feature exaction becomes incorrect. As for IOT, the generated sequences remain stable and high results for each order. our analysis in Section 5.1. The converged (smallest) validation loss value seems to be similar to Transformer baseline, but please note that the loss computation of IOT is different from Transformer baseline. As shown in Eqn (3), the loss function of IOT is a weighted sum of loss values for each order, while for Transformer, it is only one order loss. Therefore, when we turn to the comparison of validation BLEU score, the superiority of our IOT can be clearly verified. From the BLEU score curves in Figure 4 , it is obvious that IOT achieves better BLEU score than standard Transformer along the whole training epochs, on both validation and test sets. These visualized results well demonstrate the effectiveness of our IOT approach. 



http://data.statmt.org/rsennrich/wmt16_backtranslations/ro-en/.



Figure 1: The IOT framework. Pred means the light predictor introduced in 3.1 for order selection.We show two ordered encoders/decoders here. After taking X 1 , X 2 , X 3 , the selected order for Y 2 , Y 3 is the lower encoder and upper decoder, while for Y 1 is the upper encoder and lower decoder.

(a) BLEU scores on IWSLT14 En→De. (b) BLEU scores on IWSLT14 De→En.

Figure 2: BLEU scores of three subsets (divided by predicted decoder) on all decoders.

VISUALIZATION To better understand the difference between IOT and standard Transformer, we investigate on the training process and provide visualization results about model optimization and performance improvements. Specifically, we plot the curve of training loss, validation loss, as well as the validation BLEU score and test BLEU score along the training epochs on IWSLT14 De→En translation dataset. The loss curves are visualized in Figure 3, and the BLEU curves are presented in Figure 4. From the validation loss curves of Figure 3(b) and 3(c), we can first see that our IOT (N = 3) training converges faster than Transformer baseline and shows the advantage of IOT, which is consistent to

Figure 3: Comparison of training/validation loss curves along the model training on IWSLT De→En translation. 'IOT' is IOT (N = 3) and 'Baseline' is Transformer. Figure 3(c) is the same curve as 3(b), except the value of y-axis in Figure 3(c) is between 3.9-4.1, while 4.9-9.0 for Figure 3(b).

Figure 4: Comparison of validation/test BLEU curves along the model training on IWSLT De→En translation. 'IOT' is IOT (N = 3) and 'Baseline' is Transformer. Figure 4(c) is the same curve as 4(b), except the value of y-axis in Figure 4(c) is between 33-36, while 0-36 for Figure 4(b).

Results for different decoder orders on IWSLT14 De→En translation.

and first used byRush et al. (2017).The training data consists of 3.8M article-headline pairs, while the dev and test set consist of 190k and 2k pairs respectively.

Preliminary results of varied orders on IWSLT14 De→En task.

BLEU scores of IOT on eight IWSLT low-resource translation tasks.

Results on IWSLT14 De→En translation task (a), Java and Python code generation tasks (b). Results on Java and Python code generations.

Table 5(b).We can observe that Transformer obtains better result than the LSTM-based work(Wei et al., 2019). Compared with Transformer, IOT can further improve the quality of generated code. Specifically, IOT boosts Transformer with 0.93 BLEU/2.86% PoV gain on Java generation and 0.75 BLEU/1.25% PoV gain on Python respectively. Again, these results well demonstrate the effectiveness of our method.

ROUGE-1/2/L F1 scores for Gigaword summarization.As discussed before, our approach only increases negligible parameters and inference time cost. Here we compare the detailed inference time and model size of our framework to the standard Transformer. The detailed parameter numbers and inference time on IWSLT14 En↔De test set are shown in Table8. Since we only add one linear layer and softmax layer as the predictor, the number of extra parameters is M × hidden size (encoder predictor) or N × hidden size (decoder predictor), which is negligible compared to other model parameters. Therefore, IOT introduces more model diversity and improves the performance, but under the constraint of almost same number of parameters. As for the inference time, the only difference is from the one-pass order prediction and the cost is extremely low compared with heavy autoregressive generation process, which can be seen from Table8.

Inference time and model parameters counted for Transformer and our framework on IWSLT14 En↔De. The study is performed on a single Tesla P100 GPU card.

Training cost analysis for Transformer and our IOT on four IWSLT translation tasks. The study is performed on a single Tesla P100 GPU card. are presented in Table9, and we can have several observations. Take IWSLT14 En→De translation as an example, (1) jointly optimizing different orders indeed introduces more training cost for each epoch. Transformer baseline costs 277.1s per epoch training, while our IOT costs 475.6s and 685.4s

Ensemble performances of standard Transformer and IOT.

, a dropout technique to regularize the layer parameters. The test BLEU is 35.40, which achieves about 0.8 score improvement over Transformer. After applying IOT with LayerDrop, we obtain further gains than IOT only (35.62) to reach a BLEU score 36.13. Therefore, this demonstrates IOT is not only regularization and can be smoothly integrated with other regularization methods. More details and experiments on other tasks are shown in Appendix B.3.

Table 19, we provide some sentence examples belong to each subset to give more clarifications.B.3 REGULARIZATIONIn Section 5.4, we have provided an example of regularization experiments on IWLST14 De→En translation, which demonstrates that our IOT is not only regularization and can be smoothly integrated with other regularization methods. To give more evidences and more details, we extend the regularization experiments on all IWSLT translation tasks (IWSLT14 En↔De, IWSLT14 En↔Es, IWSLT17 En↔Zh, IWSLT17 En↔Fr), and WMT16 Ro→En translation. The two specific settings of the experiments are as follows. Setting (1):

Transformer + (1)28.35 ± 0.12 34.68 ± 0.05 35.9 ± 0.23 36.7 ± 0.14 Regularization study experiments on 8 IWSLT translation tasks and WMT16 Ro→En translation. We study both setting (1): train Transformer model with all shared decoders but without instance-wise learning, and setting (2): add LayerDrop(Fan et al., 2019) regularization technique experiments on these tasks.

Robustness study on IWSTL14 De→En translation task on dev set. is the order trained by Transformer.

availability

//github.com/instance-wise-ordered-transformer/IOT

A EXPERIMENTAL SETTINGS AND MORE RESULTS

A.1 DETAILED DATA SETTINGS Neural Machine Translation Following the common practice (Ott et al., 2019) , we lowercase all words for IWSLT14 En↔De. For IWSLT14 En↔De, En↔Es, IWSLT17 En↔Fr, we use a joint source and target vocabulary with 10k byte-pair-encoding (BPE) (Sennrich et al., 2015) operations, and for IWSLT17 En↔Zh, we use a seperate source and target vocabulary. For all WMT tasks, sentences are encoded by a joint source and target vocabulary of 32k tokens.Code Generation In the Java dataset, the numbers of training, validation and test sequences are 69, 708, 8, 714 and 8, 714 respectively, and the corresponding numbers for Python are 55, 538, 18, 505 and 18, 502. All samples are tokenized. We use the downloaded Java dataset without further processing, and use Python standard AST module to further process the python code. The source and target vocabulary sizes in natural language to Java code generation are 27k and 50k, and those for natural language to Python code generation are 18k and 50k. In this case, following Wei et al. (2019) , we do not apply subword tokenization like BPE to the sequences.Abstractive Summarization The Gigaword corpus represents for a headline generation task, each source article contains about 31.4 tokens on average, while the target headline contains near 8.3 tokens per sentence. The training data consists of 3.8M article-headline pairs, while the validation and test set consist of 190k and 2k pairs respectively. We preprocess the dataset in a same way as NMT task. The words in the source article and target headline are concatenated to make a joint BPE vocabulary. After preprocessing, there are 29k subword tokens in the vocabulary.

Model Configuration

The detailed model configurations are as follows:• transformer iwslt de en setting: 6 blocks in encoder and decoder, embedding size 512, feed-forward size 1024, attention heads 4, dropout value 0.3, weight decay 0.0001.• transformer vaswani wmt en de big setting: 6 blocks in encoder and decoder, embedding size 1024, feed-forward size 4096, attention heads 16, dropout value 0.3, attention dropout 0.1, relu dropout 0.1. • transformer wmt en de big setting: 6 blocks in encoder and decoder, embedding size 0124, feed-forward size 4096, attention heads 16, dropout value 0.3.Optimization We adopt the default optimization setting in Vaswani et al. (2017) . Adam (Kingma & Ba, 2014) optimizer with β 1 = 0.9, β 2 = 0.98 and = 10 -9 . The learning rate scheduler is inverse sqrt with warmup steps 4, 000, default learning rate is 0.0005. Label smoothing (Szegedy et al., 2016) is used with value 0.1. As introduced, to learn the predictors, we clamp the softmax output with value 0.05.

A.3 RESULTS OF ORDER COMBINATIONS

We show in the paper that different number of orders (e.g., N = 4 or N = 5) have varied performances. Therefore, one necessary point is about the different combinations of these N decoders.Here, we work on N = 5 IOT model to show the results of different order candidates.We first present each ordered decoder in Table 11 again (same as in Table 1 ). For the N = 5 ordered decoders with IOT model, we show the performances with 5 combined orders selected from all six variants on dev set of IWSLT14 De→En and En→De translations. The results are reported in Table 12 . We can see the different combinations achieve similar strong performances, which shows that our approach is robust towards different order combinations. This also demonstrate that the importance of IOT is the diversity among order candidates that can help each data distinguish them. For other N ordered decoders, the patterns are similar. Therefore, here we only report the N combinations used for IOT experiments in the paper as follows: IOT (N = 2) is combined by order 4, 6 (ED→SA→FF and FF→ED→SA), and IOT (N = 3) is order 1, 4, 6, IOT (N = 4) is order 1, 2, 4, 6, and IOT (N = 5) is order 1, 2, 4, 5, 6.

A.4 RESULTS OF DIFFERENT NUMBER OF DECODERS

The results of N = 4 ordered decoders (order 1, 2, 4, 6) are mainly reported in the paper. Here, we also show results of other N decoders for all tasks, along with the Transformer baseline.The results of different N decoders for WMT14 En→De and WMT16 Ro→En translations, code generation task, and Gigaword summarization are reported in Table 14 , Table 15 and Table 16 respectively. As we can see, more ordered decoders can bring better performance, which supports the effectiveness of our framework and demonstrates the data has its own favor towards different orders.Considering the efficiency, we do not perform experiments with more than 4 decoders for these tasks. We conduct another study on IWSLT14 De→En dev set to investigate the impact of our proposed auxiliary losses controlled by weight c 1 and c 2 . The values of c 1 and c 2 are varied between [0.0, 0.05, 0.1, 0.5] and [0.0, 0.005, 0.01, 0.05] respectively, and the results are presented in Table 17 . It can be seen that the best configuration is c 1 = 0.1 and c 2 = 0.01. Therefore, we report the leading results in the paper with c 1 = 0.1, c 2 = 0.01. The results also clearly demonstrate that the two additional losses are necessary to make our framework effective.

B.2 DATA EXAMPLES VERIFICATION

As discussed in Section 5.2, the data split by the corresponding predicted order is in different pattern. For example, the difficulty of each set is different. We therefore analyze the split data and calculate some statistics among these subsets. Specifically, we first count the sentence number S, the tokens T , and the distinct vocabulary D i in each subset. We show these numbers in Table 18 , along with corresponding averaged BLEU score (see Figure 2 ). We can see that the vocabulary size of set 1 is the smallest, and set 2 is the largest, which means there are more distinct words in set 2. This leads the generation of set 2 to be harder than set 1, which maps the BLEU score ranking among these sets. Besides, we also calculate the token frequency f ij for token j in each own subset i, and sum the frequency of top 20 tokens in each subset, F i = 20 j=1 f ij , to give another evidence. The results also show that F 1 is the highest, which means the tokens in set 1 contains most frequent words to make an easy learning, while set 2 is harder since F 2 is small. f ij is the sum of the frequency of top 20 tokens in set i, where f ij is the frequency for token j.

