RELAXED ATTENTION FOR TRANSFORMER MODELS

Abstract

The powerful modeling capabilities of all-attention-based transformer architectures often cause overfitting and-for natural language processing tasks-lead to an implicitly learned internal language model in the autoregressive transformer decoder complicating the integration of external language models. In this paper, we explore relaxed attention, a simple and easy-to-implement smoothing of the attention weights, yielding a two-fold improvement to the general transformer architecture: First, relaxed attention provides regularization when applied to the self-attention layers in the encoder. Second, we show that it naturally supports the integration of an external language model as it suppresses the implicitly learned internal language model by relaxing the cross attention in the decoder. We demonstrate the benefit of relaxed attention across several tasks with clear improvement in combination with recent benchmark approaches. Specifically, we exceed the former state-of-the-art performance of 26.90% word error rate on the largest public lip-reading LRS3 benchmark with a word error rate of 26.31%, as well as we achieve a top-performing BLEU score of 37.67 on the IWSLT14 (DE → EN) machine translation task without external language models and virtually no additional model parameters. Code and models will be made publicly available.

1. INTRODUCTION

Early encoder-decoder models emerged from machine translation, where the encoder compressed the entire source language sentence into a fixed-length embedding vector (Cho et al., 2014b) . This is particularly difficult for very long sentences (Cho et al., 2014a) , as the fixed-length embedding vector is only a limited-capacity representation. The use of attention, introduced in Bahdanau et al. (2015) , enabled the computation of variable-length weight distributions over the input sequence and soon turned out to be advantageous for far more applications than just neural machine translation (NMT), e.g., automatic speech recognition (ASR) (Chorowski et al., 2015; Chan et al., 2016; Bahdanau et al., 2016) , language modeling and understanding (Devlin et al., 2019 ), object detection (Carion et al., 2020) , and image classification (Dosovitskiy et al., 2021; Liu et al., 2021b) . Soon the most prominent attention-based encoder-decoder (AED) model emerged, namely the transformer (Vaswani et al., 2017) model. Without the use of any recurrency, it entirely relies on self-attention in the encoder to model temporal dependencies in the input and cross attention in the decoder to extract relevant timesteps thereof during the autoregressive decoding process. While transformers in language modeling tasks are well-suited for upscaling the model size and depth without any saturation when large amounts of data are present (Devlin et al., 2019; Brown et al., 2020; Kaplan et al., 2020; Fedus et al., 2022) , they are also susceptible to overfit and require strong regularization to learn at all (Xu et al., 2021; Popel & Bojar, 2018; Steiner et al., 2021) . In a study exclusively on ASR (Lohrenz et al., 2021) , it was shown that regularization by smoothing the attention weights in the decoder's cross attention, dubbed relaxed attention, improves performance when the transformer model is combined with an external language model but, for reasons yet to be explored, does not help without a language model. In this work, we take on the idea of relaxed attention to expand it to the self-attention layers in the encoder, regularizing already the encoder. Thereby, we increase the method's versatility as it becomes applicable to encoder-only transformers, which are common in several non-sequence tasks such as image classification or pre-trained bidirectional encoder representation by transformer (BERT, Devlin et al. ( 2019)) models. Our main contributions are summarized as follows: • We introduce relaxed self-attention in the transformer encoder to improve generalization and develop fuzzy relaxation as a variant thereof. • Beyond relaxed self-attention, we extensively investigate the capability of relaxed cross attention in the decoder of sequence-to-sequence transformer models and show that the improvement is due to better external language model integration as it suppresses the influence of the internal language model. • We show improvements of the relaxed attention approaches on a variety of tasks including automatic speech recognition, lip-reading, machine translation, and image classification. On the lip-reading and machine translation task we report a new state of the art and topperforming result, respectively. The paper is structured as follows: After a summary of related work in Section 2, we introduce the relaxed attention approach in Section 3, followed by the experimental evaluation including results and discussion in Section 4. Section 5 concludes the paper.

2. RELATED WORK

Regularization of transformer models In this work, we introduce a regularization method to the self-attention function (Vaswani et al., 2017) , which is fundamental to transformer models. Several regularization approaches proposed for such networks in the past are related to the network output of transformer models by modifying the loss computation, either through label smoothing (Müller et al., 2020) , or by introducing additional loss terms. This could be a CTC-loss computed on the encoder outputs (Karita et al., 2019; Chen et al., 2021) for monotonous tasks such as ASR, or a divergence term between output softmax distributions of two forward passes with different dropout masks (Liang et al., 2021; Shen et al., 2020) . Related to the network input, several-mostly application-dependentdata augmentation approaches such as spectral augmentation for ASR (Park et al., 2019b) , or cutoff for machine translation (Shen et al., 2020) have proven to be effective. Another set of regularization methods, specific to transformer models, adds a loss term to encourage attention heads to yield diverse outputs (Li et al., 2018; Audhkhasi et al., 2022) or is based on the dropout technique (Srivastava et al., 2014) and randomly masks attention heads (Zhou et al., 2020; Sun et al., 2020) or entire en-/decoder block layers (LayerDrop) (Fan et al., 2020) during training. It was also observed that only a subset of specialized attention heads contribute to model performance, while other heads remain useless and can be pruned (Voita et al., 2019) . Relaxed attention in Lohrenz et al. ( 2021) was used to prevent too narrow attention weight distributions in the cross attention during training which only yielded improvements with an external language model in ASR. We, however, apply this approach to the self-attention function to reduce over-fitting already in the encoder and investigate if relaxed self-attention also helps when applied during both training and test (matched inference). In addition we include a variety of the aforementioned-proven to be effective-regularization methods as baselines and show that relaxed attention is able to further improve performance yielding complementarity to other regularization methods. When attention-based encoder-decoder networks were first applied to ASR, Chorowski et al. (2015) proposed a modified softmax function to smooth the attention weights in the cross attention between encoder and decoder by replacing the exponential function in the standard softmax function with a sigmoid. Thereby, they compressed the probability-like outputs, but didn't take into account the input sequence length, despite the authors' observation that longer sentences require less smoothing of the attention weights. Even though this method dubbed smooth focus was so far only applied to recurrent neural network (RNN)-based AED models, we include it as a reference method in our simulations as it is the closest to the relaxed attention approach. Internal language model handling For many sequence-to-sequence tasks the integration of language models (LMs) to AED models is of dual use: First, LMs leverage the use of large amounts of additional text-only data to improve performance. Second, LMs can be utilized to adapt acoustic models to domains which differ from the original acoustic training data domain. Several techniques exist to combine language models with AED models, such as shallow fusion (Gülçehre et al., 2015) , deep fusion (Gülçehre et al., 2015) , and cold fusion (Sriram et al., 2018) , whereas shallow fusion still is the most common solution due to its simplicity and flexibility. However, AED models tend to learn an internal language model in the autoregressive decoder (McDermott et al., 2019) , which can either be suppressed by subtracting an additional LM trained only on text transcriptions from the

