RELAXED ATTENTION FOR TRANSFORMER MODELS

Abstract

The powerful modeling capabilities of all-attention-based transformer architectures often cause overfitting and-for natural language processing tasks-lead to an implicitly learned internal language model in the autoregressive transformer decoder complicating the integration of external language models. In this paper, we explore relaxed attention, a simple and easy-to-implement smoothing of the attention weights, yielding a two-fold improvement to the general transformer architecture: First, relaxed attention provides regularization when applied to the self-attention layers in the encoder. Second, we show that it naturally supports the integration of an external language model as it suppresses the implicitly learned internal language model by relaxing the cross attention in the decoder. We demonstrate the benefit of relaxed attention across several tasks with clear improvement in combination with recent benchmark approaches. Specifically, we exceed the former state-of-the-art performance of 26.90% word error rate on the largest public lip-reading LRS3 benchmark with a word error rate of 26.31%, as well as we achieve a top-performing BLEU score of 37.67 on the IWSLT14 (DE → EN) machine translation task without external language models and virtually no additional model parameters. Code and models will be made publicly available.

1. INTRODUCTION

Early encoder-decoder models emerged from machine translation, where the encoder compressed the entire source language sentence into a fixed-length embedding vector (Cho et al., 2014b) . This is particularly difficult for very long sentences (Cho et al., 2014a) , as the fixed-length embedding vector is only a limited-capacity representation. The use of attention, introduced in Bahdanau et al. (2015) , enabled the computation of variable-length weight distributions over the input sequence and soon turned out to be advantageous for far more applications than just neural machine translation (NMT), e.g., automatic speech recognition (ASR) (Chorowski et al., 2015; Chan et al., 2016; Bahdanau et al., 2016) , language modeling and understanding (Devlin et al., 2019 ), object detection (Carion et al., 2020) , and image classification (Dosovitskiy et al., 2021; Liu et al., 2021b) . Soon the most prominent attention-based encoder-decoder (AED) model emerged, namely the transformer (Vaswani et al., 2017) model. Without the use of any recurrency, it entirely relies on self-attention in the encoder to model temporal dependencies in the input and cross attention in the decoder to extract relevant timesteps thereof during the autoregressive decoding process. While transformers in language modeling tasks are well-suited for upscaling the model size and depth without any saturation when large amounts of data are present (Devlin et al., 2019; Brown et al., 2020; Kaplan et al., 2020; Fedus et al., 2022) , they are also susceptible to overfit and require strong regularization to learn at all (Xu et al., 2021; Popel & Bojar, 2018; Steiner et al., 2021) . In a study exclusively on ASR (Lohrenz et al., 2021) , it was shown that regularization by smoothing the attention weights in the decoder's cross attention, dubbed relaxed attention, improves performance when the transformer model is combined with an external language model but, for reasons yet to be explored, does not help without a language model. In this work, we take on the idea of relaxed attention to expand it to the self-attention layers in the encoder, regularizing already the encoder. Thereby, we increase the method's versatility as it becomes applicable to encoder-only transformers, which are common in several non-sequence tasks such as image classification or pre-trained bidirectional encoder representation by transformer (BERT, Devlin et al. (2019) ) models. Our main contributions are summarized as follows:

