LEARNING MONOTONIC ALIGNMENTS WITH SOURCE-AWARE GMM ATTENTION

Abstract

Transformers with soft attention have been widely adopted in various sequence-tosequence (Seq2Seq) tasks. Whereas soft attention is effective for learning semantic similarities between queries and keys based on their contents, it does not explicitly model the order of elements in sequences which is crucial for monotonic Seq2Seq tasks. Learning monotonic alignments between input and output sequences may be beneficial for longform and online inference applications that are still challenging for the conventional soft attention algorithm. Herein, we focus on monotonic Seq2Seq tasks and propose a sourceaware Gaussian mixture model attention in which the attention scores are monotonically calculated considering both the content and order of the source sequence. We experimentally demonstrate that the proposed attention mechanism improved the performance on the online and long-form speech recognition problems without performance degradation in offline in-distribution speech recognition.

1. INTRODUCTION

In recent years, transformer models with soft attention have been widely adopted in various sequence generation tasks (Raffel et al., 2019; Vaswani et al., 2017; Parmar et al., 2018; Karita et al., 2019) . Soft attention does not explicitly model the order of elements in a sequence and attends all encoder outputs for each decoder step. However, the order of elements is crucial for understanding monotonic sequence-to-sequence (Seq2Seq) tasks, such as automatic speech recognition (ASR), video analysis, and lip reading. Learning monotonic alignments enables the model to attend to a subset of the encoder output without performance degradation in these tasks. In comparison, soft attention is not suitable for streaming inference applications because the softmax operation needs to wait until all encoder outputs are produced. Figure 1 2019) have been proposed for learning the monotonic mapping between encoder and decoder states for long-form sequence generation. The GMM attention is a pure location-aware algorithm in which encoder contents are not considered during attention score calculation. However, each element in the encoder output sequence contains different amounts of information and should be attended considering their contents. In figure 1 (c) , the GMM attention fails to learn the detailed alignments and attends to many tokens simultaneously. In this study, we adopted the GMM attention mechanism to the modern transformer structure and proposed the Source-Aware Gaussian Mixture Model (SAGMM) attention which considers both contents and orders of source sequences. Each component in the SAGMM is multi-modal and discards non-informative tokens in the attention window. For online inference, we propose a truncated SAGMM (SAGMM-tr) that discards the long-tail of the attention score in the SAGMM. To the best of our knowledge, this is the first 1



(b)  shows the attention plot for soft-attention. Soft attention learns the alignments between queries and keys based on their similarities; it requires all encoder tokens prior to the attention score calculation. Furthermore, soft attention cannot easily decode long-form sequences that are not considered in the training corpus.The Gaussian Mixture Model (GMM) attention Graves (2013); Battenberg et al. (2020); Chiu et al. (

