LEARNING MONOTONIC ALIGNMENTS WITH SOURCE-AWARE GMM ATTENTION

Abstract

Transformers with soft attention have been widely adopted in various sequence-tosequence (Seq2Seq) tasks. Whereas soft attention is effective for learning semantic similarities between queries and keys based on their contents, it does not explicitly model the order of elements in sequences which is crucial for monotonic Seq2Seq tasks. Learning monotonic alignments between input and output sequences may be beneficial for longform and online inference applications that are still challenging for the conventional soft attention algorithm. Herein, we focus on monotonic Seq2Seq tasks and propose a sourceaware Gaussian mixture model attention in which the attention scores are monotonically calculated considering both the content and order of the source sequence. We experimentally demonstrate that the proposed attention mechanism improved the performance on the online and long-form speech recognition problems without performance degradation in offline in-distribution speech recognition.

1. INTRODUCTION

In recent years, transformer models with soft attention have been widely adopted in various sequence generation tasks (Raffel et al., 2019; Vaswani et al., 2017; Parmar et al., 2018; Karita et al., 2019) . Soft attention does not explicitly model the order of elements in a sequence and attends all encoder outputs for each decoder step. However, the order of elements is crucial for understanding monotonic sequence-to-sequence (Seq2Seq) tasks, such as automatic speech recognition (ASR), video analysis, and lip reading. Learning monotonic alignments enables the model to attend to a subset of the encoder output without performance degradation in these tasks. In comparison, soft attention is not suitable for streaming inference applications because the softmax operation needs to wait until all encoder outputs are produced. Figure 1 (b) shows the attention plot for soft-attention. Soft attention learns the alignments between queries and keys based on their similarities; it requires all encoder tokens prior to the attention score calculation. Furthermore, soft attention cannot easily decode long-form sequences that are not considered in the training corpus. The Gaussian Mixture Model (GMM) attention Graves (2013); Battenberg et al. (2020); Chiu et al. (2019) have been proposed for learning the monotonic mapping between encoder and decoder states for long-form sequence generation. The GMM attention is a pure location-aware algorithm in which encoder contents are not considered during attention score calculation. However, each element in the encoder output sequence contains different amounts of information and should be attended considering their contents. In figure 1 (c) , the GMM attention fails to learn the detailed alignments and attends to many tokens simultaneously. In this study, we adopted the GMM attention mechanism to the modern transformer structure and proposed the Source-Aware Gaussian Mixture Model (SAGMM) attention which considers both contents and orders of source sequences. Each component in the SAGMM is multi-modal and discards non-informative tokens in the attention window. For online inference, we propose a truncated SAGMM (SAGMM-tr) that discards the long-tail of the attention score in the SAGMM. To the best of our knowledge, this is the first attempt to adopt a GMM-based attention to online sequence generation tasks. Learning accurate monotonic alignments enables the SAGMM-tr to attend to a relevant subset of sequences for each decoder step and improves the performance of the model in terms of streaming and long-form sequence generation tasks. Figure 1 (d) shows the monotonic alignments learned by the SAGMM-tr, enabling online inference. Experiments involving streaming and long-form ASR showed substantial performance improvements compared with conventional algorithms without performance degradation in offline in-distribution ASR. Furthermore, we tested the SAGMM-tr in a machine translation task and demonstrated the performance of the proposed algorithm in non-monotonic tasks. 2017) without relative positional encoding , the attention score from soft attention α Sof t is derived from the query matrix Q ∈ R I×d and key matrix K ∈ R J×d as follows: α Sof t = softmax QK / √ d where d, I, and J are the feature dimension, decoder and encoder sequence length, respectively. The attention context matrix from the h-th head H h and the multi-head output M are expressed as H h = α h Sof t V h M = concat H 1 ; ...; H n h W O where α h Sof t denotes the α Sof t for the h-th head, V h ∈ R J×d the value matrix for the h-th head, and n h the number of heads. In this study, we adopted relative positional encoding Shaw et al. (2018) ; Dai et al. (2019) which provides a stronger baseline for long-form sequence generation for self-attention layers.

2.2. GMM ATTENTION

The previous studies regarding GMM attention Battenberg et al. (2020); Chiu et al. (2019) were based on early content-based attention (Cho et al., 2014) . Li et al. (2020) adopted the GMM attention to the transformer framework, but did not provide detailed descriptions. Herein, we adopt v2 model in Battenberg et al. (2020) which improved the performance of the original GMM attention mechanism Graves (2013). We define the GMM attention as a variant of multi-head attention by considering a Gaussian distribution component as an attention score of single-head in a multi-head mechanism. In the study by Battenberg et al. (2020) , the value matrix was shared for all Gaussian components, whereas multi-head value matrices were multiplied with the probability from corresponding components in this study to attend to information from different representation subspaces (Vaswani et al., 2017) . Hence, the multi-head GMM attention introduced here is a more generalized algorithm compared with the early GMM attention. Let us denote the i-th row of Q as Q i ∈ R 1×d . The normal distribution parameters for the i-th step are expressed ∆ i , σ i , φ i = ζ (Q i W ∆ ) , ζ (Q i W σ ) , Q i W φ (4) µ i = ∆ i + µ i-1 (5)



Figure 1: Input speech waveform (a) and attention plots between speech waveform and transcription learned by (b) soft attention, (c) GMM attention, and (d) SAGMM-tr attention. 2 SOURCE-AWARE GMM ATTENTION 2.1 SOFT ATTENTION Herein, we abbreviate the head index h during attention score calculation for simplicity. In dot-product multi-head soft attention Vaswani et al. (2017) without relative positional encoding , the attention score from soft attention α Sof t is derived from the query matrix Q ∈ R I×d and key matrix K ∈ R J×d as follows:

