HIDDEN MARKOV TRANSFORMER FOR SIMULTANEOUS MACHINE TRANSLATION

Abstract

Simultaneous machine translation (SiMT) outputs the target sequence while receiving the source sequence, and hence learning when to start translating each target token is the core challenge for SiMT task. However, it is non-trivial to learn the optimal moment among many possible moments of starting translating, as the moments of starting translating always hide inside the model and can only be supervised with the observed target sequence. In this paper, we propose a Hidden Markov Transformer (HMT), which treats the moments of starting translating as hidden events and the target sequence as the corresponding observed events, thereby organizing them as a hidden Markov model. HMT explicitly models multiple moments of starting translating as the candidate hidden events, and then selects one to generate the target token. During training, by maximizing the marginal likelihood of the target sequence over multiple moments of starting translating, HMT learns to start translating at the moments that target tokens can be generated more accurately. Experiments on multiple SiMT benchmarks show that HMT outperforms strong baselines and achieves state-of-the-art performance 1 .

1. INTRODUCTION

Recently, with the increase of real-time scenarios such as live broadcasting, video subtitles and conferences, simultaneous machine translation (SiMT) attracts more attention (Cho & Esipova, 2016; Gu et al., 2017; Ma et al., 2019; Arivazhagan et al., 2019) , which requires the model to receive source token one by one and simultaneously generates the target tokens. For the purpose of high-quality translation under low latency, SiMT model needs to learn when to start translating each target token (Gu et al., 2017) , thereby making a wise decision between waiting for the next source token (i.e., READ action) and generating a target token (i.e., WRITE action) during the translation process. However, learning when to start translating target tokens is not trivial for a SiMT model, as the moments of starting translating always hide inside the model and we can only supervise the SiMT model with the observed target sequence (Zhang & Feng, 2022a) . Existing SiMT methods are divided into fixed and adaptive in deciding when to start translating. Fixed methods directly decide when to start translating according to pre-defined rules instead of learning them (Dalvi et al., 2018; Ma et al., 2019; Elbayad et al., 2020) . Such methods ignore the context and thus sometimes force the model to start translating even if the source contents are insufficient (Zheng et al., 2020a) . Adaptive methods dynamically decide READ/WRITE actions, such as predicting a variable to indicate READ/WRITE action (Arivazhagan et al., 2019; Ma et al., 2020; Miao et al., 2021) . However, due to the lack of clear correspondence between READ/WRITE actions and the observed target sequence (Zhang & Feng, 2022c) , it is difficult to learn precise READ/WRITE actions only with the supervision of the observed target sequence (Alinejad et al., 2021; Zhang & Feng, 2022a; Indurthi et al., 2022) . To seek the optimal moment of starting translating each target token that hides inside the model, an ideal solution is to clearly correspond the moments of starting translating to the observed target Then, HMT judges whether to select each state from low latency to high latency (i.e., from left to right). Once a state is selected, HMT will translate the target token based on the selected state. sequence, and further learn to start translating at those moments that target tokens can be generated more accurately. To this end, we propose Hidden Markov Transformer (HMT) for SiMT, which treats the moments of starting translating as hidden events and treats the translation results as the corresponding observed events, thereby organizing them in the form of hidden Markov model (Baum & Petrie, 1966; Rabiner & Juang, 1986; Wang et al., 2018) . As illustrated in Figure 1 , HMT explicitly produces a set of states for each target token, where multiple states represent starting translating the target token at different moments respectively (i.e., start translating after receiving different numbers of source tokens). Then, HMT judges whether to select each state from low latency to high latency. Once a state is selected, HMT will generate the target token based on the selected state. For example, HMT produces 3 states when generating the first target token to represent starting translating after receiving the first 1, 2 and 3 source tokens respectively (i.e., h ≤1 , h ≤2 and h ≤3 ). Then during the judgment, the first state is not selected, the second state is selected to output 'I' and then the third state is not considered anymore, thus HMT starts translating 'I' when receiving the first 2 source tokens. During training, HMT is optimized by maximizing the marginal likelihood of the target sequence (i.e., observed events) over all possible selection results (i.e., hidden events). In this way, those states (moments) which generate the target token more accurately will be selected more likely, thereby HMT effectively learns when to start translating under the supervision of the observed target sequence. Experiments on English→Vietnamese and German→English SiMT benchmarks show that HMT outperforms strong baselines under all latency and achieves state-of-the-art performance.

2. RELATED WORK

Learning when to start translating is the key to SiMT. Recent SiMT methods fall into fixed and adaptive. For fixed method, Ma et al. ( 2019 Compared with the previous methods, HMT explicitly models multiple possible moments of starting translating in both training and inference, and integrates two key issues in SiMT, 'learning when to start translating' and 'learning translation', into a unified framework via the hidden Markov model.



Figure1: Illustration of hidden Markov Transformer. HMT explicitly produces K = 3 states for each target token to represent starting translating the target token when receiving different numbers of source tokens respectively (where h ≤n means starting translating when receiving the first n source tokens). Then, HMT judges whether to select each state from low latency to high latency (i.e., from left to right). Once a state is selected, HMT will translate the target token based on the selected state.

) proposed a wait-k policy, which first READs k source tokens and then READs/WRITEs one token alternately. Elbayad et al. (2020) proposed an efficient training for wait-k policy, which randomly samples different k between batches. Zhang & Feng (2021a) proposed a char-level wait-k policy. Zheng et al. (2020a) proposed adaptive wait-k, which integrates multiple wait-k models heuristically during inference. Guo et al. (2022) proposed postevaluation for the wait-k policy. Zhang et al. (2022) proposed wait-info policy to improve wait-k policy via quantifying the token information. Zhang & Feng (2021c) proposed a MoE wait-k to learn multiple wait-k policies via multiple experts. For adaptive method, Gu et al. (2017) trained an READ/WRITE agent via reinforcement learning. Zheng et al. (2019b) trained the agent with golden actions generated by rules. Zhang & Feng (2022c) proposed GMA to decide when to start translating according to the predicted alignments. Arivazhagan et al. (2019) proposed MILk, which uses a variable to indicate READ/WRITE and jointly trains the variable with monotonic attention (Raffel et al., 2017). Ma et al. (2020) proposed MMA to implement MILk on Transformer. Liu et al. (2021) proposed CAAT for SiMT to adopt RNN-T to the Transformer architecture. Miao et al. (2021) proposed GSiMT to generate READ/WRITE actions, which implicitly considers all READ/WRITE combinations during training and takes one combination in inference. Zhang & Feng (2022d) proposed an information-transport-based policy for SiMT.

