FINE: FUTURE-AWARE INFERENCE FOR STREAMING SPEECH TRANSLATION

Abstract

A popular approach to streaming speech translation is to employ a single offline model together with a wait-k policy to support different latency requirements. It is a simpler alternative compared to training multiple online models with different latency constraints. However, there is an apparent mismatch in using a model trained with complete utterances on partial streaming speech during online inference. We demonstrate that there is a significant difference between the speech representations extracted at the end of a streaming input and their counterparts at the same positions when the complete utterance is available. Built upon our observation that this problem can be alleviated by introducing a few frames of future speech signals, we propose Future-aware inference (FINE) for streaming speech translation with two different methods to make the model aware of the future. The first method FINE-Mask incorporates future context through a trainable masked speech model. The second method FINE-Wait simply waits for more actual future audio frames at the cost of extra latency. Experiments on the MuST-C EnDe, EnEs, and EnFr benchmarks show that both methods are effective and can achieve better trade-offs between translation quality and latency than strong baselines, and a hybrid approach combining the two can achieve further improvement. Extensive analyses suggest that our methods can effectively alleviate the aforementioned mismatch problem between offline training and online inference.

1. INTRODUCTION

Streaming speech translation (ST) systems consume audio frames incrementally and generate realtime translations, unlike their offline counterparts which have access to the complete utterance before starting to translate. Because of the streaming nature, streaming ST models commonly use unidirectional encoders (Ren et al., 2020; Ma et al., 2020b; Zeng et al., 2021) and are trained with some wait-k policy (Ma et al., 2019) that determines whether to wait for more speech frames or emit target tokens. In real-world applications, however, it is a costly effort to train and maintain multiple models to satisfy different latency requirements (Zhang & Feng, 2021) . Recently, some works (Papi et al., 2022; Dong et al., 2022) show that offline models can be adapted to streaming scenarios with waitk policies to meet different latency requirements and achieve comparable or better performance, partially due to their use of more powerful bidirectional encoders. However, there is an inherent mismatch in using a model trained with complete utterances on incomplete streaming speech during online inference (Ma et al., 2019) . Intuitively, speech representations extracted from streaming inputs (Figure 1(b) ) are less informative than in the case with full speech encoding (Figure 1(a) ). Two questions arise naturally: how much is the difference in speech representations between the two inference modes, and is it significant enough to cause problems? We analyze the gap in speech representations, measured by cosine similarity, at different positions in the streaming input compared to using the full speech (Section 3). We find that there is a significantly greater gap for representations closer to the end of a streaming segment, with an average similarity score as low as 0.2 for the last frame, and the gap quickly narrows for frames further away. Moreover, we observe more degradation in translation quality for utterances with the greatest gap in speech representations between online and offline inference. Based on the above findings, we hypothesize that the lack of future contexts at the end of streaming inputs could be detrimental to streaming speech translation. To this end, we propose two novel Future-aware inference (FINE) strategies for streaming speech translation: FINE-Mask and FINE-Wait, as shown in Figure 1(c ) and 1(d). In FINE-Mask, we append a few mask embeddings to the end of the current streaming speech tokens as additional input to the acoustic feature extractor, which based on its masked modeling capability can implicitly estimate and construct future contexts in the corresponding hidden representations and extract more accurate representations for the frames in the streaming input. Since we find that only the speech representations of the last few positions in the streaming input are severely affected by the mismatch problem, the closest future context could provide the most improvement. Thus in FINE-Wait, we simply wait for a few extra speech tokens during streaming encoding and use them as the future context to extract improved representations for the frames in the original streaming segment. FINE-Wait incurs additional latency as the strategy requires waiting for more oracle future context, but it achieves significant improvement in translation quality and leads to a better trade-off. We conduct experiments on the MuST-C EnDe, EnEs, and EnFr benchmarks. Experimental results show that our methods outperform several strong baselines on the trade-off between translation quality and latency. In particular, in the lower latency range (when AL is less than 1000ms), we achieve improvements of 8 BLEU in EnDe, 12 BLEU in EnEs, and 6 BLEU in EnFr. Extensive analyses demonstrate that introducing future context reduces the representation gap between the full speech encoding and the partial streaming encoding.

2. BACKGROUND

Speech translation systems can be roughly categorized into non-streaming (offline) and streaming (online) depending on the inference mode. Regardless of the inference mode, speech translation models typically employ the encoder-decoder architecture and are trained on an ST corpus D = {(x, z, y)}, where x = (x 1 , . . . , x T ) denotes an audio sequence, z = (z 1 , . . . , z I ) and y = (y 1 , . . . , y J ) the corresponding source transcription and target translation respectively. Non-Streaming Speech Translation For the non-streaming ST task, the encoder maps the entire input audio x to the speech representations h, and the decoder generates the j-th target token y j conditional on the full representations h and the previously generated tokens y <j . The decoding process of non-streaming ST is defined as: p(y | x) = J j=1 p (y j | x, y <j ) . A significant amount of work has focused on non-streaming ST, including pre-training (Wang et al., 2020a; Dong et al., 2021a; Tang et al., 2022; Ao et al., 2022 ),multi-task learning (Liu et al., 2020; Indurthi et al., 2020; 2021 ),data augmentation (Pino et al., 2019;; Di Gangi et al., 2019b; McCarthy et al., 2020) ,knowledge distillation (Dong et al., 2021b; Zhao et al., 2021; Du et al., 2022) ,and cross-modality representation learning (Tang et al., 2021; Fang et al., 2022; Ye et al., 2022) . Streaming Speech Translation A streaming ST model generates the j-th target token y j based on streaming audio prefix x ≤g(j) and the previous tokens y <j , where g(j) is a monotonic non-



Figure 1: (a) and (b) represent the input mismatch between offline training and streaming testing. (c) and (d) denote the proposed methods FINE-Mask and FINE-Wait, respectively. "M" in (c) denotes the mask token. Our methods introduce more informative future context to mitigate the mismatch.

