FINE: FUTURE-AWARE INFERENCE FOR STREAMING SPEECH TRANSLATION

Abstract

A popular approach to streaming speech translation is to employ a single offline model together with a wait-k policy to support different latency requirements. It is a simpler alternative compared to training multiple online models with different latency constraints. However, there is an apparent mismatch in using a model trained with complete utterances on partial streaming speech during online inference. We demonstrate that there is a significant difference between the speech representations extracted at the end of a streaming input and their counterparts at the same positions when the complete utterance is available. Built upon our observation that this problem can be alleviated by introducing a few frames of future speech signals, we propose Future-aware inference (FINE) for streaming speech translation with two different methods to make the model aware of the future. The first method FINE-Mask incorporates future context through a trainable masked speech model. The second method FINE-Wait simply waits for more actual future audio frames at the cost of extra latency. Experiments on the MuST-C EnDe, EnEs, and EnFr benchmarks show that both methods are effective and can achieve better trade-offs between translation quality and latency than strong baselines, and a hybrid approach combining the two can achieve further improvement. Extensive analyses suggest that our methods can effectively alleviate the aforementioned mismatch problem between offline training and online inference.

1. INTRODUCTION

Streaming speech translation (ST) systems consume audio frames incrementally and generate realtime translations, unlike their offline counterparts which have access to the complete utterance before starting to translate. Because of the streaming nature, streaming ST models commonly use unidirectional encoders (Ren et al., 2020; Ma et al., 2020b; Zeng et al., 2021) and are trained with some wait-k policy (Ma et al., 2019) that determines whether to wait for more speech frames or emit target tokens. In real-world applications, however, it is a costly effort to train and maintain multiple models to satisfy different latency requirements (Zhang & Feng, 2021) . Recently, some works (Papi et al., 2022; Dong et al., 2022) show that offline models can be adapted to streaming scenarios with waitk policies to meet different latency requirements and achieve comparable or better performance, partially due to their use of more powerful bidirectional encoders. However, there is an inherent mismatch in using a model trained with complete utterances on incomplete streaming speech during online inference (Ma et al., 2019) . Intuitively, speech representations extracted from streaming inputs (Figure 1(b) ) are less informative than in the case with full speech encoding (Figure 1(a) ). Two questions arise naturally: how much is the difference in speech representations between the two inference modes, and is it significant enough to cause problems? We analyze the gap in speech representations, measured by cosine similarity, at different positions in the streaming input compared to using the full speech (Section 3). We find that there is a significantly greater gap for representations closer to the end of a streaming segment, with an average similarity score as low as 0.2 for the last frame, and the gap quickly narrows for frames further away. Moreover, we observe more degradation in translation quality for utterances with the greatest gap in speech representations between online and offline inference. Based on the above findings, we hypothesize that the lack of future contexts at the end of streaming inputs could be detrimental to streaming speech translation. To this end, we propose two novel

