DUAL-MODE ASR: UNIFY AND IMPROVE STREAMING ASR WITH FULL-CONTEXT MODELING

Abstract

Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible, while full-context ASR waits for the completion of a full speech utterance before emitting completed hypotheses. In this work, we propose a unified framework, Dual-mode ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech recognition. We show that the latency and accuracy of streaming ASR significantly benefit from weight sharing and joint training of full-context ASR, especially with inplace knowledge distillation during the training. The Dual-mode ASR framework can be applied to recent state-of-the-art convolution-based and transformer-based ASR networks. We present extensive experiments with two state-of-the-art ASR networks, ContextNet and Conformer, on two datasets, a widely used public dataset LibriSpeech and a large-scale dataset MultiDomain. Experiments and ablation studies demonstrate that Dual-mode ASR not only simplifies the workflow of training and deploying streaming and full-context ASR models, but also significantly improves both emission latency and recognition accuracy of streaming ASR. With Dual-mode ASR, we achieve new state-of-the-art streaming ASR results on both LibriSpeech and MultiDomain in terms of accuracy and latency.

1. INTRODUCTION

"Ok Google. Hey Siri. Hi Alexa." have featured a massive boom of smart speakers in recent years, unveiling a trend towards ubiquitous and ambient Artificial Intelligence (AI) for better daily lives. As the communication bridge between human and machine, low-latency streaming ASR (a.k.a., online ASR) is of central importance, whose goal is to emit each hypothesized word as quickly and accurately as possible on the fly as they are spoken. On the other hand, there are some scenarios where full-context ASR (a.k.a., offline ASR) is sufficient, for example, offline video captioning on video-sharing platforms. While low-latency streaming ASR is generally preferred in most of the speech recognition scenarios, it often has worse prediction accuracy as measured in Word Error Rate (WER), due to the lack of future context compared with full-context ASR. Improving both WER and emission latency has been shown to be highly challenging (He et al., 2019; Li et al., 2020a; Sainath et al., 2020) in streaming ASR systems. Since the acoustic, pronunciation, and language model (AM, PM, and LM) of a conventional ASR system have been evolved into a single end-to-end (E2E) all-neural network, modern streaming and full-context ASR models share most of the neural architectures and training recipes in common, such as, Mel-spectrogram inputs, data augmentations, neural network meta-architectures, training objectives, model regularization techniques and decoding methods. The most significant difference is that streaming ASR encoders are auto-regressive models, with the prediction of the current timestep conditioned on previous ones (no future context is permitted). Specifically, let x and y be the input and output sequence, t as frame index, T as total length of frames. Streaming ASR encoders model the output y t as a function of input x 1:t while full-context ASR encoders model the output y t as a function of input x 1:T . Streaming ASR encoders can be built with uni-directional LSTMs, causal convolution and left-context attention layers in streaming ASR encoders (Chiu & Raffel, 2018; Fan et al., 2018; Han et al., 2020; Gulati et al., 2020; Huang et al., 2020; Moritz et al., 2020; Miao † equal contribution 

