DUAL-MODE ASR: UNIFY AND IMPROVE STREAMING ASR WITH FULL-CONTEXT MODELING

Abstract

Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible, while full-context ASR waits for the completion of a full speech utterance before emitting completed hypotheses. In this work, we propose a unified framework, Dual-mode ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech recognition. We show that the latency and accuracy of streaming ASR significantly benefit from weight sharing and joint training of full-context ASR, especially with inplace knowledge distillation during the training. The Dual-mode ASR framework can be applied to recent state-of-the-art convolution-based and transformer-based ASR networks. We present extensive experiments with two state-of-the-art ASR networks, ContextNet and Conformer, on two datasets, a widely used public dataset LibriSpeech and a large-scale dataset MultiDomain. Experiments and ablation studies demonstrate that Dual-mode ASR not only simplifies the workflow of training and deploying streaming and full-context ASR models, but also significantly improves both emission latency and recognition accuracy of streaming ASR. With Dual-mode ASR, we achieve new state-of-the-art streaming ASR results on both LibriSpeech and MultiDomain in terms of accuracy and latency.

1. INTRODUCTION

"Ok Google. Hey Siri. Hi Alexa." have featured a massive boom of smart speakers in recent years, unveiling a trend towards ubiquitous and ambient Artificial Intelligence (AI) for better daily lives. As the communication bridge between human and machine, low-latency streaming ASR (a.k.a., online ASR) is of central importance, whose goal is to emit each hypothesized word as quickly and accurately as possible on the fly as they are spoken. On the other hand, there are some scenarios where full-context ASR (a.k.a., offline ASR) is sufficient, for example, offline video captioning on video-sharing platforms. While low-latency streaming ASR is generally preferred in most of the speech recognition scenarios, it often has worse prediction accuracy as measured in Word Error Rate (WER), due to the lack of future context compared with full-context ASR. Improving both WER and emission latency has been shown to be highly challenging (He et al., 2019; Li et al., 2020a; Sainath et al., 2020) in streaming ASR systems. Since the acoustic, pronunciation, and language model (AM, PM, and LM) of a conventional ASR system have been evolved into a single end-to-end (E2E) all-neural network, modern streaming and full-context ASR models share most of the neural architectures and training recipes in common, such as, Mel-spectrogram inputs, data augmentations, neural network meta-architectures, training objectives, model regularization techniques and decoding methods. The most significant difference is that streaming ASR encoders are auto-regressive models, with the prediction of the current timestep conditioned on previous ones (no future context is permitted). Specifically, let x and y be the input and output sequence, t as frame index, T as total length of frames. Streaming ASR encoders model the output y t as a function of input x 1:t while full-context ASR encoders model the output y t as a function of input x 1:T . Streaming ASR encoders can be built with uni-directional LSTMs, causal convolution and left-context attention layers in streaming ASR encoders (Chiu & Raffel, 2018; Fan et al., 2018; Han et al., 2020; Gulati et al., 2020; Huang et al., 2020; Moritz et al., 2020; Miao et al., 2020; Tsunoo et al., 2020; Zhang et al., 2020; Yeh et al., 2019) . Recurrent Neural Network Transducers (RNN-T) (Graves, 2012) are commonly used as the decoder in both streaming and fullcontext models, which predicts the token of the current input frame based on all previous tokens using uni-directional recurrent layers. Figure 1 illustrates a simplified example of the similarity and difference between streaming and full-context ASR models with E2E neural networks. Albeit the similarities, streaming and full-context ASR models are usually developed, trained, and deployed separately. In this work, we propose Dual-mode ASR, a framework to unify streaming and full-context speech recognition networks with shared weights. Dual-mode ASR comes with many immediate benefits, including reduced model download and storage on devices and simplified development and deployment workflows. To accomplish this goal, we first introduce Dual-mode Encoders, which can run in both streaming mode and full-context mode. Dual-mode encoders are designed to reuse the same set of model weights for both modes with zero or near-zero parameters overhead. We propose the design principles of a dual-mode encoder and show examples on how to design dual-mode convolution, dual-mode pooling, and dual-mode attention layers. We also 



† equal contribution



Figure 1: A simplified illustration of the similarity and difference between Streaming ASR and Fullcontext ASR networks. Modern end-to-end streaming and full-context ASR models share most of the neural architectures and training recipes in common, with the most significant difference in the ASR encoder (highlighted). Streaming ASR encoders are auto-regressive models, with each prediction of the current timestep conditioned on previous ones (no future context). We show examples of feed-forward layer, convolution layer and self-attention layer in the encoder of streaming and full-context ASR respectively. With Dual-mode ASR, we unify them without parameters overhead.

investigate into different training algorithms for Dual-mode ASR, specifically, randomly sampled training and joint training. We show that joint training significantly outperforms randomly sampled training in terms of model quality and training stability. Moreover, motivated by Inplace Knowledge Distillation (Yu & Huang, 2019b) in which a large model is used to supervise a small model, we propose to distill knowledge from the full-context mode (teacher) into the streaming mode (student) on the fly during the training within the same Dual-mode ASR model, by encouraging consistency of the predicted token probabilities. We demonstrate that the emission latency and prediction accuracy of streaming ASR significantly benefit from weight sharing and joint training of its full-context mode, especially with inplace knowledge distillation during the training. We present extensive experiments with two state-of-theart ASR networks, convolution-based ContextNet (Han et al., 2020) and conv-transformer hybrid Conformer (Gulati et al., 2020), on two datasets, a widely used public dataset LibriSpeech (Panay-

