REVISITING THE ENTROPY SEMIRING FOR NEURAL SPEECH RECOGNITION

Abstract

In streaming settings, speech recognition models have to map sub-sequences of speech to text before the full audio stream becomes available. However, since alignment information between speech and text is rarely available during training, models need to learn it in a completely self-supervised way. In practice, the exponential number of possible alignments makes this extremely challenging, with models often learning peaky or sub-optimal alignments. Prima facie, the exponential nature of the alignment space makes it difficult to even quantify the uncertainty of a model's alignment distribution. Fortunately, it has been known for decades that the entropy of a probabilistic finite state transducer can be computed in time linear to the size of the transducer via a dynamic programming reduction based on semirings. In this work, we revisit the entropy semiring for neural speech recognition models, and show how alignment entropy can be used to supervise models through regularization or distillation. We also contribute an open-source implementation of CTC and RNN-T in the semiring framework that includes numerically stable and highly parallel variants of the entropy semiring. Empirically, we observe that the addition of alignment distillation improves the accuracy and latency of an already well-optimized teacher-student distillation model, achieving state-of-the-art performance on the Librispeech dataset in the streaming scenario.

1. INTRODUCTION

Modern automatic speech recognition (ASR) systems deploy a single neural network trained in an end-to-end differentiable manner on a paired corpus of speech and text (Graves et al., 2006; Graves, 2012; Chan et al., 2015; Sak et al., 2017; He et al., 2019) . For many applications like providing closed captions in online meetings or understanding natural language queries for smart assistants, it is imperative that an ASR model operates in a streaming fashion with low latency. This means that before the full audio stream becomes available, the model has to produce partial recognition outputs that correspond to the already given speech. Ground truth alignments that annotate sub-sequences of speech with sub-sequences of text are hard to collect, and rarely available in sufficient quantities to be used as training data. Thus, ASR models have to learn alignments from paired examples of un-annotated speech and text in a completely self-supervised way. The two most popular alignment models used for neural speech recognition today are Connectionist Temporal Classification (CTC) (Graves et al., 2006) and Recurrent Neural Network Transducer (RNN-T) (Graves, 2012; He et al., 2019) . They formulate a probabilistic model over the alignment space, and are trained with a negative log-likelihood (NLL) criterion. Despite the widespread use of CTC and RNN-T, ASR models tend to converge to peaky or suboptimal alignment distributions in practice (Miao et al., 2015; Liu et al., 2018; Yu et al., 2021) . Prior work outside of ASR has discovered that the standard NLL loss generally leads to over-confident predictions (Pereyra et al., 2017; Xu et al., 2020) . Even within ASR, there is strong theoretical evidence that sub-optimal alignments are an inevitable consequence of the NLL training criterion (Zeyer et al., 2021; Blondel et al., 2021) . A common remedy for mitigating over-confident predictions is to impose an entropy regularizer to encourage diversification. For example, label smoothing leads to more calibrated representations and better predictions (Müller et al., 2019; Meister et al., 2020) . Another popular technique is 1

