REVISITING THE ENTROPY SEMIRING FOR NEURAL SPEECH RECOGNITION

Abstract

In streaming settings, speech recognition models have to map sub-sequences of speech to text before the full audio stream becomes available. However, since alignment information between speech and text is rarely available during training, models need to learn it in a completely self-supervised way. In practice, the exponential number of possible alignments makes this extremely challenging, with models often learning peaky or sub-optimal alignments. Prima facie, the exponential nature of the alignment space makes it difficult to even quantify the uncertainty of a model's alignment distribution. Fortunately, it has been known for decades that the entropy of a probabilistic finite state transducer can be computed in time linear to the size of the transducer via a dynamic programming reduction based on semirings. In this work, we revisit the entropy semiring for neural speech recognition models, and show how alignment entropy can be used to supervise models through regularization or distillation. We also contribute an open-source implementation of CTC and RNN-T in the semiring framework that includes numerically stable and highly parallel variants of the entropy semiring. Empirically, we observe that the addition of alignment distillation improves the accuracy and latency of an already well-optimized teacher-student distillation model, achieving state-of-the-art performance on the Librispeech dataset in the streaming scenario.

1. INTRODUCTION

Modern automatic speech recognition (ASR) systems deploy a single neural network trained in an end-to-end differentiable manner on a paired corpus of speech and text (Graves et al., 2006; Graves, 2012; Chan et al., 2015; Sak et al., 2017; He et al., 2019) . For many applications like providing closed captions in online meetings or understanding natural language queries for smart assistants, it is imperative that an ASR model operates in a streaming fashion with low latency. This means that before the full audio stream becomes available, the model has to produce partial recognition outputs that correspond to the already given speech. Ground truth alignments that annotate sub-sequences of speech with sub-sequences of text are hard to collect, and rarely available in sufficient quantities to be used as training data. Thus, ASR models have to learn alignments from paired examples of un-annotated speech and text in a completely self-supervised way. The two most popular alignment models used for neural speech recognition today are Connectionist Temporal Classification (CTC) (Graves et al., 2006) and Recurrent Neural Network Transducer (RNN-T) (Graves, 2012; He et al., 2019) . They formulate a probabilistic model over the alignment space, and are trained with a negative log-likelihood (NLL) criterion. Despite the widespread use of CTC and RNN-T, ASR models tend to converge to peaky or suboptimal alignment distributions in practice (Miao et al., 2015; Liu et al., 2018; Yu et al., 2021) . Prior work outside of ASR has discovered that the standard NLL loss generally leads to over-confident predictions (Pereyra et al., 2017; Xu et al., 2020) . Even within ASR, there is strong theoretical evidence that sub-optimal alignments are an inevitable consequence of the NLL training criterion (Zeyer et al., 2021; Blondel et al., 2021) . A common remedy for mitigating over-confident predictions is to impose an entropy regularizer to encourage diversification. For example, label smoothing leads to more calibrated representations and better predictions (Müller et al., 2019; Meister et al., 2020) . Another popular technique is knowledge distillation, where we minimize the relative entropy between a teacher's and student's soft predictions, instead of training on hard labels (Hinton et al., 2015; Stanton et al., 2021) . However, it is not straightforward to calculate the entropy of the alignment distribution of a neural speech model. This is because the total number of possible alignments is exponential in the length of the acoustic and text sequence, which makes naive calculation intractable. Fortunately, classical results from Eisner (2001); Cortes et al. (2006) show that the entropy of a probabilistic finite state transducer can be computed in time linear to the size of the transducer. Their approach is based on the semiring framework for dynamic programming (Mohri, 1998) , which generalizes classical algorithms like forward-backward, inside-outside, Viterbi, and belief propagation. The unifying algebraic structure of all these classical algorithms is that state transitions and merges can be interpreted as generalized multiplication and addition operations on a semiring. Eisner (2001) ; Cortes et al. (2006) ingeniously constructed a semiring that corresponds to the computation of entropy. While these results have been long established, open-source implementations like the OpenFST library (Allauzen et al., 2007) have not been designed with modern automatic differentiation and deep learning libraries in mind. This is why thus far, supervising training with alignment entropy regularization or distillation has not been part of the standard neural speech recognition toolbox. In fact, implementing the entropy semiring on top of modern ASR lattices like CTC or RNN-T is highly non-trivial. During training, we need to compute not only the entropy in the forward pass, but also its gradients in the backward pass, which necessitates a highly numerically stable implementation. Note that even for simple operations like calculating the binary cross entropy of a softmax distribution, naive implementations that do not use the LogSumExp trick are plagued by numerical inaccuracies. When it comes to calculating the entropy of a sequence that might be thousands of tokens long, such inaccuracies accumulate quickly, leading to NaNs that make training impossible. Moreover, it is crucial that the addition of the alignment entropy supervision does not incur additional forward passes through the ASR lattice beyond the one already done to compute NLL.

Our Contributions

We contribute an open-source implementation of CTC and RNN-T in the semiring framework that is both numerically stable and highly parallel. Regarding numerical stability, we find that the vanilla entropy semiring from Eisner (2001); Cortes et al. (2006) produces unstable outputs not just at the final step, but also during intermediate steps of the dynamic program. Thus, a naive implementation will result in instability in both the forward and backward pass, since automatic differentiation re-uses activations produced during the intermediate steps. To address this, we propose a novel variant of the entropy semiring that is isomorphic to the original, but is designed to be numerically stable for both the forward and backward pass. Regarding parallelism, our implementation allows for efficient plug-and-play computations of arbitrary semirings in the dynamic programming graphs of CTC and RNN-T. Thus, when outputs from more than one semiring are desired, we can compute them in parallel using a single pass over the same data, by simply plugging in a new semiring that is formed via the concatenation of existing semirings.



Figure 1: Example of a loop-skewed RNN-T lattice annotated with transition probabilities. In this example, there are 3 text tokens to be produced and 4 acoustic tokens to be consumed, which results in a total of (3+4)! 3!4! = 35 alignments and 2 * 3 * 4 + 3 + 4 = 31 transitions. The naive calculation of entropy requires 35 * (3 + 4 + 1) = 280 multiplications and 35 -1 = 34 additions, while the entropy semiring calculation requires 31 * 3 = 93 multiplications and 2 * 3 * 4 + 31 = 55 additions.

