IN-SITU TEXT-ONLY ADAPTATION OF SPEECH MOD-ELS WITH LOW-OVERHEAD SPEECH IMPUTATIONS

Abstract

Fast and accurate adaptation of automatic speech recognition (ASR) systems using only text data in the target domain is a problem of long-standing practical relevance. Text-only adaptation was easy in traditional cascaded ASR systems with completely decoupled acoustic and language models. Recently, the RNN-Transducer (RNN-T) has emerged as a popular ASR model because of its high accuracy, low latency, and capability of supporting streaming input. However text-only adaptation of the RNN-T model is significantly more challenging due to its tight integration of acoustic and language models and end-to-end training. Existing recent approaches for text-only adaptation of RNN-Ts, either entail significant modification to the network or introduce high latency during decoding. We propose a new approach (TOLSTOI) that using text imputes speech representations internal to the ASR model, and performs in-situ adaptation that results in higher adaptation accuracy without any runtime overheads during decoding. Our imputation model is a function of the labeled data and trained parameters of the ASR model, and that we show, is more effective in controlling catastrophic forgetting compared to existing methods. We establish the effectiveness of TOLSTOI using three target domains and two ASR models of varying complexity. We yield up to 35% relative reduction in word error rate with text only adaptation, while forgetting the least compared to existing adaptation approaches. Our method is easy to implement and can be harnessed on existing RNN-T models without requiring ASR model training from scratch.

1. INTRODUCTION

Text-only adaptation of end-to-end (E2E) automatic speech recognition (ASR) systems to new target domains is of much practical interest since in many situations, e.g. mobile phones, it is easier to get target-specific text data than the corresponding audio. Efficient and effective text-only adaptation remains an open problem in large part due to the nature of E2E ASR systems that use a single model to jointly learn both a mapping from speech to text and a language model (LM), thus rendering traditional LM adaptation techniques for ASR (Bellegarda, 2004) ineffective. RNN-Transducer (RNN-T) models (Graves, 2012) are one of the most popular E2E ASR architectures that achieve high accuracy and enable real-time decoding of speech, thus making them the predominant choice for ASR on mobile devices (He et al., 2019) . Customizing RNN-T models using text-only data has gathered momentum in recent years. For any ASR applications using RNN-T models, running at real-time is a critical requirement. Thus, we seek simple and accurate text-only adaptation techniques that do not increase the model complexity. In this work, we propose such an approach TOLSTOI that is simple in its design and works with pretrained RNN-T models while enabling fast and accurate adaptation to the target domain. We will first review existing approaches to the problem of text-only adaptation to help contextualize TOLSTOI better. A popular solution for text-only adaptation of E2E ASR systems is shallow fusion (Hannun et al., 2014) where scores from the E2E model are combined with scores from an external LM trained on the target text during beam search decoding. While simple in its design, this technique significantly increases decoding time due to the reliance on an external LM during inference. More recent work on adapting RNN-T models using only text aims at directly updating the parameters of the prediction network (Pylkkonen et al., 2021; Chen et al., 2022a) . However such techniques do not yield very accurate adaptation to the target text and also involve architectural changes to the RNN-T that necessitate training the model from scratch. Text-only adaptation can also be tackled by generating corresponding speech via text-to-speech synthesis (TTS). The main limitations of TTS-based adaptation are significant computational costs and the reliance on high-quality TTS systems that are available only for a small subset of high-resource languages and accents. From these prior works, the key requirements for practical text-only adaptation of ASR that emerge are i) the model should adapt to target-domain text-only data with high accuracy ii) the adaptation should be applied to existing pretrained models without any retraining iii) the inference should be fast and inexpensive and iv) the adaptation should not lead to catastrophic forgetting (Goodfellow et al., 2013; Takashima et al., 2022) . We propose TOLSTOI that addresses all four requirements. Starting from text in the target domain, we impute speech representations as would have been produced by the transcription network of a pretrained RNN-T model. Our imputation model is a simple feedforward network (with roughly 200K parameters) that incurs minimal overhead in its training by harnessing forced alignments and representations from the ASR model. Using the trained imputation model, we generate sequences of speech representations for all the text in the target domain which are used for in-situ adaptation of the RNN-T ASR model. TOLSTOI can be used with any existing pretrained RNN-T. We do not introduce any new parameters in the RNN-T and do not rely on any external LMs, thus incurring no additional overhead on latency at inference time. Along with yielding fast and accurate adaptation to the target domain, TOLSTOI also safeguards against forgetting since the imputation model is trained to mimic representations from the source distribution. TOLSTOI yields up to 35% relative word error rate (WER) reduction on a new target domain, while maintaining the same decoding latency as the base RNN-T model and ensuring minimal forgetting of its source information when compared to three other competitive baselines. We also present a detailed ablation study to justify the various design choices of TOLSTOI.

2. RELATED WORK

LM adaptation in traditional ASR systems Unlike end-to-end models, traditional ASR systems adopt a cascaded structure with the LM being completely decoupled from the acoustic model (Mohri et al., 2002) . This enables easier adaptation of the LM to a target domain (Hori et al., 2003; Bellegarda, 2004; Neubig et al., 2009; Gangireddy et al., 2016) and also allows for ASR lattice rescoring with an external LM (Park et al., 2010; Xu et al., 2018) . LM fusion A popular approach for text-only adaptation of end-to-end ASR is "shallow fusion" where an external LM is log-linearly interpolated with the RNN-T output during beam decoding Kannan et al. (2018) . For RNN-T models, another recent approach is to extract internal LM probabilities and discount with the ratio of external and internal LM probabilities (McDermott et al., 2019; Meng et al., 2021a; b; Udagawa et al., 2022) . These techniques incur a significant overhead at inference time due to the external LM and also require careful tuning of the interpolation weight used for the external LM. Synthesizing audio Another approach to text-only adaptation is to synthesize audio using text-tospeech (TTS) synthesis (Zheng et al., 2021; Deng et al., 2021; Joshi & Singh, 2022; Hayashi et al., 2018; Hori et al., 2019; Baskar et al., 2021; Chen et al., 2022c) . However, this is a slow generation process and relies on access to high-quality TTS Shen et al. ( 2018) which is absent for most languages. To address these issues, recent work on text-only adaptation has investigated generating simpler pseudo-speech representations called "textograms" by repeating one-hot encodings of the output labels for a fixed duration (Thomas et al., 2022) . The input to the RNN-T is augmented to accept a textogram as an additional channel. This model requires training the RNN-T from scratch and also negatively impacts the decoding latency. Fine-tuning RNN-T model parameters Recent approaches exploit the inherent structure of the RNN-T to perform in-situ text-only adaptation. Pylkkonen et al. ( 2021) adds a separate LM output head to the prediction network in an RNN-T (that handles text-only inputs) and both are jointly finetuned using the target text. Chen et al. (2022a) first factorize the prediction network into two networks that separately handle "blank" tokens (capturing alignment with the audio) and the output vocabulary tokens, before adapting the latter with the target text. This technique requires retraining the RNN-T model and does not yield accurate adaptation.

