IN-SITU TEXT-ONLY ADAPTATION OF SPEECH MOD-ELS WITH LOW-OVERHEAD SPEECH IMPUTATIONS

Abstract

Fast and accurate adaptation of automatic speech recognition (ASR) systems using only text data in the target domain is a problem of long-standing practical relevance. Text-only adaptation was easy in traditional cascaded ASR systems with completely decoupled acoustic and language models. Recently, the RNN-Transducer (RNN-T) has emerged as a popular ASR model because of its high accuracy, low latency, and capability of supporting streaming input. However text-only adaptation of the RNN-T model is significantly more challenging due to its tight integration of acoustic and language models and end-to-end training. Existing recent approaches for text-only adaptation of RNN-Ts, either entail significant modification to the network or introduce high latency during decoding. We propose a new approach (TOLSTOI) that using text imputes speech representations internal to the ASR model, and performs in-situ adaptation that results in higher adaptation accuracy without any runtime overheads during decoding. Our imputation model is a function of the labeled data and trained parameters of the ASR model, and that we show, is more effective in controlling catastrophic forgetting compared to existing methods. We establish the effectiveness of TOLSTOI using three target domains and two ASR models of varying complexity. We yield up to 35% relative reduction in word error rate with text only adaptation, while forgetting the least compared to existing adaptation approaches. Our method is easy to implement and can be harnessed on existing RNN-T models without requiring ASR model training from scratch.

1. INTRODUCTION

Text-only adaptation of end-to-end (E2E) automatic speech recognition (ASR) systems to new target domains is of much practical interest since in many situations, e.g. mobile phones, it is easier to get target-specific text data than the corresponding audio. Efficient and effective text-only adaptation remains an open problem in large part due to the nature of E2E ASR systems that use a single model to jointly learn both a mapping from speech to text and a language model (LM), thus rendering traditional LM adaptation techniques for ASR (Bellegarda, 2004) ineffective. RNN-Transducer (RNN-T) models (Graves, 2012) are one of the most popular E2E ASR architectures that achieve high accuracy and enable real-time decoding of speech, thus making them the predominant choice for ASR on mobile devices (He et al., 2019) . Customizing RNN-T models using text-only data has gathered momentum in recent years. For any ASR applications using RNN-T models, running at real-time is a critical requirement. Thus, we seek simple and accurate text-only adaptation techniques that do not increase the model complexity. In this work, we propose such an approach TOLSTOI that is simple in its design and works with pretrained RNN-T models while enabling fast and accurate adaptation to the target domain. We will first review existing approaches to the problem of text-only adaptation to help contextualize TOLSTOI better. A popular solution for text-only adaptation of E2E ASR systems is shallow fusion (Hannun et al., 2014) where scores from the E2E model are combined with scores from an external LM trained on the target text during beam search decoding. While simple in its design, this technique significantly increases decoding time due to the reliance on an external LM during inference. More recent work on adapting RNN-T models using only text aims at directly updating the parameters of the prediction network (Pylkkonen et al., 2021; Chen et al., 2022a) . However such techniques do not yield very accurate adaptation to the target text and also involve architectural changes to the RNN-T that necessitate training the model from scratch. Text-only adaptation can also be tackled by generating corresponding speech via text-to-speech synthesis (TTS). The main limitations of TTS-based adaptation are significant computational costs and the reliance on high-quality TTS systems that are available only for a small subset of high-resource languages and accents. From these prior works, the key requirements for practical text-only adaptation of ASR that emerge are i) the model should adapt to target-domain text-only data with high accuracy ii) the adaptation should be applied to existing pretrained models without any retraining iii) the inference should be fast and inexpensive and iv) the adaptation should not lead to catastrophic forgetting (Goodfellow et al., 2013; Takashima et al., 2022) . We propose TOLSTOI that addresses all four requirements. Starting from text in the target domain, we impute speech representations as would have been produced by the transcription network of a pretrained RNN-T model. Our imputation model is a simple feedforward network (with roughly 200K parameters) that incurs minimal overhead in its training by harnessing forced alignments and representations from the ASR model. Using the trained imputation model, we generate sequences of speech representations for all the text in the target domain which are used for in-situ adaptation of the RNN-T ASR model. TOLSTOI can be used with any existing pretrained RNN-T. We do not introduce any new parameters in the RNN-T and do not rely on any external LMs, thus incurring no additional overhead on latency at inference time. Along with yielding fast and accurate adaptation to the target domain, TOLSTOI also safeguards against forgetting since the imputation model is trained to mimic representations from the source distribution. TOLSTOI yields up to 35% relative word error rate (WER) reduction on a new target domain, while maintaining the same decoding latency as the base RNN-T model and ensuring minimal forgetting of its source information when compared to three other competitive baselines. We also present a detailed ablation study to justify the various design choices of TOLSTOI.

2. RELATED WORK

LM adaptation in traditional ASR systems Unlike end-to-end models, traditional ASR systems adopt a cascaded structure with the LM being completely decoupled from the acoustic model (Mohri et al., 2002) . This enables easier adaptation of the LM to a target domain (Hori et al., 2003; Bellegarda, 2004; Neubig et al., 2009; Gangireddy et al., 2016) and also allows for ASR lattice rescoring with an external LM (Park et al., 2010; Xu et al., 2018) . LM fusion A popular approach for text-only adaptation of end-to-end ASR is "shallow fusion" where an external LM is log-linearly interpolated with the RNN-T output during beam decoding Kannan et al. (2018) . For RNN-T models, another recent approach is to extract internal LM probabilities and discount with the ratio of external and internal LM probabilities (McDermott et al., 2019; Meng et al., 2021a; b; Udagawa et al., 2022) . These techniques incur a significant overhead at inference time due to the external LM and also require careful tuning of the interpolation weight used for the external LM. Synthesizing audio Another approach to text-only adaptation is to synthesize audio using text-tospeech (TTS) synthesis (Zheng et al., 2021; Deng et al., 2021; Joshi & Singh, 2022; Hayashi et al., 2018; Hori et al., 2019; Baskar et al., 2021; Chen et al., 2022c) . However, this is a slow generation process and relies on access to high-quality TTS Shen et al. (2018) which is absent for most languages. To address these issues, recent work on text-only adaptation has investigated generating simpler pseudo-speech representations called "textograms" by repeating one-hot encodings of the output labels for a fixed duration (Thomas et al., 2022) . The input to the RNN-T is augmented to accept a textogram as an additional channel. This model requires training the RNN-T from scratch and also negatively impacts the decoding latency. Fine-tuning RNN-T model parameters Recent approaches exploit the inherent structure of the RNN-T to perform in-situ text-only adaptation. Pylkkonen et al. (2021) adds a separate LM output head to the prediction network in an RNN-T (that handles text-only inputs) and both are jointly finetuned using the target text. Chen et al. (2022a) first factorize the prediction network into two networks that separately handle "blank" tokens (capturing alignment with the audio) and the output vocabulary tokens, before adapting the latter with the target text. This technique requires retraining the RNN-T model and does not yield accurate adaptation. M L < l a t e x i t s h a 1 _ b a s e 6 4 = " q l d y 3 d X F Y I T T h K i l W 5 W f N Q 0 q K S M = " > A A A B 9 H i c b V D L S g M x F L 3 j s 9 Z X 1 a W b Y B F c l Z k q 6 L L o x o V C B f u A d i i Z N N O G Z p I x y R T K 0 O 9 w 4 0 I R t 3 6 M O / / G T D s L b T 0 Q O J x z L / f k B D F n 2 r j u t 7 O y u r a + s V n Y K m 7 v 7 O 7 t l w 4 O m 1 o m i t A G k V y q d o A 1 5 U z Q h m G G 0 3 a s K I 4 C T l v B 6 C b z W 2 O q N J P i 0 U x i 6 k d 4 I F j I C D Z W 8 r s R N k O C e X o / 7 d 3 1 S m W 3 4 s 6 A l o m X k z L k q P d K X 9 2 + J E l E h S E c a 9 3 x 3 N j 4 K V a G E U 6 n x W 6 i a Y z J C A 9 o x 1 K B I 6 r 9 d B Z 6 i k 6 t 0 k e h V P Y J g 2 b q 7 4 0 U R 1 p P o s B O Z i H 1 o p e J / 3 m d x I R X f s p E n B g q y P x Q m H B k J M o a Q H 2 m K D F 8 Y g k m i t m s i A y x w s T Y n o q 2 B G / x y 8 u k W a 1 4 5 5 X q w 0 W 5 d p 3 X U Y B j O I E z 8 O A S a n A L d W g A g S d 4 h l d 4 c 8 b O i / P u f M x H V 5 x 8 5 w j + w P n 8 A d p a k i Y = < / l a t e x i t > M J < l a t e x i t s h a 1 _ b a s e 6 4 = " n 2 p d G I I i l 1 h i U I D 7 C z H 1 M E 8 5 b f I = " > A A A B 9 H i c b V D L S g M x F L 3 j s 9 Z X 1 a W b Y B F c l Z k q 6 L L o R g S h g n 1 A O 5 R M m m l D M 8 m Y Z A p l 6 H e 4 c a G I W z / G n X 9 j p p 2 F t h 4 I H M 6 5 l 3 t y g p g z b V z 3 2 1 l Z X V v f 2 C x s F b d 3 d v f 2 S w e H T S 0 T R W i D S C 5 V O 8 C a c i Z o w z D D a T t W F E c B p 6 1 g d J P 5 r T F V m k n x a C Y x 9 S M 8 E C x k B B s r + d 0 I m y H B P L 2 f 9 u 5 6 p b J b c W d A y 8 T L S R l y 1 H u l r 2 5 f k i S i w h C O t e 5 4 b m z 8 F C v D C K f T Y j f R N M Z k h A e 0 Y 6 n A E d V + O g s 9 R a d W 6 a N Q K v u E Q T P 1 9 0 a K I 6 0 n U W A n s 5 B 6 0 c v E / 7 x O Y s I r P 2 U i T g w V Z H 4 o T D g y E m U N o D 5 T l B g + s Q Q T x W x W R I Z Y Y W J s T 0 V b g r f 4 5 W X S r F a 8 8 0 r 1 4 a J c u 8 7 r K M A x n M A Z e H A J N b i F O j S A w B M 8 w y u 8 O W P n x X l 3 P u a j K 0 6 + c w R / 4 H z + A N d S k i Q = < / l a t e x i t > ⊗ Imputation Model ( ) f IMP < l a t e x i t s h a 1 _ b a s e 6 4 = " C + W W c l u O K y O B W 9 k p L 9 9 P U h d p p r c = " > A A A B 9 X i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e x G Q Y 9 B L 3 o Q I p g H J D H M T m a T I b M P Z n r V s O x / e P G g i F f / x Z t / 4 y T Z g y Y W N B R V 3 X R 3 u Z E U G m 3 7 2 8 o t L a + s r u X X C x u b W 9 s 7 x d 2 9 h g 5 j x X i d h T J U L Z d q L k X A 6 y h Q 8 l a k O P V d y Z v u 6 H L i N x + 4 0 i I M 7 n A c 8 a 5 P B 4 H w B K N o p H u v l 3 S Q P 2 F y f V N L 0 1 6 x Z J f t K c g i c T J S g g y 1 X v G r 0 w 9 Z 7 P M A m a R a t x 0 7 w m 5 C F Q o m e V r o x J p H l I 3 o g L c N D a j P d T e Z X p 2 S I 6 P 0 i R c q U w G S q f p 7 I q G + 1 m P f N Z 0 + x a G e 9 y b i f 1 4 7 R u + 8 m 4 g g i p E H b L b I i y X B k E w i I H 2 h O E M 5 N o Q y J c y t h A 2 p o g x N U A U T g j P / 8 i J p V M r O S b l y e 1 q q X m R x 5 O E A D u E Y H D i D K l x B D e r A Q M E z v M K b 9 W i 9 W O / W x 6 w 1 Z 2 U z + / A H 1 u c P 6 K u S x g = = < / l a t e x i t >

⊗

Standard RNN-T model and three prior approaches M L < l a t e x i t s h a 1 _ b a s e 6 4 = " q l d y 3 d X F Y I T T h K i l W 5 W f N Q 0 q K S M = " > A A A B 9 H i c b V D L S g M x F L 3 j s 9 Z X 1 a W b Y B F c l Z k q 6 L L o x o V C B f u A d i i Z N N O G Z p I x y R T K 0 O 9 w 4 0 I R t 3 6 M O / / G T D s L b T 0 Q O J x z L / f k B D F n 2 r j u t 7 O y u r a + s V n Y K m 7 v 7 O 7 t l w 4 O m 1 o m i t A G k V y q d o A 1 5 U z Q h m G G 0 3 a s K I 4 C T l v B 6 C b z W 2 O q N J P i 0 U x i 6 k d 4 I F j I C D Z W 8 r s R N k O C e X o / 7 d 3 1 S m W 3 4 s 6 A l o m X k z L k q P d K X 9 2 + J E l E h S E c a 9 3 x 3 N j 4 K V a G E U 6 n x W 6 i a Y z J C A 9 o x 1 K B I 6 r 9 d B Z 6 i k 6 t 0 k e h V P Y J g 2 b q 7 4 0 U R 1 p P o s B O Z i H 1 o p e J / 3 m d x I R X f s p E n B g q y P x Q m H B k J M o a Q H 2 m K D F 8 Y g k m i t m s i A y x w s T Y n o q 2 B G / x y 8 u k W a 1 4 5 5 X q w 0 W 5 d p 3 X U Y B j O I E z 8 O A S a n A L d W g A g S d 4 h l d 4 c 8 b O i / P u f M x H V 5 x 8 5 w j + w P n 8 A d p a k i Y = < / l a t e x i t > M J < l a t e x i t s h a 1 _ b a s e 6 4 = " n 2 p d G I I i l 1 h i U I D 7 C z H 1 M E 8 5 b f I = " > A A A B 9 H i c b V D L S g M x F L 3 j s 9 Z X 1 a W b Y B F c l Z k q 6 L L o R g S h g n 1 A O 5 R M m m l D M 8 m Y Z A p l 6 H e 4 c a G I W z / G n X 9 j p p 2 F t h 4 I H M 6 5 l 3 t y g p g z b V z 3 2 1 l Z X V v f 2 C x s F b d 3 d v f 2 S w e H T S 0 T R W i D S C 5 V O 8 C a c i Z o w z D D a T t W F E c B p 6 1 g d J P 5 r T F V m k n x a C Y x 9 S M 8 E C x k B B s r + d 0 I m y H B P L 2 f 9 u 5 6 p b J b c W d A y 8 T L S R l y 1 H u l r 2 5 f k i S i w h C O t e 5 4 b m z 8 F C v D C K f T Y j f R N M Z k h A e 0 Y 6 n A E d V + O g s 9 R a d W 6 a N Q K v u E Q T P 1 9 0 a K I 6 0 n U W A n s 5 B 6 0 c v E / 7 x O Y s I r P 2 U i T g w V Z H 4 o T D g y E m U N o D 5 T l B g + s Q Q T x W x W R I Z Y Y W J s T 0 V b g r f 4 5 W X S r F a 8 8 0 r 1 4 a J c u 8 7 r K M A x n M A Z e H A J N b i F O j S A w B M 8 w y u 8 O W P n x X l 3 P u a j K 0 6 + c w R / 4 H z + A N d S k i Q = < / l a t e x i t > P (yu|x1,...,t, y1,...,u 1) Figure 1 shows a technique from each of the above-mentioned categories and how it is integrated within the standard RNN-T architecture. (A recent line of work focuses on learning shared speech and text representations, thus allowing for the use of unpaired text (Bapna et al., 2021; Ao et al., 2021; Tang et al., 2022; Chen et al., 2022b ). An extension of these ideas to streaming RNN-T models (Sainath et al., 2023) was concurrent to our work and would be interesting to explore further.) < l a t e x i t s h a 1 _ b a s e 6 4 = " R E o j D n C t R t 3 e Y V p 2 u X L D z o M 1 R w 4 = " > A A A C G H i c b V D L S s N A F J 3 U V 6 2 v q k s 3 w S J U q D W p g i 6 L b l x W s A 9 o S 5 h M J u 3 Q y Y O Z G z H E f I Y b f 8 W N C 0 X c d u f f O G 2 z q K 0 H B s 6 c c y 8 z 5 9 g h Z x I M 4 0 f L r a y u r W / k N w t b 2 z u 7 e 8 X 9 g 5 Y M I k F o k w Q 8 E B 0 b S 8 q Z T 5 v A g N N O K C j 2 b E 7 b 9 u h 2 4 r c f q Z A s 8 B 8 g D m n f w w O f u Y x g U J J V P G + U Y y t 6 7 n k Y h r a b P K V W Y l Z 6 3 A l A V i C t x H P X 6 M x M T 6 1 i y a g a U + j L x M x I C W V o W M V x z w l I 5 F E f C M d S d k 0 j h H 6 C B T D C a V r o R Z K G m I z w g H Y V 9 b F H Z T + Z B k v 1 E 6 U 4 u h s I d X z Q p + r 8 R o I 9 K W P P V p O T A H L R m 4 j / e d 0 I 3 O t + w v w w A u q T 2 U N u x H U I 9 E l L u s M E J c B j R T A R T P 1 V J 0 M s M A H V Z U G V Y C 5 G X i a t W t W 8 q N b u L 0 v 1 m 6 y O P D p C x 6 i M T H S F 6 u g O N V A T E f S C 3 t A H + t R e t X f t S / u e j e a 0 b O c Q / Y E 2 / g U / n p / d < / l a t e x i t > y u 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " G G T e B U 3 N + + h h H W j 5 d O X s 2 m H s V U Q = " > A A A B 7 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B i y W p g h 6 L X j x W s B / Q h r L Z T t u l m 0 3 Y 3 Q g h 9 E d 4 8 a C I V 3 + P N / + N 2 z Y H b X 0 w 8 H h v h p l 5 Q S y 4 N q 7 7 7 R T W 1 j c 2 t 4 r b p Z 3 d v f 2 D 8 u F R S 0 e J Y t h k k Y h U J 6 A a B Z f Y N N w I 7 M Q K a R g I b A e T u 5 n f f k K l e S Q f T R q j H 9 K R 5 E P O q L F S O + 1 n y Y U 3 7 Z c r b t W d g 6 w S L y c V y N H o l 7 9 6 g 4 g l I U r D B N W 6 6 7 m x 8 T O q D G c C p 6 V e o j G m b E J H 2 L V U 0 h C 1 n 8 3 P n Z I z q w z I M F K 2 p C F z 9 f d E R k O t 0 z C w n S E 1 Y 7 3 s z c T / v G 5 i h j d + x m W c G J R s s W i Y C G I i M v u d D L h C Z k R q C W W K 2 1 s J G 1 N F m b E J l W w I 3 v L L q 6 R V q 3 q X 1 d r D V a V + m 8 d R h B M 4 h X P w 4 B r q c A 8 N a A K D C T z D K 7 w 5 s f P i v D s f i 9 a C k 8 8 c w x 8 4 n z 8 W + 4 9 n < / l a t e x i t > y u 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " G G T e B U 3 N + + h h H W j 5 d O X s 2 m H s V U Q = " > A A A B 7 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B i y W p g h 6 L X j x W s B / Q h r L Z T t u l m 0 3 Y 3 Q g h 9 E d 4 8 a C I V 3 + P N / + N 2 z Y H b X 0 w 8 H h v h p l 5 Q S y 4 N q 7 7 7 R T W 1 j c 2 t 4 r b p Z 3 d v f 2 D 8 u F R S 0 e J Y t h k k Y h U J 6 A a B Z f Y N N w I 7 M Q K a R g I b A e T u 5 n f f k K l e S Q f T R q j H 9 K R 5 E P O q L F S O + 1 n y Y U 3 7 Z c r b t W d g 6 w S L y c V y N H o l 7 9 6 g 4 g l I U r D B N W 6 6 7 m x 8 T O q D G c C p 6 V e o j G m b E J H 2 L V U 0 h C 1 n 8 3 P n Z I z q w z I M F K 2 p C F z 9 f d E R k O t 0 z C w n S E 1 Y 7 3 s z c T / v G 5 i h j d + x m W c G J R s s W i Y C G I i M v u d D L h C Z k R q C W W K 2 1 s J G 1 N F m b E J l W w I 3 v L L q 6 R V q 3 q X 1 d r D V a V + m 8 d R h B M 4 h X P w 4 B r q c A 8 N a A K D C T z D K 7 w 5 s f P i v D s f i 9 a C k 8 8 c w x 8 4 n z 8 W + 4 9 n < / l a t e x i t > h t < l a t e x i t s h a 1 _ b a s e 6 4 = " Z h V p k y 9 R 3 K g e n P L 9 G L I o G h e o d d U = " > A A A C B n i c b Z D N S g M x F I X v + F v r X 9 W l m 8 E i u C o z V d B l 0 Y 3 L C v Y H 2 q F k 0 k w b m m S G 5 I 5 Q h u 5 d u 9 V n c C d u f Q 0 f w b c w b W e h b Q 8 E P s 6 5 l y Q n T A Q 3 6 H n f z t r 6 x u b W d m G n u L u 3 f 3 B Y O j p u m j j V l D V o L G L d D o l h g i v W Q I 6 C t R P N i A w F a 4 W j u 2 n e e m L a 8 F g 9 4 j h h g S Q D x S N O C V q r 3 Q 1 l N p z 0 s F c q e x V v J n c Z / B z K k K v e K / 1 0 + z F N J V N I B T G m 4 3 s J B h n R y K l g k 2 I 3 N S w h d E Q G r G N R E c l M k M 3 e O 3 H P r d N 3 o 1 j b o 9 C d u X 8 3 M i K N G c v Q T k q C Q 7 O Y T c 2 V W S h X 2 Z 0 U o 5 s g 4 y p J k S k 6 v z 9 K h Y u x O + 3 E 7 X P N K I q x B U I 1 t 1 9 w 6 Z B o Q t E 2 V 7 T d + I t N L E O z W v E v K 9 W H q 3 L t N m + p A K d w B h f g w z X U 4 B 7 q 0 A A K A l 7 g F d 6 c Z + f d + X A + 5 6 N r T r 5 z A v / k f P 0 C l F e Z X g = = < / l a t e x i t > h t 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 G A I B I q O m v 0 X O v y 9 v x x D R g W U 0 E 0 = " > A A A C C n i c b Z D L S g M x G I U z 9 V b r r e r S T b A I b i w z V d B l 0 Y 3 L C v Y C n a F k 0 r Q N T T J D 8 o 9 Q h n k D 1 2 7 1 G d y J W 1 / C R / A t T N t Z a N s D g Y 9 z / p 8 k J 4 w F N + C 6 3 0 5 h b X 1 j c 6 u 4 X d r Z 3 d s / K B 8 e t U y U a M q a N B K R 7 o T E M M E V a w I H w T q x Z k S G g r X D 8 d 0 0 b z 8 x b X i k H m E S s 0 C S o e I D T g l Y y / d D m Y 6 y X g o X X t Y r V 9 y q O x N e B i + H C s r V 6 J V / / H 5 E E 8 k U U E G M 6 X p u D E F K N H A q W F b y E 8 N i Q s d k y L o W F Z H M B O n s z R k + s 0 4 f D y J t j w I 8 c / 9 u p E Q a M 5 G h n Z Q E R m Y x m 5 o r s 1 C u s r s J D G 6 C l K s 4 A a b o / P 5 B I j B E e N o L 7 n P N K I i J B U I 1 t 1 / A d E Q 0 o W D b K 9 l u v M U m l q F V q 3 q X 1 d r D V a V + m 7 d U R C f o F J 0 j D 1 2 j O r p H D d R E F M X o B b 2 i N + f Z e X c + n M / 5 a M H J d 4 7 R P z l f v 1 1 B m t w = < / l a t e x i t > x t < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 4 S M X L p Z X B x n f W 1 Y T M x Q k 5 4 s Z 1 Y = " > A A A C C H i c b Z D L S g M x G I U z X m u 9 V V 2 6 C R b B V Z m p g i 6 L b l x W s B d s h 5 J J M 2 1 o k h m S f 8 Q y z A u 4 d q v P 4 E 7 c + h Y + g m 9 h 2 s 5 C 2 x 4 I f J z z / y Q 5 Q S y 4 A d f 9 d l Z W 1 9 Y 3 N g t b x e 2 d 3 b 3 9 0 s F h 0 0 S J p q x B I x H p d k A M E 1 y x B n A Q r B 1 r R m Q g W C s Y 3 U z y 1 i P T h k f q H s Y x 8 y U Z K B 5 y S s B a D 9 1 A p k 9 Z L 4 W s V y q 7 F X c q v A h e D m W U q 9 4 r / X T 7 E U 0 k U 0 A F M a b j u T H 4 K d H A q W B Z s Z s Y F h M 6 I g P W s a i I Z M Z P p y / O 8 K l 1 + j i M t D 0 K 8 N T 9 u 5 E S a c x Y B n Z S E h i a + W x i L s 0 C u c z u J B B e + S l X c Q J M 0 d n 9 Y S I w R H j S C u 5 z z S i I s Q V C N b d f w H R I N K F g u y v a b r z 5 J h a h W a 1 4 5 5 X q 3 U W 5 d p 2 3 V E D H 6 A S d I Q 9 d o h q 6 R X X U Q B Q p 9 I J e 0 Z v z 7 L w 7 H 8 7 n b H T F y X e O 0 D 8 5 X 7 + N 7 5 p 6 < / l a t e x i t > g u < l a t e x i t s h a 1 _ b a s e 6 4 = " r o I 5 K G L I H T W X 2 6 7 K m u R a F G I h x j s = " > A A A C C H i c b Z D L S g M x G I X / q b d a b 1 W X b o J F c F V m q q D L o h u X F e w F 2 6 F k 0 k w b m m S G J C O U Y V 7 A t V t 9 B n f i 1 r f w E X w L 0 3 Y W 2 v Z A 4 O O c / y f J C W L O t H H d b 6 e w t r 6 x u V X c L u 3 s 7 u 0 f l A + P W j p K F K F N E v F I d Q K s K W e S N g 0 z n H Z i R b E I O G 0 H 4 9 t p 3 n 6 i S r N I P p h J T H 2 B h 5 K F j G B j r c d e I N J h 1 k + T r F + u u F V 3 J r Q M X g 4 V y N X o l 3 9 6 g 4 g k g k p D O N a 6 6 7 m x 8 V O s D C O c Z q V e o m m M y R g P a d e i x I J q P 5 2 9 O E N n 1 h m g M F L 2 S I N m 7 t + N F A u t J y K w k w K b k V 7 M p u b K L B C r 7 G 5 i w m s / Z T J O D J V k f n + Y c G Q i N G 0 F D Z i i x P C J B U w U s 1 9 A Z I Q V J s Z 2 V 7 L d e I t N L E O r V v U u q r X 7 y 0 r 9 J m + p C C d w C u f g w R X U 4 Q 4 a 0 A Q C E l 7 g F d 6 c Z + f d + X A + 5 6 M F J 9 8 5 h n 9 y v n 4 B c + u a a g = = < / l a t e x i t > g u 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " y L A 3 v h v W R u N k 8 8 b 7 o n b 9 z r Q w J R g = " > A A A C C n i c b Z D L S g M x G I X / q b d a b 1 W X b o J F c G O Z q Y I u i 2 5 c V r A X 6 A w l k 6 Z t a J I Z k o x Q h n k D 1 2 7 1 G d y J W 1 / C R / A t T N t Z a N s D g Y 9 z / p 8 k J 4 w 5 0 8 Z 1 v 5 3 C 2 v r G 5 l Z x u 7 S z u 7 d / U D 4 8 a u k o U Y Q 2 S c Q j 1 Q m x p p x J 2 j T M c N q J F c U i 5 L Q d j u + m e f u J K s 0 i + W g m M Q 0 E H k o 2 Y A Q b a / l + K N J h 1 k u T C y / r l S t u 1 Z 0 J L Y O X Q w V y N X r l H 7 8 f k U R Q a Q j H W n c 9 N z Z B i p V h h N O s 5 C e a x p i M 8 Z B 2 L U o s q A 7 S 2 Z s z d G a d P h p E y h 5 p 0 M z 9 u 5 F i o f V E h H Z S Y D P S i 9 n U X J m F Y p X d T c z g J k i Z j B N D J Z n f P 0 g 4 M h G a 9 o L 6 T F F i + M Q C J o r Z L y A y w g o T Y 9 s r 2 W 6 8 x S a W o V W r e p f V 2 s N V p X 6 b t 1 S E E z i F c / D g G u p w D w 1 o A o E Y X u A V 3 p x n 5 9 3 5 c D 7 n o w U n 3 z m G f 3 K + f g F d P Z r c < / l a t e x i t > g u 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " y L A 3 v h v W R u N k 8 8 b 7 o n b 9 z r Q w J R g = " > A A A C C n i c b Z D L S g M x G I X / q b d a b 1 W X b o J F c G O Z q Y I u i 2 5 c V r A X 6 A w l k 6 Z t a J I Z k o x Q h n k D 1 2 7 1 G d y J W 1 / C R / A t T N t Z a N s D g Y 9 z / p 8 k J 4 w 5 0 8 Z 1 v 5 3 C 2 v r G 5 l Z x u 7 S z u 7 d / U D 4 8 a u k o U Y Q 2 S c Q j 1 Q m x p p x J 2 j T M c N q J F c U i 5 L Q d j u + m e f u J K s 0 i + W g m M Q 0 E H k o 2 Y A Q b a / l + K N J h 1 k u T C y / r l S t u 1 Z 0 J L Y O X Q w V y N X r l H 7 8 f k U R Q a Q j H W n c 9 N z Z B i p V h h N O s 5 C e a x p i M 8 Z B 2 L U o s q A 7 S 2 Z s z d G a d P h p E y h 5 p 0 M z 9 u 5 F i o f V E h H Z S Y D P S i 9 n U X J m F Y p X d T c z g J k i Z j B N D J Z n f P 0 g 4 M h G a 9 o L 6 T F F i + M Q C J o r Z L y A y w g o T Y 9 s r 2 W 6 8 x S a W o V W r e p f V 2 s N V p X 6 b t 1 S E E z i F c / D g G u p w D w 1 o A o E Y X u A V 3 p x n 5 9 3 5 c D 7 n o w U n 3 z m G f 3 K + f g F d P Z r c < / l a t e x i t > g u < l a t e x i t s h a _ b a s e = " r o I K G L I H T W X K m u R a F G I h x j s = " > A A A C C H i c b Z D L S g M x G I X / q b d a b W X b o J F c F V m q q D L o h u X F e w F F k k w b m m S G J C O U Y V A t V t B n f i r f w E X w L Y W v Z A O O c / y f J C W L O t H H d b e w t r x u V X c L u s u f l A + P W j p K F K F N E v F I d Q K s K W e S N g z n H Z i R b E I O G H t p n i S r N I P p h J T H B h K F j G B j r c d e I N J h k + T r F + u u F V J r Q M X g V y N X o l g g k g k p D O N a m x V O s D C O c Z q V e o m m M y R g P a d e i x I J q P O E N n h m g M F L S I N m t + N F A u t J y K w k w K b k V M p u b K L B C r G i w m s / Z T J O D J V k f n + Y c G Q i N G F D Z i i x P C J B U w U s A Z I Q V J s Z V L d e I t N L E O r V v U u q r X y r J m + p C C d w C u f g w R X U Q a A Q C E l g F d c Z + f d + X A + M F J h n y v n B c + u a a g = = < / l a t e x i t > h t < l a t e x i t s h a 1 _ b a s e 6 4 = " Z h V p k y 9 R 3 K g e n P L 9 G L I o G h e o d d U = " > A A A C B n i c b Z D N S g M x F I X v + F v r X 9 W l m 8 E i u C o z V d B l 0 Y 3 L C v Y H 2 q F k 0 k w b m m S G 5 I 5 Q h u 5 d u 9 V n c C d u f Q 0 f w b c w b W e h b Q 8 E P s 6 5 l y Q n T A Q 3 6 H n f z t r 6 x u b W d m G n u L u 3 f 3 B Y O j p u m j j V l D V o L G L d D o l h g i v W Q I 6 C t R P N i A w F a 4 W j u 2 n e e m L a 8 F g 9 4 j h h g S Q D x S N O C V q r 3 Q 1 l N p z 0 s F c q e x V v J n c Z / B z K k K v e K / 1 0 + z F N J V N I B T G m 4 3 s J B h n R y K l g k 2 I 3 N S w h d E Q G r G N R E c l M k M 3 e O 3 H P r d N 3 o 1 j b o 9 C d u X 8 3 M i K N G c v Q T k q C Q 7 O Y T c 2 V W S h X 2 Z 0 U o 5 s g 4 y p J k S k 6 v z 9 K h Y u x O + 3 E 7 X P N K I q x B U I 1 t 1 9 w 6 Z B o Q t E 2 V 7 T d + I t N L E O z W v E v K 9 W H q 3 L t N m + p A K d w B h f g w z X U 4 B 7 q 0 A A K A l 7 g F d 6 c Z + f d + X A + 5 6 N r T r 5 z A v / k f P 0 C l F e Z X g = = < / l a t e x i t > h t 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 G A I B I q O m v 0 X O v y 9 v x x D R g W U 0 E 0 = " > A A A C C n i c b Z D L S g M x G I U z 9 V b r r e r S T b A I b i w z V d B l 0 Y 3 L C v Y C n a F k 0 r Q N T T J D 8 o 9 Q h n k D 1 2 7 1 G d y J W 1 / C R / A t T N t Z a N s D g Y 9 z / p 8 k J 4 w F N + C 6 3 0 5 h b X 1 j c 6 u 4 X d r Z 3 d s / K B 8 e t U y U a M q a N B K R 7 o T E M M E V a w I H w T q x Z k S G g r X D 8 d 0 0 b z 8 x b X i k H m E S s 0 C S o e I D T g l Y y / d D m Y 6 y X g o X X t Y r V 9 y q O x N e B i + H C s r V 6 J V / / H 5 E E 8 k U U E G M 6 X p u D E F K N H A q W F b y E 8 N i Q s d k y L o W F Z H M B O n s z R k + s 0 4 f D y J t j w I 8 c / 9 u p E Q a M 5 G h n Z Q E R m Y x m 5 o r s 1 C u s r s J D G 6 C l K s 4 A a b o / P 5 B I j B E e N o L 7 n P N K I i J B U I 1 t 1 / A d E Q 0 o W D b K 9 l u v M U m l q F V q 3 q X 1 d r D V a V + m 7 d U R C f o F J 0 j D 1 2 j O r p H D d R E F M X o B b 2 i N + f Z e X c + n M / 5 a M H J d 4 7 R P z l f v 1 1 B m t w = < / l a t e x i t > External LM ⊕ Significantly slows inference time M S < l a t e x i t s h a 1 _ b a s e 6 4 = " o B V l o Q x T + 6 6 h / 2 n S k J x M j l U 2 M 2 4 = " > A A A B 9 H i c b V D L S g M x F L 3 x W e u r 6 t J N s A i u y k w V d F l 0 4 0 a o a B / Q D i W T Z t r Q T G Z M M o U y 9 D v c u F D E r R / j z r 8 x 0 8 5 C W w 8 E D u f c y z 0 5 f i y 4 N o 7 z j V Z W 1 9 Y 3 N g t b x e 2 d 3 b 3 9 0 s F h U 0 e J o q x B I x G p t k 8 0 E 1 y y h u F G s H a s G A l 9 w V r + 6 C b z W 2 O m N I / k o 5 n E z A v J Q P K A U 2 K s 5 H V D Y o a U i P R u 2 n v o l c p O x Z k B L x M 3 J 2 X I U e + V v r r 9 i C Y h k 4 Y K o n X H d W L j p U Q Z T g W b F r u J Z j G h I z J g H U s l C Z n 2 0 l n o K T 6 1 S h 8 H k b J P G j x T f 2 + k J N R 6 E v p 2 M g u p F 7 1 M / M / r J C a 4 8 l I u 4 8 Q w S e e H g k R g E + G s A d z n i l E j J p Y Q q r j N i u m Q K E K N 7 a l o S 3 A X v 7 x M m t W K e 1 6 p 3 l + U a 9 d 5 H Q U 4 h h M 4 A x c u o Q a 3 U I c G U H i C Z 3 i F N z R G L + g d f c x H V 1 C + c w R / g D 5 / A O T 2 k i 0 = < / l a t e x i t > Expensive RNN-T Retraining Textogram LM Head tolstoi < l a t e x i t s h a 1 _ b a s e 6 4 = " W + l d k n H X Q F o x 5 v y k 8 9 X B E 6 I f X T 0 = " > A A A C E H i c b Z D L S g M x G I U z X u t 4 G 3 X p Z r A I r s p M X e h G L L p x W c F e o B 1 K J s 2 0 o c l k S P 4 p l q E v 4 d q d 6 D O 4 E 1 e C L y A + g m 9 h e l l o 2 w O B j 3 P + n y Q n T D j T 4 H n f 1 t L y y u r a e m 7 D 3 t z a 3 t l 1 9 v a r W q a K 0 A q R X K p 6 i D X l L K Y V Y M B p P V E U i 5 D T W t i 7 H u W 1 P l W a y f g O B g k N B O 7 E L G I E g 7 F a j t M E e g + a Z C C 5 B s m G L S f v F b y x 3 H n w p 5 C / / L A v k q c v u 9 x y f p p t S V J B Y y A c a 9 3 w v Q S C D C t g h N O h 3 U w 1 T T D p 4 Q 5 t G I y x o D r I x i 8 f u s f G a b u R V O b E 4 I 7 d v x s Z F l o P R G g m B Y a u n s 1 G 5 s I s F I v s R g r R e Z C x O E m B x m R y f 5 R y F 6 Q 7 a s d t M 0 U J 8 I E B T B Q z X 3 B J F y t M w H R o m 2 7 8 2 S b m o V o s + K e F 4 q 2 X L 1 2 h i X L o E B 2 h E + S j M 1 R C N 6 i M K o i g P n p E z + j F e r B e r T f r f T K 6 Z E 1 3 D t A / W Z + / 2 r y g u w = = < / l a t e x i t >

3. OUR APPROACH (TOLSTOI)

We are given an ASR model M trained on a labeled dataset D : {(x 1 , y 1 ) . . . , (x N , y N )} where x i ∈ X , the space of speech utterances, and y i ∈ Y, the space of text transcripts. Each speech utterance x i comprises of a variable number of frames x i 1 , . . . , x i Ti where each x i t is a fixed-length real-valued vector denoting features such as spectrogram of the audio frame. Each text sequence comprises of a variable number of tokens y i = (y i 1 , . . . , y i Ui ) where each y u ∈ V, the output vocabulary. Popular choices for the output vocabulary are characters and subwords (Sennrich et al., 2015) . Typically the number of text tokens U i ≪ T i , the number of audio frames. Let P(X , Y) denote the distribution of speech and text from which the training data is sampled. Our goal is to deploy the ASR model M on a target domain whose distribution P(X , Y) differs from the training distribution. For the target domain, we only have text data D = { ỹ1 , . . . , ỹk } where the number of text samples in D in the target is generally much smaller than the size of the training set D. Since we are only given text samples in the target distribution we assume that the training and target distributions differ only on the text marginals P(Y) and the distribution of the speech given the text stays unchanged. That is, P(X |Y) = P(X |Y). We seek to use D to fine-tune the parameters of M so that the updated model M is accurate on speech corresponding to new text from the target distribution P, without catastrophically deteriorating accuracy on samples from the training distribution P. We propose to perform in-situ adaptation without introducing new layers or external LMs during deployment. As mentioned earlier, we focus on adapting the RNN-Transducer (RNN-T) architecture (Graves, 2012; Graves et al., 2013) as the ASR model since it has recently emerged as a popular choice, particularly in mobile devices, because of its high accuracy, low latency, and capability of supporting streaming input. We first present a brief background of RNN-Ts. Background of RNN-T The RNN-T network comprises of three modules: (1) A speech module M S with parameters θ S that converts speech frames x = x 1 . . . , x T to vectors h 1 , . . . , h T typ- Best Alignment ε g 0 g 1 g 2 g 3 h h 2 h 3 h 4 h 5 ε ε ε y 1 y 2 y 3 ε U T h 6 ε ε g 0 g 1 g 2 g 3 h 1 h 2 h 3 h 4 h 5 ε ε ε y 1 y 2 y 3 ε U T h 6 ε h 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " E 8 M 1 7 x q 9 7 j s 8 H P P T f t e S A Y 5 + y m 8 = " > A A A C C H i c b Z D L S g M x G I X / q b d a b 1 W X b o J F c F V m q q D L o h u X F e w F 2 6 F k 0 k w b m m S G J C O U Y V 7 A t V t 9 B n f i 1 r f w E X w L 0 3 Y W 2 v Z A 4 O O c / y f J C W L O t H H d b 6 e w t r 6 x u V X c L u 3 s 7 u 0 f l A + P W j p K F K F N E v F I d Q K s K W e S N g 0 z n H Z i R b E I O G 0 H 4 9 t p 3 n 6 i S r N I P p h J T H 2 B h 5 K F j G B j r c d e I N J R 1 k + 9 r F + u u F V 3 J r Q M X g 4 V y N X o l 3 9 6 g 4 g k g k p D O N a 6 6 7 m x 8 V O s D C O c Z q V e o m m M y R g P a d e i x I J q P 5 2 9 O E N n 1 h m g M F L 2 S I N m 7 t + N F A u t J y K w k w K b k V 7 M p u b K L B C r 7 G 5 i w m s / Z T J O D J V k f n + Y c G Q i N G 0 F D Z i i x P C J B U w U s 1 9 A Z I Q V J s Z 2 V 7 L d e I t N L E O r V v U u q r X 7 y 0 r 9 J m + p C C d w C u f g w R X U 4 Q 4 a 0 A Q C E l 7 g F d 6 c Z + f d + X A + 5 6 M F J 9 8 5 h n 9 y v n 4 B C B u a J w = = < / l a t e x i t > h 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " a r N T S o f a b e 8 7 S x g 4 u h y L z Z s r k f 4 = " > A A A C C H i c b Z D L S g M x G I X / q b d a b 1 W X b o J F c F V m q q D L o h u X F e w F 2 6 F k 0 k w b m m S G J C O U Y V 7 A t V t 9 B n f i 1 r f w E X w L 0 3 Y W 2 v Z A 4 O O c / y f J C W L O t H H d b 6 e w t r 6 x u V X c L u 3 s 7 u 0 f l A + P W j p K F K F N E v F I d Q K s K W e S N g 0 z n H Z i R b E I O G 0 H 4 9 t p 3 n 6 i S r N I P p h J T H 2 B h 5 K F j G B j r c d e I N J R 1 k 9 r W b 9 c c a v u T G g Z v B w q k K v R L / / 0 B h F J B J W G c K x 1 1 3 N j 4 6 d Y G U Y 4 z U q 9 R N M Y k z E e 0 q 5 F i Q X V f j p 7 c Y b O r D N A Y a T s k Q b N 3 L 8 b K R Z a T 0 R g J w U 2 I 7 2 Y T c 2 V W S B W 2 d 3 E h N d + y m S c G C r J / P 4 w 4 c h E a N o K G j B F i e E T C 5 g o Z r + A y A g r T I z t r m S 7 8 R a b W I Z W r e p d V G v 3 l 5 X 6 T d 5 S E U 7 g F M 7 B g y u o w x 0 0 o A k E J L z A K 7 w 5 z 8 6 7 8 + F 8 z k c L T r 5 z D P / k f P 0 C C b e a K A = = < / l a t e x i t > h 3 < l a t e x i t s h a 1 _ b a s e 6 4 = " g Q Q o X R q v 5 c 0 X F 5 y 1 / T p 3 7 s x v B d g = " > A A A C C H i c b Z D N T g I x F I X v 4 B / i H + r S T S M x c U V m w E S X R D c u M R E w w o R 0 S g c a 2 s 6 k 7 Z i Q y b y A a 7 f 6 D O 6 M W 9 / C R / A t L D A L B U 7 S 5 M s 5 9 6 b t C W L O t H H d b 6 e w t r 6 x u V X c L u 3 s 7 u 0 f l A + P 2 j p K F K E t E v F I P Q R Y U 8 4 k b R l m O H 2 I F c U i 4 L Q T j G + m e e e J K s 0 i e W 8 m M f U F H k o W M o K N t R 5 7 g U h H W T + t Z / 1 y x a 2 6 M 6 F l 8 H K o Q K 5 m v / z T G 0 Q k E V Q a w r H W X c + N j Z 9 i Z R j h N C v 1 E k 1 j T M Z 4 S L s W J R Z U + + n s x R k 6 s 8 4 A h Z G y R x o 0 c / 9 u p F h o P R G B n R T Y j P R i N j V X Z o F Y Z X c T E 1 7 5 K Z N x Y q g k 8 / v D h C M T o W k r a M A U J Y Z P L G C i m P 0 C I i O s M D G 2 u 5 L t x l t s Y h n a t a p X r 9 b u L i q N 6 7 y l I p z A K Z y D B 5 f Q g F t o Q g s I S H i B V 3 h z n p 1 3 5 8 P 5 n I 8 W n H z n G P 7 J + f o F C 1 O a K Q = = < / l a t e x i t > h 4 < l a t e x i t s h a 1 _ b a s e 6 4 = " A I i M x U 3 I 5 8 J 9 t y C 8 p K R z o z J k e s c = " > A A A C C H i c b Z D L S g M x G I X / 8 V r r r e r S T b A I r s p M L e i y 6 M Z l B X v B d i i Z N N O G J p k h y Q h l m B d w 7 V a f w Z 2 4 9 S 1 8 B N / C t J 2 F t j 0 Q + D j n / 0 l y g p g z b V z 3 2 1 l b 3 9 j c 2 i 7 s F H f 3 9 g 8 O S 0 f H L R 0 l i t A m i X i k O g H W l D N J m 4 Y Z T j u x o l g E n L a D 8 e 0 0 b z 9 R p V k k H 8 w k p r 7 A Q 8 l C R r C x 1 m M v E O k o 6 6 e 1 r F 8 q u x V 3 J r Q M X g 5 l y N X o l 3 5 6 g 4 g k g k p D O N a 6 6 7 m x 8 V O s D C O c Z s V e o m m M y R g P a d e i x I J q P 5 2 9 O E P n 1 h m g M F L 2 S I N m 7 t + N F A u t J y K w k w K b k V 7 M p u b K L B C r 7 G 5 i w m s / Z T J O D J V k f n + Y c G Q i N G 0 F D Z i i x P C J B U w U s 1 9 A Z I Q V J s Z 2 V 7 T d e I t N L E O r W v E u K 9 X 7 W r l + k 7 d U g F M 4 g w v w 4 A r q c A c N a A I B C S / w C m / O s / P u f D i f 8 9 E 1 J 9 8 5 g X 9 y v n 4 B D O + a K g = = < / l a t e x i t > h 5 < l a t e x i t s h a 1 _ b a s e 6 4 = " u q i / G + 4 S C i a U g w m u l C S + H x A k p X I = " > A A A C C H i c b Z D L S g M x G I X / 8 V r r r e r S T b A I r s p M V X R Z d O O y g r 1 g O 5 R M m m l D k 8 y Q Z I Q y z A u 4 d q v P 4 E 7 c + h Y + g m 9 h 2 s 5 C 2 x 4 I f J z z / y Q 5 Q c y Z N q 7 7 7 a y s r q 1 v b B a 2 i t s 7 u 3 v 7 p Y P D p o 4 S R W i D R D x S 7 Q B r y p m k D c M M p + 1 Y U S w C T l v B 6 H a S t 5 6 o 0 i y S D 2 Y c U 1 / g g W Q h I 9 h Y 6 7 E b i H S Y 9 d L L r F c q u x V 3 K r Q I X g 5 l y F X v l X 6 6 / Y g k g k p D O N a 6 4 7 m x 8 V O s D C O c Z s V u o m m M y Q g P a M e i x I J q P 5 2 + O E O n 1 u m j M F L 2 S I O m 7 t + N F A u t x y K w k w K b o Z 7 P J u b S L B D L 7 E 5 i w m s / Z T J O D J V k d n + Y c G Q i N G k F 9 Z m i x P C x B U w U s 1 9 A Z I g V J s Z 2 V 7 T d e P N N L E K z W v H O K 9 X 7 i 3 L t J m + p A M d w A m f g w R X U 4 A 7 q 0 A A C E l 7 g F d 6 c Z + f d + X A + Z 6 M r T r 5 z B P / k f P 0 C D o u a K w = = < / l a t e x i t > h 6 < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 K n j 3 y x f k Q g l x r M R o R M 3 E z g I i q o = " > A A A C C H i c b Z D L S g M x G I X / 8 V r r r e r S T b A I r s p M F X V Z d O O y g r 1 g O 5 R M m m l D k 8 y Q Z I Q y z A u 4 d q v P 4 E 7 c + h Y + g m 9 h 2 s 5 C 2 x 4 I f J z z / y Q 5 Q c y Z N q 7 7 7 a y s r q 1 v b B a 2 i t s 7 u 3 v 7 p Y P D p o 4 S R W i D R D x S 7 Q B r y p m k D c M M p + 1 Y U S w C T l v B 6 H a S t 5 6 o 0 i y S D 2 Y c U 1 / g g W Q h I 9 h Y 6 7 E b i H S Y 9 d L L r F c q u x V 3 K r Q I X g 5 l y F X v l X 6 6 / Y g k g k p D O N a 6 4 7 m x 8 V O s D C O c Z s V u o m m M y Q g P a M e i x I J q P 5 2 + O E O n 1 u m j M F L 2 S I O m 7 t + N F A u t x y K w k w K b o Z 7 P J u b S L B D L 7 E 5 i w m s / Z T J O D J V k d n + Y c G Q i N G k F 9 Z m i x P C x B U w U s 1 9 A Z I g V J s Z 2 V 7 T d e P N N L E K z W v H O K 9 X 7 i 3 L t J m + p A M d w A m f g w R X U 4 A 7 q 0 A A C E l 7 g F d 6 c Z + f d + X A + Z 6 M r T r 5 z B P / k f P 0 C E C e a L A = = < / l a t e x i t > g 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " o q E Y G I i h l j u g 0 6 w 7 7 L z j x B + 1 J S 4 = " > A A A C C H i c b Z D L S g M x G I X / q b d a b 1 W X b o J F c F V m q q D L o h u X F e w F 2 6 F k 0 k w b m m S G J C O U Y V 7 A t V t 9 B n f i 1 r f w E X w L 0 3 Y W 2 v Z A 4 O O c / y f J C W L O t H H d b 6 e w t r 6 x u V X c L u 3 s 7 u 0 f l A + P W j p K F K F N E v F I d Q K s K W e S N g 0 z n H Z i R b E I O G 0 H 4 9 t p 3 n 6 i S r N I P p h J T H 2 B h 5 K F j G B j r c d e I N J h 1 k + 9 r F + u u F V 3 J r Q M X g 4 V y N X o l 3 9 6 g 4 g k g k p D O N a 6 6 7 m x 8 V O s D C O c Z q V e o m m M y R g P a d e i x I J q P 5 2 9 O E N n 1 h m g M F L 2 S I N m 7 t + N F A u t J y K w k w K b k V 7 M p u b K L B C r 7 G 5 i w m s / Z T J O D J V k f n + Y c G Q i N G 0 F D Z i i x P C J B U w U s 1 9 A Z I Q V J s Z 2 V 7 L d e I t N L E O r V v U u q r X 7 y 0 r 9 J m + p C C d w C u f g w R X U 4 Q 4 a 0 A Q C E l 7 g F d 6 c Z + f d + X A + 5 6 M F J 9 8 5 h n 9 y v n 4 B B n u a J g = = < / l a t e x i t > g 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " + x r A H 9 n k n m I R j b Q Z V g 6 1 q h s V Q H Y = " > A A A C C H i c b Z D L S g M x G I X / q b d a b 1 W X b o J F c F V m q q D L o h u X F e w F 2 6 F k 0 k w b m m S G J C O U Y V 7 A t V t 9 B n f i 1 r f w E X w L 0 3 Y W 2 v Z A 4 O O c / y f J C W L O t H H d b 6 e w t r 6 x u V X c L u 3 s 7 u 0 f l A + P W j p K F K F N E v F I d Q K s K W e S N g 0 z n H Z i R b E I O G 0 H 4 9 t p 3 n 6 i S r N I P p h J T H 2 B h 5 K F j G B j r c d e I N J h 1 k 9 r W b 9 c c a v u T G g Z v B w q k K v R L / / 0 B h F J B J W G c K x 1 1 3 N j 4 6 d Y G U Y 4 z U q 9 R N M Y k z E e 0 q 5 F i Q X V f j p 7 c Y b O r D N A Y a T s k Q b N 3 L 8 b K R Z a T 0 R g J w U 2 I 7 2 Y T c 2 V W S B W 2 d 3 E h N d + y m S c G C r J / P 4 w 4 c h E a N o K G j B F i e E T C 5 g o Z r + A y A g r T I z t r m S 7 8 R a b W I Z W r e p d V G v 3 l 5 X 6 T d 5 S E U 7 g F M 7 B g y u o w x 0 0 o A k E J L z A K 7 w 5 z 8 6 7 8 + F 8 z k c L T r 5 z D P / k f P 0 C C B e a J w = = < / l a t e x i t > g 3 < l a t e x i t s h a 1 _ b a s e 6 4 = " s y S y y h X q 4 f O w O R u G n u M f P A b 5 c 4 Y = " > A A A C C H i c b Z D N T g I x F I X v 4 B / i H + r S T S M x c U V m w E S X R D c u M R E w w o R 0 S g c a 2 s 6 k 7 Z i Q y b y A a 7 f 6 D O 6 M W 9 / C R / A t L D A L B U 7 S 5 M s 5 9 6 b t C W L O t H H d b 6 e w t r 6 x u V X c L u 3 s 7 u 0 f l A + P 2 j p K F K E t E v F I P Q R Y U 8 4 k b R l m O H 2 I F c U i 4 L Q T j G + m e e e J K s 0 i e W 8 m M f U F H k o W M o K N t R 5 7 g U i H W T + t Z / 1 y x a 2 6 M 6 F l 8 H K o Q K 5 m v / z T G 0 Q k E V Q a w r H W X c + N j Z 9 i Z R j h N C v 1 E k 1 j T M Z 4 S L s W J R Z U + + n s x R k 6 s 8 4 A h Z G y R x o 0 c / 9 u p F h o P R G B n R T Y j P R i N j V X Z o F Y Z X c T E 1 7 5 K Z N x Y q g k 8 / v D h C M T o W k r a M A U J Y Z P L G C i m P 0 C I i O s M D G 2 u 5 L t x l t s Y h n a t a p X r 9 b u L i q N 6 7 y l I p z A K Z y D B 5 f Q g F t o Q g s I S H i B V 3 h z n p 1 3 5 8 P 5 n I 8 W n H z n G P 7 J + f o F C b O a K A = = < / l a t e x i t > (h 1 , g 0 ) h 2 (h 2 , g 1 ) h 3 (h 3 , g 2 ) h 4 (h 4 , g 2 ) h 5 (h 5 , g 3 ) h 6 < l a t e x i t s h a 1 _ b a s e 6 4 = " Y z S A W e P x / W l l / P O Z 7 D N u c Q S 3 S a 4 = " > A A A C z n i c f Z J b S 8 M w H M X T e p v z N v X R l + A Q F G S 0 3 b w 8 D n 3 x c Y L b x H W U N M 2 2 s K T t k l Q Y p f j q p / P d j + C 3 M J u V T S v + I X A 4 5 x d C T u L H j E p l W e + G u b K 6 t r 5 R 2 i x v b e / s 7 l X 2 D z o y S g Q m b R y x S D z 6 S B J G Q 9 J W V D H y G A u C u M 9 I 1 x / f z v L u M x G S R u G D m s a k z 9 E w p A O K k d K W V 3 k 7 d X 2 e j j L P P o c z N c y 8 1 M r O o D u Z J C i A e e h A 1 y 1 / k 8 6 C t A t k f Z m s L 0 i n Q D a W y c Z / 5 M U y e b E g 6 w X y 0 q t U r Z o 1 H 1 g U d i 6 q I J + W V / l w g w g n n I Q K M y R l z 7 Z i 1 U + R U B Q z k p X d R J I Y 4 T E a k p 6 W I e J E 9 t N 5 7 x k 8 0 U 4 A B 5 H Q K 1 R w 7 i 7 v S B G X c s p 9 T X K k R v J 3 N j P / z H z + l 9 1 L 1 O C 6 n 9 I w T h Q J 8 d f 5 g 4 R B F c H Z 2 8 K A C o I V m 2 q B s K D 6 C h C P k E B Y 6 R 9 Q 1 t 3 Y v 5 s o i o 5 T s + s 1 5 7 5 R b d 7 k L Z X A E T g G p 8 A G V 6 A J 7 k A L t A E 2 G s a T g Y 3 A b J n P Z m a + f K G m k e 8 5 B D / G f P 0 E e U 3 c q w = = < / l a t e x i t > (h t 1 , g u ) < l a t e x i t s h a 1 _ b a s e 6 4 = " / 4 z v V F e V D H 0 I X N e u J I B R 7 2 U v t c s = " > A A A C G n i c b V D L S s N A F J 3 U V 6 2 v q D v d D B a h g p a k C r o s u n F Z w T 6 g D W E y n b a D M 5 M w M x F K C P g h r t 3 q N 7 g T t 2 7 8 B P / C S Z u F t j 1 w 4 d x z 7 u V y T x A x q r T j f F u F p e W V 1 b X i e m l j c 2 t 7 x 9 7 d a 6 k w l p g 0 c c h C 2 Q m Q I o w K 0 t R U M 9 K J J E E 8 Y K Q d P N x k f v u R S E V D c a / H E f E 4 G g o 6 o B h p I / n 2 Q a U X 8 G S U + o k + c 9 N T m H V D 0 8 X p i W + X n a o z A Z w n b k 7 K I E f D t 3 9 6 / R D H n A i N G V K q 6 z q R 9 h I k N c W M p K V e r E i E 8 A M a k q 6 h A n G i v G T y Q w q P j d K H g 1 C a E h p O 1 L 8 b C e J K j X l g J j n S I z X r Z e J C L + C L 5 G 6 s B 1 d e Q k U U a y L w 9 P 4 g Z l C H M M s J 9 q k k W L O x I Q h L a l 6 A e I Q k w t q k W T L Z u L N J z J N W r e q e V 2 t 3 F + X 6 d Z 5 S E R y C I 1 A B L r g E d X A L G q A J M H g C L + A V v F n P 1 r v 1 Y X 1 O R w t W v r M P / s H 6 + g U u + K C M < / l a t e x i t > h t < l a t e x i t s h a 1 _ b a s e 6 4 = " Z h V p k y 9 R 3 K g e n P L 9 G L I o G h e o d d U = " > A A A C B n i c b Z D N S g M x F I X v + F v r X 9 W l m 8 E i u C o z V d B l 0 Y 3 L C v Y H 2 q F k 0 k w b m m S G 5 I 5 Q h u 5 d u 9 V n c C d u f Q 0 f w b c w b W e h b Q 8 E P s 6 5 l y Q n T A Q 3 6 H n f z t r 6 x u b W d m G n u L u 3 f 3 B Y O j p u m j j V l D V o L G L d D o l h g i v W Q I 6 C t R P N i A w F a 4 W j u 2 n e e m L a 8 F g 9 4 j h h g S Q D x S N O C V q r 3 Q 1 l N p z 0 s F c q e x V v J n c Z / B z K k K v e K / 1 0 + z F N J V N I B T G m 4 3 s J B h n R y K l g k 2 I 3 N S w h d E Q G r G N R E c l M k M 3 e O 3 H P r d N 3 o 1 j b o 9 C d u X 8 3 M i K N G c v Q T k q C Q 7 O Y T c 2 V W S h X 2 Z 0 U o 5 s g 4 y p J k S k 6 v z 9 K h Y u x O + 3 E 7 X P N K I q x B U I 1 t 1 9 w 6 Z B o Q t E 2 V 7 T d + I t N L E O z W v E v K 9 W H q 3 L t N m + p A K d w B h f g w z X U 4 B 7 q 0 A A K A l 7 g F d 6 c Z + f d + X A + 5 6 N r T r 5 z A v / k f P 0 C l F e Z X g = = < / l a t e x i t > g 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " x + g / i 4 Z G 8 0 s i h P N S R A S 9 f j x S 0 S 4 = " > A A A C B n i c b Z D L S g M x G I X / e K 3 1 V n X p J l g E V 2 W m C r o s u n F Z w V 6 g H U o m z b S h S W Z I M k I Z u n f t V p / B n b j 1 N X w E 3 8 K 0 n Y W 2 P R D 4 O O f / S X L C R H B j P e 8 b r a 1 v b G 5 t F 3 a K u 3 v 7 B 4 e l o + O m i V N N W Y P G I t b t k B g m u G I N y 6 1 g 7 U Q z I k P B W u H o b p q 3 n p g 2 P F a P d p y w Q J K B 4 h G n x D q r 3 Q 1 l N p j 0 v F 6 p 7 F W 8 m f A y + D m U I V e 9 V / r p 9 m O a S q Y s F c S Y j u 8 l N s i I t p w K N i l 2 U 8 M S Q k d k w D o O F Z H M B N n s v R N 8 7 p w + j m L t j r J 4 5 v 7 d y I g 0 Z i x D N y m J H Z r F b G q u z E K 5 y u 6 k N r o J M q 6 S 1 D J F 5 / d H q c A 2 x t N O c J 9 r R q 0 Y O y B U c / c F T I d E E 2 p d c 0 X X j b / Y x D I 0 q x X / s l J 9 u C r X b v O W C n A K Z 3 A B P l x D D e 6 h D g 2 g I O A F X u E N P a N 3 9 I E + 5 6 N r K N 8 5 g X 9 C X 7 8 l j Z k Z < / l a t e x i t > f IMP < l a t e x i t s h a 1 _ b a s e 6 4 = " f 7 F w Q x j Y S J k o W R K 9 k q m 7 o Y W t l h 0 = " > A A A C D H i c b Z D L S g M x G I U z X m u 9 V V 2 6 C R b B V Z m p g i 6 L b n Q h V L A X a G v J p J k 2 N M k M y T 9 i G e Y V X L v V Z 3 A n b n 0 H H 8 G 3 M G 1 n o W 0 P B D 7 O + X + S H D 8 S 3 I D r f j t L y y u r a + u 5 j f z m 1 v b O b m F v v 2 7 C W F N W o 6 E I d d M n h g m u W A 0 4 C N a M N C P S F 6 z h D 6 / G e e O R a c N D d Q + j i H U k 6 S s e c E r A W g 9 B N 2 k D e 4 L k 5 r a a p t 1 C 0 S 2 5 E + F 5 8 D I o o k z V b u G n 3 Q t p L J k C K o g x L c + N o J M Q D Z w K l u b b s W E R o U P S Z y 2 L i k h m O s n k 1 S k + t k 4 P B 6 G 2 R w G e u H 8 3 E i K N G U n f T k o C A z O b j c 2 F m S 8 X 2 a 0 Y g o t O w l U U A 1 N 0 e n 8 Q C w w h H j e D e 1 w z C m J k g V D N 7 R c w H R B N K N j + 8 r Y b b 7 a J e a i X S 9 5 p q X x 3 V q x c Z i 3 l 0 C E 6 Q i f I Q + e o g q 5 R F d U Q R R q 9 o F f 0 5 j w 7 7 8 6 H 8 z k d X X K y n Q P 0 T 8 7 X L 1 n A m / g = < / l a t e x i t > g 0 g 1 g 2 g 3 h 1 h 2 h 3 h 4 h 5 y 1 y 2 y 3 U T h 6 ? < l a t e x i t s h a _ b a s e = " I a n u d g e B a Y i J V l R J V q I v t D A = " > A A A C C X i c b Z D L S g M x G I U z V b H W W l m A R X J W Z u t C N W H T j s o K w H Q o m T T T h u Y y J J l C G f o E r t q w i d w J d + A j i I / g W p p e F t j Q + D j n / l y o o R R b T z v m t r K t b + Q a t n d w v B X c t U Y V L D k k n V j J A m j A p S M Q w k w U Q T x i p B H b Z Y C U p l L c m F C Q o g s Y U I O t o D V A S k j T o L b L h S k j c R X A R / B s W r T / c y e f l y q + C T s j c c q J M J g h r Q P f S y Y I W U o Z m T k t l J N E o T q E s C i w J x o s N s u Q R P L F O B Z S S M M n L h / N z L E t R y y E y Z H p P h u b S O I L O D M Q X Y U Z F k h o i P T + O G X Q S D i u B X a o I t i w o Q W E F b V f g L i H F M L G l u f a b v z J h a h X i Z X y n V e

s X I O p u A I H I N T I N z U A G o A p q A A M J H s E T e H Y e n F f n z X m f j u a c c h + C f n x e C p < / l a t e x i t >

? < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 I a n 6 u d g e B a Y i J V l R J V q 9 4 I v t D A = " > A A A C C X i c b Z D L S g M x G I U z 9 V b H W 9 W l m 2 A R X J W Z u t C N W H T j s o K 9 w H Q o m T T T h u Y y J J l C G f o E r t 3 q w i d w J 2 5 d + A j i I / g W p p e F t j 0 Q + D j n / 0 l y o o R R b T z v 2 8 m t r K 6 t b + Q 3 3 a 3 t n d 2 9 w v 5 B X c t U Y V L D k k n V j J A m j A p S M 9 Q w 0 k w U Q T x i p B H 1 b 8 Z 5 Y 0 C U p l L c m 2 F C Q o 6 6 g s Y U I 2 O t o D V A S k j T o 6 L b L h S 9 k j c R X A R / B s W r T / c y e f l y q + 3 C T 6 s j c c q J M J g h r Q P f S 0 y Y I W U o Z m T k t l J N E o T 7 q E s C i w J x o s N s 8 u Q R P L F O B 8 Z S 2 S M M n L h / N z L E t R 7 y y E 5 y Z H p 6 P h u b S 7 O I L 7 O D 1 M Q X Y U Z F k h o i 8 P T + O G X Q S D i u B X a o I t i w o Q W E F b V f g L i H F M L G l u f a b v z 5 J h a h X i 7 5 Z 6 X y n V e s X I O p 8 u A I H I N T 4 I N z U A G 3 o A p q A A M J H s E T e H Y e n F f n z X m f j u a c 2 c 4 h + C f n 4 x e 4 C p 3 3 < / l a t e x i t > ? < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 I a n 6 u d g e B a Y i J V l R J V q 9 4 I v t D A = " > A A A C C X i c b Z D L S g M x G I U z 9 V b H W 9 W l m 2 A R X J W Z u t C N W H T j s o K 9 w H Q o m T T T h u Y y J J l C G f o E r t 3 q w i d w J 2 5 d + A j i I / g W p p e F t j 0 Q + D j n / 0 l y o o R R b T z v 2 8 m t r K 6 t b + Q 3 3 a 3 t n d 2 9 w v 5 B X c t U Y V L D k k n V j J A m j A p S M 9 Q w 0 k w U Q T x i p B H 1 b 8 Z 5 Y 0 C U p l L c m 2 F C Q o 6 6 g s Y U I 2 O t o D V A S k j T o 6 L b L h S 9 k j c R X A R / B s W r T / c y e f l y q + 3 C T 6 s j c c q J M J g h r Q P f S 0 y Y I W U o Z m T k t l J N E o T 7 q E s C i w J x o s N s 8 u Q R P L F O B 8 Z S 2 S M M n L h / N z L E t R 7 y y E 5 y Z H p 6 P h u b S 7 O I L 7 O D 1 M Q X Y U Z F k h o i 8 P T + O G X Q S D i u B X a o I t i w o Q W E F b V f g L i H F M L G l u f a b v z 5 J h a h X i 7 5 Z 6 X y n V e s X I O p 8 u A I H I N T 4 I N z U A G 3 o A p q A A M J H s E T e H Y e n F f n z X m f j u a c 2 c 4 h + C f n 4 x e 4 C p 3 3 < / l a t e x i t > ? < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 I a n 6 u d g e B a Y i J V l R J V q 9 4 I v t D A = " > A A A C C X i c b Z D L S g M x G I U z 9 V b H W 9 W l m 2 A R X J W Z u t C N W H T j s o K 9 w H Q o m T T T h u Y y J J l C G f o E r t 3 q w i d w J 2 5 d + A j i I / g W p p e F t j 0 Q + D j n / 0 l y o o R R b T z v 2 8 m t r K 6 t b + Q 3 3 a 3 t n d 2 9 w v 5 B X c t U Y V L D k k n V j J A m j A p S M 9 Q w 0 k w U Q T x i p B H 1 b 8 Z 5 Y 0 C U p l L c m 2 F C Q o 6 6 g s Y U I 2 O t o D V A S k j T o 6 L b L h S 9 k j c R X A R / B s W r T / c y e f l y q + 3 C T 6 s j c c q J M J g h r Q P f S 0 y Y I W U o Z m T k t l J N E o T 7 q E s C i w J x o s N s 8 u Q R P L F O B 8 Z S 2 S M M n L h / N z L E t R 7 y y E 5 y Z H p 6 P h u b S 7 O I L 7 O D 1 M Q X Y U Z F k h o i 8 P T + O G X Q S D i u B X a o I t i w o Q W E F b V f g L i H F M L G l u f a b v z 5 J h a h X i 7 5 Z 6 X y n V e s X I O p 8 u A I H I N T 4 I N z U A G 3 o A p q A A M J H s E T e H Y e n F f n z X m f j u a c 2 c 4 h + C f n 4 x e 4 C p 3 3 < / l a t e x i t > ? < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 I a n 6 u d g e B a Y i J V l R J V q 9 4 I v t D A = " > A A A C C X i c b Z D L S g M x G I U z 9 V b H W 9 W l m 2 A R X J W Z u t C N W H T j s o K 9 w H Q o m T T T h u Y y J J l C G f o E r t 3 q w i d w J 2 5 d + A j i I / g W p p e F t j 0 Q + D j n / 0 l y o o R R b T z v 2 8 m t r K 6 t b + Q 3 3 a 3 t n d 2 9 w v 5 B X c t U Y V L D k k n V j J A m j A p S M 9 Q w 0 k w U Q T x i p B H 1 b 8 Z 5 Y 0 C U p l L c m 2 F C Q o 6 6 g s Y U I 2 O t o D V A S k j T o 6 L b L h S 9 k j c R X A R / B s W r T / c y e f l y q + 3 C T 6 s j c c q J M J g h r Q P f S 0 y Y I W U o Z m T k t l J N E o T 7 q E s C i w J x o s N s 8 u Q R P L F O B 8 Z S 2 S M M n L h / N z L E t R 7 y y E 5 y Z H p 6 P h u b S 7 O I L 7 O D 1 M Q X Y U Z F k h o i 8 P T + O G X Q S D i u B X a o I t i w o Q W E F b V f g L i H F M L G l u f a b v z 5 J h a h X i 7 5 Z 6 X y n V e s X I O p 8 u A I H I N T 4 I N z U A G 3 o A p q A A M J H s E T e H Y e n F f n z X m f j u a c 2 c 4 h + C f n 4 x e 4 C p 3 3 < / l a t e x i t > Figure 2 : Training the imputation model of TOLSTOI using RNN-T alignments. ically using a Transformer or recurrent network, (2) A language module M L with parameters θ L that converts text tokens y = y 1 , . . . , y U to contextual vectors g 1 , . . . , g U using a recurrent network so that g u summarizes the tokens before y u , and (3) A thin joint network M J with parameters θ J that combines vectors h t and g u and outputs a softmax distribution spanning the vocabulary V plus the blank symbol ∅. To generate a token sequence y, the RNN-T implicitly aligns each token y u to one frame t where it is output. An example of a valid alignmentfoot_0 appears in Figure 2 . Even though the network is modular, all parameters are trained jointly end-to-end by maximizing the likelihood of the output y i given speech x i over all (x i , y i ) ∈ D. During training, the loglikelihood of the target sequence y i is computed by marginalizing over all possible alignments of the T i frames of x i and the U tokens of y using an efficient forward-backward algorithm; we will refer to this objective as the RNNT-loss (Graves, 2012) . min θ L ,θ S ,θ J (x i ,y i )∈D RNNT-loss(y i , {M J (•|g u = M L (y 1 , . . . , y u-1 ), h t = M S (x 1 , . . . , x T , t)}) During inference, beam-search finds the best possible alignment (t, u) and the predicted sequence is a concatenation of the non-blank tokens at each aligned (t, u) (Graves, 2012; Saon et al., 2020) . Motivation of our approach Figure 1 presents a schematic diagram of our imputation based adaptation. Given only text data D for adaptation, we propose to update the language parameters θ L and joint parameters θ J while keeping the speech parameters θ S fixed. This is challenging since even though the network architecture is modular, the training is not. Treating the M L as a language model and updating part of the network using text-only data, as proposed in (Pylkkonen et al., 2021; Chen et al., 2022a) , has the potential of deteriorating performance as the output vector of M L gets incompatible with the vector from M S . We, therefore, propose to first augment the missing speech data in D by imputing using a separate generator. However, training a full-fledged TTS model requires substantial training data and resources, and high-quality TTS models are not available for low-resource languages. Our key insight is to impute, not the full raw speech x, but the h vectors from the last layer of the speech module M S . Generating the last layer vectors h is significantly easier than generating the raw audio signals x. In fact, only a thin joint layer separates h from the output character distribution. So the h vectors are expected to be "closer" to text than speech. We are able to design a very simple and low-overhead model for imputing the h vectors from the text. In Section 3.1 we describe the design and training of the imputation model. Once the imputation model is trained, for any new target domain, we attach each text y ∈ D with imputed h values to create a proxy parallel dataset for fine-tuning M J and M L . In Section 3.2 we describe how we perform this fine-tuning.

3.1. IMPUTATION MODEL

Let H denote the space of vectors from the last layer of M S . The goal of our imputation model is to directly model P(H|Y) instead of P(X |Y) that TTS models attempt to do. The imputation model generates proxy output vectors h 1 , . . . , h T of the speech module M S given only a text sequence y 1 , . . . y U so as to mimic the output of the speech module M S (x 1 , . . . , x T ) without having access to the real audio frames x 1 , . . . , x T . A default option is to train a full-fledged sequence to (3) WER on Source-Target Mixture: The test sizes for each of the target domains datasets is small. In any ASR system deployed in the real world, it is important for the adapted model to not be significantly worse than the unadapted model on sentences outside its narrow domain because of catastrophic forgetting during adaptation. Therefore, we also measure the average WER of the target and source test datasets. The test data of the source domain comprises 4458 (speech utterance, text) pairs from Hub5'00 and CallHome. We compare TOLSTOI against three existing methods of text-only adaptation of RNN-Ts. Shallow Fusion (Kannan et al., 2018) is a standard method which uses an external LM trained on the target domain text to interpolate the RNN-T probabilities during inference. For the external LM, we use the same configuration as our prediction network and set the interpolation parameter (λ) to 0.3 for all cases. Since we do not assume access to any validation set from the target domain, fine-tuning the hyper-parameter is not an option. NN-LM In this method, the prediction network is viewed as an auto-regressive language model. To get the LM scores, a linear layer is added to project the prediction network representations to the output vocabulary space V and fine-tuned on the target domain text for one epoch. Textogram (Thomas et al., 2022) In this method, the transcription network is modified to work with two input modalities: (i) standard acoustic features and (ii) one-hot encoding of the units in the text, with the encoding of each unit being repeated a fixed number of times such that each feature has duration similar to the spectrograms. Other methods We also compared with two other methods: (1) The factorized model proposed in Chen et al. (2022a) . However, their factorized RNN-T model had significantly worse accuracy than our unadapted RNN-T model both before and after text-only adaptation. (2) TTS where we used a proprietary IBM synthesis engine (Kons et al., 2019; Fernandez et al., 2022) based on Tacotron (Shen et al., 2018) to generate audio for the text-only inputs. This gives competitive reduction in WERs of target-only/source-target mixed (3.2/6.5 and 8.6/9.8 on the ATIS and HVB test sets, respectively). However, it is worth noting here that all three target domains consist of US-accented English speech and the TTS system we employ is carefully tuned to produce natural-sounding US-accented English samples. Building TTS of similarly high quality for low-resource languages is a non-trivial task (Ogayo et al., 2022) . In contrast, TOLSTOI relies only on access to the ASR pretraining data to train the imputation model and can potentially scale well to low-resource languages. 1 presents the decode time (RTF) and target-only WER and source-target mixed WER of different adaptation methods on two ASR models adapted to three different target domains. If we focus only on target WERs, shallow fusion provides good reductions in WER. However, this comes at the cost of a 2.5 times blow-up of decode time (RTF) which could be unacceptable in many applications. Also, the external LM fine-tuned on the target text leads to significant catastrophic forgetting as observed by the significantly increased WER on the mixed corpus. TOLSTOI provides the best WER when measured on the mixed test set while being competitive with shallow fusion on the target-only set. The decode time of TOLSTOI is exactly the same as the basic RNN-T since we do not modify the RNN-T architecture. NN-LM does not provide much gains beyond the unadapted model. The Textogram is more successful in adaptation but is also subject to significant deterioration on the mixed corpus. Also, the Textogram modified the network architecture of the RNN-T resulting in significantly larger decode time compared to the basic RNN-T.

4.2. OVERALL COMPARISONS Table

Table 2 presents two illustrative examples from ATIS comparing predictions using shallow fusion and TOLSTOI. Shallow fusion does not handle disfluencies well if the target text used to train the external LM is devoid of it. For example, the false start "I need to reset" and the filler phrase "by the way" are omitted. More anecdotal examples are shown in Table 7 in Appendix A. 3 presents ablations of various design choices governing TOLSTOI. We investigate three dimensions of the imputation model: 1) choice of loss function, 2) choice of alignment to sample at test-time, and 3) choice of model architecture. On the choice of loss function, we replaced the L1 loss in Eqn 2 with an L2 loss and a contrastive loss. We used the contrastive loss from Chen et al. (2020) for the imputed features in each batch. We also tried a distillation loss where a generated h t is combined with the corresponding g u via M J to generate a probability distribution over the vocabulary and a KL-Divergence loss is imposed between this predicted distribution and the probability distribution from the real h t . These losses were all worse than L1. generation model, we tried the FixedGram model with 1 blank or 4 blanks before each output token instead of the default of 3. We also tested a feed-forward regression model for length that takes g u as its input and outputs the number of speech frames learnt from the alignment data (c.f., Section 3.1). All these alignment generation models fared worse than using a FixedGram model with 3 blanks as in TOLSTOI. We next tried different tweaks to the architecture of the imputation model. We tried changing the input context to the imputation model by adding g u+1 , adding output characters y u-1 , y u , y u+1 or omitting h t-1 . We also tried a Transformer-based (Vaswani et al., 2017) encoder-decoder imputation model to transform the complete g u sequence to a full h t sequence. All these variants performed worse than TOLSTOI. During RNN-T finetuning, we found it useful to mix an equal number of utterances from the pretraining corpus with the imputed data (Zhu et al., 2022) . Omitting this step resulted in small WER degradations shown in the last row in Table 3 . Figure 4 shows the WERs on ATIS and HVB as a function of decreasing amount of target text. As expected, there is an increase in WERs with reducing the amount of target text. However, the reduction in WERs with using only 50% of the data versus using the complete dataset is not substantial. This suggests that TOLSTOI could work well even in target domains with very limited amounts of text-only data.

5. CONCLUSION

In this paper we presented TOLSTOI, a low-overhead method of adapting RNN-T models using textonly data in a target domain. Our key insight it to impute speech features from the last layer of the transcription network, which allows accurate fine-tuning of a subset of the RNN-T parameters. We proposed a very simple design of the imputation model by leveraging existing text-speech representations and alignments from the trained RNN-T model. Unlike existing methods, TOLSTOI does not modify the base RNN-T architecture and can adapt existing pre-trained ASR models without increasing inference time during decoding. Via experiments on three target domains and two ASR models, we show that TOLSTOI provides the best accuracy on a mixed source-target test set since it is least subject to catastrophic forgetting while reducing target WER by more than 35%. With a detailed ablation over more complicated models, we justify the effectiveness of our simple imputation model. As part of future work, we would like to experiment our approach on low-resource languages and explore techniques for online adaptation of the ASR model.

6. REPRODUCIBILITY STATEMENT

All our experiments are performed on publicly available datasets such as Switchboard, ATIS, HarperValleyBank, and Librispeech. We also use published train/test splits specified for each of these datasets, thus enabling reproducibility. We have provided sufficient implementation details of our baseline models, our imputation model and the RNN-T fine-tuning process to help reproduce our main results. Published as a conference paper at ICLR 2023 8 : Anecdotal examples showing the mistakes with TOLSTOI and shallow fusion. Errors are underlined and highlighted in red. When the language model context is not clear (e.g, for word topeka), TOLSTOI produces acoustically similar predictions as opposed to the Shallow Fusion and Unadapted models.



In a valid alignment, at each step u the output character is either yu or ∅ so that the concatenation of the tokens matches y after ignoring the ∅.



Figure 1: Figure on the left shows a schematic diagram of the standard RNN-T architecture (in blue). Three existing approaches for text-only adaptation are shown as appendages to the RNN-T model (in red): (1) "Textogram" increases the dimensionality of the input to the transcription network MS thus requiring an expensive model retraining. (2) Shallow fusion uses an external LM during inference that leads to significant degradation of test-time latency. (3) A separate LM head can be added and jointly trained with the prediction network on the target text. This, however, does not result in very accurate adaptation. The figure on the right shows TOLSTOI that uses a lightweight imputation model to generate speech representations corresponding to the target text.

During the training of the base model, either the acoustic feature or textogram feature is used for training the RNN-T objective. During fine-tuning, only the textogram features for each text from the target domain are used for adaptation. Different from NN-LM, in textogram RNN-T loss is used for fine-tuning the joint network and prediction network parameters much like in our method.

Figure 4: WER vs. Adaptation Text(%)

Referencewe are going to fly listen we are going to fly over saint louis Unadapted we are going to fly listen we are going to fly over saint louis Shallow Fusion we are going to fly listen we are going to fly over saint louis TOLSTOI we are going to fly listen we are going to fly over saint louis Reference he went moved to texas Unadapted he went moved to texas Shallow Fusion he went and moved to texas TOLSTOI he went moved to texas Reference i would like to transfer money between my accounts Unadapted i would like to transfer money between my accountants Shallow Fusion i would like to transfer money between my account TOLSTOI i would like to transfer money between my accounts Reference yeah we played softball last night we were clobbered Unadapted yeah we played softball last night we were clobbered Shallow Fusion yeah we played softball last night we were clovered TOLSTOI yeah we played softball last night we were clobbered

get married in topeka since most of their friends are there Unadapted they plan to get married in to figure since most of their friends are there Shallow Fusion they plan to get married into since most of their friends are there TOLSTOI they plan to get married in to peak us since most of their friends are there Reference show me weekday flights from milwaukee to orlando one way Unadapted show me weak day flights from milwaukee to orlando one way Shallow Fusion show me weekday flights from milwaukee to orlando one way TOLSTOI show me weakday flights from milwaukee to orlando one way Table

Adaptation Datasets For the adaptation experiments, we choose three diverse datasets with different amounts of adaptation text coming from three different domains. (1) ATIS(Hemphill et al., 1990) consists of roughly 5K sentences from the airline reservation domain for training and 893 (speech, text) utterance pairs for testing. (2) HarperValleyBank (HVB)(Wu et al., 2020) consists of roughly 15K sentences from the banking domain for training and 2797 (speech, text) utterance pairs for testing. (3) Librispeech(Panayotov et al., 2015) consists of roughly 29K sentences from audiobooks for training and 2619 (speech, text) utterances for testing. Note, in true spirit of text-only adaptation, we do not assume the availability of a parallel validation dataset in the target domain.Imputation Model The imputation model of TOLSTOI is a 2-layer feed-forward network with a tanh non-linearity. The first layer projects the 512-d input of (h t-1 , g u ) to 256-d and the second layer outputs a 256-d h t . The imputation model contains roughly 200K parameters, in contrast to the 56 million parameters of the ASR model. The imputation model is trained using the same learning rate, optimizer, and learning rate schedule as the baseline training with a batch size of 2048. The SWB 300H training data was aligned with the corresponding ASR model yielding a total of 42 million (h t , g u ) pairs to train the imputation model.Fine-tuning details For fine-tuning of M L using the imputation model, we keep the same optimizer and learning rate scheduler as the starting RNN-T training except that the maximum learning rate used for fine-tuning was 5e-5. We fine-tune for a fixed number of 2000 updates since we do not have a validation set for target-specific hyper-parameter selection.Metrics We evaluate on three metrics: (1) RTF: Real Time Factor(RTF) which measures the average time (in seconds) to decode one second of audio (50 audio frames in our case). It is computed by averaging over the entire test set on a single CPU. (2) WER on Target-Only: Word error rate(WER) on the test set of the target domain.

Comparison on decode time (RTF) and WERs on target-only and target-source mixed data of different adaptation methods on two different ASR models trained on SWB 2000H (top) and SWB 300H (bottom) adapted to three different domains. WERs on SWB test are 8.8 and 12.7, respectively, using SWB-2000 and SWB-300. When deployed on an equal mixture of source and target data, TOLSTOI provides the highest reduction in WER. Reference i need to reset i would like to reset my password Shallow Fusion i need to reset i would like to reset my password TOLSTOI i need to reset i would like to reset my password Reference hi i lost my debit card can you send me a new one my name is robert davis by the way Shallow Fusion hi i lost my debit card can you send me a new one my name is robert davis by the way TOLSTOI hi i lost my debit card can you send me a new one my name is robert davis by the way

Anecdotal examples comparing TOLSTOI with shallow fusion. Deletion errors are underlined and highlighted in red.

On choice of the alignment Ablation of various design choices of TOLSTOI. More complicated loss functions, or alignment models, or imputation models performed worse than the simple design choices of TOLSTOI.

A.3 EVALUATION OF TOLSTOI ON RNN-T MODEL TRAINED ON EXTREMELY LARGEDATASETIn this section, we compare the effectiveness of the TOLSTOI on an RNN-T model trained on an extremely large dataset consisting of proprietary data totaling close to 56K hours of labeled speech. Comparison on WERs on target-only and target-source mixed data of different adaptation methods on RNN-T model trained on proprietary data adapted to ATIS and HVB domains.We empirically observe that the FixedGram-based TOLSTOI works well even when used with very large amounts of training data. Our results at such a scale are consistent with our results on smaller SWB 300H and SWB 2000H benchmarks.

Anecdotal examples on the mixed test set comparing TOLSTOI with shallow fusion. Deletion errors are underlined and highlighted in red.

7. ACKNOWLEDGEMENTS

We would like to thank George Saon, Samuel Thomas and Jeff Kuo for insightful discussions. The authors from IIT Bombay gratefully acknowledge support from IBM Research, specifically the IBM AI Horizon Networks-IIT Bombay initiative.

⊗ M L

< l a t e x i t s h a 1 _ b a s e 6 4 = " q l d y 3 d X F Y I T T h K i l W 5 W f N Q 0 q K S M = " > A A A B 9 H i c b V D L S g M x F L 3 j s 9 Z X 1 a W b Y B F c l Z k q 6 L L o x o V C B f u A d i i Z N N O G Z p I x y R T K 0 O 9 w 4 0 I R t 3 6 M O / / G T D s L b T 0 Q O J x z L / f k B D F n 2 r j u t 7 O y u r a + s V n Y K m 7 v 7 O 7 t l w 4 O m 1 o m i t A G k V y q d o A 1 5 U z Q h m G G 0 3 a s K I 4 C T l v B 6 C b z W 2 O q N J P i 0 U x i 6 k d 4 I F j I C D Z W 8 r s R N k O C e X o / 7 d 3 1 S m W 3 4 s 6 A l o m X k z L k q P d K X 9 2 + J E l E h S E c a 9 3 x 3 N j 4 K V a G E U 6 n x W 6 i a Y z J C A 9 o x 1 K B I 6 r 9 d B Z 6 i k 6 t 0 k e h V P Y J g 2 b q 7 4 0 U R 1 p P o s B O Z i H 1 o p e J / 3 m d x I R X f s p E n B g q y P x Q m H B k J M o a Q H 2 m K D F 8 Y g k m i t m s i A y x w s T Y n o q 2 B G / x y 8 u k W a 1 4 5 5 X q w 0 W 5 d p 3 X U Y B j O I E z 8 O A S a n A L d W g A g S d 4 h l d 4 c 8 b O i / P u f M x H V 5 x 8 5 w j + w P n 8 A d p a k i Y = < / l a t e x i t > M J < l a t e x i t s h a 1 _ b a s e 6 4 = " n 2 p d G I I i l 1 h i U I D 7 C z H 1 M E 8 5 b f I = " > A A A B 9 H i c b V D L S g M x F L 3 j s 9 Z X 1 a W b Y B F c l Z k q 6 L L o R g S h g n 1 A O 5 R M m m l D M 8 m Y Z A p l 6 H e 4 c a G I W z / G n X 9 j p p 2 F t h 4 I H M 6 5 l 3 t y g p g z b V z 3 2 1 l Z X V v f 2 C x s F b d 3 d v f 2 S w e H T S 0 T R W i D S C 5 V O 8 C a c i Z o w z D D a T t W F E c B p 6 1 g d J P 5 r T F V m k n x a C Y x 9 S M 8 E C x k B B s r + d 0 I m y H B P L 2 f 9 u 5 6 p b J b c W d A y 8 T L S R l y 1 H u l r 2 5 f k i S i w h C O t e 5 4 b m z 8 F C v D C K f T Y j f R N M Z k h A e 0 Y 6 n A E d V + O g s 9 R a d W 6 a N Q K v u E Q T P 1 9 0 a K I 6 0 n U W A n s 5 B 6 0 c v E / 7 x O Y s I r P 2 U i T g w V Z H 4 o T D g y E m U N o D 5 T l B g + s Q Q T x W x W R I Z Y Y W J s T 0 V b g r f 4 5 W X S r F a 8 8 0 r 1 4 a J c u 8 7 r K M A x n M A Z e H A J N b i F O j S A w B M 8 w y u 8 O W P n x X l 3 P u a j K 0 6 + c w R / 4 H z + A N d S k i Q = < / l a t e x i t > y u 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " G G T e B U 3 N + + h h H W j 5 d O X s 2 m H s V U Q = " > A A A B 7 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B i y W p g h 6 L X j x W s B / Q h r L Z T t u l m 0 3 Y 3 Q g h 9 E d 4 8 a C I V 3 + P N / + N 2 z Y H b X 0 w 8 H h v h p l 5 Q S y 4 N q 7 7 7 R T W 1 j c 2 t 4 r b p Z 3 d v f 2 D 8 u F R S 0 e J Y t h k k Y h U J 6 A a B Z f Y N N w I 7 M Q K a R g I b A e T u 5 n f f k K l e S Q f T R q j H 9 K R 5 E P O q L F S O + 1 n y Y U 3 7 Z c r b t W d g 6 w S L y c V y N H o l 7 9 6 g 4 g l I U r D B N W 6 6 7 m x 8 T O q D G c C p 6 V e o j G m b E J H 2 L V U 0 h C 1 n 8 3 P n Z I z q w z I M F K 2 p C F z 9 f d E R k O t 0 z C w n S E 1 Y 7 3 s z c T / v G 5 i h j d + x m W c G J R s s W i Y C G I i M v u d D L h C Z k R q C W W K 2 1 s J G 1 N F m b E J l W w I 3 v L L q 6 R V q 3 q X 1 d r D V a V + m 8 d R h B M 4 h X P w 4 B r q c A 8 N a A K D C T z D K 7 w 5 s f P i v D s f i 9 a C k 8 8 c w x 8 4 n z 8 W + 4 9 n < / l a t e x i t > g u 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " y L A 3 v h v W R u N k 8 8 b 7 o n b 9 z r Q w J R g = " > A A A C C n i c b Z D L S g M x G I X / q b d a b 1 W X b o J F c G O Z q Y I u i 2 5 c V r A X 6 A w l k 6 Z t a J I Z k o x Q h n k D 1 2 7 1 G d y J W 1 / C R / A t T N t Z a N s D g Y 9 z / p 8 k J 4 w 5 0 8 Z 1 v 5 3 C 2 v r G 5 l Z x u 7 S z u 7 d / U D 4 8 a u k o U Y Q 2 S c Q j 1 Q m x p p x J 2 j T M c N q J F c U i 5 L Q d j u + m e f u J K se a i X S 9 5 p q X x 3 V q x c Z i 3 l 0 C E 6 Q i f I Q + e o g q 5 R F d U Q R R q 9 o F f 0 5 j w 7 7 8 6 H 8 z k d X X K y n Q P 0 T 8 7 X L 1 n A m / g = < / l a t e x i t > fIMP < l a t e x i t s h a 1 _ b a s e 6 4 = " f 7e a i X S 9 5 p q X x 3 V q x c Z i 3 l 0 C E 6 Q i f I Q + e o g q 5 R F d U Q R R q 9 o F f 0 5 j w 7 7 8 6 H 8 z k d X X K y n Q P 0 T 8 7 X L 1 n A m / g = < / l a t e x i t > ML < l a t e x i t s h a 1 _ b a s e 6 4 = "< l a t e x i t s h a 1 _ b a s e 6 4 = " y e p 6 F 8 F t 9 E q + e t o 0 d 75 G 1 0 x i p 0 j 8 A / G 1 y 9 / Y 6 G / < / l a t e x i t > sequence model that takes as input text tokens y 1 , . . . y U and generates h 1 , . . . , h T using a standard encoder-decoder network. However, such an approach would require training a heavy-weight model from scratch. Instead, we developed a simple low-overhead model that we present next.Designing the Low-overhead Imputation Model Our approach is to leverage the existing trained RNN-T model M to reduce the overheads of generating the h sequence given a token sequence y. First, we reduce the need for encoding discrete tokens y 1 , . . . , y U from scratch by starting from the output g 1 , . . . , g U of the language model M L . Second, we reduce the need for cross-attention training of a full seq2seq approach by pre-aligning the h sequence to the token sequence y. We denote an alignment as A, a sequence of (t, u) pairs to show that the LM state g u was aligned with audio frame h t in a valid execution of the RNN-T to generate a sequence. We will discuss how we generate such alignments during training and deployment of the Imputation model. Given an alignment A we factorize the generator of the h sequence as follows:In the above we have further assumed that h t depends only on h t-1 and is independent of other prior h r s given h t-1 . We model the above distribution using an imputation model f IMP which consists of a simple feedforward-neural networks that takes as input the language model encoding g u and the last audio encoding h t-1 to produce the next audio encoding h t . Figure 2 presents an overview.Training the Imputation Model To create the training data for the imputation model f IMP , we use the training data D of the ASR model. For each utterance x i in D, we trace the lattice of the highest probability beam to find the best valid alignment A i that consists of a sequence of h t and the aligned g u as shown in Figure 2 . We use the alignment-length synchronous decoding algorithm (Saon et al., 2020) for this. From these alignments we extract training instances for the imputation model as {(h t-1 , g u ), h t } for each (t, u) ∈ A i . When multiple g u 's align with the same h t in the best alignment output, we select the last g u . We train the parameters of the f IMP model to minimize the reconstruction loss of h t with input (h t-1 , g u ) as follows:where h 0 = 0 and the space of f are parameters of a standard feed forward network as described above. For the loss function, we considered several candidates and found the L1 distance to provide the best results as we show in our ablation studies.

3.2. RNN-T FINETUNING USING THE IMPUTATION MODEL

In this section we describe how we fine-tune the RNN-T model M on the target text D using the trained f IMP model. For each text y i ∈ D we first sample an alignment A i as elaborated in Section 3.2.1. Next use A i and f IMP to sample a sequence of vectors h i : h i 1 , . . . , h i |A i | as shown in the alignment example in Figure 2 . We first feed the token sequence y to the language module to get a sequence of outputs g 1 , . . . , g U . Next we invoke the imputation model f IMP to generate for each (t, u) ∈ A in increasing order of t, h t ∼ f IMP (h t-1 , g u ). This gives us a pseudo labeled training set DImp = {(h i , y i )} that we use to fine-tune parameters θ L and θ J of the RNN-T using the same loss function of maximizing the likelihood of y over all possible alignments. Since the generated h come from the Pr(h|y) distribution, and only the P(y) distribution changes in the target distribution, this fine-tuning step adapts to the target distribution. We tried many different light-weight methods of generating alignments for a given text token sequence y. A simple method is to generate a fixed number of blanks(∅) B before each token y u . We call this the FixedGram model. A second option is to train a distribution over the number of blanks for each token v ∈ V. Finally, a more evolved method would be to use a statebased model, much like the imputation model that takes as input g u and h t-1 to generate if the next token should be a blank. We will show that the simple FixedGram model was adequate for the adaptation task. Figure 3 presents an overview of our steps during fine-tuning under the FixedGram alignment model. Algorithm 1 presents the overall pseudocode of TOLSTOI. Note, the training of the imputation model is performed once and is independent of the adaptation dataset.

4. EXPERIMENTS

We present an extensive evaluation of TOLSTOI against three existing approaches with two ASR models and three target domains.

4.1. EXPERIMENTAL SETUP

We perform adaptation experiments on two RNN-T models of very different capacity: (i) SWB 300H -trained on the Switchboard 300H dataset (Godfrey et al., 1992) containing over 300 hours of labelled training data and (ii) SWB 2000H -trained on 262 hours of Switchboard-300, 1698 hours of Fisher data (Cieri et al., 2004) and 15 hours of CallHome data (Martin & Przybocki, 2000) .The transcription network of the RNN-T model consists of 6 bidirectional LSTM layers with a hidden size of 640 cells and a projection layer that projects down the encoder embeddings to 256 dimensions. The prediction network consists of a single-layer unidirectional LSTM with a hidden size of 768 cells and a projection layer reducing the prediction network embeddings to 256 dimensions. The joint network first combines the transcription network embeddings (256-d) and prediction network embeddings (256-d) via a Hadamard product joint operation (⊙) (Saon et al., 2021) , followed by a tanh non-linearity and a final output softmax layer that produces a probability distribution over 45 non-blank characters (V) and a blank character (∅). The total number of parameters in the RNN-T model is roughly 56 million.The audio features are mean and variance normalized log-Mel filterbank features computed for every 10ms of the audio. These features are further appended with delta-spectral and double-delta spectral features ( (Mason & Zhang, 1991; Picone, 1993) ) where every two consecutive frames are stacked resulting in 240-dimensional (240-d) input vectors with a receptive field of 20ms. Speed and tempo perturbations are applied to each input with the value of 0.9, 1.1 to augment the training data. We also use SpecAugment (Park et al., 2019) for additional augmentation. The RNN-T models were trained for 20 epochs using AdamW (Loshchilov & Hutter, 2017) optimizer with the maximum learning rate of 5e-4 and the OneCycleLR policy (Smith & Topin, 2019) policy consisting of a linear warmup phase from 5e-5 to 5e-4 followed by a linear annealing phase to 0. A batch size of 64 was used to train the Pytorch models on V100 GPUs.

A APPENDIX A.1 EVALUATION OF TOLSTOI USING A CONFORMER MODEL

In this section, we compare the effectiveness of TOLSTOI using a Conformer (Gulati et al., 2020) encoder in the RNN-T. While FixedGram-based TOLSTOI works well on the bidirectional-LSTM encoder as shown in Section 4.2, the purpose of this experiment is to evaluate its effectiveness for the Conformer encoder which would distribute acoustic features in the neighborhood. The Conformer-Transducer model uses an encoder network of 10 Conformer blocks (512-dimensional feed-forward module, 31-kernel convolution block, and 8 64-dimensional attention heads). We train the model on the SWB 300H dataset for this experiment. All the other model architecture choices and hyperparameters remain the same. We empirically observe that the FixedGram-based TOLSTOI works well with the Conformer-based RNN-T model. The reduction in the WER for the target domains is significant and catastrophic forgetting is also minimal compared to shallow fusion.

A.2 EVALUATION OF TOLSTOI ON RNN-T MODEL WITH BPE UNITS

In this section, we compare the effectiveness of TOLSTOI on an RNN-T model trained on subwordbased BPE units (Sennrich et al., 2015) . While the FixedGram-based TOLSTOI works well on the bidirectional-LSTM encoder with character units, the purpose of this experiment is to evaluate the effectiveness of the model with subword units. To this end, we train an SWB 300H model using the subword units using the same model configuration as for the character model, except that our output vocabulary now comprises 1000 BPE units as opposed to the 45 characters. The subword encoder is learned on the text obtained from the training set. We show results on TOLSTOI using the FixedGram approach. We empirically observe that the FixedGram-based TOLSTOI works well even with the subword units. The reduction in the WER for the target domains is significant and catastrophic forgetting is also minimal as compared to the Shallow Fusion. Additionally, we observe that the catastrophic forgetting for shallow fusion increases for the BPE-based models as opposed to the character-based model.

