ADAPT-AND-ADJUST: OVERCOMING THE LONG-TAIL PROBLEM OF MULTILINGUAL SPEECH RECOGNITION

Abstract

One crucial challenge of real-world multilingual speech recognition is the longtailed distribution problem, where some resource-rich languages like English have abundant training data, but a long tail of low-resource languages have varying amounts of limited training data. To overcome the long-tail problem, in this paper, we propose Adapt-and-Adjust (A2), a transformer-based multi-task learning framework for end-to-end multilingual speech recognition. The A2 framework overcomes the long-tail problem via three techniques: (1) exploiting a pretrained multilingual language model (mBERT) to improve the performance of low-resource languages; (2) proposing dual adapters consisting of both languagespecific and language-agnostic adaptation with minimal additional parameters; and (3) overcoming the class imbalance, either by imposing class priors in the loss during training or adjusting the logits of the softmax output during inference. Extensive experiments on the CommonVoice corpus show that A2 significantly outperforms conventional approaches.

1. INTRODUCTION

Deploying a single Automatic Speech Recognition (ASR) model to recognize multiple languages is highly desired but very challenging for real-world multilingual ASR scenarios due to the wellknown long-tailed distribution challenge, namely, that some resource-rich languages like English have abundant training data, while the majority low-resource languages have varying amounts of training data. The recent popular end-to-end (E2E) monolingual ASR architecture (Graves et al., 2013; Chan et al., 2015; Vaswani et al., 2017) is promising to achieve state-of-the-art performance for resource-rich languages but suffers dramatically from the long tail of low-resource languages due to the lack of training data. This paper aims to investigate an end-to-end multilingual ASR framework where a single model is trained end-to-end from a pooled dataset of all target languages to improve the overall performance of multilingual ASR tasks, especially for low-resource languages. The long-tailed data distribution problem makes building an end-to-end multilingual ASR notoriously challenging. This imbalanced data setting poses a multitude of open challenges for multi-task training because the distribution of the training data is very skewed. These challenges stem from two aspects. First, very limited audio samples are available for low-resource languages, such as Kyrgyz, Swedish, and Turkish, while simultaneously, vast amounts of data exist from high-resource languages, such as English, French, and Spanish. Second, graphemes or subword labels follow a long-tailed distribution in ASR since some labels appear significantly more frequently, even for a monolingual setting. Furthermore, a multilingual system may include languages with writing scripts other than the Latin alphabet, such as Chinese or Cyrillic, that further worsen the skewness. To further illustrate the long-tail distribution in our study, Figure 1 shows the frequencies of sentence piece tokens in the curated multilingual dataset from CommonVoice (Ardila et al., 2020) . While a standard end-to-end multilingual training approach can improve overall performance compared with monolingual end-to-end approaches, it does not address the long-tail problem explicitly. One of the key challenges is the class imbalance issue, which will bias the multilingual model towards the dominant languages. To address this, one straightforward approach is to resample the training data (Kannan et al., 2019; Pratap et al., 2020) during batch assembly. However, such an ad-hoc approach does not fully resolve the underlying long-tail distribution problem, and only a marginal improvement is obtained in practice. Another challenge is how to model the languages with limited training data robustly. In this paper, the "long-tail problem" is twofold: 1) the longtailed class distribution arising from the skewed multilingual data and sentence piece distribution 2) the robust modelling of languages with limited training data, i.e., tail languages. To this end, we propose the Adapt-and-Adjust (A2) framework for multilingual speech recognition using a speech transformer to address the twofold long-tail problem. Firstly, for better language modeling, a distilled mBERT (Devlin et al., 2019) is converted to an autoregressive transformer decoder to jointly explore the multilingual acoustic and text space to improve the performance of low-resource languages. Secondly, to adapt the multilingual network to specific languages with minimal additional parameters, both language-specific and language-agnostic adapters are used to augment each encoder and decoder layer. While the language-specific adapters focus on adapting the shared network weights to a particular language, a common adapter is proposed to learn some shared and language-agnostic knowledge for better knowledge transfer across languages. Lastly, to increase the relative margin between logits of rare versus dominant languages, we perform class imbalance adjustments during multilingual model training or inference by revisiting the classic idea of logit adjustment (Zhou & Liu, 2006) . Class imbalance adjustment (Collell et al., 2016; Cui et al., 2019; Menon et al., 2020) is applied by adjusting the logits of the softmax input with the class priors. We conduct experiments and establish a benchmark from the CommonVoice corpus with a realistic long-tailed distribution of different languages. The extensive experiments show that A2 significantly outperforms conventional approaches for end-to-end multilingual ASR. Our key contributions are as follows: • We propose Adapt-and-Adjust (A2), a novel end-to-end transformer-based framework for real-world multilingual speech recognition to overcome the "long-tail problem"; • We demonstrate the effectiveness of utilizing a pretrained multilingual language model as a speech decoder to improve multilingual text representations and language adapters to better share the learned information across all languages. To the best of our knowledge, this work is the first to adapt a pretrained multilingual language model for multilingual ASR. • We show that incorporating class priors during training or inference is effective and essential to addressing the long-tail distribution issue in multilingual training. • We establish a reproducible multilingual speech recognition benchmark with long-tailed distributions of 11 languages from different language families for the research community.

2. ADAPT-AND-ADJUST FRAMEWORK

2.1 OVERVIEW Figure 2 gives an overview of the proposed A2 framework for end-to-end multilingual ASR. A2 is built on a transformer-based sequence-to-sequence model with three key novel contributions: (1) an mBERT-based decoder, (2) language adapters, and (3) class-imbalance adjustments. Firstly, the vanilla transformer decoder is replaced with mBERT for better language modeling, particularly for low-resource languages. Secondly, the common and language-specific adapters are added to each encoder and decoder layer to learn both the shared and language-specific information for better acoustic modelling. Finally, we perform class imbalance adjustments during training or inference, where the logits are adjusted with the class priors estimated from the training data. p ctc is the connectionist temporal classification (CTC) output, p attn is the decoder output, y t-1 is the previous token. The three key modules are (1) pre-trained mBERT-based decoding to improve language modelling; and (2) dual adapters to improve acoustic modelling, all particularly for tail languages; and (3) class-imbalance adjustment of the logits f dec (y t-1 , h enc ) by class priors π y .

2.2. BASE MODEL: HYBRID CTC-ATTENTION SPEECH TRANSFORMER

A sequence-to-sequence speech transformer model (Dong et al., 2018; Kim et al., 2016; Karita et al., 2019b) based on the hybrid CTC-Attention network is used for acoustic modeling. It takes in the acoustic features x ∈ R T ×F and outputs the sentence piece tokens y, where T and F denote the sequence length and feature dimension. The encoder consists of several 2D convolution layers followed by self-attention layers. The convolution layers are used to extract more robust features before they are sent to the transformer. The decoder layers have two attention mechanisms, one for self-attention and the other for the encoder output. The network is trained in an autoregressive manner by predicting the next token given the current output. In addition, the CTC layer (Graves et al., 2006) is added to the encoder output to serve as a regularizer to the attention model. Training Multi-task loss L MTL (Watanabe et al., 2018; Karita et al., 2019b) , combining the CTC loss (Graves et al., 2006) and attention loss L ATTN , is used to train the speech transformer. The multitask loss is computed as an interpolation of the two losses with a hyper-parameter λ (0 ≤ λ ≤ 1): L ATTN = KL(p attn ||p y ), L MTL = λ log p ctc (y|h enc ) + (1 -λ)L attn , where p y is the label distribution after label smoothing (Müller et al., 2019) to prevent the model from making over-confident predictions. Kullback-Leibler divergence loss (KL) (Kullback & Leibler, 1951) , is used for the attention loss. Decoding Beam search is used to predict the sentence pieces without any additional language models. The decoding score is computed as a weighted sum of both the CTC and attention network probabilities using β as the decoding parameter to balance them (Karita et al., 2019a) : ŷ = arg max y∈Y * {βp ctc (y|h enc ) + (1 -β)p attn (y|h enc , y )}, where y is the decoded sequence so far.

2.3. MULTILINGUAL BERT AS TRANSFORMER DECODER

For better language modeling, especially for low-resource languages, mBERT is used as the transformer decoder. Since mBERT is pre-trained on text data, it is essential to augment a cross-attention layer to the encoder output for each mBERT layer. The cross-attention and its self-attention layers are learned to "align" the acoustic and text spaces for the speech recognition. This is because the text space may diverge significantly from the acoustic space of the encoder output. Autoregressive mBERT Figure 3 depicts the adaptation of mBERT as an autoregressive transformer decoder. We copy the embeddings and self-attention parameters of mBERT into the decoder layers. Let t denote the current decoding step. The autoregressive decoder takes the current input token y t to predict the next token y t+1 . The mBERT embedding layer converts the input token to a vector representation. Subsequently, the cross-attention layer takes the encoder output h enc as the key and value, and the self-attention output as the query, and computes the attention output. Figure 3 : Parameter transfer from a pre-trained multilingual language model to a speech recognition decoder. The dotted line shows the transfer direction from a specific module of the model.

Vocabulary Mapping

The vocabulary size of the vanilla mBERT is too large (119,547 tokens) for training the end-to-end speech recognition system. Therefore, vocabulary mapping is performed to reduce the number of targets for the speech transformer. In this work, sentence pieces (SP) (Kudo, 2018) are used as the target tokens. The SP models are trained on the transcriptions with a preset vocabulary size. In this work, we use a shared set of 5,237 tokens as the multilingual system's vocabulary. The minimum number in the token set for the sentence piece model is 150 for all the monolingual systems, except Chinese with 2,265 tokens. The generated sentence piece tokens are then matched against the mBERT token set. During training, the embeddings of all tokens in the mBERT vocabulary are initialized with mBERT embeddings.

2.4. LANGUAGE ADAPTERS

Similar to Kannan et al. (2019) , lightweight residual language adapters are used for better acoustic modelling with minimal language-specific parameters to increase the model robustness to languages with limited resources. As shown in Figure 4 , in addition to the language-specific adapter for capturing the language-intrinsic knowledge, a common adapter is also trained to learn language-agnostic information in the multilingual data; we call these Dual-Adapters. The language-specific and common adapters are denoted as A lang and A com , respectively. Each adapter of layer l consists of a down-projection layer W l d , followed by a ReLU activation function, and an up-projection layer W l u . The adapters take h l as the input, where h l is the self attention output of layer l. We compute the output of Adapter(h l ) as follows for both the language-specific and common adapters: Adapter(h l ) = W l u (ReLU(W l d (LayerNorm(h l )))) + h l . The final adapter output is computed as o l = o l lang + o l com . o l is then used as the input to the next encoder or decoder layer. We create a language mask to specify the language-specific adapters.

2.5. SENTENCE PIECE CLASS IMBALANCE ADJUSTMENTS

The sentence piece class imbalance problem is addressed by incorporating the class priors during training or inference via logit adjustments. Derived from a Bayesian point of view in Menon et al. (2020) for computer vision tasks, the softmax classifier with adjusted logits as input minimizes the balanced error across all classes. A natural adjustment is to scale the logits f y (x) by the inverse of the corresponding class prior π y . In log domain, the adjustment can be performed as follows: f adj y (x) = f y (x) -τ • log π y , where τ > 0 is a hyper-parameter. The adjustment can be viewed as applying a class-dependent offset to re-weight each logit according to its class prior.

Class priors

The class priors are the natural frequencies of the sentence piece tokens estimated from the multilingual training data. To form a valid prior distribution, smoothing is applied to the raw counts according to Equation 6for zero occurrence tokens: π y = Ci C - 1 (N -n0)×C , c i > 0 1 n0×C , otherwise, ( ) where C is the total number of counts for all labels, n 0 is the number of labels with zero occurrences, N is the number of classes and c i is the raw count of class i.

Training phase class imbalance adjustments

To incorporate the priors during training, the logits f dec yt of the last decoder layer are adjusted before softmax according to the following: f dec yt = w T y • Decoder(h enc , Embedding(y t-1 )) f adj yt = f dec yt -τ • log π yt , p adj yt = exp(f adj yt ) y t ∈[N ] exp(f adj y t ) . The adjusted softmax output vector p adj y of the sequence is used to compute the KL loss and perform the backward propagation to update the model. y t-1 is the previous label available only during training. To reduce the training and inference discrepancy, scheduled sampling (Bengio et al., 2015) is commonly used for sequential classification tasks like speech recognition. During later training iterations, instead of using the ground truth label y t-1 for computing the logits, y t-1 is chosen from the maximum prediction output of the current model to simulate the inference: y t-1 = argmax y p adj yt-1 . ( ) If the scheduled sampling is used, the adjusted logits at step t will have influence over all of the following tokens in the current sequence. This is a crucial difference from the image classification task in Menon et al. (2020) . If τ is set to be 1, the training phase logit adjustment becomes similar to the widely used label smoothing technique Müller et al. (2019) . However, in conventional label smoothing, the prior π y is usually a uniform distribution that is independent of the data. The logit adjustment applies a class-specific "smoothing" based on the class prior, and has been shown to be superior to the baseline with the standard label smoothing. Inference phase class imbalance adjustments Alternatively, the class priors can be incorporated during inference via logit adjustments. The decoding score is computed as follows: ŷ = arg max y∈Y * {βp ctc (y|h enc ) + (1 -β)p adj y }. During beam search, the attention decoding scores p adj y are computed in the same way as the scheduled sampling from the adjusted logits.

3.1. EXPERIMENTAL SETUP

Dataset The CommonVoice dataset (Ardila et al., 2020) is a multilingual corpus collected by Mozilla. Similar to Conneau et al. (2020a) , we use 11 languages: English (en), Spanish (es), French (fr), Italian (it), Kyrgyz (ky), Dutch (nl), Russian (ru), Swedish (sv), Turkish (tr), Tatar (tt), and Chinese (zh). The dataset is split into training, dev, and eval sets according to the ESPNET recipe. The transcriptions are tokenized using the SentencePiece model with the unigram algorithm. We train our SentencePiece model using speech transcriptions, and the size of the vocabulary is shown in Table 10 . We then add special tokens, such as <unk>, <sos>, <eos>, and a blank token for the CTC objective. The detailed data split is shown in Table 6 of Appendix C.

Network configurations

We use six transformer encoder layers with a hidden size of 2048 units and eight attention heads, each with an attention dimension of 256. For the decoder, distil-mBERT (Sanh et al., 2019) is used. The mBERT decoder consists of six transformer decoder layers with a hidden size of 3072 and an attention dimension of 756, and four attention heads. We train our model with a batch size of 32 and accumulate the gradient in two steps to have a larger batch size using a single GPU NVIDIA V100 16GB. The models are trained with the Adam optimizer with a warm-up step of 25000. In particular, for balanced sampling, we take six samples for each language and construct a balanced batch by accumulating the gradient 11 times. Training and Evaluation We evaluate our model using beam-search with a beamwidth of 10 and λ = 0.3 and β = 0.5. The hyper-parameter τ is set to 0.3 for both the training and inference phase class imbalance adjustments. The multilingual models are trained with 150K iterations. We compute the average over the last ten checkpoints as the decoding model. For the monolingual setting, we stop after 100 epochs of training. Models are evaluated using the character error rate (CER) to simplify the evaluation and to have a universal metric for all languages. Baselines As baseline approaches, we consider the following: Monolingual: we train monolingual models; SMT (Standard Multilingual Training), we randomly sample the batch from the data distribution; BS (Balanced Sampling), we sample the same number of utterances for each language in a batch so that they have roughly equal contributions to the training; LAN-Specific Adapters: language-specific adapters proposed by Kannan et al. (2019) ; and LID: (language ID) conditioning with one-hot language vectors proposed by Li et al. (2018) .

3.2. RESULTS

In Table 1 , we present the test results on the CommonVoice dataset. Compared to the monolingual models, even the SMT models improve the performance of the low-resource languages significantly. In other words, SMT is a decent multilingual baseline to be compared with. We conjecture that this may be because the multilingual models can capture common sub-phonetic articulatory features (Kirchhoff et al., 2002; Metze, 2005; Livescu et al., 2007; Wang & Sim, 2014) that are shared by different languages and are beneficial for low-resource languages recognition.

Balanced Sampling

We observe the same trend as in Kannan et al. (2019) : compared to the SMT, the tail language performance is significantly boosted. However, the performance of the head languages suffers due to fewer occurrences during training. The model is clearly over-fitted to the tail languages due to up-sampling, for example, the CERs on the training set of "ky" and "sv" are significantly lower than the evaluation data (3.4% and 4.2% training vs 13.4% and 22.8% evaluation). Consequently, the overall performance is the same as SMT. In fact, even after balanced sampling, the sentence piece tokens still have a long-tailed distribution, as shown in Appendix C. Language Adapters We next compare the language adaptation techniques, the LAN-Specific Adapters (Kannan et al., 2019) , the one-hot language vector (Li et al., 2018) , and the Dual-Adapters. Note that all adapters are based on BS + mBERT, which has better performance than the BS-only model. Adding the language-specific adapters without common adapters significantly outperforms the BS baseline, with a 0.9% absolute performance gain. Another way of injecting language information is to augment a one-hot language vector. Interestingly, applying sentence piece class imbalance adjustment (LID + Adjust-Train) to the language vector significantly improves the CER. Sentence Piece Class Imbalance Adjustment Both the training and inference phase adjustments provide a significant performance gain over the LAN-Specific Adapters, with 1% absolute CER reduction. The gains are mostly due to the improved performance of the head languages, although tail languages also benefit from the logit adjustments. More importantly, the gap between the monolingual and multilingual performance for the head languages is greatly reduced, leading to a better "balanced error" performance. This strongly justifies the importance of class imbalance adjustments. Compared to BS, A2 also avoids over-fitting to the tail languages, CERs on "ky" and "sv" are 8.2% and 23.6%, much closer to evaluation CERs. Compared to SMT with random sampling, A2 has a significant better averaged CER with a modest cost for the two head languages "fr" and "en". Some example transcriptions and detailed analysis using different A2 modules are given in Appendix E.

3.3.1. MULTILINGUAL BERT

The effectiveness of mBERT is presented in Table 2 . The performance of mBERT depends heavily on the quality of the acoustic models. Without adapters or logit adjustments, the improvement over BS is marginal, and mBERT performance is even worse for SMT. This may indicate that, with better acoustic models like A2, the text space of the vanilla mBERT is better aligned with the acoustic space, which leads to improved performance across all languages, especially for low-resource ones. It is also interesting to note that, even without adapters, "SMT + mBERT + Adjust-Train" yields the same overall CER as the best adapter system (BS + mBERT + Dual-Adapters). To study the impacts of the pretrained language models, a more advanced XLM-R (Conneau et al., 2020b) pretrained model is used in place of the distilled-mBERT. Although XLM-R has a better multilingual language generation capability than mBERT, it does not translate to the final performance gain for the multilingual ASR task. We believe this is because it becomes more difficult for the model to align the text and acoustic space with the increased model complexities for XLM-R. XLM-R is not advised considering it has more parameters compared to distilled-mBERT, and the performance gain is not significant although it does improve the performance on "zh" and "fr" slightly. BS + Dual-Adapters 23.5 18.9 13.5 12.1 12.3 31.0 10.9 16.5 12.0 12.9 21.6 16.8 BS + Dual-Adapters + mBERT 23.4 18.6 13.4 11.8 11.7 30.8 10.8 16.2 11.6 12.4 21.5 16.5 SMT + Adjust-Train 20.2 16.9 12.5 11.9 13.1 32.9 11.3 18.5 13.5 13.9 25.3 17.3 SMT + 

3.3.2. LANGUAGE ADAPTERS

The results and parameter sizes of different adapters are given in Table 3 . Generally speaking, decoder layer adapters are not as effective as in the encoder layers, indicating adaptation of the acoustic space is much more effective than of the text space. Therefore, considering the extra computation and parameters, it is advisable to apply only the encoder adapters. We investigate the effectiveness of the common language adapters in Table 4 . The Dual-Adapters outperform the language-specific adapters significantly, by a 0.5% absolute CER reduction, indicating knowledge transfer with the common adapter is effective. In addition to the individual language adapters, we also divide languages into groups to allow sharing of adapters within the same group. According to the written scripts, we divide the 11 languages into language groups, e.g., Latin, Chinese characters and Cyrillic scripts. They can also be groups into language families, e.g., Romance, Chinese, Turkic, Germanic. This group focuses more on the similarities in lexica, grammars, and pronunciations, which are usually subsumed under the end-toend multilingual architectures. According to one group, languages that belong to the same cluster do not necessarily belong to the same cluster in the other group. For example, Tartar and Turkish are both Turkic languages. However, Tartar uses the Cyrillic script, and Turkish uses the Latin alphabet. All languages in the same group share the same dual-adapters, and the adapters are trained with all language members. In general, grouping by language families is better than grouping by written scripts because it is more consistent with the encoder adapters for adapting the acoustic space, which are more effective than decoder adapters in Table 3 . Compared to individual language adapters, sharing language adapters by language families helps the low-resource languages performance, e.g., "sv" of the Germanic group, "ky" and "tr" of the Turkic group because more data are used to train the group adapters. However, this also comes with a cost to the resource-rich languages compared to "Individual Dual-Adapters". Therefore, individual language adapters are advised considering the adapters' parameter sizes are much smaller than the encoder and decoder attention weights.

3.3.3. SENTENCE PIECE CLASS-IMBALANCE ADJUSTMENTS: TRAINING VS. INFERENCE

The two logit adjustments are compared in Table 5 . For the SMT systems, training phase adjustment shows a clear advantage over inference phase adjustment. Under the convex assumption, the solution of the two adjustment approaches is the same. However, deep neural network optimization is a nonconvex problem, so they may converge to different local minima. Under SMT, the model is heavily biased towards the head classes due to random sampling. Training phase class imbalance adjustment can help the training to place more focus on the tail classes, leading to much better balanced and lower error. With better acoustic models, e.g., language adapters, the inference phase adjustment can better calibrate the raw classification scores and yield similar performance to the training phase adjustment. Lastly, we show the effects of τ on inference phase logit adjustment in Appendix D. 

4. RELATED WORK

Long-Tail Conventional approaches to addressing the long-tail problem are focused on data sampling methods (Kubat & Matwin, 1997; Chawla et al., 2002; Wallace et al., 2011) . Recently, the long-tail distribution issue has regained interest for neural network models (Menon et al., 2020) , and several approaches have been proposed, such as weight normalization (Kang et al., 2019) , adaptive margin (Cao et al., 2019) , and equalized loss (Tan et al., 2020) . Adapters Adapters were first proposed to learn domain-specific representations in computer vision in a parameter-efficient way (Rebuffi et al., 2017) . They were subsequently adopted for NLP tasks to avoid fine-tuning a new model for each task by training an adapter module for each task while sharing the pre-trained language model parameters (Houlsby et al., 2019; Lin et al., 2020) . Invertible adapters were proposed in Pfeiffer et al. (2020a; b) to effectively adapt an existing pretrained multilingual model to unseen languages for multi-task cross-lingual transfer. Multilingual ASR E2E architectures like LAS (Toshniwal et al., 2018) and the Recurrent Neural Transducer (Kannan et al., 2019) have been used for building a multilingual ASR system for a group of Indian languages. In Kannan et al. (2019) , language adapters are used to tackle the data imbalance problem, although the improvement from using language adapters is marginal compared to the language vector augmentation. Acoustic vector quantization is also used in the recent work by Conneau et al. (2020a) on multilingual ASR. A massive multilingual ASR study with more than 50 languages and more than 16,000 hours of speech is presented in Pratap et al. (2020) . The two main techniques are data resampling and language clusters, which bear some similarities with our balanced sampling and language adapters. Unfortunately, their datasets are not publicly available.

5. CONCLUSION

In this paper, we introduce Adapt-and-Adjust (A2), an end-to-end transformer-based framework to address the crucial challenge of long-tail data distribution issues in multilingual speech recognition. A2 consists of three major components, namely, language adapters, class imbalance adjustments, and a pretrained multilingual language model. Extensive experiments on the CommonVoice corpus show that A2 significantly outperforms conventional approaches for end-to-end multilingual speech recognition due to 1) the better acoustic modeling with adapters and class imbalance adjustments; and 2) the better language modeling with pretrained multilingual BERT.

A A2 ALGORITHM

The full algorithm of the A2 framework with training phase sentence piece class imbalance adjustments is outlined in Algorithm 1. Generate language adapter mask m using the language tag in x 7: Compute encoder hidden states h enc using x and m by encoder forwarding 8: Compute logits f dec yt using h enc and m by decoder forwarding 9: Adjust logit according to Equation 910: Compute CTC posteriors p ctc (y|h enc ) 11: Compute attention loss L ATTN in Equation 12: Compute multi-task loss L MTL using p ctc (y|h enc ), L ATTN , and λ in Equation 213: Update model θ ← θ -α∇ θ L MTL 14: end while

B ASR MODEL STRUCTURE

The encoders and decoders of the transformer-based speech recognition model used in this study is presented here. Encoder Layer Figure 5 shows the structure of the encoder layer. An adapter layer is added after the layer norm and self-attention module. We also apply two residual connections after both the self-attention layer and the adapter layer: o = SelfAttn(LayerNorm(h l-1 enc )) + h l-1 enc , h l enc = FeedForward(Adapter(LayerNorm(o)) + o), where h l-1 enc is the encoder hidden states of the previous layer l -1 and h l enc is the output of the encoder layer. Decoder Layer Figure 6 shows the structure of decoder layer with the adapter. We place the adapter layer after the cross-attention model. o 1 = SelfAttn(LayerNorm(h l-1 dec )) + h l-1 dec , (14) o 2 = CrossAttn(h enc , LayerNorm(o 1 )) + o 1 (15) h l+1 dec = FeedForward(Adapter(LayerNorm(o 2 )) + o 2 ), where h l-1 dec is the decoder hidden states of the previous layer, and h l dec is the output the current layer.

C DATASET DISTRIBUTION

Table 6 shows the data split used for training, validation, and testing. Table 7 shows the frequencies of the labels in the original data and our estimates on the balance sampling. There are only a few data with an occurrence of more than 50k (around 1%). The class distribution on the balance sampling is shifted to the 10k-50k range; thus, there are more labels in the middle of the distribution. 

E TRANSCRIPTION EXAMPLES

We provide samples of two languages generated from different models, a resource-limited language zh (traditional Chinese scripts and Mandarin spoken in Taiwan) and a resource-rich language en (English). The reference in Table 8 roughly translates to English as "Obstacles and challenges are the sources of power that motivate us." The monolingual output is rather poor. Not only did it miss an important character 們 for distinguishing "my" and "our", the sounding out of the sentence is also quite different from the reference (Fu4 Neng2 Yi2 Piao4 Zhan4 Shi4 Wo3 De0 Yuan2 Dong4 Li4). The standard multilingual training (SMT) improves the monolingual training by predicting the character 與 (Yu3) in place of 一 (Yi2). In addition, the output of 調站 (Tiao2 Zhan4) sounds almost the same as the reference 挑戰 (Tiao3 Zhan4). (The monolingual system output 票佔 (Piao4 Zhan4) not only does not make any sense in terms of word meaning, the pronunciation is also quite different from the reference.) This shows that the multilingual training helps improve the acoustic modelling compared to the monolingual training if the monolingual training data are limited. The mBERT model further helps the language modeling by correcting 可能與調站 (literally translates to "possibility and changing stations") to 困難與挑戰 ("obstacles and challenges"). However, mBERT also replaces 機發 (Ji1 Fa1) to 資料 (Zi1 Liao4) since 資料 ("documents/materials") is a more common word than 機發, which is a wrong word with correct pronunciations relative to the reference. Lastly, the adapters successfully convert 資料 (Zi1 Liao4) to 機發 (Ji1Fa1) so that it sounds the same as the reference with one wrong character. This demonstrates the adapter's capability in better acoustic modelling. Finally the errors including the wrong characters in 機發 (Ji1Fa1) and 員動力 (Yuan2 Dong4 Li4) can be easily corrected by a decent external language model during decoding, e.g., an RNN-LM trained on the training transcriptions. Table 9 shows the transcriptions of different models applied to an English testing utterance. Different from Table 8 , where only limited training data are available, mBERT does not provide any improvement over SMT, except that it makes the sentence more grammatical by adding a subject, "I". The Dual-Adapters improve the performance compared to mBERT by successfully decoding the sentence piece "ENT" in "SPENT", which is closer to "TWENTY" in terms of characters. Lastly, the class imbalance adjustment system yields the best CER. 

H LANGUAGE GROUP ADAPTERS

The language groups used in the experiment are shown in Table 11 . We also present the detailed CER results in Table 12 with sentence piece class imbalance adjustments. The inference phase logit adjustments is slightly better than the training phase adjustment.



Figure 1: The long-tail distribution of sentence piece tokens in the curated multilingual dataset. The head classes are tokens with high frequency, otherwise, they are classified as tail classes.

Figure2: Overview of the Adapt-and-Adjust framework. The layer norm is omitted to save space. p ctc is the connectionist temporal classification (CTC) output, p attn is the decoder output, y t-1 is the previous token. The three key modules are (1) pre-trained mBERT-based decoding to improve language modelling; and (2) dual adapters to improve acoustic modelling, all particularly for tail languages; and (3) class-imbalance adjustment of the logits f dec (y t-1 , h enc ) by class priors π y .

Figure 4: Dual-Adapters. The orange box presents the active language-specific adapter and the blue box presents the common adapter. The figure on the right shows the language adapter structure.

Figure 5: Encoder structure.

Figure7and Figure8visualizes the long-tail distribution in more detailed. To better visualize them, we ignore the classes fewer than 20k.

.

Figure 10: Validation losses.

Figure 11: Closer look at validation losses.

Test results in terms of CER (%) on the CommonVoice dataset.

Ablation study on mBERT for multilingual speech recognition.

Ablation study on language adapters' impacts on BS + mBERT models

Ablation study of language adapters.

Training and inference phase logit adjustments with different A2 models.

Dataset split information

Class frequency in the original data and after applying balanced sampling

Transcription of a testing traditional Chinese utterance with different models

Transcription of a testing English utterance with different models Reference FOUND A LITTLE CROWD OF ABOUT TWENTY PEOPLE SURROUNDING THE HUGE HOLE

Vocabulary coverage

I FEW-SHOT SETTING

We also investigate the effectiveness of the A2 framework in the few-shot setting. We take five-hour speech data of each language as the training data. The test results are shown in Table 13 . Using mBERT as the speech decoder consistently improves the performance in the low-resource monolingual scenario. Interestingly, for Chinese language "zh", without using mBERT as the decoder, the model seems to be unable to learn useful information and the mBERT decoder yields an absolute CER reduction of 17.2%.For multilingual training, marginal performance gains are observed with mBERT decoder, indicating the multilingual language model is complementary to the shared acoustic knowledge learned from the multilingual speech data. The language adapters lead to consistent and significant performance gain across all languages. Note the sentence piece class imbalance adjustments do not yield significant performance boost as in the long-tail multilingual setting in the main text. We believe this is largely due to less severe sentence piece class imbalance when the same amount of training data for each language is used. 

J HYPER-PARAMETER SETTING

The hyper-parameters used in the training are shown in Table 14 . 

