ADAPT-AND-ADJUST: OVERCOMING THE LONG-TAIL PROBLEM OF MULTILINGUAL SPEECH RECOGNITION

Abstract

One crucial challenge of real-world multilingual speech recognition is the longtailed distribution problem, where some resource-rich languages like English have abundant training data, but a long tail of low-resource languages have varying amounts of limited training data. To overcome the long-tail problem, in this paper, we propose Adapt-and-Adjust (A2), a transformer-based multi-task learning framework for end-to-end multilingual speech recognition. The A2 framework overcomes the long-tail problem via three techniques: (1) exploiting a pretrained multilingual language model (mBERT) to improve the performance of low-resource languages; (2) proposing dual adapters consisting of both languagespecific and language-agnostic adaptation with minimal additional parameters; and (3) overcoming the class imbalance, either by imposing class priors in the loss during training or adjusting the logits of the softmax output during inference. Extensive experiments on the CommonVoice corpus show that A2 significantly outperforms conventional approaches.

1. INTRODUCTION

Deploying a single Automatic Speech Recognition (ASR) model to recognize multiple languages is highly desired but very challenging for real-world multilingual ASR scenarios due to the wellknown long-tailed distribution challenge, namely, that some resource-rich languages like English have abundant training data, while the majority low-resource languages have varying amounts of training data. The recent popular end-to-end (E2E) monolingual ASR architecture (Graves et al., 2013; Chan et al., 2015; Vaswani et al., 2017) is promising to achieve state-of-the-art performance for resource-rich languages but suffers dramatically from the long tail of low-resource languages due to the lack of training data. This paper aims to investigate an end-to-end multilingual ASR framework where a single model is trained end-to-end from a pooled dataset of all target languages to improve the overall performance of multilingual ASR tasks, especially for low-resource languages. The long-tailed data distribution problem makes building an end-to-end multilingual ASR notoriously challenging. This imbalanced data setting poses a multitude of open challenges for multi-task training because the distribution of the training data is very skewed. These challenges stem from two aspects. First, very limited audio samples are available for low-resource languages, such as Kyrgyz, Swedish, and Turkish, while simultaneously, vast amounts of data exist from high-resource languages, such as English, French, and Spanish. Second, graphemes or subword labels follow a long-tailed distribution in ASR since some labels appear significantly more frequently, even for a monolingual setting. Furthermore, a multilingual system may include languages with writing scripts other than the Latin alphabet, such as Chinese or Cyrillic, that further worsen the skewness. To further illustrate the long-tail distribution in our study, Figure 1 shows the frequencies of sentence piece tokens in the curated multilingual dataset from CommonVoice (Ardila et al., 2020) . While a standard end-to-end multilingual training approach can improve overall performance compared with monolingual end-to-end approaches, it does not address the long-tail problem explicitly. One of the key challenges is the class imbalance issue, which will bias the multilingual model towards the dominant languages. To address this, one straightforward approach is to resample the training data (Kannan et al., 2019; Pratap et al., 2020) during batch assembly. However, such an ad-hoc approach does not fully resolve the underlying long-tail distribution problem, and only a marginal improvement is obtained in practice. Another challenge is how to model the languages with limited training data robustly. In this paper, the "long-tail problem" is twofold: 1) the longtailed class distribution arising from the skewed multilingual data and sentence piece distribution 2) the robust modelling of languages with limited training data, i.e., tail languages. To this end, we propose the Adapt-and-Adjust (A2) framework for multilingual speech recognition using a speech transformer to address the twofold long-tail problem. Firstly, for better language modeling, a distilled mBERT (Devlin et al., 2019) is converted to an autoregressive transformer decoder to jointly explore the multilingual acoustic and text space to improve the performance of low-resource languages. Secondly, to adapt the multilingual network to specific languages with minimal additional parameters, both language-specific and language-agnostic adapters are used to augment each encoder and decoder layer. While the language-specific adapters focus on adapting the shared network weights to a particular language, a common adapter is proposed to learn some shared and language-agnostic knowledge for better knowledge transfer across languages. Lastly, to increase the relative margin between logits of rare versus dominant languages, we perform class imbalance adjustments during multilingual model training or inference by revisiting the classic idea of logit adjustment (Zhou & Liu, 2006) . Class imbalance adjustment (Collell et al., 2016; Cui et al., 2019; Menon et al., 2020) is applied by adjusting the logits of the softmax input with the class priors. We conduct experiments and establish a benchmark from the CommonVoice corpus with a realistic long-tailed distribution of different languages. The extensive experiments show that A2 significantly outperforms conventional approaches for end-to-end multilingual ASR. Our key contributions are as follows: • We propose Adapt-and-Adjust (A2), a novel end-to-end transformer-based framework for real-world multilingual speech recognition to overcome the "long-tail problem"; • We demonstrate the effectiveness of utilizing a pretrained multilingual language model as a speech decoder to improve multilingual text representations and language adapters to better share the learned information across all languages. To the best of our knowledge, this work is the first to adapt a pretrained multilingual language model for multilingual ASR. • We show that incorporating class priors during training or inference is effective and essential to addressing the long-tail distribution issue in multilingual training. • We establish a reproducible multilingual speech recognition benchmark with long-tailed distributions of 11 languages from different language families for the research community.

2. ADAPT-AND-ADJUST FRAMEWORK

2.1 OVERVIEW Figure 2 gives an overview of the proposed A2 framework for end-to-end multilingual ASR. A2 is built on a transformer-based sequence-to-sequence model with three key novel contributions: (1) an mBERT-based decoder, (2) language adapters, and (3) class-imbalance adjustments. Firstly, the vanilla transformer decoder is replaced with mBERT for better language modeling, particularly for low-resource languages. Secondly, the common and language-specific adapters are added to each encoder and decoder layer to learn both the shared and language-specific information for better acoustic modelling. Finally, we perform class imbalance adjustments during training or inference, where the logits are adjusted with the class priors estimated from the training data.



Figure 1: The long-tail distribution of sentence piece tokens in the curated multilingual dataset. The head classes are tokens with high frequency, otherwise, they are classified as tail classes.

