ADAPT-AND-ADJUST: OVERCOMING THE LONG-TAIL PROBLEM OF MULTILINGUAL SPEECH RECOGNITION

Abstract

One crucial challenge of real-world multilingual speech recognition is the longtailed distribution problem, where some resource-rich languages like English have abundant training data, but a long tail of low-resource languages have varying amounts of limited training data. To overcome the long-tail problem, in this paper, we propose Adapt-and-Adjust (A2), a transformer-based multi-task learning framework for end-to-end multilingual speech recognition. The A2 framework overcomes the long-tail problem via three techniques: (1) exploiting a pretrained multilingual language model (mBERT) to improve the performance of low-resource languages; (2) proposing dual adapters consisting of both languagespecific and language-agnostic adaptation with minimal additional parameters; and (3) overcoming the class imbalance, either by imposing class priors in the loss during training or adjusting the logits of the softmax output during inference. Extensive experiments on the CommonVoice corpus show that A2 significantly outperforms conventional approaches.

1. INTRODUCTION

Deploying a single Automatic Speech Recognition (ASR) model to recognize multiple languages is highly desired but very challenging for real-world multilingual ASR scenarios due to the wellknown long-tailed distribution challenge, namely, that some resource-rich languages like English have abundant training data, while the majority low-resource languages have varying amounts of training data. The recent popular end-to-end (E2E) monolingual ASR architecture (Graves et al., 2013; Chan et al., 2015; Vaswani et al., 2017) is promising to achieve state-of-the-art performance for resource-rich languages but suffers dramatically from the long tail of low-resource languages due to the lack of training data. This paper aims to investigate an end-to-end multilingual ASR framework where a single model is trained end-to-end from a pooled dataset of all target languages to improve the overall performance of multilingual ASR tasks, especially for low-resource languages. The long-tailed data distribution problem makes building an end-to-end multilingual ASR notoriously challenging. This imbalanced data setting poses a multitude of open challenges for multi-task training because the distribution of the training data is very skewed. These challenges stem from two aspects. First, very limited audio samples are available for low-resource languages, such as Kyrgyz, Swedish, and Turkish, while simultaneously, vast amounts of data exist from high-resource languages, such as English, French, and Spanish. Second, graphemes or subword labels follow a long-tailed distribution in ASR since some labels appear significantly more frequently, even for a monolingual setting. Furthermore, a multilingual system may include languages with writing scripts other than the Latin alphabet, such as Chinese or Cyrillic, that further worsen the skewness. To further illustrate the long-tail distribution in our study, Figure 1 shows the frequencies of sentence piece tokens in the curated multilingual dataset from CommonVoice (Ardila et al., 2020) . While a standard end-to-end multilingual training approach can improve overall performance compared with monolingual end-to-end approaches, it does not address the long-tail problem explicitly. One of the key challenges is the class imbalance issue, which will bias the multilingual model towards the dominant languages. To address this, one straightforward approach is to resample the training data (Kannan et al., 2019; Pratap et al., 2020) during batch assembly. However, such an ad-hoc approach does not fully resolve the underlying long-tail distribution problem, and only a marginal improvement is obtained in practice. Another challenge is how to model the languages

