UNSUPERVISED CROSS-LINGUAL REPRESENTATION LEARNING FOR SPEECH RECOGNITION Anonymous authors Paper under double-blind review

Abstract

This paper presents XLSR which learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages. We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations and jointly learns a quantization of the latents shared across languages. The resulting model is fine-tuned on labeled data and experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining. On the CommonVoice benchmark, XLSR shows a relative phoneme error rate reduction of 72% compared to the best known results. On BABEL, our approach improves word error rate by 16% relative compared to a comparable system. Our approach enables a single multilingual speech recognition model which is competitive to strong individual models. Analysis shows that the latent discrete speech representations are shared across languages with increased sharing for related languages.

1. INTRODUCTION

Cross-lingual learning aims to build models which leverage data from other languages to improve performance. This has been a long standing interest in the speech community (Byrne et al., 2000; Le & Besacier, 2009; Ghoshal et al., 2013; Huang et al., 2013; Gales et al., 2017; Cho et al., 2018; Seki et al., 2018) which includes systems able to transcribe multiple languages (Burget et al., 2010; Bourlard et al., 2011; Heigold et al., 2013; Toshniwal et al., 2018; Kannan et al., 2019) . However, the vast majority of work in speech processing has focused on supervised cross-lingual training which requires labeled data in multiple languages. Transcribed speech is often much scarcer than unlabeled speech and requires non-trivial human annotation. Unsupervised representation learning, or pretraining, does not require labeled data and has received a lot of recent attention in computer vision (Tian et al., 2019; He et al., 2019; Chen et al., 2020) after much success in natural language processing (Peters et al., 2018; Devlin et al., 2018) . For the latter, cross-lingual pretraining has been shown to be very effective, particularly, for low resource languages (Lample & Conneau, 2019; Conneau et al., 2019) . In speech processing, most work in this area has focused on monolingual unsupervised representation learning (van den Oord et al., 2018; Chung & Glass, 2018; Schneider et al., 2019; Chung et al., 2019; Baevski et al., 2020b; Harwath et al., 2020; Jiang et al., 2019; Tjandra et al., 2019; Eloff et al., 2019; Baevski et al., 2020a) . In this paper, we focus on the cross-lingual setting by learning representations on unlabeled data that generalize across languages. We build on the pretraining approach of Baevski et al. (2020c) which jointly learns contextualized speech representations as well as a discrete vocabulary of latent speech representations. The latter serves to effectively train the model with a contrastive loss ( § 2) and the discrete speech representations are shared across languages (Figure 1 ). Different to recent work on unsupervised cross-lingual pretraining, we fine-tune the Transformer part of the model instead of freezing all pretrained representations (Rivière et al., 2020) or feeding them to a separate downstream model (Kawakami et al., 2020) . We extend the work of Rivière et al. ( 2020) by pretraining on multiple languages instead of just English and we experiment on top of a stronger baseline. We evaluate XLSR on 14 languages of the BABEL benchmark (Gales et al., 2014) which is conversational telephone data and ten languages of CommonVoice (Ardila et al., 2019) , a corpus of read speech ( § 3). Multilingual pretraining outperforms monolingual pretraining in most cases, except for resource rich languages and we show that increased model capacity significantly closes the gap. We

