XLA: A ROBUST UNSUPERVISED DATA AUGMENTA-TION FRAMEWORK FOR CROSS-LINGUAL NLP

Abstract

Transfer learning has yielded state-of-the-art (SoTA) results in many supervised NLP tasks. However, annotated data for every target task in every target language is rare, especially for low-resource languages. We propose XLA, a novel data augmentation framework for self-supervised learning in zero-resource transfer learning scenarios. In particular, XLA aims to solve cross-lingual adaptation problems from a source language task distribution to an unknown target language task distribution, assuming no training label in the target language task. At its core, XLA performs simultaneous self-training with data augmentation and unsupervised sample selection. To show its effectiveness, we conduct extensive experiments on zero-resource cross-lingual transfer tasks for Named Entity Recognition (NER), Natural Language Inference (NLI) and paraphrase identification on Paraphrase Adversaries from Word Scrambling (PAWS). XLA achieves SoTA results in all the tasks, outperforming the baselines by a good margin. With an in-depth framework dissection, we demonstrate the cumulative contributions of different components to XLA's success.

1. INTRODUCTION

Self-supervised learning in the form of pretrained language models (LM) has been the driving force in developing state-of-the-art natural language processing (NLP) systems in recent years. These methods typically follow two basic steps, where a supervised task-specific fine-tuning follows a large-scale LM pretraining (Devlin et al., 2019; Radford et al., 2019) . However, getting annotated data for every target task in every target language is difficult, especially for low-resource languages. Recently, the pretrain-finetune paradigm has also been extended to multi-lingual setups to train effective multi-lingual models that can be used for zero-shot cross-lingual transfer. Jointly trained deep contextualized multi-lingual LMs like mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020) coupled with supervised fine-tuning in the source language have been quite successful in transferring linguistic and task knowledge from one language to another without using any task label in the target language. The joint pretraining with multiple languages allows these models to generalize across languages. Despite their effectiveness, recent studies (Pires et al., 2019; K et al., 2020) have also highlighted one crucial limiting factor for successful cross-lingual transfer. They all agree that the cross-lingual generalization ability of the model is limited by the (lack of) structural similarity between the source and target languages. For example, for transferring mBERT from English, K et al. (2020) report about 23.6% accuracy drop in Hindi (structurally dissimilar) compared to 9% drop in Spanish (structurally similar) in cross-lingual natural language inference (XNLI). The difficulty level of transfer is further exacerbated if the (dissimilar) target language is low-resourced, as the joint pretraining step may not have seen many instances from this language in the first place. In our experiments ( §4.2), in cross-lingual NER (XNER), we report F1 reductions of 28.3% in Urdu and 30.4% in Burmese for XLM-R, which is trained on a much larger multi-lingual dataset than mBERT. One attractive way to improve cross-lingual generalization is to perform data augmentation (Simard et al., 1998) , and train the model (e.g., XLM-R) on examples that are similar but different from the labeled data in the source language. Formalized by the Vicinal Risk Minimization (VRM) principle (Chapelle et al., 2001) , such data augmentation methods have shown impressive results recently in computer vision (Zhang et al., 2018; Berthelot et al., 2019; Li et al., 2020a) . These methods enlarge the support of the training distribution by generating new data points from a vicinity distribution around each training example. For images, the vicinity of a training image can be defined by a set of operations like rotation and scaling, or by linear mixtures of features and labels (Zhang et al., 2018) . However, when it comes to text, such methods have rarely been successful. The main reason is that unlike images, linguistic units (e.g., words, phrases) are discrete and a smooth change in their embeddings may not result in a plausible linguistic unit that has similar meanings. In NLP, the most successful data augmentation method has so far been back-translation (Sennrich et al., 2016) which generates paraphrases of an input sentence through round-trip translations. However, it requires parallel data to train effective machine translation systems, acquiring which can be more expensive for low-resource languages than annotating the target language data with task labels. Furthermore, back-translation is only applicable in a supervised setup and to tasks where it is possible to find the alignments between the original labeled entities and the back-translated entities, such as in question answering (Yu et al., 2018; Dong et al., 2017) . In this work, we propose XLA, a robust unsupervised cross-lingual augmentation framework for improving cross-lingual generalization of multilingual LMs. XLA augments data from the unlabeled training examples in the target language as well as from the virtual input samples (sentences) generated from the vicinity distribution of the source and target language sentences. With the augmented data, it performs simultaneous self-learning with an effective distillation strategy to learn a strongly adapted cross-lingual model from noisy (pseudo) labels for the target language task. We propose novel ways to generate virtual sentences using a multilingual masked LM (Conneau et al., 2020) , and get reliable task labels by simultaneous multilingual co-training. This co-training employs a two-stage co-distillation process to ensure robust transfer to dissimilar and/or low-resource languages. We validate the effectiveness and robustness of XLA by performing extensive experiments on three different zero-resource cross-lingual transfer tasks -XNER, XNLI, and PAWS-X, which posit different sets of challenges. We have experimented with many different language pairs (14 in total) comprising languages that are similar/dissimilar/low-resourced. XLA yields impressive results on XNER, setting SoTA in all tested languages outperforming the baselines by a good margin. In particular, the relative gains for XLA are higher for structurally dissimilar and/or low-resource languages, where the base model is weaker: 28.54%, 16.05%, and 9.25% absolute improvements for Urdu, Burmese, and Arabic, respectively. For XNLI, with only 5% labeled data in the source, it gets comparable results to the baseline that uses all the labeled data, and surpasses the standard baseline by 2.55% on average when it uses all the labeled data in the source. We also have similar findings in PAWS-X. We provide a comprehensive analysis of the factors that contribute to XLA's performance.

2. BACKGROUND

Contextual representation and cross-lingual transfer In recent years, significant progress has been made in learning contextual word representations and pretrained models. Notably, BERT (Devlin et al., 2019 ) pretrains a Transformer (Vaswani et al., 2017) encoder with a masked language model (MLM) objective, and uses the same model architecture to adapt to a new task. It also comes with a multilingual version mBERT, which is trained jointly on 102 languages. RoBERTa (Liu et al., 2019) extends BERT with improved training, while XLM (Lample and Conneau, 2019) extends mBERT with a conditional LM and a translation LM (using parallel data) objectives. Conneau et al. (2020) train the largest multilingual language model XLM-R with RoBERTa framework. Despite any explicit cross-lingual supervision, mBERT and its variants have been shown to learn cross-lingual representations that generalize well across languages. Wu and Dredze (2019) and Pires et al. ( 2019) evaluate the zero-shot cross-lingual transferability of mBERT on several tasks and attribute its generalization capability to shared subword units. Pires et al. (2019) also found structural similarity (e.g., word order) to be another important factor for successful cross-lingual transfer. K et al. (2020), however, show that the shared subword has a minimal contribution; instead, the structural similarity between languages is more crucial for effective transfer (more in Appendix D). Vicinal risk minimization (VRM) Data augmentation supported by the VRM principle (Chapelle et al., 2001) can be an effective choice for achieving better out-of-distribution adaptation. In VRM, we minimize the empirical vicinal risk defined as: L v (θ) = 1 N N n=1 l(f θ (x n ), ỹn ), where f θ denotes the model parameterized by θ, and D = {(x n , ỹn )} N n=1 is an augmented dataset constructed by sampling the vicinal distribution ϑ( xi , ỹi |x i , y i ) around the original training sample (x i , y i ). Defining vicinity is however quite challenging as it requires the extraction of samples from a distribution without hurting their labels. Earlier methods apply simple rules like rotation and scaling of images (Simard

