XLA: A ROBUST UNSUPERVISED DATA AUGMENTA-TION FRAMEWORK FOR CROSS-LINGUAL NLP

Abstract

Transfer learning has yielded state-of-the-art (SoTA) results in many supervised NLP tasks. However, annotated data for every target task in every target language is rare, especially for low-resource languages. We propose XLA, a novel data augmentation framework for self-supervised learning in zero-resource transfer learning scenarios. In particular, XLA aims to solve cross-lingual adaptation problems from a source language task distribution to an unknown target language task distribution, assuming no training label in the target language task. At its core, XLA performs simultaneous self-training with data augmentation and unsupervised sample selection. To show its effectiveness, we conduct extensive experiments on zero-resource cross-lingual transfer tasks for Named Entity Recognition (NER), Natural Language Inference (NLI) and paraphrase identification on Paraphrase Adversaries from Word Scrambling (PAWS). XLA achieves SoTA results in all the tasks, outperforming the baselines by a good margin. With an in-depth framework dissection, we demonstrate the cumulative contributions of different components to XLA's success.

1. INTRODUCTION

Self-supervised learning in the form of pretrained language models (LM) has been the driving force in developing state-of-the-art natural language processing (NLP) systems in recent years. These methods typically follow two basic steps, where a supervised task-specific fine-tuning follows a large-scale LM pretraining (Devlin et al., 2019; Radford et al., 2019) . However, getting annotated data for every target task in every target language is difficult, especially for low-resource languages. Recently, the pretrain-finetune paradigm has also been extended to multi-lingual setups to train effective multi-lingual models that can be used for zero-shot cross-lingual transfer. Jointly trained deep contextualized multi-lingual LMs like mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020) coupled with supervised fine-tuning in the source language have been quite successful in transferring linguistic and task knowledge from one language to another without using any task label in the target language. The joint pretraining with multiple languages allows these models to generalize across languages. Despite their effectiveness, recent studies (Pires et al., 2019; K et al., 2020) have also highlighted one crucial limiting factor for successful cross-lingual transfer. They all agree that the cross-lingual generalization ability of the model is limited by the (lack of) structural similarity between the source and target languages. For example, for transferring mBERT from English, K et al. (2020) report about 23.6% accuracy drop in Hindi (structurally dissimilar) compared to 9% drop in Spanish (structurally similar) in cross-lingual natural language inference (XNLI). The difficulty level of transfer is further exacerbated if the (dissimilar) target language is low-resourced, as the joint pretraining step may not have seen many instances from this language in the first place. In our experiments ( §4.2), in cross-lingual NER (XNER), we report F1 reductions of 28.3% in Urdu and 30.4% in Burmese for XLM-R, which is trained on a much larger multi-lingual dataset than mBERT. One attractive way to improve cross-lingual generalization is to perform data augmentation (Simard et al., 1998) , and train the model (e.g., XLM-R) on examples that are similar but different from the labeled data in the source language. Formalized by the Vicinal Risk Minimization (VRM) principle (Chapelle et al., 2001) , such data augmentation methods have shown impressive results recently in computer vision (Zhang et al., 2018; Berthelot et al., 2019; Li et al., 2020a) . These methods enlarge the support of the training distribution by generating new data points from a vicinity distribution around each training example. For images, the vicinity of a training image can be defined by a set

