XLA: A ROBUST UNSUPERVISED DATA AUGMENTA-TION FRAMEWORK FOR CROSS-LINGUAL NLP

Abstract

Transfer learning has yielded state-of-the-art (SoTA) results in many supervised NLP tasks. However, annotated data for every target task in every target language is rare, especially for low-resource languages. We propose XLA, a novel data augmentation framework for self-supervised learning in zero-resource transfer learning scenarios. In particular, XLA aims to solve cross-lingual adaptation problems from a source language task distribution to an unknown target language task distribution, assuming no training label in the target language task. At its core, XLA performs simultaneous self-training with data augmentation and unsupervised sample selection. To show its effectiveness, we conduct extensive experiments on zero-resource cross-lingual transfer tasks for Named Entity Recognition (NER), Natural Language Inference (NLI) and paraphrase identification on Paraphrase Adversaries from Word Scrambling (PAWS). XLA achieves SoTA results in all the tasks, outperforming the baselines by a good margin. With an in-depth framework dissection, we demonstrate the cumulative contributions of different components to XLA's success.

1. INTRODUCTION

Self-supervised learning in the form of pretrained language models (LM) has been the driving force in developing state-of-the-art natural language processing (NLP) systems in recent years. These methods typically follow two basic steps, where a supervised task-specific fine-tuning follows a large-scale LM pretraining (Devlin et al., 2019; Radford et al., 2019) . However, getting annotated data for every target task in every target language is difficult, especially for low-resource languages. Recently, the pretrain-finetune paradigm has also been extended to multi-lingual setups to train effective multi-lingual models that can be used for zero-shot cross-lingual transfer. Jointly trained deep contextualized multi-lingual LMs like mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020) coupled with supervised fine-tuning in the source language have been quite successful in transferring linguistic and task knowledge from one language to another without using any task label in the target language. The joint pretraining with multiple languages allows these models to generalize across languages. Despite their effectiveness, recent studies (Pires et al., 2019; K et al., 2020) have also highlighted one crucial limiting factor for successful cross-lingual transfer. They all agree that the cross-lingual generalization ability of the model is limited by the (lack of) structural similarity between the source and target languages. For example, for transferring mBERT from English, K et al. (2020) report about 23.6% accuracy drop in Hindi (structurally dissimilar) compared to 9% drop in Spanish (structurally similar) in cross-lingual natural language inference (XNLI). The difficulty level of transfer is further exacerbated if the (dissimilar) target language is low-resourced, as the joint pretraining step may not have seen many instances from this language in the first place. In our experiments ( §4.2), in cross-lingual NER (XNER), we report F1 reductions of 28.3% in Urdu and 30.4% in Burmese for XLM-R, which is trained on a much larger multi-lingual dataset than mBERT. One attractive way to improve cross-lingual generalization is to perform data augmentation (Simard et al., 1998) , and train the model (e.g., XLM-R) on examples that are similar but different from the labeled data in the source language. Formalized by the Vicinal Risk Minimization (VRM) principle (Chapelle et al., 2001) , such data augmentation methods have shown impressive results recently in computer vision (Zhang et al., 2018; Berthelot et al., 2019; Li et al., 2020a) . These methods enlarge the support of the training distribution by generating new data points from a vicinity distribution around each training example. For images, the vicinity of a training image can be defined by a set of operations like rotation and scaling, or by linear mixtures of features and labels (Zhang et al., 2018) . However, when it comes to text, such methods have rarely been successful. The main reason is that unlike images, linguistic units (e.g., words, phrases) are discrete and a smooth change in their embeddings may not result in a plausible linguistic unit that has similar meanings. In NLP, the most successful data augmentation method has so far been back-translation (Sennrich et al., 2016) which generates paraphrases of an input sentence through round-trip translations. However, it requires parallel data to train effective machine translation systems, acquiring which can be more expensive for low-resource languages than annotating the target language data with task labels. Furthermore, back-translation is only applicable in a supervised setup and to tasks where it is possible to find the alignments between the original labeled entities and the back-translated entities, such as in question answering (Yu et al., 2018; Dong et al., 2017) . In this work, we propose XLA, a robust unsupervised cross-lingual augmentation framework for improving cross-lingual generalization of multilingual LMs. XLA augments data from the unlabeled training examples in the target language as well as from the virtual input samples (sentences) generated from the vicinity distribution of the source and target language sentences. With the augmented data, it performs simultaneous self-learning with an effective distillation strategy to learn a strongly adapted cross-lingual model from noisy (pseudo) labels for the target language task. We propose novel ways to generate virtual sentences using a multilingual masked LM (Conneau et al., 2020) , and get reliable task labels by simultaneous multilingual co-training. This co-training employs a two-stage co-distillation process to ensure robust transfer to dissimilar and/or low-resource languages. We validate the effectiveness and robustness of XLA by performing extensive experiments on three different zero-resource cross-lingual transfer tasks -XNER, XNLI, and PAWS-X, which posit different sets of challenges. We have experimented with many different language pairs (14 in total) comprising languages that are similar/dissimilar/low-resourced. XLA yields impressive results on XNER, setting SoTA in all tested languages outperforming the baselines by a good margin. In particular, the relative gains for XLA are higher for structurally dissimilar and/or low-resource languages, where the base model is weaker: 28.54%, 16.05%, and 9.25% absolute improvements for Urdu, Burmese, and Arabic, respectively. For XNLI, with only 5% labeled data in the source, it gets comparable results to the baseline that uses all the labeled data, and surpasses the standard baseline by 2.55% on average when it uses all the labeled data in the source. We also have similar findings in PAWS-X. We provide a comprehensive analysis of the factors that contribute to XLA's performance.

2. BACKGROUND

Contextual representation and cross-lingual transfer In recent years, significant progress has been made in learning contextual word representations and pretrained models. Notably, BERT (Devlin et al., 2019) pretrains a Transformer (Vaswani et al., 2017) encoder with a masked language model (MLM) objective, and uses the same model architecture to adapt to a new task. It also comes with a multilingual version mBERT, which is trained jointly on 102 languages. RoBERTa (Liu et al., 2019) extends BERT with improved training, while XLM (Lample and Conneau, 2019) extends mBERT with a conditional LM and a translation LM (using parallel data) objectives. Conneau et al. (2020) train the largest multilingual language model XLM-R with RoBERTa framework. Despite any explicit cross-lingual supervision, mBERT and its variants have been shown to learn cross-lingual representations that generalize well across languages. Wu and Dredze (2019) and Pires et al. (2019) evaluate the zero-shot cross-lingual transferability of mBERT on several tasks and attribute its generalization capability to shared subword units. Pires et al. (2019) also found structural similarity (e.g., word order) to be another important factor for successful cross-lingual transfer. K et al. (2020) , however, show that the shared subword has a minimal contribution; instead, the structural similarity between languages is more crucial for effective transfer (more in Appendix D). Vicinal risk minimization (VRM) Data augmentation supported by the VRM principle (Chapelle et al., 2001) can be an effective choice for achieving better out-of-distribution adaptation. In VRM, we minimize the empirical vicinal risk defined as: et al., 1998) . Recent work (Zhang et al., 2018; Berthelot et al., 2019) show impressive results in image classification with simple linear interpolation of data. However, to our knowledge, none of these methods have so far been successful in NLP due to the discrete nature of texts.foot_0  L v (θ) = 1 N N n=1 l(f θ (x n ),

LM-based supervised augmentation

Recently, a number of data-augmentation methods have been proposed using contextualized LMs like BERT, e.g., Contextual Augmentation (Kobayashi, 2018) , Conditional BERT (Wu et al., 2018) , and AUG-BERT (Wu et al., 2018) . These approaches use a constrained augmentation method which alters a pretrained LM to a label-conditional LM for a specific task. This means these methods update the parameters of the pretrained LM using the labels.

3. XLA FRAMEWORK

While recent cross-lingual transfer learning efforts have relied almost exclusively on multi-lingual pretraining and zero-shot transfer of a fine-tuned source model, there is a great potential for more elaborate methods that can leverage the unlabeled data better. Motivated by this, we present XLAour unsupervised data augmentation framework for zero-resource cross-lingual task adaptation. Figure 1 gives an overview of XLA. Let D s = (X s , Y s ) and D t = (X t ) denote the training data for a source language s and a target language t, respectively. XLA augments data from various origins at different stages of training. In the initial stage (epoch 1), it uses the augmented training samples from the target language (D t ) along with the original source (D s ). In later stages (epoch 2-3), it uses virtual (vicinal) sentences generated from the vicinity distribution of source and target examples: ϑ(x s n |x s n ) and ϑ(x t n |x t n ), where x s n ∼ X s and x t n ∼ X t . It performs self-training on the augmented data to acquire the corresponding pseudo labels. To avoid confirmation bias with self-training where the model accumulates its own errors, it simultaneously trains three task models to generate virtual training data through data augmentation and filtering of potential label noises via multi-epoch co-teaching (Zhou and Li, 2005) . In each epoch, the co-teaching process first performs co-distillation, where two peer task models are used to select "reliable" training examples to train the third model. The selected samples with pseudo labels are then added to the target task model's training data by taking the agreement from the other two models, a process we refer to as co-guessing. The co-distillation and co-guessing mechanism ensure robustness of XLA to out-of-domain distributions that can occur in a multilingual setup, e.g., due to a structurally dissimilar and/or low-resource target language. Algorithm 1 gives a pseudocode of the overall training method. Each of the task models in XLA is an instance of XLM-R fine-tuned on the source language task (e.g., English NER), whereas the pretrained masked LM parameterized by θ mlm (i.e., before fine-tuning) is used to define the vicinity distribution ϑ(x n |x n , θ mlm ) around each selected example x n . Algorithm 1 XLA: a robust unsupervised data augmentation framework for cross-lingual NLP Input: source (s) and target (t) language datasets: Ds = (Xs, Ys), Dt = (Xt); task models: θ (1) , θ (2) , θ (3) , pre-trained masked LM θmlm, mask ratio P , diversification factor δ, sampling factor α, and distillation factor η Output: models trained on augmented data 1: θ (1) , θ (2) , θ (3) = WARMUP(Ds, θ (1) , θ (2) , θ (3) ) warm up with conf. penalty. 2: for e ∈ [1 : 3] do e denotes epoch.

3:

for k ∈ {1, 2, 3} do 4: X (k) t , Y (k) t = DISTIL(Xt, ηe, θ (k) ) infer and select tgt training data for augmentation. 5: for j ∈ {1, 2, 3} do 6: if k == j then Continue 7: / * source language data augmentation * / 8: Xs = GEN-LM(Xs, θmlm, P, δ) vicinal example generation. 9: X (k) s , Y (k) s = DISTIL( Xs, ηe, θ (k) ); X (j) s , Y = DISTIL( Xs, ηe, θ (j) ) 10: Ds = AGREEMENT D (k) s = (X (k) s , Y (k) s ), D (j) s = (X (j) s , Y s ) 11: / * target language data augmentation (no vicinity) * / 12: X (j) t , Y = DISTIL(Xt, ηe, θ (j) ) 13: D t = AGREEMENT D (k) t = (X (k) t , Y (k) t ), D (j) t = (X (j) t , Y t ) see line 4 14: / * target language data augmentation * / 15: Xt = GEN-LM(Xt, θmlm, P, δ) vicinal example generation. 16: X (k) t , Y (k) t = DISTIL( Xt, ηe, θ (k) ); X (j) t , Y = DISTIL( Xt, ηe, θ (j) ) 17: Dt = AGREEMENT D (k) t = (X (k) t , Y (k) t ), D (j) t = (X (j) t , Y t ) 18: / * train new models on augmented data * / 19: for l ∈ {1, 2, 3} do 20: if l = j and l = k then 21: with sampling factor α, train θ (l) on D, train progressively 22: where D = {Ds1(e ∈ {1, 3}) ∪ D t 1(e ∈ {1, 3}) ∪ Ds1(e = 3) ∪ Dt1(e ∈ {2, 3})} 23: Return {θ (1) , θ (2) , θ (3) } Although the data augmentation proposed in Contextual Augmentation (Kobayashi, 2018) , Conditional BERT (Wu et al., 2018) , AUG-BERT (Wu et al., 2018 ) also use a pretrained masked LM, there are some fundamental differences between our method and these approaches. Unlike these approaches our vicinity-based LM augmentation is purely unsupervised and we do not perform any fine-tuning of the pretrained vicinity model (θ lm ). The vicinity model in XLA is a disjoint pretrained entity whose weights are not trained on any task objective. This disjoint characteristic gives our framework the flexibility to replace θ lm even with a better monolingual LM for a specific target language, which in turn makes XLA extendable to utilize stronger LMs that may come in the future. In the following, we describe the steps in Algorithm 1.

3.1. WARM-UP STEP: TRAINING TASK MODELS WITH CONFIDENCE PENALTY

We first train three instances of the XLM-R model (θ (1) , θ (2) , θ (3) ) with an additional task-specific linear layer on the source language (English) labeled data. Each model has the same architecture (XLM-R large) but is initialized with different random seeds. For token-level prediction tasks (e.g., NER), the token-level representations are fed into the classification layer, whereas for sentence-level tasks (e.g., XNLI), the [CLS] representation is used as input to the classification layer. Training with confidence penalty Our goal is to train the task models so that they can be used reliably for self-training on a target language that is potentially dissimilar and low-resourced. In such situations, an overly confident (overfitted) model may produce more noisy pseudo labels, and the noise will then accumulate as the training progresses. Overly confident predictions may also impose difficulties on our distillation methods ( §3.3) in isolating good samples from noisy ones. However, maximum likelihood training with the standard cross-entropy (CE) loss may result in overfitted models that produce overly confident predictions (low entropy), especially when the class distribution is not balanced. We address this by adding a negative entropy term -H to the CE loss. L(θ) = - C c=1 y c log p c θ (x) CE + p c θ (x) log p c θ (x) -H (1) where x is the representation that goes to the output layer, and y c n and p c θ (x n ) are respectively the ground truth label and model predictions with respect to class c. Such regularizer of output distribution has been shown to be an effective generalization method for training large models (Pereyra et al., 2017) . In our experiments ( §4), we report significant gains with confidence penalty for cross-lingual transfer. Appendix C shows visualizations on why confidence penalty is helpful for distillation.

3.2. VICINITY DISTRIBUTION AND SENTENCE AUGMENTATION

Our augmentated sentences comes from two different sources: the original target language samples X t , and the virtual samples generated from the vicinity distribution of the source and target samples: ϑ(x s n |x s n , θ mlm ) and ϑ(x t n |x t n , θ mlm ), where x s n ∼ X s and x t n ∼ X t . It has been shown that contextual LMs pretrained on large-scale datasets capture useful linguistic features and can be used to generate fluent grammatical texts (Hewitt and Manning, 2019) . We use the XLM-R masked LM (Conneau et al., 2020) as our vicinity model θ mlm , which is trained on massive multilingual corpora (2.5 TB of Common-Crawl data in 100 languages). Note that the vicinity model is a disjoint pretrained entity whose parameters are not trained on any task objective. In order to generate samples around each selected example, we first randomly choose P % of the input tokens. Then we successively (i.e., one at a time) mask one of the chosen tokens and ask θ mlm to predict a token in that masked position, i.e., we compute ϑ(x m |x, θ mlm ) with m being the index of the masked token. For a specific mask, we sample S candidate words from the output distribution. We then generate novel sentences by following one of the two alternative approaches. • Successive max In this approach, we take the most probable output token (S = 1) at each prediction step, ôm = arg max o ϑ(x m = o|x, θ mlm ). A new sentence is then constructed by P % newly generated tokens. We generate δ virtual samples for each original example x, by randomly masking P % tokens each time. Here, δ is the diversification factor. • Successive cross In this approach, we divide each original sample x into two parts and use successive max to create two sets of augmented samples of size δ 1 and δ 2 respectively. We then take the cross of these two sets to generate δ 1 × δ 2 augmented samples. Augmentation of sentences through successive max or successive cross is carried out within the GEN-LM (generate via LM) module in Algorithm 1. For tasks involving a single sequence (e.g., XNER), we directly use successive max. Pairwise tasks like XNLI and PAWS-X have pairwise dependencies: dependencies between a premise and a hypothesis in XNLI or dependencies between a sentence and its possible paraphrase in PAWS-X. To model such dependencies, we use successive cross, which uses cross-product of two successive max applied independently to each component.

3.3. CO-LABELING OF AUGMENTED SENTENCES THROUGH CO-DISTILLATION

Traditional VRM based data augmentation methods assume that the samples generated by the vicinity model share the same class so that the same class labels can be used for the newly generated data (Chapelle et al., 2001) . This approach does not consider the vicinity relation across examples of different classes. Recent methods relax this assumption and generate new images and their labels as simple linear interpolations (Berthelot et al., 2019) . However, due to the discrete nature of texts, such linear interpolation methods have not been successful so far in NLP. The meaning of a sentence (e.g., sentiment, word meanings) can change entirely even with minor variations in the original sentence. For example, consider the following example generated by our vicinity model (more in appendix G).

Original text:

EU rejects German call to boycott british lamb.

Masked text:

<mask> rejects german call to boycott british lamb. MLM prediction: Trump rejects german call to boycott british lamb. Here, EU is an Organization whereas the newly predicted word Trump is a Person (different name type). Therefore, we need to relabel the augmented sentences no matter whether the original sentence has labels (source) or not (target). However, the relabeling process can induce noise, especially for dissimilar/low-resource languages, since the base task model may not be adapted fully in the early training stages. We propose a two-stage sample distillation process to filter out noisy augmented data. Sample distillation by single-model. The first stage of distillation involves predictions from a single peer model for which we propose two alternatives: (i) Distillation by model confidence: In this approach, we select samples based on the model's prediction confidence. This method is similar in spirit to the selection method proposed by Ruder and Plank (2018) . For sentence-level tasks (e.g., XNLI), the model produces a single class distribution for each training example. In this case, the model's confidence is computed by p = max c∈{1...C} p c θ (x). For token-level sequence labeling tasks (e.g., NER), the model's confidence is computed by: p = 1 T max c∈{1...C} p c θ (x t ) T t=1 , where T is the length of the sequence. The distillation is then done by selecting the top η% samples with the highest confidence scores. (ii) Sample distillation by clustering: We propose this method based on the finding that large neural models tend to learn good samples faster than noisy ones, leading to a lower loss for good samples and higher loss for noisy ones (Han et al., 2018; Arazo et al., 2019) . We use a 1d two-component Gaussian Mixture Model (GMM) to model per-sample loss distribution and cluster the samples based on their goodness. GMMs provide flexibility in modeling the sharpness of a distribution and can be easily fit using Expectation-Maximization (EM) (Appendix B). The loss is computed based on the pseudo labels predicted by the model. For each sample x, its goodness probability is the posterior probability p(z = g|x, θ GMM ), where g is the component with smaller mean loss. Here, distillation hyperparameter η is the posterior probability threshold based on which samples are selected. Distillation by model agreement. In the second stage of distillation, we select samples by taking the agreement (co-guess) of two different peer models θ (j) and θ (k) to train the third θ (l) . Formally, AGREEMENT D (k) , D (j) ) = {(X (k) , Y (k) ) : Y (k) = Y (j) } s.t. k = j 3.4 DATA SAMPLES MANIPULATION XLA uses multi-epoch co-teaching. It uses D s and D t in the first epoch. In epoch 2, it uses Dt (target virtual), and finally it uses all the four datasets -D s , D t , Dt , and Ds (line 22 in Alg. 1). The datasets used at different stages can be of different sizes. For example, the number of augmented samples in Ds and Dt grow polynomially with the successive cross masking method. Also, the co-distillation produces sample sets of variable sizes. To ensure that our model does not overfit on one particular dataset, we employ a balanced sampling strategy. For N number of datasets {D i } N i=1 with probabilities, {p i } N i=1 , we define the following multinomial distribution to sample from: p i = f α i N j=1 f α j where f i = n i N j=1 n j (2) where α is the sampling factor and n i is the total number of samples in the i th dataset. By tweaking α, we can control how many samples a dataset can provide in the mix.

4. EXPERIMENTS

We consider three tasks in the zero-resource cross-lingual transfer setting. We assume labeled training data only in English, and transfer the trained model to a target language. For all experiments, we report the mean score of the three models that use different seeds (variance shown in Appendix F).

4.1. TASKS & SETTINGS

XNER: As a sequence labeling task, XNER evaluates the model's capability to learn task-specific contextual representations that depend on language structure. We use the standard CoNLL datasets (Sang, 2002; Sang and Meulder, 2003) for English (en), German (de), Spanish (es) and Dutch (nl). We also evaluate on Finnish (fi) and Arabic (ar) datasets collected from Bari et al. (2020) . Note that Arabic is structurally different from English, and Finnish is from a different language family. To show how the models perform on extremely low-resource languages, we experiment with three structurally different languages from WikiANN (Pan et al., 2017) of different (unlabeled) training data sizes: Urdu (ur-20k training samples), Bengali (bn-10K samples), and Burmese (my-100 samples). XNLI XNLI judges the model's ability to extract a reasonable meaning representation of sentences across different languages. We use the standard dataset (Conneau et al., 2018) . For a given pair of sentences, the task is to predict the entailment relationship between the two sentences, i.e., whether the second sentence (hypothesis) is an Entailment, Contradiction, or Neutral with respect to the first one (premise). We experiment with Spanish, German, Arabic, Swahili (sw), Hindi (hi) and Urdu. PAWS-X The Paraphrase Adversaries from Word Scrambling Cross-lingual task (Yang et al., 2019a) requires the models to determine whether two sentences are paraphrases. We evaluate on all the six (typologically distinct) languages: fr, es, de, Chinese (zh), Japanese (ja), and Korean (ko). Settings Our goal is to adapt a task model from a source (language) distribution to an unknown target (language) distribution assuming no labeled data in the target. In this scenario, there might be two different distributional gaps: (i) the generalization gap for the source distribution, and (ii) the gap between the source and target language distribution. We wish to investigate our method in tasks that exhibit such properties. We use the standard task setting for XNER, where we take 100% samples from the datasets as they come from various domains and sizes without any specific bias. 2020) also argue that the translation process can induce subtle artifacts that may have a notable impact on models. Therefore, for XNLI and PAWS-X, we experiment with two different setups. First, to ensure distributional differences and nonparallelism, we use 5% of the training data from the source language and augment a different (nonparallel) 5% dataset for the target language. We used a different seed each time to retrieve the 5% target language data. Second, to compare with previous methods, we also evaluate on the standard 100% setup. However, the evaluation is done on the entire test set in both setups. We will refer to these two settings as 5% and 100%. Details about model settings and hyperparameters are in Appendix E.

4.2. RESULTS

XNER Table 1 reports the XNER results on the datasets from CoNLL and Bari et al. ( 2020), where we also evaluate an ensemble by averaging the probabilities from the three models. We observe that after performing warm-up with conf-penalty, XLM-R performs better than mBERT on average by ∼3.8% for all the languages. On average, XLA gives a sizable improvement of ∼5.5% on five different languages. Specifically, we get an absolute improvement of 3.76%, 4.34%, 6.94%, 8.31%, and 4.18% for es, nl, de, ar, and fi, respectively. Interestingly, XLA surpasses supervised LSTM-CRF for nl and de without using any target labeled data. It also produces comparable results for es. In Table 2 , we report the results on the three low-resource langauges from WikiANN. From these results and the results of ar and fi in Table 1 , we see that XLA is very effective for languages that are structurally dissimilar and/or low-resourced, especially when the base model is weak: 28.54%, 16.05%, and 9.25% absolute improvements for ur, my and ar, respectively. XNLI-5% From Table 3 , we see that the performance of XLM-R trained on 5% data is surprisingly good compared to the model trained on full data (XLM-R (our imp.)), lagging by only 5.6% on average. In our single GPU implementation of XNLI, we could not reproduce the reported results of Conneau et al. (2020) . However, our results resemble the reported XLM-R results of XTREME (Hu et al., 2020) . We consider XTREME as our standard baseline for XNLI-100%. We observe that with only 5% labeled data in the source, XLA gets comparable results to the XTREME baseline that uses 100% labeled data (lagging behind by only ∼0.7% on average); even for ar and sw, we get 0.22% and 1.11% improvements, respectively. It surpasses the standard 5% baseline by 4.2% on average. Specifically, XLA gets absolute improvements of 3.05%, 3.34%, 5.38%, 5.01%, 4.29%, and 4.12% for es, de, ar, sw, hi, and ur, respectively. Again, the gains are relatively higher for low-resource and/or dissimilar languages despite the base model being weak in such cases. XNLI-100% Now, considering XLA's performance on the full (100%) labeled source data in Table 3 , we see that it achieves state-of-the-art results for all of the languages with an absolute improvement of 2.55% on average from the XTREME baseline. Specifically, XLA gets absolute improvements of 1.95%, 1.68%, 4.30%, 3.50%, 3.24%, and 1.65% for es, de, ar, sw, hi, and ur, respectively. PAWS-X Similar to XNLI, we observe sizable improvements for XLA over the baselines on PAWS-X for both 5% and 100% settings (Table 4 ). Specifically, in 5% setting, XLA gets absolute gains of 5.33%, 5.94%, 5.04%, 6.85%, 7.00%, and 5.45% for de, es, fr, ja, ko, and zh, respectively, while in 100% setting, it gets 2.21%, 2.36%, 2.00%, 3.99%, 4.53%, and 4.41% improvements respectively. In general, we get an average improvements of 5.94% and 3.25% in PAWS-X-5% and PAWS-X-100% settings respectively. Moreover, our 5% setting outperforms 100% XLM-R baselines for es, ja, and zh. Interestingly, in the 100% setup, our XLA (ensemble) achieves almost similar accuracies compared to supervised finetuning of XLM-R on all target language training dataset. In this section, we further analyze XLA by dissecting it and measuring the contribution of its different components. For this, we use the XNER task and analyze the model based on the results in Table 1 .

Model confidence vs. clustering

We first analyze the performance of our single-model distillation methods ( §3.3) to see which of the two alternatives works better. From Table 5 , we see that both perform similarly, with model confidence being slightly better. In our main experiments (Tables 1 2 3 4 ) and subsequent analysis, we use model confidence for distillation. However, we should not rule out the clustering method as it gives a more general solution to consider other distillation features (e.g., sequence length, language) than model prediction scores, which we did not explore in this paper. Distillation factor η We next show the results for different distillation factor (η) in Table 5 . Here 100% refers to the case when no single-model distillation is done based on model confidence. We notice that the best results for each of the languages are obtained for values other than 100%, which indicates that distillation is indeed an effective step in XLA. See Appendix C.2 for more on η.

Two-stage distillation

We now validate whether the second-stage distillation (distillation by model agreement) is needed. In Table 5 , we also compare the results with the model agreement (shown as ∩) to the results without using any agreement (shown as φ). We observe better performance with model agreement in all the cases on top of the single-model distillation, which validates its utility.

5.2. DIFFERENT TYPES OF AUGMENTATION IN DIFFERENT STAGES

Figure 2 presents the effect of different types of augmented data used by different epochs in our multi-epoch co-teaching framework. We observe that in every epoch, there is a significant boost in F1 scores for each of the languages. Arabic, being structural dissimilar to English, has a lower base score, but the relative improvements brought by XLA are higher for Arabic, especially in epoch 2 when it gets exposed to the target language virtual data ( Dt ) generated by the vicinity distribution. Table 6 shows the robustness of the finetuned XLA model on XNER task. After finetuning in a specific target language, the F1 scores in English remain almost similar. For some languages, XLA adaptation on a different language also improves the performance. For example, Arabic gets improvements for all XLA-adapted models (compare 50.88 with others). This indicates that augmentation of XLA does not overfit on a target language.

5.4. EFFECT OF CONFIDENCE PENALTY & ENSEMBLE

For all the three tasks, we get reasonable improvements over the baselines by training with confidence penalty regularizer ( §3.1). Specifically, we get 0.56%, 0.74%, 1.89%, and 1.18% improvements in XNER, XNLI-5%, PAWS-X-5%, and PAWS-X-100% respectively (Table 1, 3, 4 ). The improvements in XNLI-100% are marginal and inconsistent, which we suspect due to the balanced class distribution. From the results of ensemble models, we see that the ensemble boosts the baseline XLM-R. However, our regular XLA still outperforms the ensemble baselines by a sizeable margin. Moreover, ensembling the trained models from XLA further improves the performance. These comparisons ensure that the capability of XLA through co-teaching and co-distillation is beyond the ensemble effect.

6. CONCLUSION

We propose a novel data augmentation framework, XLA, for zero-resource cross-lingual task adaptation. XLA performs simultaneous self-training with data augmentation and unsupervised sample selection. With extensive experiments on three different cross-lingual tasks spanning many language pairs, we have demonstrated the effectiveness of XLA. For the zero-resource XNER task, XLA sets a new SoTA for all the tested languages. For both XNLI and PAWS-X tasks, with only 5% labeled data in the source, XLA gets comparable results to the baseline that uses 100% labeled data. Through an in-depth analysis, we show the cumulative contributions of different components of XLA. 

APPENDIX

Here we provide the additional contents regarding the XLA framework. In appendix A, we present the justification for design choices of XLA framework. In appendix B, we discuss the mathematical details of EM training for two components GMM clustering algorithm. In appendix C, we visualize various effects of confidence penalty. In appendix E, we present the setup details of our experiments. In appendix D and F, we elaborate on the related work and results with standard deviation as well as comparing with prior research work, respectively. Finally, in appendix G we present the examples of augmented samples generated by our vicinity model for XNER, XNLI, and PAWS-X. A JUSTIFICATIONS FOR DESIGN METHODOLOGY OF XLA FRAMEWORK Here are our justifications for various design principles of the XLA framework. Is using three models with different initialization necessary? Yes, different initialization ensures different convergence paths, which results in diversity during inference. Co-labeling (Section 3.3) utilizes this property. There could be some other ways to achieve the same thing. For example, our initial attempt with three different heads (sharing a backbone net) didn't work well. Is using three epochs necessary? We utilize different types of datasets in different epochs. While pseudo-labeling may induce noise, the model's predictions for in-domain cross-lingual samples are usually better. Because of this, for a smooth transition, we apply the vicinal samples in the second epoch. Finally, inspired by the joint training of the cross-lingual language model, in the third epoch we use all four datasets. We also include the labeled source data which ensures that our model does not overfit on target distribution as well as persists the generalization capability of the source distribution. Need for the combination of co-teaching, co-distillation and co-guessing? The combination of these helps to distill out the noisy samples better. Efficiency of the method and expensive extra costs for large-scale pretrained models It is a common practice in model selection to train 3-5 disjoint LM-based task models (e.g., XLM-R on NER) with different random seeds and report the ensemble score or score of the best (validation set) model. In contrast, XLA uses 3 different models and jointly trains them where the models assist each other through distillation and co-labeling. In that sense, the extra cost comes from distillation and co-labeling, which is not significant and is compensated by the significant improvements that XLA offers.

B DETAILS ON DISTILLATION BY CLUSTERING

One limitation of the confidence-based (single-model) distillation is that it does not consider taskspecific information. Apart from classifier confidence, there could be other important features that can distinguish a good sample from a noisy one. For example, for sequence labeling, sequence length can be an important feature as the models tend to make more mistakes (hence noisy) for longer sequences Bari et al. (2020) . One might also want to consider other features like fluency, which can be estimated by a pre-trained conditional LM like GPT Radford et al. (2019) . In the following, we introduce a clustering-based method that can consider these additional features to separate good samples from bad ones. Here our goal is to cluster the samples based on their goodness. It has been shown in computer vision that deep models tend to learn good samples faster than noisy ones, leading to a lower loss for good samples and higher loss for noisy ones Han et al. (2018) ; Arpit et al. (2017) . We propose to model per-sample loss distribution (along with other task-specific features) with a mixture model, which we fit using an Expectation-Maximization (EM) algorithm. However, contrary to those approaches which use actual (supervised) labels, we use the model predicted pseudo labels to compute the loss for the samples. We use a two-component Gaussian Mixture Model (GMM) due to its flexibility in modeling the sharpness of a distribution Li et al. (2020a) . In the following, we describe the EM training of the GMM for one feature, i.e., per-sample loss, but it is trivial to extend it to consider other indicative task-specific features like sequence length or fluency score (see any textbook on machine learning). EM training for two-component GMM Let x i ∈ IR denote the loss for sample x i and z i ∈ {0, 1} denote its cluster id. We can write the 1d GMM model as: p(x i |θ, π) = 1 k=0 N (x i |µ k , σ k )π k (3) where θ k = {µ k , σ 2 k } are the parameters of the k-th mixture component and π k = p(z i = k) is the probability (weight) of the k-th component with the condition 0 ≤ π k ≤ 1 and k π k = 1. In EM, we optimize the expected complete data log likelihood Q(θ, θ t-1 ) defined as: Q(θ, θ t-1 ) = E( i log[p(x i , z i |θ)]) (4) = E( i k I(z i = k) log[p(x i |θ k )π k ]) (5) = i k E(I(z i = k)) log[p(x i |θ k )π k ] (6) = i k p(z i = k|x i , θ t-1 ) log[p(x i |θ k )π k ] (7) = i k r i,k (θ t-1 ) log p(x i |θ k ) + r i,k (θ t-1 ) log π k (8) where r i,k (θ t-1 ) is the responsibility that cluster k takes for sample x i , which is computed in the E-step so that we can optimize Q(θ, θ t-1 ) (Eq. 8) in the M-step. The E-step and M-step for a 1d GMM can be written as: E-step: Compute r i,k (θ t-1 ) = N (xi|θ t-1 k )π t-1 k k N (xi|θ t-1 k )π t-1 k M-step: Optimize Q(θ, θ t-1 ) w.r.t. θ and π • π k = i r i,k i k r i,k = 1 N i r i,k • µ k = i r i,k xi i r i,k ; σ 2 k = i r i,k (xi-µ k ) 2 i r i,k Inference For a sample x, its goodness probability is the posterior probability p(z = g|x, θ), where g ∈ {0, 1} is the component with smaller mean loss. Here, distillation hyperparameter η is the posterior probability threshold based on which samples are selected. Relation with distillation by model confidence Astute readers might have already noticed that per-sample loss has a direct deterministic relation with the model confidence. Even though they are different, these two distillation methods consider the same source of information. However, as mentioned, the clustering-based method allows us to incorporate other indicative features like length, fluency, etc. For a fair comparison between the two methods, we use only the per-sample loss in our primary (single-model) distillation methods.

C VISUALIZING THE EFFECT OF CONFIDENCE PENALTY C.1 EFFECT OF CONFIDENCE PENALTY IN CLASSIFICATION

In Figure 3 , we present the effect of the confidence penalty (Eq. 1 in the main paper) in the target language (Spanish) classification on the XNER dev. data (i.e., after training on English NER). We show the class distribution from the final logits (on the target language) using t-SNE plots van der Maaten and Hinton (2008) . From the figure, it is evident that the use of confidence penalty in the warm-up step makes the model more robust to unseen out-of-distribution target language data yielding better predictions, which in turn also provides a better prior for self-training with pseudo labels. Figures 5(a) and 5(b) present the loss distribution in a scatter plot by sorting the sentences based on their length in the x-axis; y-axis represents the loss. As we can see, the losses are indeed more scattered when we train the model with confidence penalty, which indicates higher per-sample entropy, as expected. Also, we can see that as the sentence length increases, there are more wrong predictions. Our distillation method should be able to distill out these noisy pseudo samples. 2020) shows that cross-lingual NER inference is heavily dependent on the length distribution of the samples. In general, the performance of the lower length samples is more accurate. However, if we only select the lower length samples we will easily overfit. From these plots, we observe that the confidence penalty also helps to perform a better distillation as more sentences are selected (by the distillation procedure) from the lower length distribution, while still covering the entire lengths. This shows that using the confidence penalty in training, model becomes more robust. In summary, comparing the Figures 4 5 6 , we can conclude that training without confidence penalty can make the model more prone to over-fitting, resulting in more noisy pseudo labels. Training with confidence penalty not only improves pseudo labeling accuracy but also helps the distillation methods to perform better noise filtering. BERT also comes with a multilingual version called mBERT, which has 12 layers, 12 heads and 768 hidden dimensions, and it is trained jointly on 102 languages with a shared vocabulary of 110K subword tokens.foot_2 Despite any explicit cross-lingual supervision, mBERT has been shown to learn cross-lingual representations that generalise well across languages. Wu and Dredze (2019); Pires et al. ( 2019) evaluate the zero-shot cross-lingual transferability of mBERT on several NLP tasks and attribute its generalization capability to shared subword units. Pires et al. (2019) additionally found structural similarity (e.g., word order) to be another important factor for successful cross-lingual transfer. K et al. ( 2020), however, show that the shared subword has minimal contribution, rather the structural similarity between languages is more crucial for effective transfer. Artetxe et al. (2019) further show that joint training may not be necessary and propose an alternative method to transfer a monolingual model to a bi-lingual model through learning only the word embeddings in the target language. They also identify the vocabulary size per language as an important factor. Lample and Conneau (2019) extend mBERT with a conditional LM and a translation LM (using parallel data) objectives and a language embedding layer. They train a larger model with more monolingual data. Huang et al. (2019) In VRM, we minimize the empirical vicinal risk defined as: L v (θ) = 1 N N n=1 l(f θ (x n ), ỹn ) where f θ denotes the model parameterized by θ, and D aug = {(x n , ỹn )} N n=1 is an augmented dataset constructed by sampling the vicinal distribution ϑ(x, ỹ|x i , y i ) around the original training sample (x i , y i ). Defining vicinity is however challenging as it requires to extract samples from a distribution without hurting the labels. Earlier methods apply simple rules like rotation and scaling of images Simard et al. (1998) . Recently, Zhang et al. (2018) ; Berthelot et al. (2019) and Li et al. (2020a) show impressive results in image classification with simple linear interpolation of data. However, to our knowledge, none of these methods has so far been successful in NLP due to the discrete nature of texts. E SETUP DETAILS E.1 ZERO-SHOT VS. ZERO-RESOURCE TRANSFER Previous work on cross-lingual transfer has followed different training-validation standards. Xie et al. (2018) perform cross-lingual transfer of NER from a source language to a target language, where they train their model on translations of the source language training data and validate it (for model selection) with target language development data. They call this as an unsupervised setup as they use an unsupervised word translation model Conneau et al. (2017) . Several other studies Conneau et al. (2018) ; Lample and Conneau (2019) ; Wang et al. (2019) also apply the same setting and select their model based on target language development set performance. On the other hand, Artetxe and Schwenk (2018) , Wu and Dredze (2019) validate their models using source language development data. Bari et al. (2020) show significant performance differences between validation with source vs. target language development data for NER. Later, Conneau et al. (2020) provide a comprehensive analysis of different training-validation setups and encourage validating with the source language development data. Therefore, it is clear that there is no unanimous agreement regarding the proper setup. Following the previous work and landscape of the problem, we think that different settings should be considered under different circumstances. In a pure zero-shot cross-lingual transfer, no target language data should be used either for training or for model selection. The goal here is to evaluate the generalizability and transferability of a model trained on a known source language distribution to an unknown target language distribution. In this sense, zero-shot setting is suitable to measure the cross-lingual transferability of a pre-trained model. Our goal in this work is not to propose a new pre-training approach, rather to propose novel crosslingual adaptation methods and evaluate their capability on downstream tasks. Our proposed XLA framework performs simultaneous self-training with data augmentation and unsupervised sample selection. As our objective is to evaluate cross-lingual adaptation performance and not cross-lingual representation, we train our model with the original source and augmented source and target language data, while validating it with target development data for model selection. We refer this as zeroresource setup, which is still a minimal supervision setting for task adaptation because no true target labels are used for training the model. This setup also gives us a way to compare how far we are from the supervised adaptation setting (train and validate on target language data).

E.2 USE OF MBERT VS. XLM-R

From Table 4 , we see that mBERT Devlin et al. (2019) trains the smallest multi-lingual language model (LM) in terms of training data size and model parameters, while XLM-R is the largest one. At its heart, XLA uses the generation capability of a pre-trained LM for data augmentation, which could be a bottleneck for XLA's performance. In our initial experiments, we found that the generation quality of mBERT is not as good as that of XLM-R. Using mBERT as the vicinity model can thus generate noisy samples that can propagate to the task models and may thwart us from getting the maximum benefits from the XLA framework. Thus to ensure the generation of better vicinity samples, we choose to use XLM-R -the best performing multi-lingual LM to date, as the vicinity model θ lm in our framework. For the task model θ (i) , in principle we can use any multilingual model (e.g., mBERT, XLM-R) while using XLM-R as the vicinity model. However, if we use a weaker model (e.g., mBERT) compared to the vicinity model, the performance gain may not be easily distinguishable, i.e., the gain may come from the increased generalization capability of the stronger vicinity model. This, in turn, can make us unable to evaluate the XLA framework properly in terms of its adaptation capability. In addition, from Table 1 and Table 2 (in the main paper), we observe that the zero-shot XLM-R outperforms mBERT in the warm-up step by ∼ 3.8% in NER and ∼ 13.46% in XNLI. Therefore, we choose to use XLM-R for both the task model θ (i) and vicinity model θ lm . Using this setup, an improvement over the baseline in XLA strictly indicates the superior performance of the framework. It is also both attractive and challenging to use a single LM (XLM-R) as the vicinity model θ lm over different languages. Note that the vicinity model in our framework is a disjoint pre-trained entity whose weights are not trained on any task objective. This disjoint characteristic gives our framework the flexibility to replace θ lm with a better monolingual LM for a specific target language, which in turn makes our model extendable to utilize stronger and new LMs that may come in future.

E.3 DATASETS (EXTENDED VERSION)

XNER. For XNER, we transfer from English (en) to Spanish (es), German (de), Dutch (nl), Arabic (ar), and Finnish (fi). For English and German, we consider the dataset from CoNLL-2003 shared task Sang and Meulder (2003) , while for Spanish and Dutch, we use the dataset from CoNLL-2002 shared task Sang (2002) . We collected Arabic and Finnish NER datasets from Bari et al. (2020) . The NER tags are converted from IOB1 to IOB2 for standardization and all the tokens of each of the six (6) datasets are classified into five (5) categories: Person, Organization, Location, Misc., and Other. Pre-trained LMs like XLM-R generally operate at the subword level. As a result, when the labels are at the word level, if a word is broken into multiple subwords, we mask the prediction of non-first subwords. Table 9 presents the detail statistics of the XNER datasets. We see that the datasets for different languages vary in size. Also the class-distribution is not balanced in these datasets. Therefore, we use the micro F1 score as the evaluation metric for XNER. 2018) to 15 languages. For a given pair of sentences, the task is to predict the entailment relationship between the two sentences, i.e., whether the second sentence (hypothesis) is an Entailment, Contradiction, or Neutral with respect to the first one (premise). For XNLI, we experiment with transferring from English to Spanish (es), German (de), Arabic (ar), Swahili (sw), Hindi (hi), and Urdu (ur). Unlike NER, from Table 6 , we see that the dataset sizes are same for all languages. Also the class-distribution is balanced in all the languages. Thus, we use accuracy as the evaluation metric for XNLI. PAWS-X. The task of PAWS (Paraphrase Adversaries from Word Scrambling) (Zhang et al., 2019) is to predict whether each pair is a paraphrase or not. PAWS-X contains six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. For this task, we experiment with transferring from English to all of these six languages. Table 7 presents the detail statistics of the PAWS-X datasets. Similar to XNLI, we use accuracy as the evaluation metric for this task. We present the hyperparameter settings for XNER and XNLI tasks for the XLA framework in Table 8 . In the warm-up step, we train and validate the task models with English data. However, for crosslingual adaptation, we validate (for model selection) our model with the target language development set. We train our model with respect to the number of steps instead of the number of epochs. In the case of a given number of epochs, we convert it to a total number of steps. For both tasks, we observe that learning rate is a crucial hyperparameter. In table 8, lr-warm-up-steps refer to the warmup-step from triangular learning rate scheduling Smith (2015) . This hyperparameter is not to be confused with Warm-up step of the XLA framework. In our experiments, batch-size is another crucial hyperparameter that can be obtained by multiplying per GPU training batch size with the total number of gradient accumulation steps. We fix the maximum sequence length to 280 for XNER and 128 tokens for XNLI. For each of the experiments, we report the average score of three task models, θ (1) , θ (2) , θ (3) , which are initialized with different seeds. We perform each of the experiments in a single GPU setup with float32 precision. 

G EXAMPLES OF AUGMENTED DATA

We present examples of augmented samples generated by our vicinity model for XNER, XNLI, and PAWS-X datasets in Tables 12, 13 , and 14 respectively. Table 12 : Examples of augmented data from XNER dataset.

English

Original: Motor-bike registration rose 32.7 percent in the period . Augmented: Motor-bike sales rose 32.7 percent in the US . Original: He will be replaced by Eliahu Ben-Elissar , a former Israeli envoy to Egypt and right-wing Likud party politician Augmented: He will be led by Eliahu Cohen , a former UN Secretary to Egypt and right-wing opposition party leader . Original: Israeli-Syrian peace talks have been deadlocked over the Golan since 1991 despite the previous government 's willingness to make Golan concessions . Augmented: The peace talks have been deadlocked over the Golan since 2011, despite the Saudi government 's willingness to make Golan concessions .

Spanish

Original: En esto de la comida abunda demasiado la patriotería . Augmented: En medio de la guerra abunda demasiado la violencia . Original: Pero debe , cómo no , estar abierta a incorporaciones foráneas . Augmented: También debe , cómo no , estar abierta a personas diferentes . Original: Deutsche Telekom calificó esta compra , cuyo precio no especificó , como otro paso hacia su internacionalización mediante adquisiciones mayoritarias destinadas a tener el control de la dirección de esas empresas . Augmented: Deutsche Bank calificó esta operación , cuyo importe no especificó , como otro paso hacia su expansión mediante acciones mayoritarias destinadas a tener el control de la dirección de las empresas .

Dutch

Original: Onvoldoende om een zware straf uit te spreken , luidt het . Augmented: Onvoldoende om een zware waarheid uit te leggen , is het . Original: Dit hof verbindt nu geen straf aan de schuld die ze vaststelt . Augmented: Dit hof geeft nu de schuld aan de schuld die ze vaststelt . Original: Wat jaren meeging als een omstreden ' CVP-dossier ' krijgt nu door de rechterlijke uitspraak het cachet van een oude koe in de gracht . Augmented: Wat jaren begon als een omstreden ' CVP-dossier ' krijgt nu door de rechterlijke macht het cachet van de heilige koe in de gracht .

German

Original: Gleichwohl bleibt diese wissenschaftlich abgeleitete Klassifizierung von Erzähltypen nur äußerlich . Augmented: Gleichwohl bleibt die daraus abgeleitete Klassifizierung von Erzähltypen nur begrenzt . Original: Dies führt vielmehr zu anderen grundlegenden Mißverständnissen , die zur Verwischung entscheidender Unterschiede beitragen . Augmented: Dies führt vielmehr zu sehr großen Mißverständnissen , die zur Verwischung entscheidender Informationen führen . Original: Die eine Geschichte zerfällt dabei in viele Erzählungen , die wiederum wissenschaftlich genau nach unterschiedlichen Genres klassifiziert werden können . Augmented: Die ganze Geschichte zerfällt dabei in viele Erzählungen , die nicht ganz genau in verschiedene Genres gestellt werden können . Table 13 : Examples of augmented data from XNLI dataset.

English

Original: text_a: One of our number will carry out your instructions minutely . text_b: A member of my team will execute your orders with immense precision . Augmented: text_a: One of our number will carry out your order immediately text_b: A member of my team will execute your orders with immense care . Original: text_a: my walkman broke so i 'm upset now i just have to turn the stereo up real loud text_b: I 'm upset that my walkman broke and now I have to turn the stereo up really loud . Augmented: text_a: my stereo broke so i 'm stuck. i just have to turn the stereo up super loud text_b: I 'm upset because my phone broke and now I have to turn the music up really loud .

Spanish

Original: text_a: Bueno , porque lo caliente que quiero decir como en el más frío que se pone en invierno ahí abajo , cuánto es ? text_b: Hace calor todo el tiempo donde vivo , incluido el invierno . Augmented: text_a: Bueno , pero lo primero que quiero decir como en el caso calor que se pone en invierno ahí arriba , cuánto es ? text_b: Tengo calor todo el tiempo que puedo , incluido el invierno . Original: text_a: Sí , es como en louisiana donde ese tipo que es como un miembro del ku klux klan algo fue elegido un poco aterrador cuando piensas en eso . text_b: Un miembro del ku klux klan ha sido elegido en louisiana . Augmented: text_a: Sí , estuvieron en louisiana y ese tipo que aparece como un miembro del ku klux klan algo ha sido un poco aterrador cuando piensas en eso . text_b: Un miembro del kumite klan ha sido detenido en louisiana .

Arabic

Original: text_a: text_b: Augmented: text_a: text_b: Original: text_a: text_b: Augmented: text_a: text_b:



Considering papers that have been published (or accepted) through peer review. There has been some concurrent work that uses pretrained LMs like BERT to craft adversarial examples(Li et al., 2020b). Although relevant, these methods have a different objective than ours, and none of them is cross-or multi-lingual. github.com/google-research/bert/blob/master/multilingual.md



Figure 1: Training flow diagram of XLA. After training the base task models θ (1) , θ (2) , and θ (3) on source labeled data D s (WarmUp), we use two of them (θ (j) , θ (k) ) to pseudo-label and co-distill the unlabeled target language data (D t ). A pretrained LM (Gen-LM) is used to generate new (vicinal) training samples for both source and target languages, which are also then pseudo-labeled and codistilled using the two task models (θ (j) , θ (k) ) to generate Ds and Dt . The third model θ (l) is then progressively trained on these datasets: {D s , D t } in epoch 1, Dt in epoch 2, and all in epoch 3.

Figure 2: Validation F1 results in XNER for multi-epoch co-teaching training of XLA.

(a) Without confidence penalty. (b) With confidence penalty.

Figure 3: Effect of training with confidence penalty in the warm-up step on target (Spanish) language XNER classification using t-SNE plots. From the visualization, it can be seen that the model trained with confidence penalty shows better inter-class separation which exhibits robustness of the multilingual model.

Figure 4: Histogram of loss distribution on target (Spanish) language XNER classification.

(a) Without confidence penalty. (b) With confidence penalty.

Figure 5: Scatter plot of loss distribution on target (Spanish) language XNER classification.

Figure 6: Distribution of selected sentence lengths on target (Spanish) language XNER classification.

propose to use auxiliary tasks such as cross-lingual word recovery and paraphrase detection for pre-training. Recently,Conneau et al. (2020) train the largest multilingual language model with 24-layer transformer encoder, 1024 hidden dimensions and 550M parameters.Keung et al. (2019) use adversarial fine-tuning of mBERT to achieve better language invariant contextual representation for cross-lingual NER and MLDoc document classification.Vicinal risk minimization. One of the fundamental challenges in deep learning is to train models that generalize well to examples outside the training distribution. The widely used Empirical Risk Minimization (ERM) principle where models are trained to minimize the average training error has been shown to be insufficient to achieve generalization on distributions that differ slightly from the training dataSzegedy et al. (2014);Zhang et al. (2018). Data augmentation supported by the Vicinal Risk Minimization (VRM) principleChapelle et al. (2001) can be an effective choice for achieving better out-of-training generalization.

F1 scores in XNER on datasets fromCoNLL and Bari  et al. (2020). "-" represents no results were reported for the setup.

XNER results on WikiANN

Results in accuracy for XNLI.

Results in accuracy for PAWS-X.

Analysis of distillation on XNER. Results after epoch-1 training that uses {D s , D t }.

F1 scores on XNER for all target languages.

Yuan Zhang, Jason Baldridge, and Luheng He. PAWS: Paraphrase adversaries from word scrambling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1298-1308, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1131. URL https://www.aclweb.org/anthology/N19-1131.

Training data size and number of model parameters of Cross-lingual Language Models.

Statistics of training, development and test datasets in different languages for XNER.

Statistics of training, development and test datasets in different languages for XNLI.

Statistics of training, development and test datasets in different languages for PAWS-X.

Results in Accuracy for PAWS-X task. (our imp.) 95.46 ± 0.36 90.06 ± 0.59 89.92 ± 0.54 90.85 ± 0.71 79.89 ± 1.17 79.74 ± 1.47 82.49 ± 0.82 R+con-penalty 95.38 ± 0.15 90.75 ± 0.29 90.72 ± 0.56 91.71 ± 0.31 81.77 ± 0.63 82.07 ± 0.54 84.25 ± 0.36 XLA -92.27 ± 0.75 92.28 ± 0.16 92.85 ± 0.35 83.88 ± 0.49 84.27 ± 0.23 86.90 ± 0.35 R+con-penalty 91.85 ± 0.70 86.15 ± 1.37 86.38 ± 1.02 85.98 ± 0.44 76.03 ± 1.51 75.43 ± 1.32 79.15 ± 1.14 XLA -89.05 ± 0.85 90.27 ± 0.38 90.12 ± 0.28 80.50 ± 0.73 79.60 ± 0.43 82.65 ± 0.44

F RESULTS (EXTENDED VERSION)

We include detailed results for CoNLL-XNER, XNLI, and PAWS-X datasets to compare with previous literatures. We also provide standard deviations over three different random seeds here. 

