MODEL SELECTION FOR CROSS-LINGUAL TRANSFER USING A LEARNED SCORING FUNCTION Anonymous

Abstract

Transformers that are pre-trained on multilingual text corpora, such as, mBERT and XLM-RoBERTa, have achieved impressive cross-lingual transfer learning results. In the zero-shot cross-lingual transfer setting, only English training data is assumed, and the fine-tuned model is evaluated on another target language. No target-language validation data is assumed in this setting, however substantial variance has been observed in target language performance between different fine-tuning runs. Prior work has relied on English validation/development data to select among models that are fine-tuned with different learning rates, number of steps and other hyperparameters, often resulting in suboptimal choices. In this paper, we show that it is possible to select consistently better models when small amounts of annotated data are available in an auxiliary pivot language. We propose a machine learning approach to model selection that uses the fine-tuned model's own internal representations to predict its cross-lingual capabilities. In extensive experiments we find that our approach consistently selects better models than English validation data across five languages and five well-studied NLP tasks, achieving results that are comparable to small amounts of target language development data. 1

1. INTRODUCTION

Pre-trained Transformers (Vaswani et al., 2017; Devlin et al., 2019) have lead to state-of-the-art results on a wide range of NLP tasks, for example, named entity recognition, relation extraction and question answering, often approaching human inter-rater agreement (Joshi et al., 2020a) . These models have also been demonstrated to learn effective cross-lingual representations, even without access to parallel text or bilingual lexicons (Wu & Dredze, 2019; Pires et al., 2019) . Multilingual pre-trained Transformers, such as mBERT and XLM-RoBERTa (Conneau et al., 2019) , support surprisingly effective zero-shot cross-lingual transfer, where training and development data are only assumed in a high resource source language (e.g. English), and performance is evaluated on another target language. Because no target language annotations are assumed in this setting, source language data is typically used to select among models that are fine-tuned with different hyperparameters and random seeds. However, recent work has shown that English dev accuracy does not always correlate well with target language performance (Keung et al., 2020) . In this paper, we propose an alternative strategy for model selection in a zero-shot setting. Our approach, dubbed Learned Model Selection (LMS), learns a function that scores the compatibility between a fine-tuned multilingual transformer, and a target language. The compatibility score is calculated based on features of the multilingual model's learned representations and the target language. A model's features are based on its own internal representations; this is done by aggregating representations over an unlabeled target language text corpus. These model-specific features capture information about how the cross-lingual representations transfer to the target language after fine-tuning on source language data. In addition to modelspecific representations, we also make use of learned language embeddings from the lang2vec package (Malaviya et al., 2017),foot_1 which have been shown to encode typological information, for example, whether a language has prepositions or postpositions. To measure compatibility between a multilingual model's fine-tuned representations and a target language, the model-and languagespecific representations are combined in a bilinear layer. Parameters of the scoring function are optimized to minimize a pairwise ranking loss on a set of held-out models, where the gold ranking is calculated using standard performance metrics, such as accuracy or F 1 , on a set of pivot languages (not including the target language). Our method assumes training data in English, and small amounts of annotated data in one or more pivot languages (not the target language). This corresponds to a scenario where a new multilingual NLP task needs to be quickly applied to a new language. LMS does not rely on any annotated data in the target language, yet it is effective in learning to predict whether fine-tuned multilingual representations are a good match. In experiments on five well-studied NLP tasks (part of speech tagging, named entity recognition, question answering, relation extraction and event argument role labeling), we find LMS consistently selects models with better target-language performance than those chosen using English dev data. Appendix A.5 demonstrates that our framework supports multi-task learning, which can be helpful in settings where some target-language annotations are available, but not for the desired task. Finally, we show that LMS generalizes to both mBERT and XLM-RoBERTa in Appendix A.4.

2. BACKGROUND: ZERO-SHOT CROSS LINGUAL TRANSFER

The zero-shot setting considered in this paper works as follows. A transformer model is first pretrained using a standard masked language model objective. The only difference from the monolingual approach to contextual word representations (Peters et al., 2018; Devlin et al., 2019) is the pretraining corpus, which contains text written in multiple languages. For example, mBERT is trained on text written in 104 languages from Wikipedia. After pre-training on a multilingual corpus, the resulting transformer encodes language-independent representations that support surprisingly effective cross-lingual transfer, simply by fine-tuning the pre-trained parameters using English training data. For example, after fine-tuning mBERT using the English portion of the CoNLL Named Entity Recognition dataset, the resulting model can be used to perform inference directly on Spanish text, achieving an F 1 score around 75, and outperforming prior work using cross-lingual word embeddings (Xie et al., 2018; Mikolov et al., 2013) . A challenge with this approach, however, is that there is a relatively high variance in this performance across training runs. Although the mean F 1 score on Spanish is 75, the performance of 60 models fine-tuned with different learning rates and random seeds ranges from around 70 F 1 to 78 (Figure 3 ). In zero-shot learning, no validation/development data is assumed in the target language, motivating the need for a machine learning approach to model selection.



We will make our code and data available on publication. https://github.com/antonisa/lang2vec



Figure 1: An illustration of our approach to select the best model for zero-shot cross-lingual transfer. (a) Prior works select the best model using source language development data. (b) LMS: A learned function scores fine-tuned models based on their hidden layer representations when encoding unlabeled target language data.

