MODEL SELECTION FOR CROSS-LINGUAL TRANSFER USING A LEARNED SCORING FUNCTION Anonymous

Abstract

Transformers that are pre-trained on multilingual text corpora, such as, mBERT and XLM-RoBERTa, have achieved impressive cross-lingual transfer learning results. In the zero-shot cross-lingual transfer setting, only English training data is assumed, and the fine-tuned model is evaluated on another target language. No target-language validation data is assumed in this setting, however substantial variance has been observed in target language performance between different fine-tuning runs. Prior work has relied on English validation/development data to select among models that are fine-tuned with different learning rates, number of steps and other hyperparameters, often resulting in suboptimal choices. In this paper, we show that it is possible to select consistently better models when small amounts of annotated data are available in an auxiliary pivot language. We propose a machine learning approach to model selection that uses the fine-tuned model's own internal representations to predict its cross-lingual capabilities. In extensive experiments we find that our approach consistently selects better models than English validation data across five languages and five well-studied NLP tasks, achieving results that are comparable to small amounts of target language development data. 1

1. INTRODUCTION

Pre-trained Transformers (Vaswani et al., 2017; Devlin et al., 2019) have lead to state-of-the-art results on a wide range of NLP tasks, for example, named entity recognition, relation extraction and question answering, often approaching human inter-rater agreement (Joshi et al., 2020a) . These models have also been demonstrated to learn effective cross-lingual representations, even without access to parallel text or bilingual lexicons (Wu & Dredze, 2019; Pires et al., 2019) . Multilingual pre-trained Transformers, such as mBERT and XLM-RoBERTa (Conneau et al., 2019) , support surprisingly effective zero-shot cross-lingual transfer, where training and development data are only assumed in a high resource source language (e.g. English), and performance is evaluated on another target language. Because no target language annotations are assumed in this setting, source language data is typically used to select among models that are fine-tuned with different hyperparameters and random seeds. However, recent work has shown that English dev accuracy does not always correlate well with target language performance (Keung et al., 2020) . In this paper, we propose an alternative strategy for model selection in a zero-shot setting. Our approach, dubbed Learned Model Selection (LMS), learns a function that scores the compatibility between a fine-tuned multilingual transformer, and a target language. The compatibility score is calculated based on features of the multilingual model's learned representations and the target language. A model's features are based on its own internal representations; this is done by aggregating representations over an unlabeled target language text corpus. These model-specific features capture information about how the cross-lingual representations transfer to the target language after fine-tuning on source language data. In addition to modelspecific representations, we also make use of learned language embeddings from the lang2vec package (Malaviya et al., 2017),foot_1 which have been shown to encode typological information, for example, whether a language has prepositions or postpositions. To measure compatibility between



We will make our code and data available on publication. https://github.com/antonisa/lang2vec 1

