UNIMAX: FAIRER AND MORE EFFECTIVE LANGUAGE SAMPLING FOR LARGE-SCALE MULTILINGUAL PRE-TRAINING

Abstract

Pretrained multilingual large language models have typically used heuristic temperature-based sampling to balance between different languages. However previous work has not systematically evaluated the efficacy of different pretraining language distributions across model scales. In this paper, we propose a new sampling method, UNIMAX, that delivers more uniform coverage of head languages while mitigating overfitting on tail languages by explicitly capping the number of repeats over each language's corpus. We perform an extensive series of ablations testing a range of sampling strategies on a suite of multilingual benchmarks, while varying model scale. We find that UNIMAX outperforms standard temperature-based sampling, and the benefits persist as scale increases. As part of our contribution, we release: (i) an improved and refreshed mC4 multilingual corpus consisting of 29 trillion characters across 107 languages, and (ii) a suite of pretrained umT5 model checkpoints trained with UNIMAX sampling.

1. INTRODUCTION

State-of-the-art multilingual models (Xue et al., 2021; 2022; Goyal et al., 2021 , inter alia) utilize large-scale self-supervised learning, which involves jointly training on many languages. Because data availability varies greatly across languages, multilingual pretraining can be characterized as multitask learning (or multi-objective optimization) with severe data imbalance. Typically English is the highest-resource language (or task) with orders of magnitude larger size than lower-resource languages. For example, in the mC4 corpus (Xue et al., 2021) , English has roughly 9.7 trillion characters, which is over 92,000 times larger than the lowest resource language, Yoruba. As a result, a key problem in designing such models is the "language balancing" problem: in what proportions should we balance the pretraining languages? Deriving the optimal balance is a difficult open research problem due to the high cost of pretraining. The standard approach to this problem has been to upsample languages with smaller datasets, using a temperature hyperparameter τ (Devlin et al., 2019). However, one shortcoming of this approach is that choosing τ based on the desired distribution among higher-resources languages may result in examples from the lowest-resource languages being repeated excessively. Figure 1a shows the number of epochs covered for each language in the mC4 corpus. When using τ = 3.33 and a trillion token budget (the values used in popular models such as mT5 and ByT5), the lowest-resource languages are repeated over 100 times. This excessive repetition can have several unwanted consequences: (i) it leads to overfitting, which degrades performance on downstream tasks (Raffel et al., 2020; Lee et al., 2022; Hernandez et al., 2022) , (ii) it increases the risk of memorizing private or sensitive content (Carlini et al., 2021; Lee et al., 2022) , and (iii) it wastes training cycles that could have been devoted to unique examples. As models continue to grow in scale (Chowdhery et al., 2022; Brown et al., 2020; Smith et al., 2022) , these issues with temperature sampling grow more pressing, as larger models benefit from longer training (Hoffmann et al., 2022) , overfit more easily, and have a greater capacity to memorize. * equal contribution 1

