UNIMAX: FAIRER AND MORE EFFECTIVE LANGUAGE SAMPLING FOR LARGE-SCALE MULTILINGUAL PRE-TRAINING

Abstract

Pretrained multilingual large language models have typically used heuristic temperature-based sampling to balance between different languages. However previous work has not systematically evaluated the efficacy of different pretraining language distributions across model scales. In this paper, we propose a new sampling method, UNIMAX, that delivers more uniform coverage of head languages while mitigating overfitting on tail languages by explicitly capping the number of repeats over each language's corpus. We perform an extensive series of ablations testing a range of sampling strategies on a suite of multilingual benchmarks, while varying model scale. We find that UNIMAX outperforms standard temperature-based sampling, and the benefits persist as scale increases. As part of our contribution, we release: (i) an improved and refreshed mC4 multilingual corpus consisting of 29 trillion characters across 107 languages, and (ii) a suite of pretrained umT5 model checkpoints trained with UNIMAX sampling.

1. INTRODUCTION

State-of-the-art multilingual models (Xue et al., 2021; 2022; Goyal et al., 2021 , inter alia) utilize large-scale self-supervised learning, which involves jointly training on many languages. Because data availability varies greatly across languages, multilingual pretraining can be characterized as multitask learning (or multi-objective optimization) with severe data imbalance. Typically English is the highest-resource language (or task) with orders of magnitude larger size than lower-resource languages. For example, in the mC4 corpus (Xue et al., 2021) , English has roughly 9.7 trillion characters, which is over 92,000 times larger than the lowest resource language, Yoruba. As a result, a key problem in designing such models is the "language balancing" problem: in what proportions should we balance the pretraining languages? Deriving the optimal balance is a difficult open research problem due to the high cost of pretraining. The standard approach to this problem has been to upsample languages with smaller datasets, using a temperature hyperparameter τ (Devlin et al., 2019) . However, one shortcoming of this approach is that choosing τ based on the desired distribution among higher-resources languages may result in examples from the lowest-resource languages being repeated excessively. Figure 1a shows the number of epochs covered for each language in the mC4 corpus. When using τ = 3.33 and a trillion token budget (the values used in popular models such as mT5 and ByT5), the lowest-resource languages are repeated over 100 times. This excessive repetition can have several unwanted consequences: (i) it leads to overfitting, which degrades performance on downstream tasks (Raffel et al., 2020; Lee et al., 2022; Hernandez et al., 2022) , (ii) it increases the risk of memorizing private or sensitive content (Carlini et al., 2021; Lee et al., 2022) , and (iii) it wastes training cycles that could have been devoted to unique examples. As models continue to grow in scale (Chowdhery et al., 2022; Brown et al., 2020; Smith et al., 2022) , these issues with temperature sampling grow more pressing, as larger models benefit from longer training (Hoffmann et al., 2022) , overfit more easily, and have a greater capacity to memorize. Figure 1 : The x-axis is the rank of the language based on the character count. 1/8 budget refers to the 250,000 steps with sequence length of 512, which is one-eights of the full-scaling training budget (1M steps with 1024 sequence length) referred to as 1x, matching that of mT5. This paper proposes a new paradigm for sampling across languages and datasets that ameliorates the above mentioned problems. We propose UNIMAX (uniform + max), a conceptually simple but highly effective two-pronged sampling approach that results in fairer and more effective language distributions for pretraining multilingual language models that work well across model scales. One of the main assumptions we make is that practical large-scale training jobs operate with a fixed amount of compute, which is often translated into a fixed training token budget (Raffel et al., 2020) . UNIMAX starts by pre-allocating training tokens to underrepresented datasets based on the number of allowed max repeats (N ). For the remaining budget, we prioritize "linguistic utility" (Blasi et al., 2022) by allocating uniformly across all languages with sufficient data to avoid exceeding the prescribed number of per-language epochs. Unlike previous approaches, this means UNIMAX is relatively resistant to distribution biases that arise due to artifacts of the corpus generation process (i.e., web crawlers). To take a concrete example, the mC4 corpus contains 70× more English than Chinese text. While mT5's temperature sampling (τ = 3.33) results in training on 3.4× more English than Chinese, UNIMAX will assign equal training tokens to the two languages, provided that this doesn't result in repeating the 39 billion available Chinese tokens more than N times. Another key benefit of UNIMAX is that it is robust to model scaling. In considering language sampling strategies at scale, it is important to carefully control how many times a dataset can be repeated during training to avoid overfitting and memorization. Our proposed method explicitly controls the extent of data repeats of any language, providing a direct countermeasure to overfitting on low-resource languages, without imposing any reprioritization on higher-resource languages. Our key contributions are to: (1) Propose UNIMAX, a simple but effective language sampling strategy that provides more uniform coverage of high-resource languages while mitigating overfitting on low-resource languages. (2) Perform an extensive series of ablations testing a range of sampling strategies on a suite of multilingual benchmarks, while varying model scale. (3) Release an improved and refreshed variant of the mC4 multilingual corpus consisting of 29 trillion characters across 107 languages. (4) Release pretrained model checkpoints using UNIMAX sampling.foot_0 2 RELATED WORK While (massively-)multilingual models enjoy the benefits of positive transfer across languages, the sheer number of languages reduces the effective capacity of the model per task. This competition



https://github.com/google-research/t5x/blob/main/docs/models.md



Number of training epochs for each language. Temperature sampling results in a large number of data repeats for low-resource languages, whereas UNIMAX explicitly caps repeats. Pretraining sampling distribution. Temperature sampling results in poorly balanced distributions, whereas UNIMAX provides more uniform distributions without excessive upsampling.

