SAME PRE-TRAINING LOSS, BETTER DOWNSTREAM: IMPLICIT BIAS MATTERS FOR LANGUAGE MODELS

Abstract

Language modeling on large-scale datasets leads to impressive performance gains on various downstream language tasks. The (validation) pre-training loss (or perplexity in autoregressive language modeling) is often used as the evaluation metric when developing language models since the pre-training loss tends to be well-correlated with downstream performance (which is itself difficult to evaluate comprehensively). Contrary to this conventional wisdom, this paper shows that 1) pre-training loss cannot fully explain downstream performance and 2) flatness of the model is well-correlated with downstream performance where pre-training loss is not. On simplified datasets, we identify three ways to produce models with the same (statistically optimal) pre-training loss but different downstream performance: continue pre-training after convergence, increasing the model size, and changing the training algorithm. These experiments demonstrate the existence of implicit bias of pretraining algorithms/optimizers-among models with the same minimal pre-training loss, they implicitly prefer more transferable ones. Toward understanding this implicit bias, we prove that SGD with standard mini-batch noise implicitly prefers flatter minima in language models, and empirically observe a strong correlation between flatness and downstream performance among models with the same minimal pre-training loss. We also prove in a synthetic language setting that among the models with the minimal pre-training loss, the flattest model transfers to downstream tasks.

1. INTRODUCTION

Large language models (LLMs) trained on internet-scale data have improved performance on a wide array of downstream tasks (Devlin et al., 2018; Yang et al., 2019; Radford et al., 2019; Raffel et al., 2020; Brown et al., 2020) . These models are trained with a language modeling pre-training loss to "fill in the blanks"-either predicting the next token/word (autoregressive language modeling loss, or perplexity) or masked tokens (masked language modeling (MLM) loss). In common practice, the validation pre-training loss is used to monitor the training process (Brown et al., 2020; Zhang et al., 2022a) and compare different models since the pre-training loss is generally strongly correlated with downstream performance (Hernandez et al., 2021) . Moreover, theoretical works on understanding LLMs also focus on how the pre-training loss affects downstream performance. 2021) show that good pre-training loss, or fitting the language modeling conditional probability well, is a main reason for downstream success of LLMs. Their analyses generally treat the language models as blackboxes and do not take into account how the models represents the conditional probability. In this paper, we question the conventional wisdom on the correlation between the validation pre-training loss and downstream performance for language modeling. Recent works have demonstrated that models with different architectures may have the same pre-training loss but different performance (Saunshi et al., 2022; Tay et al., 2021) . Due to the expressivity of modern neural nets, many parameter configurations even within the same architecture can still have the same pre-training loss. A priori, it is unclear why all these configurations should have the same downstream performance. We find that different parameter configurations with the same pre-training loss can indeed have different downstream performance, especially when the pre-training loss reaches a near-optimal level. Concretely, using simplified text datasets, we find three situations that demonstrate such a phenomenon: 1



Saunshi et al. (2020); Wei et al. (2021); Xie et al. (

