SAME PRE-TRAINING LOSS, BETTER DOWNSTREAM: IMPLICIT BIAS MATTERS FOR LANGUAGE MODELS

Abstract

Language modeling on large-scale datasets leads to impressive performance gains on various downstream language tasks. The (validation) pre-training loss (or perplexity in autoregressive language modeling) is often used as the evaluation metric when developing language models since the pre-training loss tends to be well-correlated with downstream performance (which is itself difficult to evaluate comprehensively). Contrary to this conventional wisdom, this paper shows that 1) pre-training loss cannot fully explain downstream performance and 2) flatness of the model is well-correlated with downstream performance where pre-training loss is not. On simplified datasets, we identify three ways to produce models with the same (statistically optimal) pre-training loss but different downstream performance: continue pre-training after convergence, increasing the model size, and changing the training algorithm. These experiments demonstrate the existence of implicit bias of pretraining algorithms/optimizers-among models with the same minimal pre-training loss, they implicitly prefer more transferable ones. Toward understanding this implicit bias, we prove that SGD with standard mini-batch noise implicitly prefers flatter minima in language models, and empirically observe a strong correlation between flatness and downstream performance among models with the same minimal pre-training loss. We also prove in a synthetic language setting that among the models with the minimal pre-training loss, the flattest model transfers to downstream tasks.

1. INTRODUCTION

Large language models (LLMs) trained on internet-scale data have improved performance on a wide array of downstream tasks (Devlin et al., 2018; Yang et al., 2019; Radford et al., 2019; Raffel et al., 2020; Brown et al., 2020) . These models are trained with a language modeling pre-training loss to "fill in the blanks"-either predicting the next token/word (autoregressive language modeling loss, or perplexity) or masked tokens (masked language modeling (MLM) loss). In common practice, the validation pre-training loss is used to monitor the training process (Brown et al., 2020; Zhang et al., 2022a) and compare different models since the pre-training loss is generally strongly correlated with downstream performance (Hernandez et al., 2021) . Moreover, theoretical works on understanding LLMs also focus on how the pre-training loss affects downstream performance. Saunshi et al. (2020); Wei et al. (2021); Xie et al. (2021) show that good pre-training loss, or fitting the language modeling conditional probability well, is a main reason for downstream success of LLMs. Their analyses generally treat the language models as blackboxes and do not take into account how the models represents the conditional probability. In this paper, we question the conventional wisdom on the correlation between the validation pre-training loss and downstream performance for language modeling. Recent works have demonstrated that models with different architectures may have the same pre-training loss but different performance (Saunshi et al., 2022; Tay et al., 2021) . Due to the expressivity of modern neural nets, many parameter configurations even within the same architecture can still have the same pre-training loss. A priori, it is unclear why all these configurations should have the same downstream performance. We find that different parameter configurations with the same pre-training loss can indeed have different downstream performance, especially when the pre-training loss reaches a near-optimal level. Concretely, using simplified text datasets, we find three situations that demonstrate such a phenomenon: • Even after the pre-training loss converges, models at a later time step still tend to perform better. • Models trained by standard algorithms have better performance than adversarially trained models with the same pre-training loss. • Larger models tend to perform better downstream than smaller models even if they have the same pre-training loss. These situations are most prominent in the saturation regime, where the models are close to the minimal possible pre-training loss (aka the entropy of the conditional probability, which can be estimated in our simplified datasets). In the saturation regime, the pre-training loss of all models are almost the same, but the transferability to downstream tasks varies. Interestingly, this phenomenon also holds when linear probing on contextualized presentations is used for evaluating downstream performance instead of finetuning. Thus, even though the predicted conditional probabilities of two models are the same (and correct), the contextualized representations can behave differently. In each of the first two cases above, we find two models with the same pre-training loss and the same architecture; but one has a better performance than the other. They only differ by the training algorithms that are used to produce them. Therefore, this suggests the training algorithms have an implicit bias toward one of these models-standard algorithms with more training steps biases towards parameter configurations that transfer better to downstream tasks. The third case has a more subtle but similar interpretation. There exists a hypothetical large model that represents the smaller model with worse performance (by padding zeros to the smaller model). The training algorithm on the large architecture could have chosen it, but did not. This suggests the algorithm has an implicit bias against the hypothetical model (which has an equally good loss). In supervised settings, optimizers are known to have an implicit bias toward selecting generalizable models among all models with small empirical loss. E.g., see Damian et al. (2021) ; Li et al. ( 2021), which show that SGD implicitly biases toward flatter minima, and references therein. However, the role of implicit bias in self-supervised learning has not been studied and is conceptually different. Unlike in supervised learning, the gap between empirical and population self-supervised losses is typically small, and thus implicit bias does not seem to contribute to bridging this gap. Instead, the implicit bias selects local minima of the population self-supervised loss that transfer better to downstream tasks. Why do the algorithms bias toward some type of models? In Section 3, we provide a first-cut theoretical analysis of the implicit bias in language modeling. Fortunately, despite the conceptual differences, mathematical tools from supervised settings can be straightforwardly adapted to language modeling settings. We prove that mini-batch SGD prefers flatter minima of population pre-training loss among all minima in the saturation regime. Interestingly, we obtain cleaner theoretical results for the standard mini-batch SGD, without the artificial label noise introduced in prior works (Damian et al., 2021; Li et al., 2021) , partly because the mini-batch noise for LLMs does not vanish even at convergence. We corroborate our theory with empirical evidence in Section 4. We show that for models with the same pre-training loss in the three situations above, flatness of the model (measured by the trace of Hessian of the loss, as predicted by the theory) strongly correlates with the downstream performance. Finally, to complement the theory and experiments above, we also rigorously formalize the connection between flatness and downstream performance in a simplified Dyck language setting in Section 5. In this setting, we prove that there are many models with good MLM pre-training loss; among them, the flattest model learns the most useful features for downstream tasks. Here, results from the supervised setting cannot be readily adapted since they are obtained (partially) via generalization bounds (Wei & Ma, 2019a;b), which do not apply to the language modeling setting where the implicit bias is not related to the gap between the empirical and population loss. Proving the correlation between flatness and downstream performance in more general settings likely requires highly non-trivial and novel theoretical tools, and we hope to motivate future work on this topic.

2. THE EXISTENCE OF IMPLICIT BIAS IN LANGUAGE MODELING

In this section, we systematically investigate the relationship between pre-training loss and downstream performance with experiments. We find out that models with the same pre-training loss but different training procedures can have different downstream performance.

