A MATHEMATICAL EXPLORATION OF WHY LAN-GUAGE MODELS HELP SOLVE DOWNSTREAM TASKS

Abstract

Autoregressive language models, pretrained using large text corpora to do well on next word prediction, have been successful at solving many downstream tasks, even with zero-shot usage. However, there is little theoretical understanding of this success. This paper initiates a mathematical study of this phenomenon for the downstream task of text classification by considering the following questions: (1) What is the intuitive connection between the pretraining task of next word prediction and text classification? (2) How can we mathematically formalize this connection and quantify the benefit of language modeling? For (1), we hypothesize, and verify empirically, that classification tasks of interest can be reformulated as sentence completion tasks, thus making language modeling a meaningful pretraining task. With a mathematical formalization of this hypothesis, we make progress towards (2) and show that language models that are -optimal in crossentropy (log-perplexity) learn features that can linearly solve such classification tasks with O( √ ) error, thus demonstrating that doing well on language modeling can be beneficial for downstream tasks. We experimentally verify various assumptions and theoretical findings, and also use insights from the analysis to design a new objective function that performs well on some classification tasks.

1. INTRODUCTION

The construction of increasingly powerful language models has revolutionized natural language processing (NLP). Using gigantic text corpora and a cross-entropy objective, language models are trained to predict a distribution over the next word to follow a given context (piece of text). Pretrained language models are useful for many downstream NLP tasks, either as initializations (Ramachandran et al., 2017; Howard & Ruder, 2018) or as a source of contextual word embeddings (McCann et al., 2017; Peters et al., 2018) . Recent models (Radford et al., 2019; Brown et al., 2020) have even bypassed the need for careful fine-tuning and have demonstrated strong performance on downstream tasks without fine-tuning. This work aims to understand this incredible success of language models. Since next word prediction is a powerful test of language understanding, at an intuitive level it is believable that doing well on language modeling can help with many diverse NLP tasks. At the same time, it is quite intriguing how improvements in the test perplexity of language models translate to better downstream performance. Attempting to understand this phenomenon naturally raises the following questions: (a) why should training on the next-word prediction task, with the cross-entropy objective, result in useful features for downstream tasks? (b) what role do inductive biases of the model architecture and training algorithms play in this empirical success? Given the nascency of deep learning theory, it is very challenging to say anything mathematically precise about (b) for deep networks. Given these difficulties, this paper focusses on the mathematical study of (a) by exploring if and how quantitative improvements on downstream NLP tasks can be mathematically guaranteed for language models that do well on the cross-entropy objective. As a first cut analysis, we restrict attention to text classification tasks and the striking observation that they can be solved fairly well with linear classifiers on top of fixed language models features, i.e. without finetuning (Table 1 ). Although we treat models as black boxes, just first-order optimality conditions of the cross-entropy objective reveal interesting properties of learned features, leading to an understanding of their success on classification tasks. Insights from the analysis help us construct a simple objective 1

