A MATHEMATICAL EXPLORATION OF WHY LAN-GUAGE MODELS HELP SOLVE DOWNSTREAM TASKS

Abstract

Autoregressive language models, pretrained using large text corpora to do well on next word prediction, have been successful at solving many downstream tasks, even with zero-shot usage. However, there is little theoretical understanding of this success. This paper initiates a mathematical study of this phenomenon for the downstream task of text classification by considering the following questions: (1) What is the intuitive connection between the pretraining task of next word prediction and text classification? (2) How can we mathematically formalize this connection and quantify the benefit of language modeling? For (1), we hypothesize, and verify empirically, that classification tasks of interest can be reformulated as sentence completion tasks, thus making language modeling a meaningful pretraining task. With a mathematical formalization of this hypothesis, we make progress towards (2) and show that language models that are -optimal in crossentropy (log-perplexity) learn features that can linearly solve such classification tasks with O( √ ) error, thus demonstrating that doing well on language modeling can be beneficial for downstream tasks. We experimentally verify various assumptions and theoretical findings, and also use insights from the analysis to design a new objective function that performs well on some classification tasks.

1. INTRODUCTION

The construction of increasingly powerful language models has revolutionized natural language processing (NLP). Using gigantic text corpora and a cross-entropy objective, language models are trained to predict a distribution over the next word to follow a given context (piece of text). Pretrained language models are useful for many downstream NLP tasks, either as initializations (Ramachandran et al., 2017; Howard & Ruder, 2018) or as a source of contextual word embeddings (McCann et al., 2017; Peters et al., 2018) . Recent models (Radford et al., 2019; Brown et al., 2020) have even bypassed the need for careful fine-tuning and have demonstrated strong performance on downstream tasks without fine-tuning. This work aims to understand this incredible success of language models. Since next word prediction is a powerful test of language understanding, at an intuitive level it is believable that doing well on language modeling can help with many diverse NLP tasks. At the same time, it is quite intriguing how improvements in the test perplexity of language models translate to better downstream performance. Attempting to understand this phenomenon naturally raises the following questions: (a) why should training on the next-word prediction task, with the cross-entropy objective, result in useful features for downstream tasks? (b) what role do inductive biases of the model architecture and training algorithms play in this empirical success? Given the nascency of deep learning theory, it is very challenging to say anything mathematically precise about (b) for deep networks. Given these difficulties, this paper focusses on the mathematical study of (a) by exploring if and how quantitative improvements on downstream NLP tasks can be mathematically guaranteed for language models that do well on the cross-entropy objective. As a first cut analysis, we restrict attention to text classification tasks and the striking observation that they can be solved fairly well with linear classifiers on top of fixed language models features, i.e. without finetuning (Table 1 ). Although we treat models as black boxes, just first-order optimality conditions of the cross-entropy objective reveal interesting properties of learned features, leading to an understanding of their success on classification tasks. Insights from the analysis help us construct a simple objective (Quad), that provably learns useful features for classification tasks, as also verified empirically. We summarize our contributions along with an overview of the paper below. In Section 2, we set up notation and formally describe language modeling and the ubiquitous lowdimensional softmax parametrization, along with a description of the cross-entropy objective and properties of its optimal solutions. We then describe the observation, in Section 3.1, that text classification tasks of interest can be reformulated as sentence completion tasks. Amenability to such a reformulation is mathematically formalized (Section 3.2) as the classification task being a natural task: tasks that can be solved linearly using conditional distribution over words following an input text. Section 4 presents our main results, theorems 4.1 and 4.2, that use the above formalization to mathematically quantify the utility of language model features on natural tasks: -optimal language model (in cross-entropy) will do O( √ )-well on such tasks. Theorem 4.2 shows a stronger result for low-dimensional softmax models by leveraging a new tool, conditional mean features (Definition 4.1), which we show (Section 6) to be effective in practice. The usefulness of the language model features themselves is demonstrated by arguing a weak linear relationship between them and conditional mean features. In Section 5.2, we present a new mathematically motivated objective (Quad) that has formal guarantees. Experiments in Section 6 verify the sentence completion reformulation idea and the good performance of conditional mean features on standard benchmarks.

1.1. RELATED WORK

Text embedding methods: Prior to language models, large text corpora like Wikipedia (Merity et al., 2016) were used to learn low-dimensional embeddings for words (Mikolov et al., 2013b; a; Pennington et al., 2014) and subsequently for sentences (Kiros et al., 2015; Arora et al., 2017; Pagliardini et al., 2018; Logeswaran & Lee, 2018) for downstream task usage. These methods were inspired by the distributional hypothesis (Firth, 1957; Harris, 1954) , which posits that meaning of text is determined in part by the surrounding context. Recent methods like BERT (Devlin et al., 2018) and variants (Lan et al., 2019; Yang et al., 2019; Liu et al., 2019) learn models from auxiliary tasks, such as sentence completion, and are among the top performers on downstream tasks. In this work we consider autoregressive models and make a distinction from masked language models like BERT; Table 2 shows that language model and BERT features have comparable performances. Language models for downstream tasks: We are interested in language models (Chen & Goodman, 1999), especially those that use neural networks to compute low-dimensional features for contexts and parametrize the next word distribution using softmax (Xu & Rudnicky, 2000; Bengio et al., 2003) . Language models have shown to be useful for downstream tasks as initializations (Ramachandran et al., 2017; Howard & Ruder, 2018) or as learned feature maps (Radford et al., 2017; McCann et al., 2017; Peters et al., 2018) . The idea of phrasing classification tasks as sentence completion problems to use language models is motivated by recent works (Radford et al., 2019; Puri & Catanzaro, 2019; Schick & Schütze, 2020 ) that show that many downstream tasks can be solved by next word prediction for an appropriately conditioned language model. This idea also shares similarities with work that phrase a suite of downstream tasks as question-answering tasks (McCann et al., 2018) or text-to-text tasks (Raffel et al., 2019) and symbolic reasoning as fill-in-the-blank tasks (Talmor et al., 2019) . Our work exploits this prevalent idea of task rephrasing to theoretically analyze why language models succeed on downstream tasks. Relevant theory: Since the success of early word embedding algorithms like word2vec (Mikolov et al., 2013a) and GloVe (Pennington et al., 2014) , there have been attempts to understand them theoretically. Levy & Goldberg (2014) argue that word2vec algorithm implicitly factorizes the PMI matrix. Noise Contrastive Estimation (NCE) theory is used to understand word embeddings (Dyer, 2014) and to show parameter recovery for negative sampling based conditional models (Ma & Collins, 2018) . A latent variable model (Arora et al., 2016) is used to explain and unify various word embedding algorithms. Theoretical justification is provided for sentence embedding methods either by using a latent variable model (Arora et al., 2017) or through the lens of compressed sensing (Arora et al., 2018) . Also relevant is recent work on theory for contrastive learning (Arora et al., 2019; Tosh et al., 2020b; a; Wang & Isola, 2020) and reconstruction-based methods (Lee et al., 2020) , which analyze the utility of self-supervised representations learned for downstream tasks. Our work is the first to analyze the efficacy of language model features on downstream tasks.

