A MATHEMATICAL EXPLORATION OF WHY LAN-GUAGE MODELS HELP SOLVE DOWNSTREAM TASKS

Abstract

Autoregressive language models, pretrained using large text corpora to do well on next word prediction, have been successful at solving many downstream tasks, even with zero-shot usage. However, there is little theoretical understanding of this success. This paper initiates a mathematical study of this phenomenon for the downstream task of text classification by considering the following questions: (1) What is the intuitive connection between the pretraining task of next word prediction and text classification? (2) How can we mathematically formalize this connection and quantify the benefit of language modeling? For (1), we hypothesize, and verify empirically, that classification tasks of interest can be reformulated as sentence completion tasks, thus making language modeling a meaningful pretraining task. With a mathematical formalization of this hypothesis, we make progress towards (2) and show that language models that are -optimal in crossentropy (log-perplexity) learn features that can linearly solve such classification tasks with O( √ ) error, thus demonstrating that doing well on language modeling can be beneficial for downstream tasks. We experimentally verify various assumptions and theoretical findings, and also use insights from the analysis to design a new objective function that performs well on some classification tasks.

1. INTRODUCTION

The construction of increasingly powerful language models has revolutionized natural language processing (NLP). Using gigantic text corpora and a cross-entropy objective, language models are trained to predict a distribution over the next word to follow a given context (piece of text). Pretrained language models are useful for many downstream NLP tasks, either as initializations (Ramachandran et al., 2017; Howard & Ruder, 2018) or as a source of contextual word embeddings (McCann et al., 2017; Peters et al., 2018) . Recent models (Radford et al., 2019; Brown et al., 2020) have even bypassed the need for careful fine-tuning and have demonstrated strong performance on downstream tasks without fine-tuning. This work aims to understand this incredible success of language models. Since next word prediction is a powerful test of language understanding, at an intuitive level it is believable that doing well on language modeling can help with many diverse NLP tasks. At the same time, it is quite intriguing how improvements in the test perplexity of language models translate to better downstream performance. Attempting to understand this phenomenon naturally raises the following questions: (a) why should training on the next-word prediction task, with the cross-entropy objective, result in useful features for downstream tasks? (b) what role do inductive biases of the model architecture and training algorithms play in this empirical success? Given the nascency of deep learning theory, it is very challenging to say anything mathematically precise about (b) for deep networks. Given these difficulties, this paper focusses on the mathematical study of (a) by exploring if and how quantitative improvements on downstream NLP tasks can be mathematically guaranteed for language models that do well on the cross-entropy objective. As a first cut analysis, we restrict attention to text classification tasks and the striking observation that they can be solved fairly well with linear classifiers on top of fixed language models features, i.e. without finetuning (Table 1 ). Although we treat models as black boxes, just first-order optimality conditions of the cross-entropy objective reveal interesting properties of learned features, leading to an understanding of their success on classification tasks. Insights from the analysis help us construct a simple objective (Quad), that provably learns useful features for classification tasks, as also verified empirically. We summarize our contributions along with an overview of the paper below. In Section 2, we set up notation and formally describe language modeling and the ubiquitous lowdimensional softmax parametrization, along with a description of the cross-entropy objective and properties of its optimal solutions. We then describe the observation, in Section 3.1, that text classification tasks of interest can be reformulated as sentence completion tasks. Amenability to such a reformulation is mathematically formalized (Section 3.2) as the classification task being a natural task: tasks that can be solved linearly using conditional distribution over words following an input text. Section 4 presents our main results, theorems 4.1 and 4.2, that use the above formalization to mathematically quantify the utility of language model features on natural tasks: -optimal language model (in cross-entropy) will do O( √ )-well on such tasks. Theorem 4.2 shows a stronger result for low-dimensional softmax models by leveraging a new tool, conditional mean features (Definition 4.1), which we show (Section 6) to be effective in practice. The usefulness of the language model features themselves is demonstrated by arguing a weak linear relationship between them and conditional mean features. In Section 5.2, we present a new mathematically motivated objective (Quad) that has formal guarantees. Experiments in Section 6 verify the sentence completion reformulation idea and the good performance of conditional mean features on standard benchmarks.

1.1. RELATED WORK

Text embedding methods: Prior to language models, large text corpora like Wikipedia (Merity et al., 2016) were used to learn low-dimensional embeddings for words (Mikolov et al., 2013b; a; Pennington et al., 2014) and subsequently for sentences (Kiros et al., 2015; Arora et al., 2017; Pagliardini et al., 2018; Logeswaran & Lee, 2018) for downstream task usage. These methods were inspired by the distributional hypothesis (Firth, 1957; Harris, 1954) , which posits that meaning of text is determined in part by the surrounding context. Recent methods like BERT (Devlin et al., 2018) and variants (Lan et al., 2019; Yang et al., 2019; Liu et al., 2019) learn models from auxiliary tasks, such as sentence completion, and are among the top performers on downstream tasks. In this work we consider autoregressive models and make a distinction from masked language models like BERT; Table 2 shows that language model and BERT features have comparable performances. Language models for downstream tasks: We are interested in language models (Chen & Goodman, 1999), especially those that use neural networks to compute low-dimensional features for contexts and parametrize the next word distribution using softmax (Xu & Rudnicky, 2000; Bengio et al., 2003) . Language models have shown to be useful for downstream tasks as initializations (Ramachandran et al., 2017; Howard & Ruder, 2018) or as learned feature maps (Radford et al., 2017; McCann et al., 2017; Peters et al., 2018) . The idea of phrasing classification tasks as sentence completion problems to use language models is motivated by recent works (Radford et al., 2019; Puri & Catanzaro, 2019; Schick & Schütze, 2020 ) that show that many downstream tasks can be solved by next word prediction for an appropriately conditioned language model. This idea also shares similarities with work that phrase a suite of downstream tasks as question-answering tasks (McCann et al., 2018) or text-to-text tasks (Raffel et al., 2019) and symbolic reasoning as fill-in-the-blank tasks (Talmor et al., 2019) . Our work exploits this prevalent idea of task rephrasing to theoretically analyze why language models succeed on downstream tasks. Relevant theory: Since the success of early word embedding algorithms like word2vec (Mikolov et al., 2013a) and GloVe (Pennington et al., 2014) , there have been attempts to understand them theoretically. Levy & Goldberg (2014) argue that word2vec algorithm implicitly factorizes the PMI matrix. Noise Contrastive Estimation (NCE) theory is used to understand word embeddings (Dyer, 2014) and to show parameter recovery for negative sampling based conditional models (Ma & Collins, 2018) . A latent variable model (Arora et al., 2016) is used to explain and unify various word embedding algorithms. Theoretical justification is provided for sentence embedding methods either by using a latent variable model (Arora et al., 2017) or through the lens of compressed sensing (Arora et al., 2018) . Also relevant is recent work on theory for contrastive learning (Arora et al., 2019; Tosh et al., 2020b; a; Wang & Isola, 2020) and reconstruction-based methods (Lee et al., 2020) , which analyze the utility of self-supervised representations learned for downstream tasks. Our work is the first to analyze the efficacy of language model features on downstream tasks.

2. LANGUAGE MODELING AND OPTIMAL SOLUTIONS

We use S to denote the discrete set of all contexts, i.e. complete or partial sentences (prefixes), W to denote the vocabulary of words, with V = |W| being the vocabulary size. For a discrete set A, let ∆ A denote the set of distributions on A. We use p, p L ∈ ∆ S to denote probability distributions over S, and p •|s , p * •|s ∈ ∆ W to denote conditional distributions, where p •|s (w) is the predicted probability of word w following context s and p * •|s (w) denotes the true conditional probability. Boldface p •|s , p * •|s ∈ R V denote vectors of probabilities for p •|s , p * •|s ∈ ∆ W . For v ∈ R V , v w) indexes the coordinate for w ∈ W; p •|s (w) is the probability of w according to p •|s . We use φ w ∈ R d to denote a d-dimensional embedding for word w; word embeddings are stacked into the columns Φ ∈ R d×V . We use f : S → R d for a feature map from contexts to d-dimensional embeddings, e.g. f (s) can be the output of a Transformer model for input context s ∈ S. For embeddings {θ s } s∈S with θ s ∈ R D (any D), we use {θ s } to denote g : S → R D such that g(s) = θ s .

2.1. LANGUAGE MODELING USING CROSS-ENTROPY

Language model aims to learn the true distribution of a text corpus and a popular approach to do so is through next word prediction. Given a context (e.g., a sentence s ∈ S), it predicts a distribution p •|s over the word to follow, e.g. for the context "The food was ", the model could place high probabilities on words "delicious", "expensive", "bland", etc. We use p L to denote the true distribution over the context set S in the language modeling corpus. A standard approach is to minimize the expected cross-entropy loss between the true distribution p * •|s and the model prediction p •|s . We define the cross-entropy loss for a language model with output vector of probabilities {p •|s } s∈S as xent ({p •|s }) = E s∼p L E w∼p * •|s -log(p •|s (w)) = E s∼p L xent,s (p •|s ) To understand what language models learn, we look at the optimal solution of the cross-entropy objective. While one cannot practically hope to learn the optimal solution due to optimization, statistical and expressivity limitations, the optimal solution at least tells us the best that language modeling can hope to do. A well-known property of cross-entropy objective is that its optimal solution is p for every s ∈ support(p L ).

2.2. SOFTMAX PARAMETRIZED LANGUAGE MODELING

Unlike traditional language models like n-gram models, neural language models parametrize the conditional distribution p •|s as a softmax computed using low dimensional embeddings. For an embedding θ ∈ R d , the softmax distribution over W using word embeddings Φ ∈ R d×V is p θ,Φ (w) = e θ φw /Z θ , where Z θ = w ∈W e θ φ w is the partition function. While p θ,Φ depends on Φ, we will use p θ instead whenever Φ is clear from context. Just like p * •|s , we can interpret p θ ∈ R V as a vector of probabilities for the distribution p θ . We now describe the abstraction for softmax models that is applicable to most neural models. A language model first embeds a context s into f (s) ∈ R d using a feature map f : S → R d that is parametrized by an architecture of choice (e.g. Transformer (Vaswani et al., 2017) ). The output conditional distribution is set to be the softmax distribution induced by the context embedding f (s) and word embeddings Φ, i.e. p •|s = p f (s) . The cross-entropy in its familiar form is presented below xent (f, Φ) = E s∼p L E w∼p * •|s -log(p f (s) (w)) = E s∼p L E w∼p * •|s [-f (s) φ w ] + log(Z f (s) ) We rewrite it as xent (f, Φ) = E Unlike Proposition 2.1, p f * (s) ∈ R V is only guaranteed to be equal to p * •|s ∈ R V on the ddimensional subspace spanned by rows of Φ ∈ R d×V . We may not learn p * •|s exactly when d < V , but this result at least guarantees learning p * •|s on a linear subspace determined by word embeddings Φ. This forms the basis for our main results later and is proved by using the firstorder optimality condition, i.e. ∇ θ xent,s (f * (s)) = 0, ∀s ∈ S. The gradient of cross-entropy is ∇ θ xent,s (θ) = -Φp * •|s + ∇ θ Z θ/Z θ = -Φp * •|s + Φp θ . Setting it to 0 completes the proof. We use the properties of optimal solutions to understand why language models help with classification tasks.

3. USING LANGUAGE MODELS FOR CLASSIFICATION TASKS

Sections 2.1 and 2.2 suggest that language models aim to learn p * •|s , or a low-dimensional projection Φp * •|s . Thus to understand why language models help with downstream tasks, a natural starting point is to understand how access to p * •|s can help with downstream tasks. In a thought experiment, we use oracle access to p * •|s for any s and demonstrate that sentence classification task can be solved by reformulating it as a sentence completion problem and using p * •|s to get completions to predict the label. This sentence completion reformulation is mathematically formalized as natural tasks.

3.1. SENTENCE COMPLETION REFORMULATION

For exposition, we consider the sentence classification task of sentiment analysis, where the inputs are movie reviews (subset of S) and labels belongs to {±1}, denoting positive and negative reviews. Classification task as sentence completion: Can we predict the label for a movie review s by using p * •|s ? One way is to use p * •|s to compare probabilities of ":)" and ":(" following a movie review and to predict sentiment based on which is higher. This seems like a reasonable strategy, since ":)" is likelier than ":(" to follow a positive movie review. One issue, however, is that p * •|s will place much higher probability on words that start sentences, like "The", rather than discriminative words useful for the task. To allow a larger set of grammatically correct completions, we can append a prompt like "This movie is " at the end of all movie reviews and query probabilities of indicative adjectives like good, bad, interesting, boring etc. that are better indicators of sentiment. This approach of adding a prompt can also work for other classification tasks. For the AG news dataset (Zhang et al., 2015) containing news articles from 4 categories (world, science/tech., sports, business), a prompt like "This article is about " can help solve the task. The theoretical and practical relevance of prompts is discussed in Theorem 4.1, and Section 6 respectively. We note that the choice of prompts and completion words is less important than the underlying idea of sentence completion reformulation and its formalization.

Solving tasks using a linear function of p *

•|s : The above process is actually a sub-case of using a linear classifier on top of p * •|s ∈ R V . For sentiment analysis, if w + = ":)" and w -= ":(", then the sign of p * •|s (w + ) -p * •|s (w -) can predict the sentiment. This strategy can be expressed as v p * •|s , where the linear classifier v ∈ R V has v(w + ) = 1, v(w -) = -1 and v(w ) = 0 for w ∈ W\{w + , w -}. Similarly with the prompt, we can assign positive weights in v to adjectives like "good" and negative weights to adjectives like "boring". Strength of sentiment in different adjectives (e.g., "good" vs "amazing") can be captured through different weights. This equivalence between sentence completion reformulation and linear classifier on p * •|s is further explored in Section D.1. Other tasks can be similarly solved with a different set of words for each class. We verify experimentally that SST and AG news tasks can be solved by a linear function of probabilities of just a small subset of words in Section 6 and for many other classification tasks in Section F.1, thus lending credibility to the sentence completion view.

3.2. NATURAL CLASSIFICATION TASKS

We now translate the above sentence completion reformulation into a reasonable mathematical characterization for classification tasks of interest. Firstly we formally define text classification tasks and the standard metric for performance of linear classification on fixed features. A binary classification taskfoot_1 T is characterized by a distribution p T over S × {±1}, where the input s is a piece of text from S and the label y is in {±1}. Given a feature map g : S → R D (arbitrary D), T is solved by fitting a linear classifier v ∈ R D on top of g(s) and the metric of classification loss is T (g, v) = E (s,y)∼p T (v g(s), y) ; T (g) = inf v∈R D T (g, v) where is a 1-Lipschitz surrogate to the 0-1 loss, like the hinge loss (ŷ, y) = (1 -y ŷ) + or the logistic loss (ŷ, y) = log(1 + e -y ŷ ). For given embeddings {θ s } s∈S , the classification loss is written as T ({θ s }, v) = E (s,y)∼p T [ (v θ s , y)]. We now formalize classification tasks amenable to sentence completion reformulation, from Section 3.1), as (τ, B)-natural tasks, i.e. tasks that achieve a small classification loss of τ by using a linear classifier with ∞ -norm boundedfoot_2 by B on top of features p * •|s ∈ R V . Definition 3.1. A classification task T is (τ, B)-natural if min v∈R V , v ∞≤B T ({p * •|s }, v) ≤ τ . While we motivated this formalization of linear classification over p * •|s in Section 3.1, we provide a mathematical justification in Section D.1, along with interpretations for τ and B that relate them to the Bayes optimal predictor and probability mass of indicative words respectively. Low dimensional softmax models, however, only learn p * •|s in the subspace of Φ, per Proposition 2.2. Thus we are also interested in subset of tasks that this subspace can solve. Definition 3.2. Task T is (τ, B)-natural w.r.t. Φ ∈ R d×V if min v∈row-span(Φ), v ∞ ≤B T ({p * •|s }, v) ≤ τ . Note that every (τ, B)-natural task w.r.t. Φ is trivially (τ, B)-natural, though the converse may not hold. However it can be argued that if Φ has some "nice properties", then (τ, B)-natural tasks of interest will roughly also be (τ, B)-natural w.r.t. Φ. Capturing the synonym structure of words can be such a nice property, as discussed in Section D.2. A better understanding of these properties of word embeddings Φ can potentially enable better performance of language models on downstream tasks. In fact, Section 5.2 describes a carefully designed objective that can learn word embeddings with desirable properties like synonyms having similar embeddings. In the subsequent sections, we use the above formalization to show guarantees for language models on natural tasks.

4. GUARANTEES FOR LANGUAGE MODELS ON NATURAL TASKS

We now show guarantees for features from language models on natural tasks in two cases: 1) for an arbitrary language model {p •|s } where we use V -dimensional features p •|s ∈ R V for downstream tasks and 2) for softmax language model (f, Φ) where we use new d-dimensional features Φp f (s) ∈ R d . Since we cannot practically hope to learn the optimal solutions described in propositions 2.1 and 2.2, we only assume that the language models are -optimal in cross-entropy. We first define * xent to be the minimum achievable cross-entropy and * xent (Φ) to be the minimum achievable cross-entropy by a d-dimensional softmax language model using Φ; clearly * xent ≤ * xent (Φ). * xent = xent ({p * •|s }), * xent (Φ) = E s∼p L inf θ∈R d xent,s (θ, Φ) We first present the results for arbitrary language models with a proof sketch that describes the main ideas, following which we present our main results for softmax language models.

4.1. ARBITARY LANGUAGE MODELS

We show guarantees for a language model that is -optimal, i.e. xent ({p •|s }) - * xent ≤ , on (τ, B)natural tasks. An important consideration is that the language model distribution p L of contexts is often a diverse superset of the downstream distribution p T (defined in Section 2.2) over sentences, thus requiring us to show how guarantees of p •|s ≈ p * •|s on average over the distribution s ∼ p L transfer to guarantees on a subset p T . In the worst case, all of the error in cross-entropy by {p •|s } is incurred on sentences from the subset p T , leading to pessimistic boundsfoot_3 . In practice, however, the errors might be more evenly distributed across p L , thus bypassing this worst case bound. As a first step, we present the worst case bound here; stronger guarantees are in Section 5.1. The worst-case coefficient γ(p T ), defined below, captures that p T is a γ(p T )-fraction of p L . γ(p T ) = sup{γ ∈ (0, 1] : p L (s) ≥ γp T (s) ∀s ∈ S} (5) We now present our results that applies to any language model, regardless of the parametrization (e.g., n-gram models, softmax models). The result suggests that small test cross-entropy (hence test perplexity) is desirable to guarantee good classification performance, thus formalizing the intuition that better language models will be more useful for downstream tasks. Theorem 4.1. Let {p •|s } be a language model that is -optimal, i.e. xent ({p •|s }) - * xent ≤ , for some > 0. For a classification task T that is (τ, B)-natural, we have T {p •|s } ≤ τ + 2B 2 (γ(p T )) -1 This upper bounds classification loss on task T for V -dimensional features {p •|s } from an -optimal language model. We discuss factors that lead to small upper bound and corresponding intuitions. • is small: learned language model has smaller cross-entropy (log-perplexity) • τ is small: task can be solved well through a sentence completion reformulation with a set of indicative words as completions, as in Section 3.1, and has small Bayes error (cf. Section D.1) • B is small: set of indicative words has high probability mass in p * •|s (cf. Section D.1). This could potentially explain the superior performance when prompts are added (Section 6). • γ(p T ) is large: p T is closer to p L ; note that γ(p T ) ≤ 1 with equality if and only if p T = p L Thus the bound captures meaningful intuitions about good performance of language models on downstream tasks. We provide a detailed proof sketch in Section E.1 and a strengthened version of this (Theorem B.1) is presented in Section E.6. Proving this result requires connecting the classification loss with language modeling cross-entropy loss and dealing with distribution mismatch; we present a rough outline to do so below. Since T is (τ, B)-natural, let v * be the classifier with v * ∞ ≤ B and T ({p * •|s }, v * ) ≤ τ . The result follows from the following 3 inequalities: T {p •|s }, v * -T ({p * •|s }, v * ) ≤ E s∼p T [(v * (p •|s -p * •|s )) 2 ] ... Lipschitzness + Jensen's E s∼p T [(v * (p •|s -p * •|s )) 2 ] ≤ γ(p T ) -1 E s∼p L [(v * (p •|s -p * •|s )) 2 ] ... Transfer p T to p L ∀v ∈ R V , (v (p •|s -p * •|s )) 2 ≤ 2 v 2 ∞ ( xent,s (p •|s ) -xent,s (p * •|s )) . .. Pinsker's inequality The first and third inequalities (Lemma E.8 and Lemma E.3) connect the classification loss to the cross-entropy loss in language modeling, while the second inequality deals with distribution mismatch between p L and p T . We now present a stronger result for softmax models.

4.2. SOFTMAX LANGUAGE MODEL WITH CONDITIONAL MEAN FEATURES

We now consider a softmax language model with feature map f that satisfies xent (f, Φ)-  : S → R d , where Φp f (s) = Φp f (s) , where p f (s) ∈ R V . We now present the result for softmax language models that has similar implications as Theorem 4.1, but with above-mentioned subtle differences. Theorem 4.2. For a fixed Φ, let f be features from an -optimal d-dimensional softmax language model, i.e. xent (f, Φ) - * xent (Φ) ≤ . For a classification task T that is (τ, B)-natural w.r.t. Φ, T (Φp f ) ≤ τ + 2B 2 (γ(p T )) -1 This result guarantees good performance of conditional mean features Φp f on some natural tasks, thereby suggesting a novel way to extract features for downstream tasks. We empirically verify the good performance of Φp f (s) on classifications tasks (Section 6) and also find a O( √ )-like behavior (Section F.5). The proof (Section E.3) is similar to that of Theorem 4.1, the main difference being the use of the following inequality, proved using a softmax variant of Pinsker's inequality (Lemma E.4). ∀v ∈ row-span(Φ), (v (p f (s) -p * •|s )) 2 ≤ 2 v 2 ∞ ( xent,s (p f (s) ) -inf f * (s)∈R d xent,s (p f * (s) )) The more general result (Theorem 5.1) replaces γ(p T ) with a more refined coefficient (Section 5.1). While guarantees are only for natural tasks w.r.t. Φ, Section D.2 discusses why this might be enough for tasks of interest if word embeddings Φ satisfy nice properties.

4.3. Φp

f (s) IS A LINEAR FUNCTION OF f (s) Theorem 4.2 shows that Φp f is useful for linear classification. However, using feature map f directly is more standard and performs better in practice (Section 6). Here we argue that there is a linear relation between f and Φp f if word embeddings Φ satisfy a certain Gaussian-like property, which we show implies that tasks solvable linearly with Φp f are also solvable linearly using f . Assumption 4.1. There exists a symmetric positive semidefinite matrix A ∈ R d×d , a vector b ∈ R d and a constant c ∈ R such that log(Z θ ) = 1 2 θ Aθ + θ b + c for any θ ∈ R d . If word embeddings were distributed as Gaussians, i.e. V columns of Φ are sampled from N (µ, Σ) independently, it is not hard to show (Lemma E.1) that log(Z θ ) ≈ 1 2 θ Σθ + θ µ + log(V ). While some papers (Arora et al., 2016; Mu & Viswanath, 2018) have noted that word embeddings are fairly random-like in the bulk to argue that the log partition function is constant for θ 2 = 1, our quadratic assumption is a bit stronger. However, empirically we find the fit to be very good, as evident in Figure 1 . Under the above assumption, we can show a linear relation between f and Φp f . Lemma 4.3. Under Assumption 4.1, feature map f satisfies Φp f (s) = Af (s) + b, ∀s ∈ S. Corollary 4.1. Under same setting as Lemma 4. 3 and Theorem 4  .2, T (f ) ≤ τ + O(B √ ). This shows that f itself is good for natural classification tasks. However, in practice, the linearity between f and Φp f only weakly holds on features from pretrained GPT-2 (Radford et al., 2018) . The fractional residual norm of the best linear fit, i.e. r = E s∼p Φp f (s)-Af (s)-b 2 E s∼p Φp f (s) 2 , measured for different distributions (r = 0 is perfect fit) are 0.28 for SST, 0.39 for AG News, and 0.18 for IMDb contexts. This non-trivial linear relationship, although surprising, might not completely explain the success of f , which usually performs better than Φp f ; we leave exploring this to future work.

5.1. BETTER HANDLING OF DISTRIBUTIONAL SHIFT

The bounds in the previous section use the coefficient γ(p T ) to transfer guarantees from p L to p T and we define a more refined notion of transferability here. The coefficient γ(p T ) is independent of the learned model and assumes a worst case distribution of errors. For the refined coefficient, we first define the error made in predicted probabilities by a softmax language model f as ∆ {p f (s) } (s) = p f (s) -p * •|s . For any distribution p ∈ ∆ S , we define uncentered covariance of a function g : S → R D as Σ p (g) = E s∼p g(s)g(s) . The refined transferability coefficient is then defined as We state the refined result for softmax language models; detailed results are deferred to Section B. γ(p; Φp f ) := Σ p L (Φ∆ {p f (s) } ) -1 2 Σ p (Φ∆ {p f (s) } )Σ p L (Φ∆ {p f (s) } ) -1 2 2 -1 Theorem 5.1 (Simplified). In the same setting as Theorem 4.2, T (Φp f ) ≤ τ + 2B 2 γ(p T ;Φp f ) It is easy show that γ(p T ; Φp f ) ≥ γ(p T ), so this is indeed a stronger bound. The coefficient γ(p T ; Φp f ) measures how average error on f on p L can propagate to p T . This can potentially be much smaller than γ(p T ) due to some inductive biases of f . For instance, if errors made by the model are random-like, i.e. ∆ {p f (s) } (s) ∼ ρ, independently of s, then Σ p L (Φ∆ {p f (s) } ) ≈ Σ p (Φ∆ {p f (s) } ) ≈ E η∼ρ [ηη ], making γ(p; Φp f ) ≈ 1. Independence prevents accumulation of language modeling error on contexts from p T , bypassing the worst case transfer of γ(p T ).

5.2. QUAD: A NEW OBJECTIVE FUNCTION

In Definition 3.2 we discuss how low dimensional softmax language models learn a linear projection of p * •|s , only solving tasks that lie in the row span of word embeddings Φ. Although Φ defines tasks that language model features can solve, the standard cross-entropy objective does not lend a simple closed form expression for optimal Φ. This motivates the construction of our Quad objective, that has two nice properties: (1) the optimal feature map f * is a linear function of p * •|s and thus can solve some natural tasks, and (2) the optimal Φ * has an intuitively meaningful closed-form solution. quad (f, Φ) = E s∼p L E w∼p * •|s [-f (s) φ w ] + 1 2 Φ f (s) 2 (6) The Quad objective is very similar to the cross-entropy objective from Equation (2), with the log partition function replaced by a quadratic function, inspired in part by Assumption 4.1. We can derive the optimal solution Φ * that depends on the eigen-decomposition of a substitutability matrix. Definition 5.1. The substitutability matrix is defined to be Ω * := E s∼p L p * •|s p * •|s ∈ R V ×V . If Ω * = U SU is the eigendecomposition, then U d ∈ R V ×d is matrix of top d eigenvectors of Ω * . The matrix Ω * captures substitutability between pairs of words. Words w and w are substitutable if they have identical conditional probabilities for every context s ∈ S and thus can replace occurrences of each other while still providing meaningful completions. By definition, these words satisfy Ω * [w] = Ω * [w ] . Such pairs of words were called "free variants" in the work on distributional semantics (Harris, 1954) , and capture the notion of synonyms; more in Section D.2. Theorem 5.2. Let f * , Φ * = arg min f,Φ quad (f, Φ). Then Φ * = BU d , for full rank B ∈ R d×d . Also, for a classification task T that is (τ, B)-natural w.r.t. Φ * , we have T (f * ) ≤ τ . Thus f * excels on natural tasks w.r.t. Φ * , which in turn, is the best d-dimensional projection of Ω * . Thus words w, w ∈ W that are synonyms (hence substitutable) will satisfy φ * w = φ * w , fulfilling the desired property for word embeddings discussed in Definition 3.2. We train using the Quad objective and compare its performance to a similarly trained GPT-2 language model. The results in Table 3 suggest that Quad performs comparably to Φp f from the cross-entropy objective, which fits our theory since both are linear functions of p * •|s . Section F.3 has more details and experiments. The goal of testing Quad is to demonstrate that theoretical insights can aid the design of provably effective algorithms. Refer to Section C for more details on Quad. Table 1 : Accuracy (%) on k-way linear classification using fixed GPT-2 features. Good performance of features f (s), conditional mean features Φp f (s) and meaningful subset of ≤ 30 (and ≤ 2k) coordinates of p f (s) verify the sentence completion reformulation and main results. The numbers right below the features denote dimensionality of the features. An asterisk indicates that we added a task-specific prompt. Other baselines are fine-tuning (FT, Section F.2) and random projection of p f (s) (rand. proj.). Sentence version of SST (train/test: 6.9K/1.8K) is used. 

6. EXPERIMENTS

We use experiments to verify (1) linear classification on fixed language model features does comparably to fine-tuning the features, (2) sentence completion reformulation (Section 3.1), i.e. tasks can be solved using probabilities for indicative words, (3) conditional mean features are effective. Tasks using linear function of p * •|s : We validate our claims from Section 3 that classification tasks can be solved by linear functions of p * •|s . Since p * •|s is never available, we instead use the output features f (s) and probabilities p •|s := p f (s) from a small pretrained GPT-2 model (Radford et al., 2019) . Table 1 demonstrates that on binary and fine-grained Stanford Sentiment Treebank (SST) (Socher et al., 2013) and AG News (Zhang et al., 2015) tasks, probabilities p f (s) of just 30 or so task-relevant tokens (see Section F.1) can solve the tasks. Even just one/two token per class ("class words") yields non-trivial performance. Furthermore, we validate the sentence completion reformulation in Section 3.1 by using the probabilities p f (s) after adding a task specific prompt and consistently observing improved performance, including for fine-tuning (FT) with small datasets. Φp f and f are good features: We first note that linear classification over fixed features f (s) from the pretrained model performs comparably to the FT baseline. We further validate Theorem 4.2 by verifying that the conditional mean features Φp f (s) also linearly solve downstream tasks fairly well. This performance is comparable to, but always worse than f (s), as seen in columns 3 and 4 of Table 1 . We again find that adding a prompt improves performance. Note that a random projection of p f (s) to same dimensions as Φp f (s) has very poor performance. Section E.5 has results for a wider range of classification tasks. Evidence for Assumption 4.1 is provided by learning a quadratic function to fit the log partition function of features from pretrained GPT-2 model (see Section F.4). Figure 1 demonstrates that the fit holds for its training and unseen data (e.g., WebText (Radford et al., 2019) ).

7. CONCLUSIONS AND FUTURE WORK

We provide intuitive and mathematical explanations for the success of language model features on classification tasks by reformulating them as sentence completion problems. This reformulation is formalized as natural tasks: those that can be solved linearly using the conditional probability distribution p * •|s . Insights from our analysis help design the Quad objective that provably learns good features for these natural tasks. We hope our analysis will inspire other mathematical insights into language models. While Section 4.3 argues linearity between conditional mean features Φp f and f , it is insufficient to explain the observed superiority of f over Φp f . We leave exploring this limitation of our analysis to future work. Guarantees for softmax models are for natural tasks w.r.t. Φ, thus knowing the optimal d-dimensional word embeddings Φ * for xent (f, Φ) is also important. Other meaningful directions include providing guarantees for other successful models like BERT (Devlin et al., 2018) and more diverse downstream tasks. Although we would like to show stronger guarantees by exploiting model and algorithmic inductive biases, as well as study the setting of finetuning language model features, lack of a good theory of deep learning is the current bottleneck.

A OVERVIEW

Section B is a more detailed version of Section 5.1 and Section C is a detailed version of Section 5.2. Section D.1 has a discussion about why natural tasks are a reasonable formalization for the sentence completion reformulation and also interpretations for τ and B in the definition of natural tasks. Section D.2 discusses desirable properties of word embeddings Φ like capturing synonym structure in words. Section E contains proofs for all results, including proof sketches for the main results in Section E.1. Lemma E.4 is the softmax variant of Pinsker's inequality that we prove and use for our main results. Section F contains many more experimental findings that consolidate many of our theoretical results. Section F.1 provides the information about subsets of words used for results in Table 1 and also additional experiments to test the performance of pretrained language model embeddings f on more downstream tasks and also verifying that conditional mean embeddings Φp f do well on these tasks. In Section F.3, we present additional results for Quad objective trained on a larger corpus and tested on SST. Section F.4 provides additional details on how A, b and c from Assumption 4.1 are learned and also further verification of the assumption on more datasets. Finally, Section F.5 experimentally verifies the O( √ ) dependence from Theorem 4.2.

B BETTER HANDLING OF DISTRIBUTIONAL SHIFT

While the bounds above used γ(p T ) to transfer from the distribution p L to p T , we define a more refined notion of transferability here. While γ(p T ) only depends on p L and p T , the more refined notions depend also on the learned language model, thus potentially exploiting some inductive biases. We first define the notion of error made in the predicted probabilities by any predictor p •|s as ∆ {p •|s } (s) = p •|s -p * •|s . Thus for any softmax language model f we have ∆ {p f (s) } (s) = p f (s) -p * •|s . For any distribution p ∈ ∆ S , we define the covariancefoot_4 of a function g : S → R D as Σ p (g) = E s∼p g(s)g(s) . We define 3 coefficients for the results to follow Definition B.1. For any distribution p ∈ ∆ S , we define the following γ(p; {p •|s }) := Σ p L (∆ {p •|s } ) -1 2 Σ p (∆ {p •|s } )Σ p L (∆ {p •|s } ) -1 2 2 -1 (7) γ Φ (p; {p •|s }) := Σ p L (Φ∆ {p •|s } ) -1 2 Σ p (Φ∆ {p •|s } )Σ p L (Φ∆ {p •|s } ) -1 2 2 -1 (8) γ(p; Φp f ) := γ Φ (p; {p f (s) }) (9) We notice that Σ p (∆ {p •|s } ) = E s∼p (p •|s -p * •|s )(p •|s -p * •|s ) , Σ p (Φ∆ {p •|s } ) = ΦΣ p (∆ {p •|s } )Φ . We are now ready to state the most general results. Theorem B.1 (Strengthened Theorem 4.1). Let {p •|s } be a language model that is -optimal, i.e. xent ({p •|s }) - * xent ≤ for some > 0. For a classification task T that is (τ, B)-natural, we have T {p •|s } ≤ τ + 2B 2 γ(p T ; {p •|s }) For a classification task T that is (τ, B)-natural w.r.t. Φ, we have T {p •|s } ≤ T {Φp •|s } ≤ τ + 2B 2 γ Φ (p T ; {p •|s }) Theorem 5.1 (Strengthened Theorem 4.2). For a fixed Φ, let f be features from an -optimal ddimensional softmax language model, i.e. xent (f, Φ) - * xent (Φ) ≤ , where * xent (Φ) is defined in Equation ( 4). For a classification task T that is (τ, B)-natural w.r.t. Φ, we have T {p f (s) } ≤ T (Φp f ) ≤ τ + 2B 2 γ(p T ; Φp f ) Discussions: It is not hard to show that the coefficients satisfy γ Φ (p T ; {p •|s }) ≥ γ(p T ; {p •|s }) ≥ γ(p T ) and γ(p T ; Φp f ) ≥ γ(p T ) , thus showing that these results are strictly stronger than the ones from the previous section. The transferability coefficient is a measure of how guarantees on p L using a language model can be transferred to another distribution of contexts and it only depends on the distribution of contexts and not the labels. Unlike γ(p T ), the coefficients in Definition B.1 depend on the learned models, either {p •|s } or {p f (s) }, and can be potentially much smaller due to the inductive bias of the learned models. For instance, if errors made by the model are random-like, i.e. ∆ {p •|s } (s) ∼ ρ, independently of s, then Σ p L (∆ {p •|s } ) ≈ Σ p (∆ {p •|s } ) ≈ E η∼ρ [ηη ], making γ(p; {p •|s }) ≈ 1. Independence prevents language modeling error from accumulating on contexts from p T , bypassing the worst case transfer of γ(p T ).

C QUAD: A NEW OBJECTIVE FUNCTION

In Definition 3.2 we discuss how low dimensional softmax language models learn a linear projection of p * •|s , only solving tasks that lie in the row span of word embeddings Φ. Although Φ defines tasks that language model features can solve, the standard cross-entropy objective does not lend a simple closed form expression for optimal Φ. This motivates the construction of our Quad objective, that has two nice properties: (1) the optimal feature map f * is a linear function of p * •|s and thus can solve some natural tasks, and (2) the optimal Φ * has an intuitively meaningful closed-form solution. quad,s (θ, Φ) = E w∼p * •|s [-θ φ w ] + 1 2 Φ θ 2 = -θ Φp * •|s + 1 2 Φ θ 2 (10) quad (f, Φ) = E s∼p L [ quad,s (f (s), Φ)] The Quad objective is very similar to the cross-entropy objective from Equation ( 2), with the log partition function replaced by a quadratic function, inspired in part by Assumption 4.1. We can derive the optimal solution Φ * that depends on the eigen-decomposition of a substitutability matrix. Definition 5.1. The substitutability matrix is defined to be Ω * := E s∼p L p * •|s p * •|s ∈ R V ×V . If Ω * = U SU is the eigendecomposition, then U d ∈ R V ×d is matrix of top d eigenvectors of Ω * . The matrix Ω * captures substitutability between pairs of words. Words w and w are substitutable if they have identical conditional probabilities for every context s ∈ S and thus can replace occurrences of each other while still providing meaningful completions. By definition, these words satisfy Ω * [w] = Ω * [w ] . Such pairs of words were called "free variants" in the work on distributional semantics (Harris, 1954) , and capture the notion of synonyms in the distributional hypothesis. We now derive expressions for the optimal solution of the Quad objective described in Equation ( 11). The proof of all results from this section are in Section E.5. Theorem C.1. The optimal solution f * , Φ * = arg min f,Φ quad (f, Φ) satisfies Φ * = BU d , for full rank B ∈ R d×d f * (s) = (Φ * Φ * ) -1 /2 Φ * p * •|s = CU d p * •|s , for full rank C ∈ R d×d If Φ is fixed, then the optimal solution is f * (s) = (ΦΦ ) -1 /2 Φp * •|s . Theorem 5.2. Let f * , Φ * = arg min f,Φ quad (f, Φ). Then Φ * = BU d , for full rank B ∈ R d×d . Also, for a classification task T that is (τ, B)-natural w.r.t. Φ * , we have T (f * ) ≤ τ . Thus f * excels on natural tasks w.r.t. Φ * , which in turn, is the best d-dimensional projection of Ω * . Thus words w, w ∈ W that are synonyms (hence substitutable) will satisfy φ * w = φ * w , fulfilling the desired property for word embeddings discussed in Definition 3.2. We train using the Quad objective and compare its performance to a similarly trained language model, finding Quad to be reasonably effective. The goal of testing Quad is not to obtain state-of-the-art results, but to demonstrate that theoretical insights can aid the design of provably effective algorithms.

D MORE ON NATURAL TASKS

The discussions in this section may not be formal and precise in places, they are meant to provide more intuition for some of the definitions and results.

D.1 SENTENCE COMPLETION REFORMULATION ≡ NATURAL TASK

We provide informal justification for why the sentence completion reformulation can be formalized as being able to solve using a linear classifier over p * •|s ∈ R V . The analysis will also end up providing some intuitions for τ and B in Definition 3.1 and Theorem 4.1. In particular, we will show that a task that is amenable to the sentence completion reformulation will be (τ, B)-natural, with τ = O(Bayes-Error(T )), i.e. τ is small if the Bayes error for the task error, and B = O(α(W indicative ) -1 ) is inversely proportional to the probability mass of the set of indicative words for the task. This is formalized in Proposition D.2.

Linear classifier over p * •|s

Consider a binary classification task T and that can be solved with a sentence completion reformulation after adding a prompt as in Section 3.1, for e.g. sentiment classification can be solved by adding a prompt "This movie is" at the end of every movie review and use the completions to solve the task. Recall that p T is the distribution over S × {±1} for the task T . We abuse notation and use p T to denote the distribution over inputs where a prompt is added to each to input, for e.g. "I loved the movie." is transformed to "I loved the movie. This movie is". For any s ∼ p T , let p T (y = 1|s) and p T (y = -1|s) denote the conditional probabilities of the sentiment of review s (with an added prompt) being positive and negative respectively. By law of total probability we can write this conditional probability as p T (y = 1|s) = w∈W Pr(y = 1|(s, w)) Pr(w|s) = w∈W Pr(y = 1|(s, w)) p * •|s (w) For any task T we can roughly partition the vocabulary set W into the following Indicative words W indicative : w can be an indicative completion for the task, like "good", "boring", "trash" etc, after a movie review like s ="I loved the movie. This movie is". In this case the sentence completion reformulation can be interpreted as the following: the completion w after a review s is sufficient to determine the sentiment of the review, i.e. we do not need to know the content of the review s to predict the label if we know the completion w. This can be formalized as Pr(y = 1|(s, w)) ≈ P (y = 1|w) for some fixed distribution P for indicative completions w. Irrelevant words W irrelevant : w can be an irrelevant completion for the task, like "a", "very", "not". In this case the completions, on the other hand, do not reveal anything more about the sentiment for the review than s itself, i.e. Pr(y = 1|(s, w)) ≈ p T (y = 1|s) for irrelevant completions w. Thus from Equation ( 12 where v 1 ∈ R V is defined as v 1 (w) = P (y = 1|w) for w ∈ W indicative and v 1 (w) = 0 for w ∈ W irrelevant . Similarly we can define v -1 ∈ R V with v -1 (w) = P (y = -1|w) for w ∈ W indicative , v -1 (w) = 0 for w ∈ W irrelevant . From the earlier calculation, and a similar one for y = -1, we get p T (y = b|s) ≈ 1 1 -p * •|s (W irrelevant ) v b p * •|s = 1 p * •|s (W indicative ) v b p * •|s , for b ∈ {±1} If we assume p * •|s (W indicative ) ≈ α(W indicative ) is roughly the same for all s, i.e. probability mass of indicative words following a modified review is approximately the same, then we get p T (y = 1|s) -p T (y = -1|s) ≈ v T p * •|s , where v T = 1 α(W indicative ) (v 1 -v -1 ) Thus we can approximately express the difference in conditional probabilities of the 2 classes as a linear function of p * •|s . While it is intuitively clear why knowing p T (y = 1|s) -p T (y = -1|s) is useful for solving the task, we show precisely why in the next part.

Interpretation for τ and B

Based on the above discussed, we will show that the task T from earlier is (τ, B)-natural according to the Definition 3.1 and will also give us an interpretation for τ and B. First we show that the following predictor from Equation ( 13) is effective for task T g T (s) = p T (y = 1|s) -p T (y = -1|s) ≈ v T p * •|s (14) We reuse the notation from Equation (3) and define the task loss for any predictor g : S → R as T (g) = E (s,y)∼p T [ (g(s), y)] Furthermore let Bayes-Error(T ) := inf g:S→R E (s,y)∼p T [1{g(s) = y}] denote the Bayes error of the task T , i.e. the optimal 0 -1 error achievable on the task. Proposition D.1. For any task T and for the hinge loss , T (g T ) ≤ 4 Bayes-Error(T ), where g T (s) = p T (y = 1|s) -p T (y = -1|s). Thus if a task is easily solvable, i.e. has small Bayes error, then it will be solvable by the predictor g T (s). Since we argued above that sentence reformulation implies that g T (s) is a linear function of p * •|s , we can now show that T is a natural task as formalized in Definition 3.1. Proposition D.2 (Informal). Task T that can be reformulated as a sentence completion task (described above) is a (τ, B)-natural task w.r.t. the hinge loss, with the follow parameters τ ≤ 4 Bayes-Error(T ) and B = α(W indicative ) -1 Here Bayes-Error(T ) is the Bayes error of task T and α(W indicative ) is the total mass of the indicative words for the task. If the task T can be reformulated as sentence completion, then T is (τ, B)-natural where • τ is small if the task is unambiguous, i.e. it has small Bayes error • B is small if the probability mass of the set of indicative words W indicative is large, i.e. the task depends on a large set of frequent words Thus the upper bound in Theorem 4.1 is smaller if the task can be reformulated as sentence completion task with a large and frequent set of completions, and we can ever hope to solve it well (Bayes error is small). The proofs for the above propositions are in Section D.1.

D.2 NICE PROPERTIES OF WORD EMBEDDINGS Φ

We argue here that if the word embeddings Φ satisfy certain nice properties, then (τ, B)-natural tasks of interest will be (τ , B )-natural w.r.t. Φ, where we will provide informal quantifications for the nice properties and tasks of interest that lead to a small value for τ and B . The nice property will be related to Φ capturing the semantic meaning (synonym structure) of words and tasks of interest will be those that try to distinguish word completion (in the sentence completion reformulation) with very different meanings, i.e. tries to distinguish more coarse-grained semantic notions rather than very fine-grained ones. Note that the results here are informal and qualitative, rather than quantitative. Consider a task T that is (τ, B)-natural task and let v * ∈ R V be the classifier such that T ({p * •|s }, v * ) ≤ τ and v * ∞ ≤ B. We want to find properties of Φ and v * that will make T to be (τ , B )-natural w.r.t. Φ such that τ and B are not too large. 6We will show that T is (τ , B )-natural w.r.t. Φ by finding a classifier v such that v = Φ λ ∈ R V , v ∞ ≤ B and T ({p * •|s }, v) ≤ τ . First we define P Φ := Φ † Φ ∈ R V ×V to be the projection matrix for the row-span of Φ and P ⊥ Φ := I V -P Φ to be orthogonal projection matrix. We will show that the classifier v = P Φ v * suffices for our case, under some intuitive conditions on v * and Φ. To compute B , we first look at the ∞ norm of v = P Φ v * B = v ∞ = P Φ v * ∞ = v * -P ⊥ Φ v * ∞ ≤ v * ∞ + P ⊥ Φ v * ∞ ≤ B + P ⊥ Φ v * 2 To find the upper bound τ , we upper bound the classification loss of v = P Φ v * . We first define the substitutability matrix Ω * p = E s∼p p * •|s p * •|s , similar to the one in Definition 5.1. Then T ({p * •|s }, v) = E (s,y)∼p T (v p * •|s , y) = E (s,y)∼p T ((P Φ v * ) p * •|s , y) ≤ (a) E (s,y)∼p T (v * p * •|s , y) + E s∼p T [|(v * -P Φ v * ) p * •|s |] = T ({p * •|s }, v * ) + E s∼p T |v * P ⊥ Φ p * •|s | ≤ (b) τ + E s∼p T (v * P ⊥ Φ p * •|s ) 2 = τ + E s∼p T v * P ⊥ Φ p * •|s p * •|s P ⊥ Φ v * = (c) τ + v * P ⊥ Φ Ω * p T P ⊥ Φ v * ≤ (d) τ + P ⊥ Φ v * 2 P ⊥ Φ Ω * p T P ⊥ Φ 2 where (a) follows from 1-Lipschitz property of , (b) from Jensen's inequality and that T ({p * •|s }, v * ) ≤ τ , (c) from the definition of substitutability matrix Ω * p T and (d) by definition of spectral norm of a symmetric PSD matrix. Thus we have shown that T is (τ , B )-natural w.r.t. Φ, where τ = τ + P ⊥ Φ v * 2 P ⊥ Φ Ω * p T P ⊥ Φ 2 , B = B + P ⊥ Φ v * 2 We will now show that if Φ captures the notion of synonyms, then P ⊥ Φ Ω * p T P ⊥ Φ 2 will be small leading to τ being small. Furthermore we also shed some light on what it means for P ⊥ Φ v * 2 to be small, which will in turn make B small and τ smaller. We do so with the following arguments, 1) Ω * p T captures semantic meaning of words and thus its top eigen-directions will capture more dominant semantic concepts, 2) if Φ captures the "top-d" directions of meaning, i.e. the top-d eigen-directions of Ω * p T , then P ⊥ Φ Ω * p T P ⊥ Φ 2 = O(1/d), 3) if additionally v * cares about the "topd" directions of meaning, i.e. top-d eigen-directions of Ω * p T then P ⊥ Φ v * 2 will be small. We expand on these points below 1. Substitutability matrix (Ω * p T ) captures semantic meaning: We use a similar argument to the one in Section 5.2 right after Definition 5.1 that is based on distributional semantics (Harris, 1954) . Harris (1954) posits that meaning for elements (words) can be derived from the environments (contexts) in which they occur. Thus Harris (1954) argues that words that occur in almost identical set of contexts have the same meaning, i.e. are synonyms. On the other hand, if two words share some contexts but not all, then they have different meanings and the amount of difference in meaning roughly corresponds to amount of difference in contexts. In our setting, the similarity of words w and w can then be determined by the probabilities assigned to them by different contexts s. In particular, if p  Ω * p T (w) = Ω * p T (w ) =⇒ Ω * p T (w, w) = Ω * p T (w , w) = Ω * p T (w, w ) = Ω * p T (w , w ) (17) =⇒ β w β w = β w β w = β w β w =⇒ β w = β w (18) Thus Ω * p T indeed does capture the synonyms structure between words, and the top eigen-directions of it capture the most significant "semantic meaning" directions. 2. Φ has nice properties: if Φ roughly respects this synonym structure by aligning with the top-d eigen-directions of Ω * p T , we have P ⊥ Φ Ω * p T P ⊥ Φ 2 ≤ λ d+1 (Ω * p T ) ≤ 1 d + 1 d+1 i=1 λ i (Ω * p T ) ≤ 1 d + 1 tr(Ω * p T ) (19) ≤ 1 d + 1 E s∼p T tr(p * •|s p * •|s ) ≤ 1 d + 1 (20) From Equation ( 16), we then have τ ≤ τ + P ⊥ Φ v * 2 √ d 3. Tasks of interest: It is more likely for a classifier v * to separate words with big differences in meaning rather than small differences. For e.g., it is more likely for a task to separate word completions "good" and "bad" rather than "good" and "nice". Since top eigen-directions of Ω * We now analyze the hinge loss of the predictor g p T defined in Equation ( 14). Note that since g p T (s) ≤ 1, the hinge loss (g p T (s), y) = (1 -yg p T (s)) + = 1 -yg p T (s) for every s, y. Thus the total loss is g p T (s) = E (s,y)∼p T [(1 -yg p T (s)) + ] = E (s,y)∼p T [(1 -yg p T (s))] = (a) E s∼p T [p 1 (s) (1 -g p T (s)) + p -1 (s) (1 + g p T (s))] = E s∼p T [1 -(p 1 (s) -p -1 (s))g p T (s)] = (b) E s∼p T 1 -(p 1 (s) -p -1 (s)) 2 = E s∼p T (p 1 (s) + p -1 (s)) 2 -(p 1 (s) -p -1 (s)) 2 = E s∼p T [4p 1 (s)p -1 (s)] = 4 E s∼p T [p min (s)p max (s)] ≤ (c) 4 E s∼p T [p min (s)] = 4 Bayes-Error(T ) where (a) follows by splitting the expectation over y|s, (b) follows from the definition of g p T (s) in Equation ( 14) and (c) follows from p max (s) ≤ 1. This completes the proof. Proposition D.2. Let B = α(W indicative ) -1 . We first note the following using the definition of v from Equation ( 13). v T ∞ = α(W indicative ) -1 max w∈W |v 1 (w) -v -1 (w)| = B max w∈W |P (y = 1|w) -P (y = -1|w)| ≤ B (21) To find the value of τ that makes the task (τ, B)-natural (Definition 3.1), we observe the following min v∈R V , v ≤B T ({p * •|s }, v) = (a) T ({p * •|s }, v T ) = E (s,y)∼p T [ (v T p * •|s , y)] = (b) E (s,y)∼p T [ (g T (s), y)] = T (g T ) ≤ (c) 4 Bayes-Error(T ) where (a) follows from the calculation in Equation ( 21), (b) follows from Equation ( 13) and (c) follows from Proposition D.1.

E PROOFS E.1 PROOF SKETCH

We first present a sketch of the arguments that help us show our main results, theorems 4.1 and 4.2. The subsections after the next one contain the full proofs for strengthened versions of these results.

E.1.1 PROOF SKETCH FOR ARBITRARY LANGUAGE MODELS: THEOREM 4.1

Here we want to show guarantees for features {p •|s } on a (τ, B)-natural task T . From the definition of natural tasks, we know ∃v * ∈ R V , v * ∞ ≤ B s.t. T ({p * •|s }, v * ) ≤ τ (22) We wish to upper bound the classification error T ({p •|s }) and do so using the following sequence of inequalities. T ({p •|s }) -τ = inf v∈R V T ({p •|s }, v) -τ ≤ T ({p •|s }, v * ) -T ({p * •|s }, v * ) = T ({p •|s }, v * ) -T ({p * •|s }, v * ) E s∼p T [(v * (p •|s -p * •|s )) 2 ] • E s∼p T [(v * (p •|s -p * •|s )) 2 ] E s∼p L [(v * (p •|s -p * •|s )) 2 ] • E s∼p L [(v * (p •|s -p * •|s )) 2 ] = T ({p •|s }, v * ) -T ({p * •|s }, v * ) v * Σ p T (∆ {p •|s } )v * α 1 (v * ) Classification loss → error covariance on p T Use Lipschitzness of and Jensen's inequality • v * Σ p T (∆ {p •|s } )v * v * Σ p L (∆ {p •|s } )v * α 2 (v * ) Error covariance from p T → p L Use transferability coefficient • E s∼p L [(v * (p •|s -p * •|s )) 2 ] α 3 (v * ) Error covariance → cross-entropy loss Use (modified) Pinsker's inequality (23) where Σ p (g) := E s∼p [g(s)g(s) ] is the uncentered covariance of g w.r.t. distribution p ∈ ∆ S , as defined in Section 5.1. We upper bound T ({p •|s }) -τ by upper bounding each of α 1 (v * ), α 2 (v * ), α 3 (v * ) as follows • Classification loss → prediction error covariance: α 1 (v * ) is upper bounded by using Lipschitzness of the loss used in the definition of T , e.g. hinge loss or logistic loss, and then followed by an application of Jensen's inequality Lemma E.8 =⇒ α 1 (v) ≤ 1 for all v ∈ R V • Error covariance from p T → p L : α 2 (v * ) handles the mismatch in distributions p T and p L over which the classification loss and cross-entropy losses are measured respectively. It is upper bounded by the transferability coefficient Lemma E.10 and Lemma E.9 =⇒ α 2 (v) ≤ γ(p T ) -1 for all v ∈ R V • Error covariance → cross-entropy loss (arbitrary language models): This is arguably the most important step that connects the error in prediction to the cross-entropy loss. For the arbitrary language model case, this is proved using Pinsker's inequality and taking expectation over the distribution p L . Lemma E.3 =⇒ α 3 (v) ≤ 2 v 2 ∞ ( xent ({p •|s }) -xent (p * •|s )) for all v ∈ R V E.1.2 PROOF SKETCH FOR SOFTMAX LANGUAGE MODELS: THEOREM 4.2 Here we want to show guarantees for features Φp f = {Φp f (s) } on a (τ, B)-natural task T w.r.t Φ. From the definition of natural tasks w.r.t. Φ, we know ∃v * = Φ λ ∈ R V , v * ∞ ≤ B s.t. T ({p * •|s }, v * ) ≤ τ (24) Note that the difference here is that v * is in the span of Φ rather than an arbitrary vector in R V . We wish to upper bound the classification error T ({Φp f (s) }) and do so using the following sequence of inequalities. T ({Φp f (s) }) -τ = inf λ∈R d T ({Φp f (s) }, λ) -τ = inf v=Φ λ∈R V T ({p f (s) }, v) -τ ≤ T ({p f (s) }, v * ) -T ({p * •|s }, v * ) ≤ α 1 (v * ) • α 2 (v * ) • α 3 (v * ) where the first inequality follows because v * is in the span of Φ and second inequality follows from Equation ( 23). The bounds for α 1 (v * ) and α 2 (v * ) are the same as arbitrary language models. The main difference is the bound on α 3 (v * ) which will be a stronger bound for softmax models. • Error covariance → cross-entropy loss (softmax language models): For softmax language models, we need to prove a modified version of Pinsker's inequality specifically for softmax models. This version will show a bound that only works when v * is in the span of Φ and if the evaluated model p f (s) computes softmax using Φ as well. Lemma E.4 =⇒ α 3 (v) ≤ 2 v 2 ∞ ( xent ({p f (s) }) -inf f * ({p f * (s) })) ∀v = Φ λ ∈ R V Thus we suffer the suboptimality of the language model {p f (s) } w.r.t. the best softmax model {p f * (s) } rather than the absolute best language model {p * •|s }. This is done using the softmax variant of Pinsker's inequality in Lemma E.4. We now present the detailed proofs for all results.

E.2 PROOFS FOR ARBITRARY LANGUAGE MODELS

Theorem B.1 (Strengthened Theorem 4.1). Let {p •|s } be a language model that is -optimal, i.e.

xent ({p •|s }) - *

xent ≤ for some > 0. For a classification task T that is (τ, B)-natural, we have T {p •|s } ≤ τ + 2B 2 γ(p T ; {p •|s }) For a classification task T that is (τ, B)-natural w.r.t. Φ, we have T {p •|s } ≤ T {Φp •|s } ≤ τ + 2B 2 γ Φ (p T ; {p •|s }) Proof. The proof has two main steps that we summarize by the following two lemmas. The first one upper bounds the downstream performance on natural tasks with the covariance of errors.

Lemma E.2. For a language model {p

•|s }, if T is (τ, B)-natural, T ({p •|s }) ≤ τ + sup v∈R V , v ∞≤B v Σ p L (∆ {p •|s } )v γ(p T ; {p •|s }) If T is (τ, B)-natural w.r.t. Φ ∈ R d×V , T ({Φp •|s }) ≤ τ + sup v=Φ λ∈R V , v ∞≤B v Σ p L (∆ {p •|s } )v γ Φ (p T ; {p •|s }) where γ(•) and γ Φ (•) are from Definition B.1. The second lemma upper bounds the covariance of error with the suboptimality of the language model. Lemma E.6. For a language model {p •|s } and classifier v ∈ R V , v Σ p L (∆ {p •|s } )v ≤ 2 v 2 ∞ xent ({p •|s }) - * xent where Σ p L (∆ {p •|s } ) = E s∼p L (p •|s -p * •|s )(p •|s -p * •|s ) as defined in Section B. We prove both the above lemmas in Section E.6. We first use these to prove the main result. Combining the two lemmas, we get the following inequality T ({p •|s }) ≤ (a) τ + sup v∈R V , v ∞≤B v Σ p L (∆ {p •|s } )v γ(p T ; {p •|s }) ≤ (b) τ + sup v∈R V , v ∞≤B 2 v 2 ∞ xent ({p •|s }) - * xent γ(p T ; {p •|s }) ≤ (c) τ + 2B 2 γ(p T ; {p •|s }) where (a) uses first part of Lemma E.2, (b) uses Lemma E.6 and (c) uses the -optimality of {p •|s }. This proves the first part of the result. The second part can also be proved similarly. T ({Φp •|s }) ≤ (a) τ + sup v=Φ λ∈R V , v ∞ ≤B v Σ p L (∆ {p •|s } )v γ Φ (p T ; {p •|s }) ≤ (b) τ + sup v=Φ λ∈R V , v ∞≤B 2 v 2 ∞ xent ({p •|s }) - * xent γ Φ (p T ; {p •|s }) ≤ τ + sup v∈R V , v ∞ ≤B 2 v 2 ∞ xent ({p •|s }) - * xent γ Φ (p T ; {p •|s }) ≤ (c) τ + 2B 2 γ Φ (p T ; {p •|s }) where (a) uses second part of Lemma E.2, (b) uses Lemma E.6 and (c) uses the -optimality of {p •|s }. The proof of the lemmas can be found in Section E.6. Theorem 4.1. Let {p •|s } be a language model that is -optimal, i.e. xent ({p •|s }) - * xent ≤ , for some > 0. For a classification task T that is (τ, B)-natural, we have T {p •|s } ≤ τ + 2B 2 γ(p T ) Proof. This follows from the first part of Theorem B.1 if we can also show that γ(p T ; {p •|s }) -1 ≤ γ(p T ) -1 . For that we use the following lemma that we prove in Section E.6. Lemma E.9. For any g : S → R D and p T ∈ ∆ S , we have Σ p L (g) -1 2 Σ p T (g)Σ p L (g) -1 2 2 ≤ γ(p T ) -1 Instantiating this for g = ∆ {p •|s } and using Equation ( 7), we get γ(p T ; {p •|s }) -1 ≤ γ(p T ) -1 , which completes the proof.

E.3 PROOFS FOR SOFTMAX LANGUAGE MODELS

Theorem 5.1 (Strengthened Theorem 4.2). For a fixed Φ, let f be features from an -optimal ddimensional softmax language model, i.e. xent (f, Φ) - * xent (Φ) ≤ , where * xent (Φ) is defined in Equation ( 4). For a classification task T that is (τ, B)-natural w.r.t. Φ, we have T {p f (s) } ≤ T (Φp f ) ≤ τ + 2B 2 γ(p T ; Φp f ) Proof. Instantiating Lemma E.2 for p •|s = p f (s) , we get T ({Φp f (s) }) ≤ τ + sup v=Φ λ∈R V , v ∞≤B v Σ p L (∆ {p f (s) } )v γ Φ (p T ; {p f (s) }) = (a) τ + sup Φ λ ∞ ≤B λ ΦΣ p L (∆ {p f (s) } )Φ λ γ(p T ; Φp f ) = τ + sup Φ λ ∞ ≤B λ Σ p L (Φ∆ {p f (s) } )λ γ(p T ; Φp f ) where (a) follows from Equation ( 9) that says γ(p T ; Φp f ) = γ Φ (p T ; {p f (s) }). We now prove a similar result for the second term in the following lemma that we prove in Section E.6. Lemma E.7. For a fixed Φ and a softmax language model with features f and λ ∈ R d , λ Σ p L (Φ∆ {p f (s) } )λ ≤ 2 Φ λ 2 ∞ ( xent (f, Φ) - * xent (Φ)) where Σ p L (Φ∆ {p f (s) } ) = E s∼p L (Φp f (s) -Φp * •|s )(Φp f (s) -Φp * •|s ) as defined in Section B. Using Lemma E.7 directly gives us T (Φp f ) = T ({Φp f (s) }) ≤ τ + B 2 ( xent (f,Φ)- * xent (Φ)) γΦ(p T ;Φp f ) , and the -optimality almost completes the proof. The only thing remaining to show is that T ({p f (s) }) ≤ T (Φp f ) which follows from the following sequence. T ({p f (s) }) = inf v∈R V ,b∈R T ({p f (s) }, v) ≤ inf Φ λ∈R V ,b∈R T ({p f (s) }, (Φ λ, b)) = inf λ∈R d ,b∈R T ({Φp f (s) }, (λ, b)) = T (Φp f ) Theorem 4.2. For a fixed Φ, let f be features from an -optimal d-dimensional softmax language model, i.e. xent (f, Φ) - * xent (Φ) ≤ , where * xent (Φ) is defined in Equation ( 4). For a classification task T that is (τ, B)-natural w.r.t. Φ, we have T {p f (s) } ≤ T (Φp f ) ≤ τ + 2B 2 γ(p T ) Proof. This result follows directly from Theorem 5.1, if we can also show that γ(p T ; Φp f ) -1 ≤ γ(p T ) -1 just like in the proof of Theorem 4.1. For that we again use Lemma E.9 with g = Φ∆ {p f (s) } and Equation ( 9) and this completes the proof.

E.4 PROOFS FOR SECTION 4.3

We first show why Assumption 4.1 is approximately true when word embeddings are gaussian like. Lemma E.1. Suppose word embeddings φ w are independent samples from the distribution N (µ, Σ).

Then for any

θ ∈ R d such that λ 2 = θ Σθ = O(1) we have that | log(Z θ ) -1 2 θ Σθ -θ µ - log(V )| ≤ with probability 1 -δ for = Õ e λ 2 √ V and δ = 1 -exp(-Ω(log 2 (V ))). Proof. We first note that log(Z θ ) = log w e θ φw = θ µ + log w e θ (φw-µ) , thus we can simply deal with the case where φ w are sampled from N (0, Σ). Furthermore the only random variable of interest is X w = θ φ w which is a gaussian variable N (0, θ Σθ) = N (0, λ 2 ). Thus the problem reduces to showing that for V samples of X w ∼ N (0, λ 2 ), log(Z) is concentrated around λ 2 + log(V ) where Z = w exp(X w ). This can be proved similarly to the proof of Lemma 2.1 in Arora et al. (2016) . It is easy to see that E Xw∼N (0,λ 2 ) [exp(X w )] = e λ 2 . However the variable exp(X w ) is neither sub-gaussian nor sub-exponential and thus standard inequalities cannot be used directly. We use the same technique as Arora et al. (2016) to first observe that E[Z] = V e 1 2 λ 2 and Var[Z] ≤ E[exp(2Xw)] = V e 2λ 2 . After conditioning on the event that X w ≤ 1 2 λ log(V ) and applying Berstein's inequality just like in Arora et al. (2016) completes the proof. We next prove Lemma 4.3 that establishes a linear relationship between Φp f and f (under Assumption 4.1) and also the guarantees for f on natural tasks. Proof. Assumption 4.1 gives us that log(Z θ ) = 1 2 θ Aθ + θ b + c. We prove this lemma by matching the gradients of log(Z θ ) and the quadratic function on the R.H.S. ∇ θ log(Z θ ) = ∇ θ Z θ Z θ = w∈W e φ w θ φ w Z θ = w∈W p θ (w)φ w = Φp θ Whereas the gradient of the quadratic part is ∇ θ [ 1 2 θ Aθ + θ b + c] = Aθ + b. Matching the two for θ = f (s) gives us Φp f (s) = Φp f (s) = Af (s) + b. Corollary 4.1. Using Lemma 4.3, for any -optimal f , as defined in Theorem 4.2, for classification tasks that are (τ, B)-natural w.r.t. Φ we have T (f ) ≤ τ + O( √ ). Proof. The main idea is that Lemma 4.3 gives us that Φp f (s) = Af (s) + b and thus any linear function of Φp f will also be a linear function of f (s). From Theorem 5.1 (or Theorem 4.2), we also know that Φp f will do well on T , i.e. T (Φp f ) ≤ τ + O(B √ ). We formalizefoot_6 the intuition as T (Φp f ) = inf λ∈R d ,b T (Φp f , (λ, b)) = inf λ∈R d ,b T (Af + b, (λ, b)) = inf λ∈R d ,b T (f, (A λ, b + λ b)) ≥ inf v∈R d ,b T (f, (v, b )) = T (f ) This shows that T (f ) ≤ T (Φp f ) ≤ τ + O(B √ ) and completes the proof.

E.5 PROOFS FOR SECTION C

Theorem C.1. The optimal solution f * , Φ * = arg min f,Φ quad (f, Φ) satisfies Φ * = BU d , for full rank B ∈ R d×d f * (s) = (Φ * Φ * ) -1 /2 Φ * p * •|s = CU d p * •|s , for full rank C ∈ R d×d If Φ is fixed, then the optimal solution is f * (s) = (ΦΦ ) -1 /2 Φp * •|s . Proof. From Equations ( 10) and ( 11) we know that, quad,s (θ, Φ) = -θ Φp * We use the first-order optimality condition to get f * Φ (s), by using the fact that ∇ θ quad,s (θ, Φ) = -Φp * •|s + 1 2 Φ θ 2 and quad (f, Φ) = E s∼p L [ quad,s (f (s), Φ)]. •|s + ΦΦ θ. Setting the gradient to zero, we get f * Φ (s) = (ΦΦ ) -1 Φp * •|sfoot_7 . To get the optimal Φ * for this objective, we plug in this expression for f * Φ in quad and find Φ * = arg min Φ quad (f * Φ , Φ). quad (f * Φ , Φ) = E s∼p * [ quad,s (f * Φ (s), Φ)] = E s∼p * -f * Φ (s) Φp * •|s + 1 2 Φ f * Φ (s) 2 = E s∼p * -((ΦΦ ) -1 Φp * •|s ) Φp * •|s + 1 2 Φ (ΦΦ ) -1 Φp * •|s 2 = E s∼p * -p * •|s Φ (ΦΦ ) -1 Φp * •|s + 1 2 p * •|s Φ (ΦΦ ) -1 ΦΦ (ΦΦ ) -1 Φp * •|s = E s∼p * - 1 2 p * •|s Φ (ΦΦ ) -1 Φp * •|s = - 1 2 E s∼p * tr p * •|s Φ (ΦΦ ) -1 Φp * •|s = - 1 2 tr Φ (ΦΦ ) -1 Φ E s∼p * p * •|s p * •|s = - 1 2 Φ (ΦΦ ) -1 Φ, E s∼p * p * •|s p * •|s = - 1 2 Φ (ΦΦ ) -1 Φ, Ω * where Ω * is the substitutability matrix defined in Definition 5.1. Let Φ = N T V be the SVD. Then the above objective reduces to quad (f * Φ , Φ) = -1 2 V V , Ω * And hence learning the optimal Φ * reduces to learning an optimal V * such that V * = arg min V ∈R V ×d ,V V =I d -V V , Ω * We will now show that the best such matrix is the matrix of top d eigenvectors of Ω * , i.e. V * = U d (cf. Definition 5.1). Here we will assume that the eigenvalues of Ω * are all distinct for simplicity of presentation. First we note that V V , Ω * = V V Ω * 1 2 2 F , where Ω * 1 2 = U S 1 2 U , with U , U d and S define in Definition 5.1. This can be shown by the following sequence of steps V V , Ω * = tr(V V Ω * ) = tr(V V V V Ω * ) = tr(V V Ω * V V ) = tr(V V U SU V V ) = tr(V V U S 1 2 U U S 1 2 U V V ) = tr(V V Ω * 1 2 Ω * 1 2 V V ) = V V Ω * 1 2 , V V Ω * 1 2 = V V Ω * 1 2 2 F Furthermore, we notice that V V Ω * 1 2 2 F = Ω * 1 2 2 F -Ω * 1 2 -V V Ω * 1 2 2 F as shown below Ω * 1 2 -V V Ω * 1 2 2 F = Ω * 1 2 2 F + V V Ω * 1 2 2 F -2tr(Ω * 1 2 V V Ω * 1 2 ) = Ω * 1 2 2 F + V V Ω * 1 2 2 F -2tr(Ω * 1 2 V V V V Ω * 1 2 ) = Ω * 1 2 2 F + V V Ω * 1 2 2 F -2 V V Ω * 1 2 2 F = Ω * 1 2 2 F -V V Ω * 1 2 2 F Thus we get arg min V ∈R V ×d ,V V =I d -V V , Ω * = arg min V ∈R V ×d ,V V =I d Ω * 1 2 -V V Ω * 1 2 2 F . Note that V V Ω * 1 2 has columns that are columns of Ω * 1 2 projected on the space spanned by columns V . It is folklore that the best such subspace V * is the subspace spanned by the top d eigenvectors of Ω * 1 2 , which is the same as top d eigenvectors of Ω * , thus giving us V * V * = U d U d . Thus we get V * = U d M for M = U d V * . This tells us that the optimal solution Φ * will have SVD of the form Φ * = N * T * V * , thus giving us Φ * = BU d for matrix B = N * T * M ∈ R d×d . This directly gives f * = f * Φ * = (Φ * Φ * ) -1 Φ * p * •|s = N * T -1 V * p * •|s = CU d p * •|s for C = N * T * -1 M . E.6 PROOFS FOR SUPPORTING LEMMAS Lemma E.2. For a language model {p •|s }, if T is (τ, B)-natural, T ({p •|s }) ≤ τ + sup v∈R V , v ∞≤B v Σ p L (∆ {p •|s } )v γ(p T ; {p •|s }) If T is (τ, B)-natural w.r.t. Φ ∈ R d×V , T ({Φp •|s }) ≤ τ + sup v=Φ λ∈R V , v ∞≤B v Σ p L (∆ {p •|s } )v γ Φ (p T ; {p •|s }) where γ(•) and γ Φ (•) are from Definition B.1. Proof. We note the following upper bounds on T ({p •|s }) and T ({Φp •|s }). T ({p •|s }) = inf v∈R V T ({p •|s }, v) ≤ inf v∈R V , v ∞≤B T ({p •|s }, v) T ({Φp •|s }) = inf v=Φ λ∈R V T ({p •|s }, v) ≤ inf v=Φ λ∈R V ,b∈R, v ∞ ≤B T ({p •|s }, v) When T is (τ, B)-natural, by Definition 3.1 we know that inf v∈R V v ∞≤B T ({p * •|s }, v) ≤ τ . We now upper bound T ({p •|s }, v) using Lemma E.8. Taking infimum w.r.t. v ∈ R V , v ∞ ≤ B from the inequality in Lemma E.8. T ({p •|s }, v) ≤ T ({p * •|s }, v) + v Σ p T (∆ {p •|s } )v inf v∈R V v ∞≤B T ({p •|s }, v) ≤ inf v∈R V v ∞≤B T ({p * •|s }, v) + sup v∈R V , v ∞≤B v Σ p T (∆ {p •|s } )v This, combined with Equation ( 26), gives us T ({p •|s }) ≤ τ + sup v∈R V , v ∞ ≤B v Σ p T (∆ {p •|s } )v Using Lemma E.10 and the definition of γ(p T ; {p •|s }) in Equation ( 7), we get that v Σ p T (∆ {p •|s } )v ≤ Σ p L (∆ {p •|s } ) -1 2 Σ p T (∆ {p •|s } )Σ p L (∆ {p •|s } ) -1 2 2 v Σ p L (∆ {p •|s } )v = v Σ p L (∆ {p •|s } )v γ(p T ; {p •|s }) We have thus successfully transferred the bound from the distribution p T to p L . Combining this with Equation ( 28) completes the proof of the first part of the lemma. We now prove the second part of the lemma where we only assume that T is (τ, B)-natural w.r.t. Φ. Here we instead take the infimum over classifiers in the span of Φ in Lemma E.8 to get inf v=Φ λ∈R V ,b∈R, v ∞ ≤B T ({p •|s }, v) ≤ inf v=Φ λ∈R V ,b∈R, v ∞ ≤B T ({p * •|s }, v) + sup v=Φ λ∈R V , v ∞≤B v Σ p T (∆ {p •|s } )v This, combined with definition of (τ, B)-natural task w.r.t. Φ and Equation ( 27) gives us T ({Φp •|s ≤ τ + sup v=Φ λ∈R V , v ∞ ≤B v Σ p T (∆ {p •|s } )v For the last term, for any v = Φ λ, λ ∈ R d we notice that v Σ p T (∆ {p •|s } )v = λ ΦΣ p T (∆ {p •|s } )Φ λ = λ Σ p T (Φ∆ {p •|s } )λ ≤ (a) Σ p L (Φ∆ {p •|s } ) -1 2 Σ p T (Φ∆ {p •|s } )Σ p L (Φ∆ {p •|s } ) -1 2 2 λ Σ p L (Φ∆ {p •|s } )λ = λ Σ p L (Φ∆ {p •|s } )λ γ Φ (p T ; {p •|s }) = v Σ p L (∆ {p •|s } )v γ Φ (p T ; {p •|s }) This combined with Equation (31), we get T ({Φp •|s }) ≤ τ + inf v=Φ λ∈R V , v ∞≤B v Σ p L (∆ {p •|s } )v γ Φ (p T ; {p •|s }) Lemma E.3 (Pinsker's inequality). For discrete distributions q, q * ∈ ∆ V , let q, q * ∈ R V be the corresponding vector of probabilities. Then we have max v ∞≤1 |v (q -q * )| ≤ 2D KL (q * , q) Proof. This basically follows from Pinsker's inequality which upper bounds the total variation distance between distributions by their KL-divergence max v ∞≤1 |v (q -q * )| = qq * 1 = 2 TV(q * , q) ≤ 2D KL (q * , q) We remind the reader that for an embedding matrix Φ ∈ R d×V , p θ,Φ := softmax(Φ θ) Lemma E.4 (Softmax variant of Pinsker's inequality). Consider a matrix Φ ∈ R d×V with d ≤ V . For any discrete distribution q * ∈ ∆ V and softmax distribution p θ,Φ = softmax(Φ θ) ∈ ∆ V for θ ∈ R d , let q * , p θ,Φ ∈ R V be the corresponding vector of probabilities. Then we have max v=Φ λ, v ∞≤1 |v (p θ,Φ -q * )| ≤ 2 D KL (p θ,Φ , q * )-inf θ * ∈R d D KL (p θ * ,Φ , q * ) Pinsker's inequality (Lemma E.3), on the other hand, gives max v ∞≤1 |v (p θ,Φ -q * )| ≤ 2D KL (p θ,Φ , q * ) Proof. Define the loss ρ(θ) := D KL (p θ,Φ , q * ). The statement in Equation ( 32) to prove reduces to max Φ λ ∞≤1 |λ (Φp θ,Φ -Φq * )| ≤ 2 ρ(θ) -inf θ * ∈R d ρ(θ * ) To prove this, we compute the gradient and hessian of ρ(θ) w.r.t. θ. We can simplify ρ(θ) as follows ρ(θ) = D KL (p θ,Φ , q * ) = E w∼q * [-log(p θ,Φ (w))] = E w∼q * -log e θ φw w e θ φ w = -θ Φq * + log w e θ φ w = -θ Φq * + log (Z θ ) The gradient is ∇ρ(θ) = ∇ -θ Φq * + log(Z θ ) = -Φq * + ∇Z θ Z θ = -Φq * + ∇ w e θ φw Z θ = -Φq * + w e θ φw φ w Z θ = -Φq * + Φp θ,Φ Similarly the Hessian can be computed ∇ 2 ρ(θ) = ∇(∇ρ(θ)) = ∇[-Φq * + Φp θ,Φ ] = ∇ w∈W p θ,Φ (w)φ w = w∈W ∇ e θ φw Z θ φ w = w∈W e θ φw Z θ φ w φ w - e θ φw Z 2 θ φ w w e θ φ w φ w = E w∼p θ,Φ [φ w φ w ] - E w∼p θ,Φ [φ w ] E w∼p θ,Φ [φ w ] = Cov w∼p θ,Φ [φ w ] Where Cov w∼p θ,Φ [φ w ] denotes the covariance of the word embeddings φ w when measured w.r.t. the distribution p θ,Φ . This directly gives us that ∇ 2 ρ(θ) 0, since the covariance is always psd, and thus ρ is convex in θ. We return to the statement in Equation ( 33) that we need to prove. With the expression for gradient of ρ at hand, we can rewrite Equation (33) as trying to prove |λ ∇ρ(θ)| ≤ Φ λ ∞ 2 ρ(θ) -inf θ * ∈R d ρ(θ * ) Furthermore, using the definition of the Hessian, it is not hard to see for some λ, θ ∈ R d that λ ∇ 2 ρ( θ)λ = Cov w∼pθ ,Φ [λ φ w ] ≤ E w∼pθ ,Φ [(λ φ w ) 2 ] ≤ Φ λ 2 ∞ . Thus we can evoke Lemma E.5 with = ρ and L = Φ λ 2 ∞ to prove Equation (34) and thus completing the proof. Intuitively Lemma E.5 exploits the smoothness of the function to argue that small suboptimality (i.e. being close to optimal solution in function value) is sufficient to guarantee small norm of the gradient, a property that is well-known in the optimization literature. We now present this lemma Lemma E.5. If a function : R d → R and λ ∈ R d satisfy λ ∇ 2 ( θ)λ ≤ L, ∀ θ ∈ R d (L- smoothness in the direction of λ) and if * = inf θ∈R d (θ), then |λ ∇ (θ)| 2 ≤ 2L( (θ) - * ) Proof. This is a variant of a classical result used in optimization and we prove it here for completeness. For any η ∈ R we have (θ) - * ≥ (a) (θ) -(θ -ηλ) ≥ (b) (θ) -(θ) + ∇ (θ), -ηλ + η 2 2 λ ∇ 2 ( θ)λ ≥ (c) η(λ ∇ (θ)) - η 2 L 2 where (a) follows from the definition of infimum and (b) follows from Taylor's expansion for some θ ∈ [θ-ηλ, θ] and (c) follows from the smoothness condition in the statement of the lemma. Picking η = λ ∇ (θ) L gives us (θ) - * ≥ 1 2L |λ ∇ (θ)| 2 , thus completing the proof. Lemma E.6. For a language model {p •|s } and classifier v ∈ R V , v Σ p L (∆ {p •|s } )v ≤ 2 v 2 ∞ xent ({p •|s }) - * xent where Σ p L (g) = E s∼p L [g(s)g(s) ] and ∆ {p •|s } (s) = p •|s -p * •|s are defined in Section B Proof. We first note that xent ({p •|s }) -xent ({p * •|s }) = E s∼p L E w∼p * •|s log p * •|s (w) p •|s (w) = E s∼p L D KL (p * •|s , p •|s ) (35) We bound v Σ p L (∆ {p •|s } )v below v Σ p L (∆ {p •|s } )v = E s∼p L v (p •|s -p * •|s ) 2 ≤ (a) v 2 ∞ E s∼p L 2D KL (p * •|s , p •|s ) = (b) 2 v 2 ∞ xent ({p •|s }) -xent ({p * •|s }) where (a) follows from Lemma E.3 (Pinsker's inequality), (b) uses Equation (35). Lemma E.7. For a fixed Φ, a softmax language model with features f and λ ∈ R d , λ Σ p L (Φ∆ {p f (s) } )λ ≤ 2 Φ λ 2 ∞ ( xent (f, Φ) - * xent (Φ)) where Σ p L (Φ∆ {p f (s) } ) = E s∼p L (Φp f (s) -Φp * •|s )(Φp f (s) -Φp * •|s ) as defined in Section B. Proof. We start by nothing that λ Σ p L (Φ∆ {p f (s) } )λ = λ E s∼p L (Φp f (s) -Φp * •|s )(Φp f (s) -Φp * •|s ) λ = E s∼p L [|λ (Φp f (s) -Φp * •|s )| 2 ] = E s∼p L [|(Φ λ) (p f (s) -p * •|s )| 2 ] We will use the variant of Pinsker's inequality from Lemma E.4 to bound each term on the right hand side. Notice that xent (f, Φ) - * xent (Φ) = E s∼p L [ xent,s (f (s), Φ) -inf θ∈R d xent,s (θ, Φ)]. λ Σ p L (Φ∆ {p f (s) } )λ = E s∼p L [|(Φ λ) (p f (s) -p * •|s )| 2 ] ≤ (a) 2 Φ λ 2 ∞ E s∼p L D KL (p * •|s , p f (s),Φ ) -inf θ∈R d D KL (p * •|s , p θ,Φ ) ≤ 2 Φ λ 2 ∞ E s∼p L xent,s (f (s), Φ) -inf θ∈R d xent,s (θ, Φ) ≤ 2 Φ λ 2 ∞ ( xent (f, Φ) - * xent (Φ)) where (a) follows from Lemma E.4. This completes the proof.

E.6.1 CLASSIFICATION LOSS TO COVARIANCE OF ERROR

Lemma E.8. For any task T and classifier v ∈ R V and predicted probabilities {p •|s } T ({p •|s }, v) ≤ T ({p * •|s }, v) + E s∼p T (v (p •|s -p * •|s )) 2 = T ({p * •|s }, v) + v Σ p T (∆ {p •|s } )v where Σ p T (g) = E s∼p T [g(s)g(s) ] and ∆ {p •|s } (s) = p •|s -p * •|s are defined in Section B. Proof. The following sequence of inequalities proves it T ({p •|s }, v) = E (s,y)∼p T (v p •|s , y) ≤ (a) E (s,y)∼p T (v p * •|s , y) + |v (p * •|s -p •|s )| ≤ (b) E (s,y)∼p T (v p * •|s , y) + E s∼p T v (p * •|s -p •|s ) 2 = T ({p * •|s }, v) + v E s∼p T (p * •|s -p •|s )(p * •|s -p •|s ) v = T ({p * •|s }, v) + v Σ p T (∆ {p •|s } )v where (a) follows from 1-lipschitzness of , (b) follows from Jensen's inequality.

E.6.2 HANDLING DISTRIBUTION SHIFT

Lemma E.9. For any g : S → R D and p T ∈ ∆ S , we have Σ p L (g) -1 2 Σ p T (g)Σ p L (g) -1 2 2 ≤ γ(p T ) -1 Proof. By definition of γ(p T ), we have that Σ p L (g) = E s∼p L [g(s)g(s) ] = s∈S p L (s)g(s)g(s) γ(p T ) s∈S p T (s)g(s)g(s) = γ(p T ) E s∼p T [g(s)g(s) ] = γ(p T )Σ p T (g) Thus 1 γ(p T ) Σ p L (g) Σ p T (g) and hence 1 γ(p T ) Σ p L (g) -1 2 Σ p L (g)Σ p L (g) -1 2 Σ p L (g) -1 2 Σ p T (g)Σ p L (g) -1 2 , which is equivalent to 1 γ(p T ) I D Σ p L (g) -1 2 Σ p T (g)Σ p L (g) -1 2 . This finishes the proof. Lemma E.10. For matrices X, Y ∈ R D×D s.t. X, Y 0 and Y is full rank, we have that max a∈R D ,0< a ≤λ a Xa a Y a = Y -1 2 XY -1 2 2 for any norm • . Proof. Note that a Xa a Y a is independent of the scaling of a. The following sequence of inequalities completes the proof max a∈R D ,0< a ≤λ a Xa a Y a = max a∈R D a Xa a Y a = max a∈R D a Xa (Y 1 2 a) (Y 1 2 a) = max a∈R D , Y 1 2 a 2=1 a Xa = max b∈R D , b 2=1 (Y -1 2 b) X(Y -1 2 b) = max b∈R D , b 2=1 b Y -1 2 XY -1 2 b = Y -1 2 XY -1 2 2

F EXPERIMENT DETAILS

For all experimentsfoot_8 , we use the 117M parameter "small" GPT-2 model proposed in Radford et al. (2019) and implemented in HuggingFace (Wolf et al., 2019) . Linear classification experiments (except for fine-tuning baseline in Table 1 ) are performed on fixed output features from GPT-2. We note that the binary SST-2 dataset used in all experiments is comprised of complete sentences, and there are 6,920 train examples and 1,821 test examples. In particular, this dataset is smaller than the version included with the GLUE benchmark (Wang et al., 2018) . This smaller version of SST-2 better fits the sentence completion hypothesis we propose. F.1 SOLVING DOWNSTREAM TASKS USING f AND Φp f The features f from GPT-2 for any input sequence (w 1 , . . . , w N ) is the output embedding of the final token w N at the final layer, where N is the input length and can be different for different inputs. This is also the embedding that is directly multiplied by the word embeddings to get the softmax distribution for language modeling, as in the theoretical setting. To use a prompt, the same prompt is added at the end of all inputs and the features are extracted for this modified input. We use the LogisticRegressionCV class from the scikit-learn package to fit linear classifiers to all fixed features (i.e., no finetuning). We use the liblinear solver and one-vs-rest loss function unless it catastrophically fails (e.g., close to random performance) on a particular multi-class task. In that case, we use the stochastic average gradient (SAG) algorithm with multinomial loss. We use 5-fold cross validation for all experiments and test values for the regularization parameter C between 1e-6 and 1e4 for small datasets (i.e., fewer than 10K examples) and between 1e-3 and 1e3 for larger datasets. Details about word subsets: For all of the results presented in Table 1 , we use a pre-trained GPT-2 model. For SST, we use the prompt "This movie is " when indicated. For AG News, we use the prompt "This article is about " when indicated. We compute the conditional probability of selecting a subset of words to complete the sentence. For AG News, this subset is: 'world', 'politics', 'sports', 'business', 'science', 'financial', 'market', 'foreign', 'technology', 'international', 'stock', 'company', 'tech', 'technologies'. For SST, this subset is: ':)', ': (', 'great', 'charming', 'flawed', 'classic', 'interesting', 'boring', 'sad', 'happy', 'terrible', 'fantastic', 'exciting', 'strong'. For AG News, the class words we use are: 'foreign', 'sports', 'financial', 'scientific'. For SST, the class words we use are ':)' and ':('. We account for BPE tokenization by using the encoding of the word directly and the encoding of the word with a space prepended. We then filter to use only words that encode to a single BPE token. Tests on additional datasets: We also test the performance of pre-trained GPT-2 embeddings f and the conditional mean embeddings Φp f on the DBPedia (Auer et al., 2007) , Yahoo Answers (Zhang et al., 2015) , TREC (Li & Roth, 2002) , IMDb (Maas et al., 2011) , Customer Review (CR) (Hu & Liu, 2004) , and MPQA polarity (Wilson & Wiebe, 2003) datasets in Table 2 . We limited the training set size to 250K for larger datasets (i.e., DBPedia and Yahoo Answers). For CR and MPQA, we follow Zhang et al. (2015) and average the performance across 10 random 90-10 train-test splits of the dataset. We find that Φp f consistently has comparable performance to f across non-sentiment and sentiment downstream classification tasks. We include baseline results of bag of n-grams (BonG) for most tasks and the mLSTM model (Radford et al., 2017) for sentiment tasks. BonG performs quite well on the larger datasets, but not as well on smaller datasets, due to the high dimensionality of features. For sentiment tasks, adding a prompt almost always boosts performance. We also demonstrate that much of the performance can be recovered by only looking at "positive" and "negative" or ":)" and ":(" as class words. Using these 2-dimensional features is even more sample-efficient than the standard 768-dimensional ones. Table 3 : Comparing Quad features to cross-entropy features for GPT-2 trained on the IMDb unlabeled corpus (Maas et al., 2011) . In this experiment we fix Φ to be the word embeddings from prertained GPT-2 model for the cross-entropy objective. For the Quad objective, we initialize Φ to be the SVD of the pre-trained embeddings. An asterisk indicates that we added the prompt "This movie is " to each input. We take the hyperparameter configuration that achieves the best performance on the dev set and then perform fine-tuning using those settings with three different random seeds: 8, 33, and 42. We then report the average performance on the test set in Table 1 . We perform the hyperparameter grid search over the standard datasets and then perform fine-tuning using the best settings on the dataset with task-specific prompts added. For SST-2, we use the prompt "This movie is ", and for AG News we use "This article is about ".

F.3 TESTING QUAD OBJECTIVE

We test two models with the same parametrizations, one trained using our Quad objective and another trained with the standard cross-entropy objective using the unlabeled IMDb corpus (Maas et al., 2011) and the Amazon product review corpus (McAuley et al., 2015) . We slightly modify the standard architecture of GPT-2 to generate Tables 3 and 4 . First we add a single linear layer (that is trained) on top of the output features of the standard Transformer architecture. Furthermore, instead of tying the input and output word (token) embeddings, we learn them separately so that f and Φ are independent functions; this is more in line with out theoretical setup. We fix the input embeddings and the positional embeddings to be the parameters from the pre-trained GPT-2. For Quad, we initialize Φ, the output embeddings, using the singular vectors of the pre-trained word embeddings Φ. For the cross-entropy models, we initialize Φ to be the full pre-trained word embeddings Φ, because we found that initializing with the singular vectors harmed performance. Given our parameterization, initializing with the singular vectors is as expressive as initializing with the pretrained embeddings Φ themselves; however it potentially lends a better optimization landscape and speeds up training for our new objective Quad. As described in Section 5.2, we minimize the following objective quad (f, Φ) = E (s,w) -f (s) φ w + 1 2 Φ f (s) 2 where (s, w) are sampled from the text corpus. The implementation of the Quad loss is the same as the standard cross-entropy loss, the main difference being the second term: it is 1 2 Φ f (s) 2 for Quad instead of the log-partition function log w e f (s) φ w in the cross-entropy objective. Because IMDb is a smaller dataset, we fix Φ at its initialization and only train f to generate Table 3 . When training on the Amazon dataset, we initialized Φ the same way as we did for the IMDb dataset, but we allowed f and Φ to both be trained, since more data was available. To train the models, we use the standard learning rate schedule as in in Radford et al. (2019) . To learn a model on IMDb, we use a context size of 512 BPE tokens, and for the Amazon reviews dataset (McAuley et al., 2015) , we use the standard context length of 1,024 BPE tokens. We observe that training using Quad, in both cases, yields comparable performance to the language model on the SST task, but always slightly worse. According to the theory, features f (s) from Quad should learn p * •|s on a subspace, just like Φp f from cross-entropy models, thus making the comparison between these two important. Furthermore, adding a prompt consistently improves performance for both objectives. While Quad did not beat the cross-entropy in either case, its good performs at least demonstrates that insights from the theoretical analysis can translate to practical algorithms. We leave exploring the gap in performance between Quad and cross-entropy and a more extensive evaluation of Quad for future work.

F.4 LEARNING THE QUADRATIC APPROXIMATION OF THE LOG-PARTITION FUNCTION

In Assumption 4.1, we assert that there is a quadratic fit for the log partition function, which allows us to show in Lemma 4.3 that a linear relation holds between f and Φp f . We validate these theoretical findings by fitting a quadratic function to the log partition function for a subset of embeddings from the IMDb, SST, and AG News datasets (Figure 1 ). Here, we describe how we learned A, b and c. To ensure A is symmetric and positive semi-definite as required, we parametrize A = U U T . As defined earlier, the partition function Z θ = w e θ φ w and Φp θ = w e θ φ w Z θ φ w for any θ ∈ R d . We minimize the following objective function: we perform linear regression on √ x -b to find a and c. We choose the a, b, c that maximizes the r-value of the regression. While Theorem 4.2 only provides an upper bound on the logistic loss, this experiment shows that some square-root trend is observable in practice.



A finite minimizer may not always exist. This is handled in Section 4 that deals with -optimal solutions. Extending to k-way tasks is straightforward. ∞ makes sense since p *•|s 1 = 1 & • ∞ is dual norm of • 1. For instance if pT is 0.001 fraction of pL, {p •|s } could have 1000 error on pT and 0 error on rest of pL. This is not exactly the covariance since the mean is not subtracted, all results hold even for the usual covariance. Note that the converse is trivially true, i.e. a (τ, B)-natural task w.r.t. Φ is also (τ, B)-natural. Note that here we assume that we learn both a linear classifier and an intercept for a downstream classification task. All results in the paper essentially remain the same with an intercept in the definition of classification loss. It will be clear later that the optimal solution will have as high a rank as possible Φ. All inverses can be replaced by pseudo-inverses for low-rank matrices. Link to code: https://github.com/sadhikamalladi/mathematical-exploration-downstream-tasks.



Figure 1: Learned quadratic function v/s log partition function on various datasets for features computed from pre-trained GPT-2 to verify Assumption 4.1. We also plot the y = x line for reference.

p * •|s + p T (y = 1|s)p * •|s (W irrelevant )

p T capture more dominant semantic meanings, this could correspond to v * aligning with the top eigendirections of Ω * p T . In combination with the above property about Φ, this could suggest that P ⊥ Φ v * 2 is small, thus leading to τ and B being small. Note that they above arguments are informal and qualitative, and we leave exploring desirable properties of Φ more formally to future work. D.3 PROOFS FOR SECTION D.1 Proposition D.1. Let p b (s) = p T (y = b|s) for b ∈ {±1}, p min (s) = min b∈{±1} p b (s), p max (s) = max b∈{±1} p b (s) and g * (s) = arg max b∈{±1} p b (s) denote the Bayes optimal predictor. We first notice that there is a simple well-known closed form expression for the Bayes risk Bayes-Error(T ) = E (s,y)∼p T [1 {g * (s) = y}]

Lemma 4.3. Under Assumption 4.1, any feature map f : S → R d satisfies Φp f (s) = Af (s) + b, for all s ∈ S.

For a fixed Φ, we define f * Φ (s) = arg min θ∈R d quad,s (θ, Φ).

Figure 2: Fit of the learned quadratic function to the log partition function on various datasets for features computed by the full, pre-trained GPT-2. We also plot the y = x line for reference. These plots are meant to verify Assumption 4.1.

Proposition 2.2 (Softmax models recover p * •|s on a subspace). Fix a fixed Φ, if f * ∈ arg min f :S→R d xent (f, Φ) exists, then Φp f * (s) = Φp * •|s for every s ∈ support(p L ).

however, is undesirable. We improve on this by proving a stronger result specifically for softmax models. Inspired by Proposition 2.2, our guarantees are for features Φp f (s) ∈ R d called conditional mean features.

•|s (w ) for all or most s ∈ supp(p T ), then w and w have essentially the same meaning w.r.t. the distribution of contexts p T and the closer [p * •|s (w)] s∈supp(p T ) and [p * •|s (w )] s∈supp(p T ) are, the closer the meaning of w and w are. For the substitutability matrix Ω * p T = E

Comparing Quad features to cross-entropy features for GPT-2 trained on the Amazon corpus. An asterisk indicates that we added the prompt "This movie is " to each input. Note that the validation loss was still decreasing at the time of measurement.Task f (s) (xent) Φp f (s) (xent) f (s) (Quad, learned Φ)For both version of SST-2, we try batch sizes 8, 16, and 32, and for AG News, we try batch sizes 8, 12, and 16. We note that the longer sequence length of AG News inputs required us to use parallelization across multiple GPUs to simulate larger batch sizes, which made batch size 32 prohibitively expensive to test.

acknowledgement

Acknowledgments: Sanjeev Arora, Sadhika Malladi and Nikunj Saunshi are supported by NSF, ONR, Simons Foundation, Amazon Research, DARPA and SRC.

annex

Published as a conference paper at ICLR 2021 Table 2: GPT-2 performance without fine-tuning on downstream task test sets with k classes. We provide the performance of bag of n-grams (BonG) as an approximate baseline for these tasks. AG News, DBPedia and Yahoo performances were reported in Zhang et al. (2015) , and the other tasks were reported in Khodak et al. (2018) . We also include results from mLSTM (Sentiment Neuron) (Radford et al., 2017) for the sentiment-related classification tasks (SST, IMDb, CR, and MPQA) with numbers reported from Khodak et al. (2018) . Furthermore, we include results for BERT (Devlin et al., 2018) features without fine-tuning, where we use the output features for the first position of an input for linear classification. An asterisk indicates we add a standard sentiment prompt "The sentiment is" to each input, but for AG News we used the prompt "This article is about". We also tested the performance of the conditional probability distribution over "positive" and "negative" as well as ":)" and ":(" on the sentiment-related tasks with and without the prompt. We also include results using the pre-trained BERT base cased model (Devlin et al., 2018; Wolf et al., 2019) , using the embedding at the first token as input to the downstream task. We also tried using the mean embedding and last token embedding and found that the first token embedding is often the best. Moreover, the first token embedding is what is extracted in the traditional usage of BERT on downstream tasks, though we note that it is rare to use BERT without fine-tuning.

F.2 FINETUNING EXPERIMENTS

As a strong baseline, we finetune the GPT-2 features along with learning a linear classifier for the SST and AG News classification tasks and report accuracy numbers in Table 1 . We use a maximum sequence length of 128 BPE tokens for downstream inputs of SST-2 and a maximum length of 400 BPE tokens for AG News inputs. We use the end of sentence token as the padding token. The datasets are described below. To select the best hyperparameter configuration, we run a grid search over learning rate and batch size. We train each model for 10 epochs. For all datasets, we test learning rates 5e-5, 1e-4, and In practice, we train only on the regression loss (i.e., λ 1 = 0, λ 2 = 1) for the most promising results. Note that the regression term is trying to learn a linear relationship between between θ and Φp θ that Lemma 4.3 aims to prove. This ends up learning a matrix A = U U and vector b that also satisfy the quadratic form of log(Z θ ) from Assumption 4.1.We use 20,000 examples from a mix of IMDb, SST, and AG News embeddings as the training set. Thus we sample θ by sampling s from the aforementioned datasets and set θ = f (s), f being the feature map from pretrained GPT-2. We use the Adam (Kingma & Ba, 2014) optimizer with learning rate 1e-3 for U and learning rate 1e-4 for b and c. We decay the learning rate every 50 steps by a factor of 0.1. We use the U obtained after 8 epochs of training. We further demonstrate the quality of the learned fit by plotting the true log partition and estimated log partition function for embeddings from other datasets in Figure 2 .F.5 EXPERIMENTALLY CHECKING THEOREM 4.2 Theorem 4.2 can be informally summarized as stating that an suboptimality in the cross-entropy of a d-dimensional language model propagates to a √ increase in the logistic loss. We note that the τ, B, and γ(p T ) factors are fixed for a given pre-training corpus and downstream task, so we can empirically test if this square root relationship holds in practice. In particular, Theorem 4.2 saysOf these, τ, B, γ(p T ) -1 and * xent are independent of the language model (f, Φ) and only depend on the task T and language modeling distribution. Thus we can rewrite this as T (Φp f ) ≤ c + a xent (f, Φ) -b for suitable constants a, b, c ∈ R. The left hand side, T (Φp f ), is the logistic loss of conditional mean features from language model (f, Φ) on task T and xent (f, Φ) is the crossentropy loss of the language model, both of which can be measured in practice.We train a 117M parameter GPT-2 model from scratch on the IMDb and Amazon corpora, described in Section F.3. We maintain checkpoints during training, and for each checkpoint, we measure the cross-entropy of the model on the validation set as well as the performance of the conditional mean features Φp f on SST-2. Plotting these values together yields Figure 3 .We furthermore fit a square root trend, shown in red, to these points. We learn a, b, c such that y ≈ a √ x -b + c, where y = T (Φp f ) is the logistic loss and x = xent (f, Φ) is the cross-entropy loss. For this, we perform a grid search over 100 evenly spaced valid values of b, and for each b,

