CONTRASTIVE ESTIMATION REVEALS TOPIC POSTERIOR INFORMATION TO LINEAR MODELS Anonymous authors Paper under double-blind review

Abstract

Contrastive learning is an approach to representation learning that utilizes naturally occurring similar and dissimilar pairs of data points to find useful embeddings of data. In the context of document classification under topic modeling assumptions, we prove that contrastive learning is capable of recovering a representation of documents that reveals their underlying topic posterior information to linear models. We apply this procedure in a semi-supervised setup and demonstrate empirically that linear classifiers with these representations perform well in document classification tasks with very few training examples.

1. INTRODUCTION

Using unlabeled data to find useful embeddings is a central challenge in representation learning. Classical approaches to this task often start by fitting some type of structure to the unlabeled data, such as a generative model or a dictionary, and then embed future data via inference with the fitted structure (Blei et al., 2003; Raina et al., 2007) . While principled, this approach is not without its drawbacks. One issue is that learning structures and performing inference is often hard in general (Sontag & Roy, 2011; Arora et al., 2012) . Another issue is that we must a priori choose a structure and method for fitting the unlabeled data, and unsupervised methods for learning these structures can be sensitive to model misspecification (Kulesza et al., 2014) . Contrastive learning (also called noise contrastive estimation, or NCE) is an alternative representation learning approach that tries to capture the latent structure in unlabeled data implicitly. At a high level, these methods formulate a classification problem in which the goal is to distinguish examples that naturally occur in pairs, called positive samples, from randomly paired examples, called negative samples. The particular choice of positive samples depends on the setting. In image representation problems, for example, patches from the same image or neighboring frames from videos may serve as positive examples (Wang & Gupta, 2015; Hjelm et al., 2018) . In text modeling, the positive samples may be neighboring sentences (Logeswaran & Lee, 2018; Devlin et al., 2018) . The idea is that in the course of learning to distinguish between semantically similar positive examples and randomly chosen negative examples, we will capture some of the latent semantic information. In this work, we look "under the hood" of contrastive learning and consider its application to document modeling, where the goal is to construct useful vector representations of text documents in a corpus. In this setting, there is a natural source of positive and negative examples: a positive example is simply a document from the corpus, and a negative example is one formed by pasting together the first half of one document and the second half of another (independently chosen) document. We prove that when the corpus is generated by a topic model, learning to distinguish between these two types of documents yields representations that are closely related to their underlying latent variables. One potential application of contrastive learning is in a semi-supervised setting, where there is a small amount of labeled data as well as a much larger collection of unlabeled data. In these situations, purely supervised methods that fit complicated models may have poor performance due to the limited amount of labeled data. On the other hand, when the labels are well-approximated by some function of the latent structure, our results show that an effective strategy is to fit linear functions, which may be learned with relatively little labeled data, on top of contrastive representations. In our experiments, we verify empirically that this approach produces reasonable results. Contributions. The primary goal of this work is to shed light on what contrastive learning techniques uncover in the presence of latent structure. To this end, we focus on the setting of document modeling where latent structure is induced by a topic model. Here, our contrastive learning objective is to distinguish true documents from 'fake' documents that are composed by randomly pasting together two document halves from the corpus. We consider two types of architectures or functional forms of solutions for this problem, both trained with logistic loss. The first architecture, on which our theoretical analysis will focus, consists of general functions of the form f (•, •). Here, we have trained f so that f (x, x ) indicates the confidence of the model that x and x are two halves of the same document. To embed a new document x using f , we propose a landmark embedding procedure: fix documents l 1 , . . . , l M (our so-called landmarks) and create the embedding φ(x) using a function of the predictions f (x, l 1 ), . . . , f (x, l M ). In Section 4, we show that the embedding φ(x) is a linear transformation of the underlying topic posterior moments of x. Moreover, under certain conditions this linear relationship is invertible, so that linear functions of φ(x) correspond to polynomial functions of the topic posterior of document x. In Section 5, we show that errors in f on the contrastive learning objective transfer smoothly to errors in φ(x) as a linear transformation of the topic posterior of x. Thus, as the quality of f improves, linear functions of φ(x) more closely approximate polynomial functions of the topic posterior of document x. Unfortunately, the landmark embedding can require quite a few landmarks before our theoretical results kick in. Moreover, embedding a document requires M evaluations of f , which can be expensive. To circumvent this, in Section 7 we introduce a direct embedding procedure that more closely matches what is done in practice. We use an architecture of the form f 1 (x) T f 2 (x ) where f 1 , f 2 are functions with d-dimensional outputs, and we train this architecture on the same contrastive learning task as before. To embed a document x, we simply use the evaluation f 1 (x). In Section 7, we evaluate this embedding on a semi-supervised learning task, and we show that it has reasonable performance. Indeed, the direct embedding method outperforms the landmark embedding method, which raises the question of whether or not anything can be theoretically proven about the direct embedding method. We leave this question to future work.

Related work.

Reducing an unsupervised problem to a synthetically-generated supervised problem is a well-studied technique. In dynamical systems modeling, Langford et al. (2009) showed that the solutions to a few forward prediction problems can be used to track the underlying state of a nonlinear dynamical system. For linear dynamics, the idea is also seen in autoregressive models (Yule, 1927) . In anomaly/outlier detection, a useful technique is to learn a classifier that distinguishes between true samples from a distribution and fake samples from some synthetic distribution (Steinwart et al., 2005; Abe et al., 2006) . Similarly, estimating the parameters of a probabilistic model can be reduced to learning to classify between true data and randomly generated noise (Gutmann & Hyvärinen, 2010) . In the context of natural language processing, methods such as skip-gram and continuous bag-ofwords turn the problem of finding word embeddings into a prediction problem (Mikolov et al., 2013a; b) . Modern language representation training algorithms such as BERT and QT also use naturally occurring classification tasks such as predicting randomly masked elements of a sentence or discriminating whether or not two sentences are adjacent (Devlin et al., 2018; Logeswaran & Lee, 2018) . Training these models often employs a technique called negative sampling, in which softmax prediction probabilities are estimated by randomly sampling examples; this bears close resemblance to the way that negative examples are produced in contrastive learning. Most relevant to the current paper, Arora et al. (2019) gave a theoretical analysis of contrastive learning. They considered the specific setting of trying to minimize the contrastive loss L(f ) = E x,x+,x-[ f (x) T (f (x + ) -f (x -)) ] where (x, x + ) is a positive pair and (x, x -) is a negative pair. They showed that if there is an underlying collection of latent classes and positive examples are generated by draws from the same class, then minimizing the contrastive loss over embedding functions f yields good representations for the classification task of distinguishing latent classes. The main difference between our work and that of Arora et al. (2019) is that we adopt a generative modeling perspective and induce the contrastive distribution naturally, while they do not make generative assumptions but assume the contrastive distribution is directly induced by the downstream Algorithm 1 Contrastive Estimation with Documents Input: Corpus U of unlabeled documents. Initialize: S = ∅. for i = 1, . . . , n do Sample x and x independently from unif(U); S ← S ∪ {(x (1) , x (2) , 1)} w.p. 1 /2 {(x (1) , x(2) , 0)} w.p. 1 /2 end for Solve the optimization problem f = minimize f (x (1) ,x (2) ,y)∈S y log 1 + e -f (x (1) ,x (2) ) + (1 -y) log 1 + e f (x (1) ,x (2) ) Select landmark documents l 1 , . . . , l M and embed φ (x) = exp f (x, l i ) : i ∈ [M ] . classification task. In particular, our contrastive distribution and supervised learning problem are only indirectly related through the latent variables in the generative model, while Arora et al. assume an explicit connection. The focus of our work is therefore complementary to theirs: we study the types of functions that can be succinctly expressed with the contrastive representation in our generative modeling setup. In addition, our results apply to semi-supervised regression, but it is unclear how to define their contrastive distribution in this setting; this makes it difficult to apply their results here. Finally, Arora et al. point out the method they study has limitations that arise when the number of latent classes is small and the probability of negative samples having the same class is high. In our setting, class collisions turn out not to be a problem since our embeddings explicitly utilize conditional probability information from the solution to our contrastive learning objective.

2. SETUP

Let V denote a finite vocabulary. A topic is a distribution over V. We will assume that we have K such topics, and denote the corresponding distributions as O(• | k) for k = 1, . . . , K. To generate a length m document x, one first draws a vector w from ∆ K , the K-dimensional probability simplex, and then samples each of the m words x 1 , . . . , x m by first sampling the latent variable z i ∼ w and drawing x i ∼ O(• | z i ). We note that documents are allowed to take different lengths. We will also be interested in the case where each document has an associated label ∈ R. One natural restriction to make on a label is that it is conditionally independent of the document given the topic distribution of the document. Thus, we will assume that there is a joint distribution D of triples (x, w, ), where (x, w) are generated according to the topic model described above, and then is drawn from some distribution conditioned on w. One of the goals of this paper is to characterize the functional forms of this conditional distribution that are most suited to contrastive learning. In the representation learning approach to the semi-supervised setting, we are given a large collection U of documents with no labels, and a small collection L of labeled documents. Using U, we learn a feature map φ that will form the basis of our predictions. Then, using L, we learn a simple predictor based on φ, such as a linear function, to predict the label given φ(x).

3. CONTRASTIVE LEARNING ALGORITHM

In contrastive learning, examples come in the form of similar and dissimilar pairs of points, where the exact definition of similar/dissimilar depends on the task at hand. Our construction of similar pairs will take the form of randomly splitting a document into two documents, and our dissimilar pairs will consist of subsampled documents from two randomly chosen documents. In the generative modeling setup, since the words are i.i.d. conditional on the topic distribution, a natural way to split a document x into two is to call the first half of the words x (1) and the second half x (2) . In our experiments, we split the documents by applying a random permutation to the word tokens and partitioning in half, thus effectively ignoring the word ordering (as is common in topic models). The contrastive representation learning procedure is displayed in Algorithm 1. It uses a finite-sample approximation to the contrastive distribution D contrast described as follows: (a) sample a document x and partition it into (x (foot_0) , x (2) ), (b) with probability 1 /2 output (x (1) , x (2) , 1), (c) with probability 1 /2, sample a second document (x (1) , x(2) ) and output (x (1) , x(2) , 0). For (x, x , y) ∼ D contrast , the parts x and x are the two halves of a (possibly synthetic) document, and y is the binary label. Our contrastive learning objective is to minimize the binary cross-entropy loss of discriminating between positive and negative examples: ,x ) . (1) L contrast (f ) := E (x,x ,y)∼Dcontrast y log 1 + e -f (x,x ) + (1 -y) log 1 + e f (x In our algorithm, we approximate this expectation via sampling and optimize the empirical objective, which yields an approximate minimizer f (chosen from some function class F). To see why optimizing this contrastive learning objective is so useful, let f be the global minimizer of Eq. ( 1). By Bayes' theorem we have that g := exp(f ) satisfies the following: g (x, x ) := exp(f (x, x )) = P(y = 1 | x, x ) P(y = 0 | x, x ) = P(x (1) = x, x (2) = x ) P(x (1) = x)P(x (2) = x ) . Thus, g (x, x ) captures the ratio of the probability of x and x co-occurring as the first and second halves of the same document and the product of their marginal probabilities. In Eq. ( 1), we have not imposed any constraints on the functions over which we are optimizing. Thus, we seek to extract a useful embedding from g using only black box access to g . Our approach to this problem is to select some set of fixed documents, which we call landmarks, and to embed by utilizing the predictions of g on these landmarks. Formally, we select documents l 1 , . . . , l M and represent document x as 1 φ (x) := (g (x, l 1 ), . . . , g (x, l M )). (2) This yields the final document-level representation, which can be used for downstream tasks. As we shall see in Section 4, when the documents have an underlying topic structure, φ (x) is related to the posterior information of the topics by a linear transformation and this linear transformation is invertible whenever the landmarks l 1 , . . . , l M are sufficiently diverse. In practice, we only have access to an approximate minimizer f of Eq. ( 1). Thus, our embedding in practice will be given by φ(x) = exp f (x, l 1 ) , . . . , exp f (x, l M ) . In Section 5 we will see that, under some mild assumptions, our claims about φ also hold true for φ up to some small errors. Finally, we point out that there is nothing special about the binary cross-entropy loss. We may replace this loss in Eq. ( 1) with any proper scoring rule (Shuford et al., 1966; Buja et al., 2005) , so long as the appropriate non-linear transformation is applied to the resulting predictions.

4. RECOVERING TOPIC STRUCTURE

In this section, we focus on expressivity of the contrastive representation, showing that polynomial functions of the topic posterior can be represented as linear functions of the representation. To do so, we ignore statistical issues and assume that we have access to the oracle representations g (x, •). In Section 5, we address statistical issues. Recall the generative topic model process for a document x. We first draw a topic vector w ∈ ∆ K . Then for each word i = 1, . . . , length(x), we draw z i ∼ Categorical(w) and x i ∼ O(• | z i ). We will show that when documents are generated according to the above model, the embedding of a document x in Eq. ( 2) is closely related its underlying topic vector w.

4.1. THE SINGLE TOPIC CASE

To build intuition for the embedding in Eq. ( 2), we first consider the case where each document's probability vector w is supported on a single topic, i.e., w ∈ {e 1 , . . . , e K } where e i is the i th standard basis element. Then we have the following lemma. Lemma 1. For any documents x, x , we can write g (x, x ) = η(x) T ψ(x ), where η(x) k := P(w = e k |x (1) = x) is the topic posterior distribution and ψ(x ) k := P(x (2) = x|w = e k )/P(x (2) = x ). Due to space constraints, all proofs are deferred to Appendix C and Appendix D. The characterization from Lemma 1 shows that g contains information about the posterior topic distribution η(•). To recover it, we must make sure that the ψ(•) vectors for our landmark documents span R K . Formally, if l 1 , . . . , l M are the landmarks, and we define the matrix L ∈ R K×M by L := ψ(l 1 ) • • • ψ(l M ) , then our representation satisfies φ (x) = L T η(x). If our landmarks are chosen so that L has rank K, then L † φ (x) = η(x), where † denotes the matrix pseudo-inverse. Thus, there is a linear transformation of φ (x) that recovers the posterior distribution of w given x. There are two observations to be made here. The first is that this argument naturally generalizes beyond the single topic setting to any setting where w can take values in a finite set S, which may include some mixtures of multiple topics, though the number of landmarks needed would grow at least linearly with |S|. The second is that we have made no use of the structure of x (1) and x (2) , except for that they are independent conditioned on w. Thus, this argument applies to more exotic ways of partitioning a document beyond the bag-of-words approach.

4.2. THE GENERAL SETTING

In the general setting, we allow document vectors to be any probability vector in ∆ K , and we do not hope to recover the full posterior distribution over ∆ K . However, the intuition from the single topic case largely carries over, and we will show that we can still recover the posterior moments. Let m max be the length of the longest landmark document. Let S K m := {α ∈ Z K + : k α k = m} denote the set of non-negative integer vectors that sum to m and let S K ≤mmax = S K 0 ∪ • • • ∪ S K mmax . Let π(w) denote the degree-m max monomial vector in w: π(w) := (w α1 1 • • • w α k k : α ∈ S K ≤mmax ). For a positive integer m and a vector α ∈ S K m , we let [m] α := {z ∈ [K] m : m i=1 1I[z i = k] = α k ∀k ∈ [K]}. For a document x of length m, the degree-m polynomial vector ψ m is defined by ψ m (x) := z∈( [m] α ) m i=1 O(x i |z i ) : α ∈ S K m , and let ψ d (x) = 0 for all d = m. The cumulative polynomial vector ψ is given by ψ(x) := 1 P(x (2) = x) (ψ 0 (x), ψ 1 (x), • • • , ψ mmax (x)). Given these definitions, we have the following general case analogue of Lemma 1. Lemma 2. For any documents x, x , we may write g (x, x ) = η(x) T ψ(x ) where η(x) := E[π(w)|x (1) = x] and ψ is defined in Eq (4). Thus, we again have φ (x) = L T η(x), but the columns of L are now vectors ψ(l i ) from Eq. (4). Our analysis, so far, shows that if we choose the landmarks such that LL T is invertible, then our representation captures all moments of the topic posterior up to degree m max . As the next theorem shows, we can ensure that LL T is invertible whenever each topic has an associated anchor word (Arora et al., 2012) , i.e., a word that occurs with positive probability only within that topic. In this case, there is a set of landmarks l 1 , . . . , l M such that any polynomial of η(x) can be expressed as a linear function of φ (x). Theorem 3. Suppose that (i) each topic has an associated anchor word, and (ii) the marginal distribution of w has positive probability on the interior of ∆ K . For any d o ≥ 1, there is a collection of M = O(K do ) landmark documents l 1 , . . . , l M such that if Q(w) is a degree-d o polynomial in w, then there is a vector θ ∈ R M such that θ, φ (x) = E[Q(w)|x (1) = x] for all documents x. Coupling Theorem 3 with the Stone-Weierstrass theorem (Stone, 1948) shows that, in principle, the posterior mean of any continuous function of w can be approximated using our representation.

5. ERROR ANALYSIS

Given a finite amount of data, we cannot hope to solve Eq. ( 1) exactly. Thus, our solution f will only be an approximation to f . Since f is the basis of our representation, one may worry that errors incurred in this approximation will cascade and cause the approximate representation φ(x) to differ so wildly from φ (x) that the results of Section 4 do not even approximately hold. In this section, we will show that, under certain conditions, such fears are unfounded. Specifically, we will show that there is an error transformation from the approximation error of f to the approximation error of linear functions in φ. That is, if the target function is η(x) T θ , then we will show that the best mean squared error achievable using our approximate representation φ, given by R( φ) := min v E x∼µ (1) (η(x) T θ -φ(x) T v) 2 , is bounded in terms of the approximation quality of f as well as some other terms. Here, µ (1) is the marginal distribution over first halves of documents drawn from D. Thus, for the specific setting of semi-supervised learning, an approximate solution to Eq. ( 1) is good enough. There are a number of reasonable ways to choose landmark documents. Here we consider a simple method: randomly sample them from the marginal distribution µ (2) of x (2) . We will assume that this distribution satisfies certain regularity properties. Assumption 1. There is a constant σ min > 0 such that for any δ ∈ (0, 1), there is a number M 0 such that for an i.i.d. sample l 1 , . . . , l M from µ (2) , with M ≥ M 0 , with probability 1 -δ, the matrix L in Eq. (3) (with ψ as defined in Lemma 1 or Eq. (4)) has minimum singular value at least σ min √ M . Note that the smallest non-zero singular value of 1 √ M L is the square-root of the smallest eigenvalue of a certain empirical second-moment matrix. Hence, Assumption 1 holds under appropriate conditions on the landmark distribution, for instance via tail bounds for sums of random matrices (Tropp, 2012) combined with matrix perturbation analysis (e.g., Weyl's inequality). In the single topic setting with anchor words, it can be shown that for long enough documents, σ min is lower-bounded by a constant when M 0 grows polynomially with K. We defer a detailed proof of this to Appendix D. We will also assume that the predictions of f and f bounded above by some constant. Assumption 2. There exists some g max > 0 such that f (x, l i ), f (x, l i ) ≤ log g max for all documents x and landmarks l i . Note that if Assumption 2 holds for f , then it can be made to hold for f by clipping. Moreover, it holds for f whenever the vocabulary and document sizes are constants: f (x, x ) = log P(x (1) = x, x (2) = x ) P(x (1) = x)P(x (2) = x ) = log P(x (2) = x | x (1) = x) P(x (2) = x ) ≤ log 1 P(x (2) = x ) . Since landmarks are sampled, and the number of possible documents is finite, there exists a constant p min > 0 such that P(x (2) = l) ≥ p min . Thus, Assumption 2 holds for g max ≤ 1/p min . Given these assumptions, we have the following error transformation guarantee. Theorem 4. Fix any δ ∈ (0, 1), and suppose Assumption 1 and Assumption 2 hold (with M 0 , σ min , and f max ). Let f be the function returned by the contrastive learning algorithm, and let ε := L contrast ( f ) -L contrast (f ) denote its excess contrastive loss. If M ≥ M 0 , then with probability at least 1 -δ over the random sample of l 1 , . . . l M , R( φ) ≤ θ 2 2 (1 + g max ) 4 σ 2 min ε + 2 log(2/δ) M . We make a few observations here. First, θ 2 2 is a measure of the complexity of the target function. Thus, if the target function is some reasonable function (e.g., low-degree polynomial) of the posterior document vector, then we can expect θ 2 2 to be small. Second, the dependence on g max is probably not very tight and can likely be improved. Third, note that M can grow and ε can shrink with the number of unlabeled documents; indeed, none of the terms in Theorem 4 deal with labeled data. Finally, it is possible to establish guarantees in a semi-supervised setting using our analysis. If we have n L i.i.d. labeled examples, and we learn a linear predictor v with the representation φ using ERM (say), then the bias-variance decomposition grants mse(v) := E x∼µ (1) (η(x) T θ -φ(x) T v) 2 = R( φ) + E x∼µ (1) ( φ(x) T (v * -v)) 2 , where v * is the minimizer of mse(•). The final term E x∼µ (1) ( φ(x) T (v * -v)) 2 is the excess risk in linear regression, which goes to zero as n L → ∞.

6. TOPIC MODELING SIMULATIONS

To test our theory, we ran simulation experiments with a single-topic generative model where K = 20 topics are sampled from a symmetric Dirichlet(α) distribution over a vocabulary of size 5k. The Dirichlet parameter α governs the sparsity of the topic distributions, effectively determining the similarity of the topics: as α increases the prior concentrates on the interior of the simplex, forcing the topic distributions to be more similar. This is visualized in the left panel of Figure 1 . In the experiments, we generate a dataset and solve the contrastive optimization problem, and then we construct the landmark embeddings φ(x) for each document x using 1k landmark documents, following Section 4. Using the true likelihood matrix L for the landmarks, we infer the MAP topic estimate and measure accuracy as the fraction of test documents for which this prediction matches the generating topic. See Appendix A for additional details. The results are displayed in the center and right panel of Figure 1 where we vary the network architecture and the amount of training data. The experiment identifies several interesting properties of the contrastive learning approach. First, as a sanity check, the algorithm does accurately predict the latent topics of the test documents in most experimental conditions and the accuracy is quite high when the problem is relatively easy (e.g., α is small). Second, the performance degrades as α increases, but this can be mitigated by increasing the model capacity (size of the network) or the resampling rate (which exposes the model to more unlabeled data). Specifically, we consistently see that for a fixed model and α, increasing the resampling rate improves accuracy. A similar trend emerges when we fix α and rate and increase model capacity. These findings suggest that latent topics can be recovered by the contrastive learning approach, provided we have an expressive enough model and enough unlabeled data. 

7. SEMI-SUPERVISED EXPERIMENTS

We also conducted experiments with our document-level contrastive representations in a semisupervised setting. The goal of these experiments is to demonstrate that the contrastive representations yield non-trivial performance, as consistent with the theory. Note that our intention is not to show state-of-the-art performance using contrastive learning; that is beyond the scope of the paper. We discuss the main findings here, with experimental details deferred to Appendix B. A closely related representation. In the worst-case, the guarantees from Section 4 and Section 5 require the number of landmarks to be quite large. To develop a more practical representation, and to more closely mirror what is done in practice, we consider training models of the form f 1 , f 2 : X → R d where (x, x ) → f 1 (x) T f 2 (x ). Plugging this into Eq (1), we solve the following bivariate optimization problem: minimize f1,f2 E Dcontrast y log 1 + exp -f 1 (x) T f 2 (x ) + (1 -y) log 1 + exp f 1 (x) T f 2 (x ) . (5) Given f 1 , f 2 , we can embed a document x according to f 1 (x). We call the resulting scheme the direct embedding approach to distinguish it from the landmark embedding approach from Section 3. Methodology. We used the AG news topic classification dataset (Zhang et al., 2015) , which has 4 classes and 30k training examples per class. We reserve 1k examples per class as labeled training data and use the remaining examples for representation learning. For all methods, we use 2regularized logistic regression to fit a linear classifier on the labeled data. We compared the representations Landmark-NCE and Direct-NCE against the following baselines: (1) standard bag-of-words (BOW), ( 2) bag-of-words with dimensionality reduction (BOW+SVD), (3) representations from LDA (LDA), and (4) skip-gram word embeddings (Mikolov et al., 2013b) (word2vec) . For the NCE methods, we experiment with different neural network architectures and numbers of landmarks but use standard settings for other training parameters. See Appendix B for details. We note that all of these methods all of these methods ignore word order in the final document-level representation, and all of them (with the exception of word2vec) ignore word order in their training. In all line plots in Figure 2 , the training examples axis refers to the number of randomly selected labeled examples used to train the linear classifier. The shaded regions denote 95% confidence intervals computed over 10 replicates of this random selection procedure. Baseline comparison. In the left panel of Figure 2 , we visualize the semi-supervised perfomance of NCE and the baselines. Direct-NCE outperforms all the other methods, with dramatic improve-ments over all except word2vec in the low labeled data regime. BOW is quite competitive when there is an abundance of labeled data, but as the dimensionality of this representation is quite large, it performs poorly with limited samples. However, unsupervised dimensionality reduction on this representation appears to be unhelpful and actually degrades performance uniformly. Finally, we point out that word embedding representations (word2vec) perform quite well, but our documentlevel Direct-NCE procedure is slightly better, particularly when there are few labeled examples. This may reflect some advantage in learning document-level non-linear representations, as opposed to averaging word-level ones. Visualizing embeddings. For a qualitative perspective, we visualize the embeddings from NCE using t-SNE with the default scikit-learn parameters (van der Maaten & Hinton, 2008; Pedregosa et al., 2011) . To compare, we also used t-SNE to visualize the document-averaged word2vec embeddings. The right panels of Figure 2 shows these visualizations on the 7,600 test documents colored according to their true label. While qualitiative, the visualization of the Direct-NCE embeddings appear to be more clearly separated into label-homogeneous regions than that of word2vec. Other results. We investigated the effect of the number of landmarks on the performance of Landmark-NCE by embedding with 500, 1k, 4k, 8k, and 16k landmarks. The bottom left panel of Figure 2 displays the results, which suggest that a larger number of landmarks is helpful, with diminishing returns at the higher end of the scale. We also looked into the effect of depth on the performance of Direct-NCE by training networks with one, two, and three hidden layers. In each case, the first hidden layer has 300 nodes and the rest have 256 nodes. The top center panel of Figure 2 displays the results, which suggest that using deeper models for representation learning may improve downstream performance. We also tracked the contrastive loss of the model on a holdout validation contrastive dataset. The bottom center panel of Figure 2 plots how this loss evolves over training epochs. Along with this contrastive loss, we checkpoint the model, train a linear classifier, and evaluate the supervised test accuracy. We see that test accuracy steadily improves as contrastive loss decreases, suggesting that in these settings, contrastive loss (which we can measure using an unlabeled validation set) is a good surrogate for downstream performance (which may not be measurable until we have a task at hand). Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems, 2015.

A TOPIC MODELING SIMULATIONS

The results of Section 4 show that if a model is trained to minimize the contrastive learning objective, then that model must also recover certain topic posterior information in the corpus. However, there are a few practical questions that remain: can we train such a model, how much capacity should it have, and how much data is needed in order to train it? Our topic modeling simulations are designed to study these questions. Simulation setup. We considered a single-topic generative model where the topics θ 1 , . . . , θ K are sampled from a symmetric Dirichlet(α/K) distribution over ∆ |V| and for each document, its length is drawn from a Poisson(λ) and its topic is sampled uniformly from [K]. This model can be thought of as a limiting case of the LDA model (Blei et al., 2003; Griffiths & Steyvers, 2004) when the document-level topic distribution is symmetric Dirichlet(β) with β 1. In our experiments, we set K = 20, |V| = 5000, and λ = 30, and we varied α from 1 to 10. Notice that as α increases, the Dirichlet prior becomes more concentrated around the uniform distribution, so the topic distributions are more likely to be similar. Thus, we expect the contrastive learning problem to be more difficult with larger values of α. We used contrastive models of the same form as Section 7, namely models of the form f 1 , f 2 where the final prediction is f 1 (x) T f 2 (x ) and f 1 and f 2 are fully-connected neural networks with three hidden layers. To measure the effect of model capacity, we trained two models -a smaller model with 256 nodes per hidden layer and a larger model with 512 nodes per hidden layer. Both models were trained for 100 epochs. We used all of the same optimization parameters as in Section 7 with the exception of dropout, which we did not use. To study the effect of training data, we varied the rate r at which we resampled our entire contrastive training set from the ground truth topic model. Specifically, after every 1/r-th training epoch, we resampled 60,000 new documents and constructed a contrastive dataset from these documents. We varied the resampling rate r from 0.1 to 1.0, where larger values of r imply more training data. The total amount of training data varies from 600K documents to 6M documents. Using the results from Section 4, we constructed the embedding φ(x) of a new document x using 1000 landmark documents, each sampled from the same generative model. We constructed the true likelihood matrix L of the landmark documents using the underlying topic model and recovered the model-based posterior L † φ(x). We measured accuracy as the fraction of testing documents for which the MAP topic under the model-based posterior matched the generating topic. We used 5000 testing documents and performed 5 replicates for each setting of parameters.

B SEMI-SUPERVISED EXPERIMENT DETAILS

Methodology. We conducted semi-supervised experiments on the AG news topic classification dataset as compiled by Zhang et al. (2015) . This dataset contains news articles that belong to one of four categories: world, sports, business, and sci/tech. There are 30,000 examples from each class in the training set, and 1,900 examples from each class in the testing set. We minimally preprocessed the dataset by removing punctuation and words that occurred in fewer than 10 documents, resulting in a vocabulary of approximately 16,700 words. We randomly selected 1,000 examples from each class to remain as our labeled training dataset, and we used the remaining 116,000 examples as our unlabeled dataset for learning representations. After computing representations on the unlabeled dataset, we fit a linear classifier on the labeled training set using logistic regression with cross validation to choose the 2 regularization parameter (n folds = 3). We compared our representations, Landmark-NCE and Direct-NCE, against several representation baselines. • BOW -The standard bag-of-words representation. • BOW+SVD -A bag of words representation with dimensionality reduction. We first perform SVD on the bag-of-words representation using the unsupervised dataset to compute a low dimensional subspace, and train a linear classifier on the projected bag-of-words representations with the labeled dataset. • LDA -A representation derived from LDA. We fit LDA on the unsupervised dataset using online variational Bayes (Hoffman et al., 2010) , and our representation is the inferred posterior distribution over topics given training document. • word2vec -Skip-gram word embeddings (Mikolov et al., 2013b) . We fit the skip-gram word embeddings model on the unsupervised dataset and then averaged the word embeddings in each of the training documents to get their representation. For our representation, to solve the optimization problem in Eq. ( 5), we considered neural network architectures of various depths. We used fully-connected layers with between 250 and 300 nodes per hidden layer. We used ReLU nonlinearities, dropout probability 1/2, batch normalization, and the default PyTorch initialization (Paszke et al., 2019) . We optimized using RMSProp with momentum value 0.009 and weight decay 0.0001 as in Radhakrishnan et al. (2019) . We started with learning rate 10 -4 which we halved after 250 epochs, and we trained for 600 epochs. Unless otherwise stated, Landmark-NCE and Direct-NCE use a three-layer architecture, and Landmark-NCE uses 8000 landmarks. To sample a contrastive dataset, we first randomly partitioned each unlabeled document in half to create the positive pairs. To create the negative pairs, we again randomly partitioned each unlabeled document in half, randomly permuted one set of half documents, and discarded collisions. This results in a contrastive dataset whose size is roughly twice the number of unlabeled documents. In the course of training our models for the contrastive task, we resampled a contrastive dataset every 3 epochs to prevent overfitting on any one particular dataset. Additional discussions. In the left panel of Figure 2 , we additionally remark that LDA performs quite poorly. This could be for several reasons, including that fitting a topic model directly could be challenging on the relatively short documents in the corpus or that the document category is not well-expressed by a linear function of the topic proportions. C PROOFS FROM SECTION 4 C.1 PROOF OF SINGLE TOPIC REPRESENTATION LEMMA Proof of Lemma 1. Conditioned on the topic vector w, x (1) and x (2) are independent. Thus, g (x, x ) = P(x (1) = x, x (2) = x ) P(x (1) = x)P(x (2) = x ) = K k=1 P(w = e k )P(x (1) = x | w = e k )P(x (2) = x | w = e k ) P(x (1) = x)P(x (2) = x ) = K k=1 P(w = e k | x (1) = x)P(x (2) = x | w = e k ) P(x (2) = x ) = η(x) T ψ(x ) P(x (2) = x ) , where the third equality follows from Bayes' rule.

C.2 PROOF OF GENERAL REPRESENTATION LEMMA

Proof of Lemma 2. Fix a document x of length m and a document probability vector w. Conditioned on the assignment of each word in the document to a topic, probability of a document factorizes as P(x | w) = z∈[K] m m i=1 w zi O(x i | z i ) = z∈[K] m   m i=1 w zi     m i=1 O(x i | z i )   = π(w) T ψ(x), where the last line follows from collecting like terms. Using the form of g from above, we have g (x, x ) = P(x (1) = x, x (2) = x ) P(x (1) = x)P(x (2) = x ) = w P(x (1) = x | w)P(x (2) = x | w) dP(w) P(x (1) = x)P(x (2) = x ) = w P(x (2) = x | w) dP(w | x (1) = x) P(x (2) = x ) = w π(w) T ψ(x) dP(w | x (1) = x) P(x (2) = x ) = η(x) T ψ(x ) P(x (2) = x ) .

C.3 PROOF OF POLYNOMIAL REPRESENTATION THEOREM

Proof of Theorem 3. By assumption (i), there exists an anchor word a k for each topic k = 1, . . . , K. By definition this means that O(a k | j) > 0 if and only if j = k. For each vector α ∈ Z K + such that α k ≤ d o , create a landmark document consisting of α k copies of a k for k = 1, . . . , K. This will result in K+do do landmark documents. Moreover, from assumption (ii), we can see that each of these landmark documents has positive probability of occurring under the marginal distribution of x (2) for (x (1) , x (2) , y) ∼ D contrast , which implies g (x, l) is well-defined for all our landmark documents l. Let l denote one of our landmark documents and let α ∈ Z K + be its associated vector. Since l only contains anchor words, ψ(l) β > 0 if and only if α = β. To see this, note that ψ(l) α = z∈( [m] α ) m i=1 O(l i | z i ) ≥ K k=1 O(a k | k) α k > 0. On the other hand, if β = α but k β k = k α k , then there exists an index k such that β k ≥ α k +1. Thus, for any z ∈ [m] β , there will be more than α k words in l assigned to topic k. Since every word in l is an anchor word and at most α k of them correspond to topic k, we will have m i=1 O(l i | z i ) = 0. Rebinding ψ(l) = (ψ 0 (l), . . . , ψ d0 (l)) and forming the matrix L using this definition, we see that L T can be diagonalized and inverted. For any target degree-d o polynomial Q(w), there exists a vector v such that Q(w) = v, π d0 (w) , where π d0 (w) denotes the degree-d 0 monomial vector. Thus, we may take θ = L -1 v and get that for any document x: θ, g (x, l 1:M ) = (L -1 v) T L T η(x) = E[ v, π d0 (w) | x (1) = x] = E[Q(w) | x (1) = x].

D PROOFS FROM SECTION 5 D.1 PROOF OF ERROR TRANSFORMATION GUARANTEE

We first recall and setup some notations. For (x (1) , x (2) , y) ∼ D contrast (our contrastive distribution defined in Section 3), we let µ i denote the marginal distribution of x (i) . Furthermore, recall the contrastive loss, conditional probability, odds ratio, and oracle representation functions: L contrast (f ) := E (x,x ,y)∼Dcontrast y log 1 + e -f (x,x ) + (1 -y) log 1 + e f (x,x ) f (x, x ) := log P(y = 1 | x (1) = x, x (2) = x ) P(y = 0 | x (1) = x, x (2) = x ) , g (x, x ) := exp f (x, x ) = P(x (1) = x, x (2) = x ) P(x (1) = x)P(x (2) = x ) , φ := (g (x, l 1 ), . . . , g (x, l M )) where l 1 , . . . , l M are landmark documents. The learned approximation to f is f , and from it we derive ĝ(x, x ) := exp f (x, x ) , φ(x) := (ĝ(x, l 1 ), . . . , ĝ(x, l M )) Let η(x), ψ(x) denote the posterior/likelihood vectors from Lemma 1 or the posterior/likelihood polynomial vectors from Lemma 2. Say the length of this vector is N ≥ 1. Our goal is to show that linear functions in the representation φ(x) can provide a good approximation to the target function x → η(x) T θ where θ ∈ R N is some fixed vector. To this end, define R( φ) := min v E x∼µ (1) (η(x) T θ -φ(x) T v) 2 , which is the best mean squared error achievable using the representation φ. By Lemma 1 or Lemma 2, we know that for any x, x we have g (x, x ) = η(x) T ψ(x ). Recall the matrix L := ψ(l 1 ), . . . , ψ(l M ) . This matrix is in R N ×M . If L has full row rank, then η(x) T θ = η(x) T LL † θ = φ (x) T v where φ (x) := (g (x, l 1 ), . . . , g (x, l M )) and v = L † θ . Thus, R(φ ) = 0. We will show that R( φ) can be bounded as well. Theorem 5 (Restatement of Theorem 4). Suppose the following assumptions hold. (1) There is a constant σ min > 0 such that for any δ ∈ (0, 1), there is a number M 0 (δ) such that for an i.i.d. sample l 1 , . . . , l M with M ≥ M 0 (δ), with probability 1 -δ, the matrix L = ψ(l 1 ) • • • ψ(l M ) has minimum singular value at least σ min √ M . (2) There exists a value g max > 0 such that for all documents x and landmarks l i max{ f (x, l i ), f (x, l i )} ≤ log g max . Let f be the function returned by the contrastive learning algorithm, and let ε := L contrast ( f ) -L contrast (f ) denote its excess contrastive loss. For any δ ∈ (0, 1), if M ≥ M 0 (δ/2), then with probability at least 1 -δ over the random draw of l 1 , . . . , l M , we have R( φ) ≤ θ 2 2 (1 + g max ) 4 σ 2 min ε + 2 log(2/δ) M . Proof. We first condition on two events based on the sample l 1 , . . . , l M . The first is the event that L has full row rank and smallest non-zero singular value at least √ M σ min > 0; this event has probability at least 1 -δ/2. The second is the event that 1 M M j=1 E x∼µ (1) p (x, x ) -p(x, x ) 2 ≤ E (x,x )∼µ (1) ⊗µ (2) p (x, l j ) -p(x, l j ) 2 + 2 log(2/δ) M (6) where we make the definitions ĝ(x, x ) := exp( f (x, x )) p(x, x ) := 1/(1 + e -f (x,x ) ) = ĝ(x, x ) 1 + ĝ(x, x ) p (x, x ) := 1/(1 + e -f (x,x ) ) = g (x, x ) 1 + g (x, x ) . By Hoeffding's inequality and the fact that p and p have range [0, 1], this event also has probability at least 1 -δ/2. By the union bound, both events hold simultaneously with probability at least 1 -δ. We henceforth condition on these two events for the remainder of the proof. Since L has full row rank, via Cauchy-Schwarz, we have R( φ) = min v E x∼µ (1) (η(x) T θ -φ(x) T v) 2 ≤ E x∼µ (1) (η(x) T θ -φ(x) T v ) 2 = E x∼µ (1) ((φ (x) T -φ(x)) T v ) 2 ≤ E x∼µ (1) v 2 2 φ (x) T -φ(x) 2 2 = v 2 2 • E x∼µ (1) φ (x) T -φ(x) 2 2 . We analyze the two factors on the right-hand side separately. Analysis of v . For v , we have v 2 2 ≤ L † 2 2 θ 2 2 ≤ 1 M σ 2 min θ 2 2 , where we have used the fact that L has smallest non-zero singular value at least √ M σ min . Analysis of φ -φ. For the other term, we first note that p (x, x ) = 1/(1 + e -f (x,x ) ) = P(y = 1 | x (1) = x, x (2) = x ). Thus, we have where KL(p, p ) denotes the KL-divergence between two Bernoulli distributions with biases p and p , respectively. Pinkser's inequality tells us that KL(p, p ) ≥ 2(p -p ) 2 . Combining this with the fact that D contrast is a mixture distribution that places half its probability mass in µ (1) ⊗ µ (2) implies ε ≥ 2E (x,x )∼Dcontrast p(x, x ) -p (x, x ) 2 E (x,x )∼µ (1) ⊗µ (2) p(x, x ) -p (x, x ) 2 . ε = L contrast ( f ) -L contrast (f ) = E (x,



Strictly speaking, we should first partition x = (x (1) , x (2) ), only use landmarks that occur as secondhalves of documents, and embed x → (g (x (1) , l1), . . . , g (x (1) , lM )). For the sake of clarity, we will ignore this small technical issue here and in the remainder of the paper.



Figure 1: Topic modeling simulations. Left: Average total variation distance between topics. Right: Topic recovery accuracy for contrastive models. Total number of documents sampled = 6M × rate.

x ,y)∼Dcontrast y log p (x, x ) p(x, x ) + (1 -y) log 1 -p (x, x ) 1 -p(x, x ) = E (x,x )∼Dcontrast p (x, x ) log p (x, x ) p(x, x ) + (1 -p (x, x )) log 1 -p (x, x ) 1 -p(x, x ) = E (x,x )∼Dcontrast KL(p(x, x ), p (x, x ))

Experiments with AG news dataset. Top left: test accuracy of methods as we increase the number of supervised training examples. Bottom left: Landmark-NCE performance as we vary number of landmarks. Top middle: Direct-NCE performance as we vary network depth. Bottom middle: Relationship between contrastive error and test accuracy for Direct-NCE. Right: t-SNE visualizations of Direct-NCE and word2vec embeddings.

annex

Combining the above with Eq. ( 6) and the definitions of p, p , we haveWrapping up. To conclude, we have

D.2 SATISFYING ASSUMPTION 1

Suppose we are in the single topic case where w ∈ {e 1 , . . . , e K }. Assume that min k Pr(w = e k ) ≥ w min . Further assumes that each topic k has an anchor word a k , satisfying O(a k | z = e k ) ≥ a min . Then we will show that when M and m are large enough, the matrix L whose columns are ψ(x)/P(x) will have large singular values.First note that if document x contains a k then ψ(x) is one sparse, and satisfiesTherefore, the second moment matrix satisfiesNow, if the number of words per document is m ≥ 1/a min then≥ 1 -exp(-ma min ) ≥ 1 -1/e. Finally, using the fact that P(w = e k ) ≤ 1, we see that the second moment matrix satisfies E ψ(x)ψ(x) T(1 -1/e)I K×K . For the empirical matrix, we perform a crude analysis and apply the Matrix-Hoeffding inequality (Tropp, 2012) . We have ψ(x)ψ(x) T 2 ≤ Kw -2 min and so with probability at least 1 -δ, we haveIf we take M ≥ Ω(K log(K/δ)/w 2 min ) then we will have that the minimum eigenvalue of the empirical second moment matrix will be at least 1/2.

