CONTRASTIVE ESTIMATION REVEALS TOPIC POSTERIOR INFORMATION TO LINEAR MODELS Anonymous authors Paper under double-blind review

Abstract

Contrastive learning is an approach to representation learning that utilizes naturally occurring similar and dissimilar pairs of data points to find useful embeddings of data. In the context of document classification under topic modeling assumptions, we prove that contrastive learning is capable of recovering a representation of documents that reveals their underlying topic posterior information to linear models. We apply this procedure in a semi-supervised setup and demonstrate empirically that linear classifiers with these representations perform well in document classification tasks with very few training examples.

1. INTRODUCTION

Using unlabeled data to find useful embeddings is a central challenge in representation learning. Classical approaches to this task often start by fitting some type of structure to the unlabeled data, such as a generative model or a dictionary, and then embed future data via inference with the fitted structure (Blei et al., 2003; Raina et al., 2007) . While principled, this approach is not without its drawbacks. One issue is that learning structures and performing inference is often hard in general (Sontag & Roy, 2011; Arora et al., 2012) . Another issue is that we must a priori choose a structure and method for fitting the unlabeled data, and unsupervised methods for learning these structures can be sensitive to model misspecification (Kulesza et al., 2014) . Contrastive learning (also called noise contrastive estimation, or NCE) is an alternative representation learning approach that tries to capture the latent structure in unlabeled data implicitly. At a high level, these methods formulate a classification problem in which the goal is to distinguish examples that naturally occur in pairs, called positive samples, from randomly paired examples, called negative samples. The particular choice of positive samples depends on the setting. In image representation problems, for example, patches from the same image or neighboring frames from videos may serve as positive examples (Wang & Gupta, 2015; Hjelm et al., 2018) . In text modeling, the positive samples may be neighboring sentences (Logeswaran & Lee, 2018; Devlin et al., 2018) . The idea is that in the course of learning to distinguish between semantically similar positive examples and randomly chosen negative examples, we will capture some of the latent semantic information. In this work, we look "under the hood" of contrastive learning and consider its application to document modeling, where the goal is to construct useful vector representations of text documents in a corpus. In this setting, there is a natural source of positive and negative examples: a positive example is simply a document from the corpus, and a negative example is one formed by pasting together the first half of one document and the second half of another (independently chosen) document. We prove that when the corpus is generated by a topic model, learning to distinguish between these two types of documents yields representations that are closely related to their underlying latent variables. One potential application of contrastive learning is in a semi-supervised setting, where there is a small amount of labeled data as well as a much larger collection of unlabeled data. In these situations, purely supervised methods that fit complicated models may have poor performance due to the limited amount of labeled data. On the other hand, when the labels are well-approximated by some function of the latent structure, our results show that an effective strategy is to fit linear functions, which may be learned with relatively little labeled data, on top of contrastive representations. In our experiments, we verify empirically that this approach produces reasonable results. Contributions. The primary goal of this work is to shed light on what contrastive learning techniques uncover in the presence of latent structure. To this end, we focus on the setting of document modeling where latent structure is induced by a topic model. Here, our contrastive learning objective is to distinguish true documents from 'fake' documents that are composed by randomly pasting together two document halves from the corpus. We consider two types of architectures or functional forms of solutions for this problem, both trained with logistic loss. The first architecture, on which our theoretical analysis will focus, consists of general functions of the form f (•, •). Here, we have trained f so that f (x, x ) indicates the confidence of the model that x and x are two halves of the same document. To embed a new document x using f , we propose a landmark embedding procedure: fix documents l 1 , . . . , l M (our so-called landmarks) and create the embedding φ(x) using a function of the predictions f (x, l 1 ), . . . , f (x, l M ). In Section 4, we show that the embedding φ(x) is a linear transformation of the underlying topic posterior moments of x. Moreover, under certain conditions this linear relationship is invertible, so that linear functions of φ(x) correspond to polynomial functions of the topic posterior of document x. In Section 5, we show that errors in f on the contrastive learning objective transfer smoothly to errors in φ(x) as a linear transformation of the topic posterior of x. Thus, as the quality of f improves, linear functions of φ(x) more closely approximate polynomial functions of the topic posterior of document x. Unfortunately, the landmark embedding can require quite a few landmarks before our theoretical results kick in. Moreover, embedding a document requires M evaluations of f , which can be expensive. To circumvent this, in Section 7 we introduce a direct embedding procedure that more closely matches what is done in practice. We use an architecture of the form f 1 (x) T f 2 (x ) where f 1 , f 2 are functions with d-dimensional outputs, and we train this architecture on the same contrastive learning task as before. To embed a document x, we simply use the evaluation f 1 (x). In Section 7, we evaluate this embedding on a semi-supervised learning task, and we show that it has reasonable performance. Indeed, the direct embedding method outperforms the landmark embedding method, which raises the question of whether or not anything can be theoretically proven about the direct embedding method. We leave this question to future work.

Related work.

Reducing an unsupervised problem to a synthetically-generated supervised problem is a well-studied technique. In dynamical systems modeling, Langford et al. (2009) showed that the solutions to a few forward prediction problems can be used to track the underlying state of a nonlinear dynamical system. For linear dynamics, the idea is also seen in autoregressive models (Yule, 1927) . In anomaly/outlier detection, a useful technique is to learn a classifier that distinguishes between true samples from a distribution and fake samples from some synthetic distribution (Steinwart et al., 2005; Abe et al., 2006) . Similarly, estimating the parameters of a probabilistic model can be reduced to learning to classify between true data and randomly generated noise (Gutmann & Hyvärinen, 2010) . In the context of natural language processing, methods such as skip-gram and continuous bag-ofwords turn the problem of finding word embeddings into a prediction problem (Mikolov et al., 2013a; b) . Modern language representation training algorithms such as BERT and QT also use naturally occurring classification tasks such as predicting randomly masked elements of a sentence or discriminating whether or not two sentences are adjacent (Devlin et al., 2018; Logeswaran & Lee, 2018) . Training these models often employs a technique called negative sampling, in which softmax prediction probabilities are estimated by randomly sampling examples; this bears close resemblance to the way that negative examples are produced in contrastive learning. Most relevant to the current paper, Arora et al. ( 2019) gave a theoretical analysis of contrastive learning. They considered the specific setting of trying to minimize the contrastive loss L(f ) = E x,x+,x-[ f (x) T (f (x + ) -f (x -)) ] where (x, x + ) is a positive pair and (x, x -) is a negative pair. They showed that if there is an underlying collection of latent classes and positive examples are generated by draws from the same class, then minimizing the contrastive loss over embedding functions f yields good representations for the classification task of distinguishing latent classes. The main difference between our work and that of Arora et al. ( 2019) is that we adopt a generative modeling perspective and induce the contrastive distribution naturally, while they do not make generative assumptions but assume the contrastive distribution is directly induced by the downstream

