UNDERSTANDING THE ROBUSTNESS OF SELF-SUPERVISED LEARNING THROUGH TOPIC MODELING

Abstract

Self-supervised learning has significantly improved the performance of many NLP tasks. However, how can self-supervised learning discover useful representations, and why is it better than traditional approaches such as probabilistic models are still largely unknown. In this paper, we focus on the context of topic modeling and highlight a key advantage of self-supervised learning -when applied to data generated by topic models, self-supervised learning can be oblivious to the specific model, and hence is less susceptible to model misspecification. In particular, we prove that commonly used self-supervised objectives based on reconstruction or contrastive samples can both recover useful posterior information for general topic models. Empirically, we show that the same objectives can perform on par with posterior inference using the correct model, while outperforming posterior inference using misspecified models. * Equal contribution. 1 See Equation ( 1), similar to the objective used in Pathak et al. (2016); Devlin et al. (2018) 2 See Equation (2), this was also used in Tosh et al. (2021a).

1. INTRODUCTION

Recently researchers have successfully trained large-scale models like BERT (Devlin et al., 2018) and GPT (Radford et al., 2018) , which offers extremely powerful representations for many NLP tasks (see e.g., Liu et al. (2021) ; Jaiswal et al. (2021) and references therein). To train these models, often one starts with sentences in a large text corpus, mark random words as "unknown" and ask the neural network to predict the unknown words. This approach is known as self-supervised learning (SSL). Why can self-supervised approaches learn useful representations? To understand this we first need to define what are "useful representations". A recent line of work (Tosh et al., 2021a; Wei et al., 2021) studied self-supervised learning in the context of probabilistic models: assuming the data is generated by a probabilistic model (such as a topic model or Hidden Markov Model), one can define representation of observed data as the corresponding hidden variables in the model (such as topic proportions in topic models or hidden states in Hidden Markov Model). These works show that self-supervised learning approach is as good as explicitly doing inference using such models. This approach naturally leads to the next question -why can self-supervised learning perform better than traditional inferencing based on probabilistic models? In this paper we study this question in the context of topic modeling, and highlight one key advantage for self-supervised learning: robustness to model misspecification. Many different models (such as Latent Dirichlet Allocation (LDA) (Blei et al., 2003) , Correlated Topic Model (CTM) (Blei & Lafferty, 2007) , Pachinko Allocation Model (PAM) (Li & McCallum, 2006) ) have been applied in practice. Traditional approaches would require different ways of doing inference depending on which model is used to generate the data. On the other hand, we show that no matter which topic model is used to generate the data, if standard self-supervised learning objectives such as the reconstruction-based objective 1 or the contrastive objective 2 can be minimized, then they will generate representations that contain useful information about the topic proportions of a document. Self-supervised learning is oblivious to the choice of the probabilistic model, while the traditional approach of probabilistic modeling depends highly on the specific model. Therefore, one would expect self-supervised learning to perform similarly to inferencing with the correct model, and outperforms inferencing with misspecified model. To verify our theory, we run synthetic experiments to show that self-supervised learning indeed outperforms inferencing with misspecified models. Unlike large-scale models, our self-supervised learning is applied in the much simpler context of topic models, but we also demonstrate that even this simple application can improve over simple baselines on real data.

1.1. RELATED WORKS

Self-Supervised Learning Self-supervised learning recently has been shown to be able to learn useful representation, which is later used for downstream tasks. 2018) proposed BERT, which shows that self-supervised learning has the ability to train large-scale language models and could provide powerful representations for downstream natural language processing tasks. Theoretical Understanding of Self-Supervised Learning Given the recent success of selfsupervise learning, many works have been tried to provide theoretical understanding on contrastive learning (Arora et al., 2019; Wang & Isola, 2020; Tosh et al., 2021a; Tian et al., 2020b; HaoChen et al., 2021; Wen & Li, 2021; Zimmermann et al., 2021) and reconstruction-based learning (Lee et al., 2020; Saunshi et al., 2020; Teng & Huang, 2021) . Also, several papers considered the problem from a multi-view perspective (Tsai et al., 2020; Tosh et al., 2021b) 2020) investigated the benefits of pre-trained language models for downstream tasks. Most relevant to our paper, Tosh et al. (2021a) considered the contrastive learning in the topic models setting. Our theoretical results extend their theory to reconstruction-based objective (while also removing some assumptions for the contrastive objective), and our empirical results show that the reconstruction-based objective can be effectively minimized. Theoretical Analysis of Topic Models Many works have proposed provable algorithms to learn topic models, such as methods of moment based approaches (Anandkumar et al., 2012; 2013; 2014; 2015) and anchor word based approaches (Papadimitriou et al., 2000; Arora et al., 2012; 2016a; Gillis & Vavasis, 2013; Bittorf et al., 2012) . Much less is known about provable inference for topic models. Sontag & Roy (2011) showed that MAP estimation can be NP-hard even for LDA model. Arora et al. (2016b) considered approximate inference algorithms.

1.2. OUTLINE

We first introduce the basic concepts of topic models and our objectives in Section 2. Then in Section 3 we prove guarantees for the reconstruction-based objective. Section 4 connects the contrastive objective to reconstruction-based objective which allows us to prove a stronger guarantee for the former. We then demonstrate the ability of self-supervised learning to adapt to different models by synthetic experiments in Section 5. Finally, we also evaluate the reconstruction-based objective on real-data to show that despite the simplicity of the topic modeling context, it extracts reasonable representations in Section 6.

2. PRELIMINARIES

In this section we first introduce some general notations. Then we briefly describe the topic models we consider in Section 2.1. Finally we define the self-supervised learning objectives in Section 2.3 and give our main results.

Notation

We use [n] to denote set {1, 2, . . . , n}. For vector x ∈ R d , denote x as its 2 norm and x 1 as its 1 norm. For matrix A ∈ R m×n , we use A i ∈ R m to denote its i-th column. When matrix A has full column rank, denote its left pseudo-inverse as A † = (A A) -1 A . For matrix or general tensor T ∈ R d1×•••×d l , we use vector vec(T ) ∈ R d1•••d l to represent its vectorization. Let S k = {x ∈ R k | k i=1 x i = 1, x i ≥ 0} to denote k -1 dimensional probability simplex. For two probability vectors p, q, define their total variation (TV) distance as TV(p, q) = p -q 1 /2.



See for example Bachman et al. (2019); Caron et al. (2020); Chen et al. (2020a;b;c); Grill et al. (2020); Chen & He (2021); Tian et al. (2020a); He et al. (2020) and references therein. In particular, Devlin et al. (

, which covers both contrastive and reconstruction-based learning. Moreover, Wei et al. (2020) and Tian et al. (2021) studied the theoretical properties of self-training and the contrastive learning without the negative pairs respectively. Saunshi et al. (

