TOWARDS ROBUST AND EFFICIENT CONTRASTIVE TEXTUAL REPRESENTATION LEARNING

Abstract

There has been growing interest in representation learning for text data, based on theoretical arguments and empirical evidence. One important direction involves leveraging contrastive learning to improve learned representations. We propose an application of contrastive learning for intermediate textual feature pairs, to explicitly encourage the model to learn more distinguishable representations. To overcome the learner's degeneracy due to vanishing contrasting signals, we impose Wasserstein constraints on the critic via spectral regularization. Finally, to moderate such an objective from overly regularized training and to enhance learning efficiency, with theoretical justification, we further leverage an active negative-sample-selection procedure to only use high-quality contrast examples. We evaluate the proposed method over a wide range of natural language processing applications, from the perspectives of both supervised and unsupervised learning. Empirical results show consistent improvement over baselines.

1. INTRODUCTION

Representation learning is one of the pivotal topics in natural language processing (NLP), in both supervised and unsupervised settings. It has been widely recognized that some forms of "general representation" exist beyond specific applications (Oord et al., 2018) . To extract such generalizable features, unsupervised representation models are generally pretrained on large-scale text corpora (e.g., BERT (Devlin et al., 2018; Liu et al., 2019; Clark et al., 2020; Lagler et al., 2013) ) to avoid data bias. In supervised learning, models are typically built on top of these pre-trained representations and further fine-tuned on downstream tasks. Representation learning greatly expedites model deployment and meanwhile yields performance gains. There has been growing interest in exploiting contrastive learning (CL) techniques to refine context representations in NLP (Mikolov et al., 2013a; b) . These techniques aim to avoid representation collapse for downstream tasks, i.e., getting similar output sentences with different input in conditional generation tasks (Dai & Lin, 2017) . Intuitively, these methods carefully engineer features from crafted ("negative") examples to contrast against the features from real ("positive") examples. A feature encoder can then enhance its representation power by characterizing input texts at a finer granularity. Efforts have been made to empirically investigate and theoretically understand the effectiveness of CL in NLP, including noise contrastive estimation (NCE) of word embeddings (Mikolov et al., 2013b) and probabilistic machine translation (Vaswani et al., 2013) with theoretical developments (Gutmann & Hyvärinen, 2010) . More recently, InfoNCE (Oord et al., 2018) further links the CL to the optimization of mutual information, which inspired a series of practical followup works (Tian et al., 2020; Hjelm et al., 2019a; He et al., 2020; Chen et al., 2020) . Despite the significant empirical success of CL, there are still many open challenges in its application to NLP, including (i) the propagation of stable contrastive signals. An unregularized critic function in CL can suffer from unstable training and gradient vanishing issues, especially in NLP tasks due to the discrete nature of text. The inherent differences between positive and negative textual features make those examples easily distinguished, resulting in a weak learning signal in contrastive schemes (Arora et al., 2019) . (ii) Empirical evidence (Wu et al., 2017) shows that it is crucial to compare each positive example with adequate negative examples. However, recent works suggest using abundant negative examples, which are not akin to the positive examples, which can result in sub-optimal results and unstable training with additional computational overhead (Ozair et al., 2019; McAllester & Stratos, 2020) . In this paper, we propose two methods to mitigate the above issues. In order to stabilize the training and enhance the model's generalization ability, we propose to use the Wasserstein dependency measure (Ozair et al., 2019 ) as a substitute for the Kullback-Leibler (KL) measure in the vanilla CL objective. We further actively select K high-quality negative samples to contrast with each positive sample under the current learned representations. These supply the training procedure with necessarily large and non-trivial contrastive samples, encouraging the representation network to generate more distinguishable features. Notably, our approach also significantly alleviates the computational burden of massive features compared with previous works (Tian et al., 2020; Hjelm et al., 2019b) . Contributions: (i) We propose a Wasserstein-regularized critic to stabilize training in a generic CL framework for learning better textual representations. (ii) We further employ an active negativesample selection method to find high-quality contrastive samples, thus reducing the gradient noise and mitigating the computation concerns. (iii) We empirically verify the effectiveness of our approach under various NLP tasks, including variational text generation (Bowman et al., 2016) , natural language understanding tasks on GLUE with supervised and semi-supervised setups (Wang et al., 2018) , and image-text retrieval (Lee et al., 2018) .

2. BACKGROUND 2.1 NOISE CONTRASTIVE ESTIMATION

Our formulation is inspired by Noise Contrastive Estimation (NCE) (Gutmann & Hyvärinen, 2010), which was originally introduced for unnormalized density estimation, where the partition functions is intractable. To estimate a parametric distribution p, which we refer to as our target distribution, NCE leverages not only the observed samples A = (a 1 , a 2 , ..., a n1 ) (positive samples), but also the samples drawn from a reference distribution q, denoted as B = (b 1 , b 2 , ..., b n2 ) (negative samples). Instead of estimating p directly, the density ratio p/q is estimated by training a critic between samples from A and B. Specifically, let Z = (z 1 , ..., z n1+n2 ) denote the union of A and B. A binary class label C t is assigned to each z t , where C t = 1 if z t ∈ A and C t = 0 otherwise. The label probability is therefore P (C = 1|z) = p(z) p(z) + γq(z) , P (C = 0|z) = γq(z) p(z) + γq(z) , where γ = n2 n1 is a balancing hyperparameter accounting for the difference in number of samples between A and B. In practice, we do not know the analytic form of p; therefore, a classifier g : z → [0, 1] to estimate p(C = 1|z) is trained. To get an estimation of the critic function g, NCE maximizes the log likelihood of the data for a binary classification task: L(A, B) = n 1 t=1 log[g(at)] + n 2 t=1 log[1 -g(bt)] . (2) 2.2 CONTRASTIVE TEXTUAL REPRESENTATION LEARNING AND ITS CHALLENGES Let {w i } n i=1 be the observed text instances. We are interested in finding a vector representation u of the text w, i.e., via an encoder u = Enc(w), that can be repurposed for downstream tasks. A positive pair refers to paired instances a i = (u i , v i ) associated with w i , where we are mostly interested in learning u; v is a feature at a different representation level. In unsupervised scenarios, v i can be the feature representation at the layer next to the input text w i . In supervised scenarios, v i can be either the feature representation layer immediately after w i or immediately before the label y i that corresponds to the input w i . We will use π(u, v) to denote the joint distribution of the positive pairs, with π u (u) and π v (v) for the respective marginals. Contrastive learning follows the principle of "learning by comparison." Specifically, one designs a negative sample distribution τ (u , v ), and attempts to distinguish samples from π(u, v) and τ (u , v ) with a critic function g(u, v). The heuristic is that, using samples from τ as references (i.e., to contrast against), the learner is advantaged to capture important properties that could have been

