TOWARDS ROBUST AND EFFICIENT CONTRASTIVE TEXTUAL REPRESENTATION LEARNING

Abstract

There has been growing interest in representation learning for text data, based on theoretical arguments and empirical evidence. One important direction involves leveraging contrastive learning to improve learned representations. We propose an application of contrastive learning for intermediate textual feature pairs, to explicitly encourage the model to learn more distinguishable representations. To overcome the learner's degeneracy due to vanishing contrasting signals, we impose Wasserstein constraints on the critic via spectral regularization. Finally, to moderate such an objective from overly regularized training and to enhance learning efficiency, with theoretical justification, we further leverage an active negative-sample-selection procedure to only use high-quality contrast examples. We evaluate the proposed method over a wide range of natural language processing applications, from the perspectives of both supervised and unsupervised learning. Empirical results show consistent improvement over baselines. Under review as a conference paper at ICLR 2021 In this paper, we propose two methods to mitigate the above issues. In order to stabilize the training and enhance the model's generalization ability, we propose to use the Wasserstein dependency measure (Ozair et al., 2019 ) as a substitute for the Kullback-Leibler (KL) measure in the vanilla CL objective. We further actively select K high-quality negative samples to contrast with each positive sample under the current learned representations. These supply the training procedure with necessarily large and non-trivial contrastive samples, encouraging the representation network to generate more distinguishable features. Notably, our approach also significantly alleviates the computational burden of massive features compared with previous works (Tian et al., 2020; Hjelm et al., 2019b). Contributions: (i) We propose a Wasserstein-regularized critic to stabilize training in a generic CL framework for learning better textual representations. (ii) We further employ an active negativesample selection method to find high-quality contrastive samples, thus reducing the gradient noise and mitigating the computation concerns. (iii) We empirically verify the effectiveness of our approach under various NLP tasks, including variational text generation (Bowman et al.

1. INTRODUCTION

Representation learning is one of the pivotal topics in natural language processing (NLP), in both supervised and unsupervised settings. It has been widely recognized that some forms of "general representation" exist beyond specific applications (Oord et al., 2018) . To extract such generalizable features, unsupervised representation models are generally pretrained on large-scale text corpora (e.g., BERT (Devlin et al., 2018; Liu et al., 2019; Clark et al., 2020; Lagler et al., 2013) ) to avoid data bias. In supervised learning, models are typically built on top of these pre-trained representations and further fine-tuned on downstream tasks. Representation learning greatly expedites model deployment and meanwhile yields performance gains. There has been growing interest in exploiting contrastive learning (CL) techniques to refine context representations in NLP (Mikolov et al., 2013a; b) . These techniques aim to avoid representation collapse for downstream tasks, i.e., getting similar output sentences with different input in conditional generation tasks (Dai & Lin, 2017) . Intuitively, these methods carefully engineer features from crafted ("negative") examples to contrast against the features from real ("positive") examples. A feature encoder can then enhance its representation power by characterizing input texts at a finer granularity. Efforts have been made to empirically investigate and theoretically understand the effectiveness of CL in NLP, including noise contrastive estimation (NCE) of word embeddings (Mikolov et al., 2013b) and probabilistic machine translation (Vaswani et al., 2013) with theoretical developments (Gutmann & Hyvärinen, 2010) . More recently, InfoNCE (Oord et al., 2018) further links the CL to the optimization of mutual information, which inspired a series of practical followup works (Tian et al., 2020; Hjelm et al., 2019a; He et al., 2020; Chen et al., 2020) . Despite the significant empirical success of CL, there are still many open challenges in its application to NLP, including (i) the propagation of stable contrastive signals. An unregularized critic function in CL can suffer from unstable training and gradient vanishing issues, especially in NLP tasks due to the discrete nature of text. The inherent differences between positive and negative textual features make those examples easily distinguished, resulting in a weak learning signal in contrastive schemes (Arora et al., 2019) . (ii) Empirical evidence (Wu et al., 2017) shows that it is crucial to compare each positive example with adequate negative examples. However, recent works suggest using abundant negative examples, which are not akin to the positive examples, which can result in sub-optimal results and unstable training with additional computational overhead (Ozair et al., 2019; McAllester & Stratos, 2020) .

2. BACKGROUND

2.1 NOISE CONTRASTIVE ESTIMATION Our formulation is inspired by Noise Contrastive Estimation (NCE) (Gutmann & Hyvärinen, 2010) , which was originally introduced for unnormalized density estimation, where the partition functions is intractable. To estimate a parametric distribution p, which we refer to as our target distribution, NCE leverages not only the observed samples A = (a 1 , a 2 , ..., a n1 ) (positive samples), but also the samples drawn from a reference distribution q, denoted as B = (b 1 , b 2 , ..., b n2 ) (negative samples). Instead of estimating p directly, the density ratio p/q is estimated by training a critic between samples from A and B. Specifically, let Z = (z 1 , ..., z n1+n2 ) denote the union of A and B. A binary class label C t is assigned to each z t , where C t = 1 if z t ∈ A and C t = 0 otherwise. The label probability is therefore P (C = 1|z) = p(z) p(z) + γq(z) , P (C = 0|z) = γq(z) p(z) + γq(z) , where γ = n2 n1 is a balancing hyperparameter accounting for the difference in number of samples between A and B. In practice, we do not know the analytic form of p; therefore, a classifier g : z → [0, 1] to estimate p(C = 1|z) is trained. To get an estimation of the critic function g, NCE maximizes the log likelihood of the data for a binary classification task: L(A, B) = n 1 t=1 log[g(at)] + n 2 t=1 log[1 -g(bt)] . (2)

2.2. CONTRASTIVE TEXTUAL REPRESENTATION LEARNING AND ITS CHALLENGES

Let {w i } n i=1 be the observed text instances. We are interested in finding a vector representation u of the text w, i.e., via an encoder u = Enc(w), that can be repurposed for downstream tasks. A positive pair refers to paired instances a i = (u i , v i ) associated with w i , where we are mostly interested in learning u; v is a feature at a different representation level. In unsupervised scenarios, v i can be the feature representation at the layer next to the input text w i . In supervised scenarios, v i can be either the feature representation layer immediately after w i or immediately before the label y i that corresponds to the input w i . We will use π(u, v) to denote the joint distribution of the positive pairs, with π u (u) and π v (v) for the respective marginals. Contrastive learning follows the principle of "learning by comparison." Specifically, one designs a negative sample distribution τ (u , v ), and attempts to distinguish samples from π(u, v) and τ (u , v ) with a critic function g(u, v). The heuristic is that, using samples from τ as references (i.e., to contrast against), the learner is advantaged to capture important properties that could have been otherwise missed (Hjelm et al., 2019a; Oord et al., 2018) . A popular choice of τ is the product of marginals, i.e., τ ← π 0 (u , v ) = π u (u )π v (v ) where (u, v) are independent of each other, so that b i = (u i , v i ) ∼ π 0 . Inputting the new a i and b i to (2), we obtain the new CL loss: L NCE = E u,v∼π [log g(u, v)] + γE u ,v ∼π0 [log(1 -g(u , v ))] . (3) Note that when g is trained to optimality g * (u, v) = p(C = 1|u, v) under π 0 , it establishes a lower bound of mutual information (MI) between u and v for the positive distribution (Tian et al., 2020; Neyshabur et al., 2018) : MI(π u , π v ) = KL(π(u, v)||π u (u)π v (v)) ≥ E u,v∼π [log g * (u, v)] + log γ . However, there are three concerns regarding why directly applying equation 3 might not be good in practice for learning the contrastive representation of the input text w. • Robustness. The first issue concerns the MI's strong sensitivity to small differences in data samples (Ozair et al., 2019; Tschannen et al., 2020) . By definition in equation 4, mutual information is a KL divergence. It is well known that the KL divergence is not a metric-aware divergence measure, which implies a minor difference in representation can induce drastic changes in the mutual information, as a special case of KL. Consequently, the learned g could be numerically unstable (Ozair et al., 2019) , which makes the learned representations less robust and does not generalize well to downstream tasks, especially when features come from text (Chen et al., 2018) . • Weak/vanishing contrastive signal. With a poor initialization or a poor choice of negative samples, the MI will vanish as the π(u, v) and π u (u)π v (v) become far apart, delivering a faint and nonsmooth gradient for training. In an extreme case, the support for π(u, v) and π u (u)π v (v) do not overlap, and the MI and the gradient will vanish to zero (Arjovsky et al., 2017) . • Negative-sample selection strategy. Learning MI is generally considered sample inefficient. This point can be corroborated from several perspectives, ranging from theoretical arguments to practical considerations. To confidently estimate a lower bound to the MI, one would need a sample size exponential to the mutual information (i.e., N ≥ exp(I π (u, v))) (Ozair et al., 2019; McAllester & Stratos, 2018) . Also, both theoretical prediction and empirical evidence suggest a large ratio γ is needed for good performance (Tian et al., 2020; Hjelm et al., 2019a) , imposing potential computational concerns for large training datasets. On the other hand, some studies report a large γ can instead deteriorate model performance (Tschannen et al., 2020; Arora et al., 2019) . Such a large γ is also believed to be problematic especially when a strong association is expected between u and v. In that case, the majority of negative samples are so different from positive samples that the comparisons do not lend effective learning signals, but instead randomly drift the training (Gutmann & Hyvärinen, 2010).

3. METHOD

3.1 MODEL OVERVIEW We consider two remedies to mitigate the three issues mentioned in Section 2.2, (i) Regarding the robustness and the gradient-vanishing issues, we switch from the MI-based NCE to a Wassersteinbased NCE, by imposing a Wasserstein constraint to the critic function g. The Wasserstein distance yields a continuous discrepancy measure over two distributions, even when they have no overlapping support, and suffers less from the issue of numerical instability (Neyshabur et al., 2018) . (ii) Regarding the issue of negative samples, we propose an active negative-sample-selection strategy, to dynamically select the most challenging negative examples on-the-fly. This strategy smooths the learning signal (Wu et al., 2017) , effectively improving the CL, meanwhile significantly reducing the computational overhead. To this end, we propose RECLAIM (RElaxed Contrastive Learning with ActIve Memory selection) as a robust and efficient CL framework. Our learning framework is illustrated in Figure 1 . Details are explained in the following sections.

3.2. WASSERSTEIN CONSTRAINED CRITIC

In previous work, the critic g is usually chosen to be a naive feed-forward neural network (Tian et al., 2020) . However, as discussed in Section 2.2, such a choice of critic function typically leads to a KL-based objective, which suffers from instability and vanishing-gradient issues. Inspired by (Ozair et al., 2019) , we replace the KL divergence in equation 4 with a Wasserstein distance. Specifically, we ensure the 1-Lipschitz constraint on the critic function g (Arjovsky et al., 2017) . Input Text ! # Memory banks (!, #) positive pair ! # $ , # % & , !, # # $ % & negative pairs Wasserstein critic Loss ANS Back-propagation < l a t e x i t s h a 1 _ b a s e 6 4 = " m A Y x 4 i + X W 2 b H p L 6 i y 5 o 2 a Y v / z e s = " > A A A C C 3 i c d V D L S g M x F M 3 U V 6 2 v q k t F g k V w 4 z A z t t j u R D c u W 7 B a a E v J p J k 2 N P M g u S O W o U v B j b / i x o V F 3 P o D 7 v w G f 8 J 0 q q C i F 0 I O 5 5 z L v f e 4 k e A K L O v N y M z M z s 0 v Z B d z S 8 s r q 2 v 5 9 Y 0 L F c a S s j o N R S g b L l F M 8 I D V g Y N g j U g y 4 r u C X b q D 0 4 l + e c W k 4 m F w D s O I t X 3 S C 7 j H K Q F N d f K 7 L W D X 4 H p J + n N I T g g d H E Q y j E g v t Y x G n X z B M o u V o u O U s W W W K s V K e Q L K T u n Q s b F t W m k V j r f H t f e b n X G 1 k 3 9 t d U M a + y w A K o h S T d u K o J 0 Q C Z w K N s q 1 Y s U i P Y b 0 W F P D g P h M t Z P 0 l h H e 0 0 w X e 6 H U L w C c s t 8 7 E u I r N f R d 7 f Q J 9 N V v b U L + p T V j 8 M r t h A d R D C y g 0 0 F e L D C E e B I M 7 n L J K I i h B o R K r n f F t E 8 k o a D j y + k Q v i 7 F / 4 M L x 7 S L Z q W m 0 z h B 0 8 q i L b S L 9 p G N j t A x O k N V V E c U 3 a J 7 9 I j G x p 3 x Y D w Z z 1 N r x v j s 2 U Q / y n j 5 A C 6 p o C g = < / l a t e x i t > Encoder < l a t e x i t s h a 1 _ b a s e 6 4 = " o / X i k O d E z Z S K L g O 7 J g c b B X j t A y w = " > A A A C A n i c d V B N S y N B E O 3 x Y 9 V R 1 6 y e x E t j E D w N M 2 P E 5 C C K I u x R w a i Q h N D T q d H G n p 6 h u 0 Y 2 D M G L / 0 S 8 e F D E q / e 9 e x H / j Z 3 E h V X 0 Q V G P 9 6 r o r h d l U h j 0 / V d n Z H R s / M f E 5 J Q 7 P T P 7 c 6 7 0 a / 7 I p L n m U O e p T P V J x A x I o a C O A i W c Z B p Y E k k 4 j s 5 3 + / 7 x B W g j U n W I 3 Q x a C T t V I h a c o Z X a p c U m w h + M 4 m L Q B R Z 7 i q c d 0 L 1 e u 1 T 2 v U q t E o Z V 6 n v r t U q t 2 i f V c H 0 t D G j g + Q O U t / 6 6 m 9 n 1 i 7 v f L j 0 3 O y n P E 1 D I J T O m E f g Z t g q m U X A J P b e Z G 8 g Y P 2 e n 0 L B U s Q R M q x i c 0 K M r V u n Q O N W 2 F N K B + v 9 G w R J j u k l k J x O G Z + a z 1 x e / 8 h o 5 x t V W I V S W I y g + f C j O J c W U 9 v O g H a G B o + x a w r g W 9 q + U n z H N O N r U X B v C v 0 v p 9 + Q o 9 I K K V z v w y 9 s 7 Z I h J s k S W y S o J y A b Z J r / J P q k T T i 7 J D b k j 9 8 6 V c + s 8 O I / D 0 R H n f W e B f I D z 9 A a X N J v i < / l a t e x i t > Momentum Update < l a t e x i t s h a 1 _ b a s e 6 4 = " r V g w a p Q X K q 4 M c 1 s 5 j r a g E y O Q 0 M g = " > A A A C H X i c d V D L S s Q w F E 1 9 W 1 + j L t 0 E B 8 F V a W v F m Z 3 o x o 2 g 4 K g w M w x p 5 l a D 6 Y P k V h x K f 8 S N v + L G h S I u 3 I h / Y z q O o K I H Q g 7 n 3 H u T e 8 J M C o 2 u + 2 6 N j U 9 M T k 3 P z N p z 8 w u L S 7 X l l V O d 5 o p D i 6 c y V e c h 0 y B F A i 0 U K O E 8 U 8 D i U M J Z e L V f + W f X o L R I k x M c Z N C N 2 U U i I s E Z G q l X C z o I N x h G x f A W W B y m M S S Y x 2 V p 2 7 + 9 V t Z n C G X Z q 9 V d J 2 g G v t + g r r P d D J q N i j T 8 7 S 3 f o 5 7 j D l E n I x z 1 a q + d f s r z a j K X T O u 2 5 2 b Y L Z h C w S W U d i f X k D F + x S 6 g b W j C Y t D d Y r h d S T e M 0 q d R q s x J k A 7 V 7 x 0 F i 7 U e x K G p j B l e 6 t 9 e J f 7 l t X O M G t 1 C J F m O k P D P h 6 J c U k x p F R X t C w U c 5 c A Q x p U w f 6 X 8 k i n G 0 Q R q m x C + N q X / k 1 P f 8 Q K n e e z X d / d G c c y Q N b J O N o l H d s g u O S B H p E U 4 u S X 3 5 J E 8 W X f W g / V s v X y W j l m j n l X y A 9 b b B 0 a 6 p J E = < / l a t e x i t > A function f is said to be L-Lipschitz if |f (x) -f (y)| ≤ L x -y , i.e. , the difference in function outputs is controlled by the discrepancy in the inputs. Instead of using a gradient penalty as in (Gulrajani et al., 2017; Ozair et al., 2019) , we employ the spectral normalization (SN) (Miyato et al., 2018) . We use the SN because it is efficient and also stable. It provides a more strict 1-Lipschitz constraint than the gradient penalty. Specifically, SN controls the Lipschitz constant of the critic function, by constraining each layer g l in g. Formally, it can be formulated as g l Lip = sup x δ(∇(g l (x))), where δ(•) is the spectral norm, i.e., the largest singular value of the input. For an affine transformation, such as a linear function g l (x) = W l x, its spectral norm is g l Lip = δ(W ). When the activation function a l has Lipschitz norm equal to 1 (such as ReLU (Jarrett et al., 2009) and Leaky ReLU (Maas et al., 2013)), we have the following inequality: g 1 • g 2 Lip ≤ g 1 Lip g 2 Lip . With this inequality, we obtain the following bound: g Lip = g 1 • a 1 • g 2 • a 2 . . . • g d Lip ≤ g 1 Lip a 1 Lip g 2 Lip a 2 Lip . . . g d Lip (5) = Π L l=1 g l Lips = Π L l=1 δ(W l ) . Applying the spectral normalization operation to each weight W l using W SN = W /δ(W ) enforces δ(W SN ) = 1, so that the right hand side of 5 is above-bounded by 1. This imposes the 1-Lipschitz constraint to the critic function g, thus stabilizing the learning signal for training. In practice, the spectral normalization can be estimated efficiently using power iteration.

3.3. ACTIVE NEGATIVE-SAMPLE SELECTION (ANS)

Following the discussion in Section 2.2, stable training requires adequate and high-quality negative samples. However, involving too many negative samples far apart from their positive counterparts does not yield effective training, and wastes computational resources. Inspired by the Triplet loss in deep metric learning (Wu et al., 2017; Tschannen et al., 2020) , we propose to actively select a relatively small set of negative examples that are most challenging to the discriminator at the current step, thus enabling the model to effectively extract features to distinguish the positives from the negatives. Specifically, we use two memory banks B u and B v , which store all previously extracted features u and v from seen training instances, respectively. When processing with each new training instances, we actively select the top K nearest neighbors U nn ⊂ B u and V nn ⊂ B v for u and v via cosine distance. With the QuickSelect algorithm (Hoare, 1961) , we are able to identify the top-K negative samples with time complexity O(KN ). Under this setup, 3 can be written as: LANS = E π(u,v) [log(g(u, v) - 1 2 v ∈Vnn (log(1 -g(u, v ))) - 1 2 u ∈Unn (log(1 -g(u , v)))] (6) When the dataset is large, this can still cost much time in feature searching. Asymmetric Locality Sensitivity Hashing (ALSH) (Shrivastava & Li, 2014 ) can be applied to hash the representations in a proximity-relevant manner. This helps to reduce the time complexity of ANS to sub-linear (Shrivastava & Li, 2014). In (Schroff et al., 2015) it was found empirically that relaxing the most challenging negative samples to semi-hard negative samples sometimes leads to better results in supervised tasks like classification. This indicates that an approximate retrieval method like ALSH can still perform well. Our observations in experiments are consistent with these findings.

3.4. RECLAIM LEARNING PROCEDURE

Momentum update Before training, the two memory banks B u , B v are initialized with Gaussian noise: {u ∼ N (0, I)|∀u ∈ B u }, {v ∼ N (0, I)|∀v ∈ B v }. Naively, one can directly replace old features in memory banks with new processed features corresponding to the same input data. However, the performance of such a solution is suboptimal in practice. This is because the target model may change rapidly during the early training stages; such a practice reduces the consistency of the representations, and results in noisy learning signals (He et al., 2020; Wu et al., 2018) . Therefore, we apply the momentum update to mitigate such an issue. Specifically, assume a seen input instance x reappears. There would be a snapshot feature pair previously stored in the B u , B v of this x, denoted as { ũ, ṽ}. We further denote the newly computed (with the current encoder) feature pair correspond to x as {u, v}, and let ρ ∈ (0, 1] be the momentum update parameter. We update the feature pairs in B u , B v as: ũ = (1 -ρ)u + ρ ũ, ṽ = (1 -ρ)v + ρṽ (7) Note that the features in the memory bank are only updated in this way, and they are detached from the computational graph. Joint Objective Optimization With our ANS module, we can obtain the top-K nearest negative features for both u and v, denoted {u i } K 1 and {v i } K 1 , respectively. To get our CL loss, we apply the Wasserstein-distance-based NCE loss by simply imposing the 1-Lipschitz constraint on critic g in ( 6). Assuming the task-specific loss is L task , the total loss therefore is formulated as L total = L task -λL ANS , where λ > 0 is the hyper-parameter controlling the importance of the NCE loss. By minimizing equation 8, we can improve the training model performance. The detailed training procedure is listed in Algorithm 1.

4. RELATED WORK

Connection to Mutual Information Estimation (MIE): MIE represents a family of MI-based representation learning methods. Early works on MIE, such as InfoMax (Linsker, 1988; Bell & Sejnowski, 1995) Connection to deep metric learning: Triplet loss is a classic approach in deep metric learning, and has already been widely used in retrieval models (Lee et al., 2018) and recommendation systems (Chechik et al., 2010) . It minimizes the distance between an anchor point to a positive input, meanwhile maximizing its distance to a negative input. Motivated by the triplet loss, some works enforce constraints on more than one negative examples. For example, PDDM (Huang et al., 2016) and Histogram Loss (Ustinova & Lempitsky, 2016) use quadruplets. Moreover, the n-pair loss (Sohn, 2016) and Lifted Structure (Oh Song et al., 2016) define constraints on all images in a batch. In these previous works, researchers have focused on enlarging the batch size, since they are only sampling the negative examples within the batch. Our approach incorporates the advantages of those works, and moves beyond them to allow active sampling of the most challenging negative pairs from all seen instances within the memory banks. Such a global sampling strategy ensures the selected negative pairs for the same positive pair to be more consistent during training, so that the training can be more stable.

5. EXPERIMENTS

Three experiments are conducted to evaluate our approach: (i) supervised and semi-supervised natural language understanding (NLU) tasks on the GLUE dataset via BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) ; (ii) image-text cross-domain retrieval task, and (iii) text generation task with the VAE framework (Bowman et al., 2016) . We set λ = 0.1 in (6) across all experiment, which are tested on two NVIDIA TITAN X GPUs.

5.1. GLUE DATASET CLASSIFICATION

Dataset: The General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018) is a collection of 9 datasets for evaluating natural language understanding models. Six of these tasks are either single sentence classification or paired sentence classification tasks. Implementation details: We develop our approach on the open-sourced Hugging face Transformers codebase (Wolf et al., 2019) foot_0 . The hyper-parameters settings, i.e., learning rate, batch size, epoch numbers, etc., are all set to the default setup recommended by the official Transformer repository, for fair comparisons and reproducibility. We report results on the development sets after fine-tuning the pre-trained models (BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) ) on each downstream task training data (no ensemble model or multi-task training is used). We utilize the 12-layer architecture (base model) for both BERT and RoBERTa, due to limited computational resources. Parameters u, v are set to be the classification token embedding (CLS) and the embedded input text representation in BERT/RoBERTa, respectively. We present comparisons on fully supervised tasks in Table 1 , with K = 100 for RECLAIM. According to this table, RECLAIM consistently improves the performance on all GLUE datasets. For the semi-supervised experiments, we randomly draw {1%, 5%, 10%, 20%, 30%, 50%} from each training dataset in GLUE, and results are provided in Table 2 . Note that in this case, we only use top-10 (K = 10) negative examples, since the training dataset size is largely reduced. It can be seen from Table 2 that our approach generally achieves better results than just fine-tuning BERT on a small dataset. We note that both fully-supervised and semi-supervised tasks on GLUE do not only serve as classic natural language inference (NLI) problems, but also serve as a testbed for the generalization ability of a representation learner to new tasks. Presumably, results in Tables 1 and 2 suggest that the large-scale pre-trained BERT knowledge has been better transferred to each task in GLUE in our CL approach, as improving the Wasserstein dependency measure (Ozair et al., 2019) can be seen as encouraging the model to be as "lossless" as possible.

5.2. TEXT-IMAGE RETRIEVAL Dataset:

We evaluate the proposed RECLAIM on both the Flickr30K (Plummer et al., 2015) and MS-COCO (Lin et al., 2014b) datasets. Flickr30K contains 31,000 images, and each photo has five captions. The Flickr30K dataset is split following the same setup as (Karpathy & Fei-Fei, 2015 ; Faghri et al., 2018) . We have a training dataset with 29,000 images, a validation set with 1,000 images, and a test dataset with 1,000 images. MS-COCO contains 123,287 images, with 5 human annotated descriptions per image. We use the same split as in (Faghri et al., 2018) , i.e., the dataset is split into 113,287 training images, 5,000 validation images, and 5,000 test images. Implementation details. For this image-text matching task, we use the Adam optimizer (Kingma & Ba, 2015) to train the models. Note that we developed our approach upon SCAN (Lee et al., 2018) foot_2 , by substituting the triplet loss with our RECLAIM. In this experiment, u is the textual feature extracted from a GRU, and v is the image feature extracted from a ResNet. Training details are provided in the Appendix. The performance of sentence retrieval with image query or image retrieval with sentence query is measured by recall at T (R@T) (Karpathy & Fei-Fei, 2015) , defined as the percentage of queries that retrieve the correct objects within those with top T highest similarity scores as determined by the model. In this experiment, we can further improve the image-text retrieval results with the basic model design of SCAN. The improvement indicates that features extracted by our CL approach can better capture the common salient information between text and image pairs. Sentence Retrieval Image Retrieval Method R@1 R@5 R@10 R@1 R@5 R@10 Rsum DPC (ResNet) Zheng et al. (2017) 55 Table 3 : Image-Text retrieval results with Recall@K (R@K). Upper panel: Flickr30K, lower panel: MSCOCO.

5.3. TEXT GENERATION Dataset:

We further evaluate our model on an unsupervised text generation task. Two commonlyused datasets are employed for this task, the Penn Treebank (PTB) (Marcus et al., 1993; Bowman et al., 2016) and the Yelp corpora (Yang et al., 2017; He et al., 2019) . PTB is a relatively small dataset with sentences of varying lengths, whereas Yelp contains larger amounts of long sentences. Detailed statistics of these two datasets are summarized in the Appendix. Implementation details: To ensure a fair comparison and reproducibility between models, we develop our model based on an existing codebasefoot_3 . The encoder and decoder are both 1-layer LSTMs, and the hyper-parameter setup follows the instructions within the original codebase. In this task, the u is the latent code in text VAE and v is the word embedding vectors of the input text. The most commonly used metrics are applied to evaluate the learned language model, as listed in Table 4 . According to Table 4 , by simply adding our proposed CL method, the base model can be further improved in terms of most of the automatic metrics. Lower negative ELBO indicates our approach yields a better language model. Larger KL divergence and larger Active Unit (AU) (Burda et al., 2015) indicate that the latent space is more sufficiently made use of. We also observed that the posterior-collapsing problem is alleviated, with improved mutual information and KL (Fang et al., 2019) ; this is presumably due to the fact that we add additional CL objective to the latent code and output to improve the MI between them. Table 4 : Variational language modeling results on PTB (left) and Yelp dataset (right).

5.4. ABLATION STUDY

Choice of K We seek to further investigate how the negative sample size K influences the effectiveness of our model. To this end, we choose different K = {1, 10, 100, 300} in ANS for comparison. Besides testing different Ks with ANS, we also test an alternative approach, where we random sample 80% of features from the memory bank as negative samples instead of applying ANS. We denote this method simply as the 80% Method. Also, the in-batch method denotes that we only use negative samples within the batch. The results can be found in Table 5 . Note that those two tricks can be viewed as two different ways for constructing vanilla contrastive learning (Vanilla CL) algorithm. Table 5 : Results for different K choices on GLUE dataset. Th 80% method are also listed as a comparison. From Table 5 we observe that when K is small, the improvement is limited. In some tasks, such as MNLI, it is even worse than the BERT-base model. This finding is consistent with arguments from previous works (Wu et al., 2017; Tschannen et al., 2020) . For K = 100 and K = 300, comparable results are often observed, and either of them outperforms the others on certain tasks; K = 300 seems to work better on tasks with larger datasets such as MNLI/QNLI. We hypothesize that this is because larger datasets contain more high-quality contrastive examples than smaller datasets, thus allow using a large K without introducing much noise. Both of them show better results than the 80% Method without ANS. Computational Efficiency MNLI, the biggest dataset in GLUE, is employed as a running-time benchmark to evaluate the computational efficiency among different approaches. We record the training time for the original BERT, K = 100, and the 80% Method. Without any contrastive regularization, BERT takes approximately 45 minutes per epoch. For K = 100, RECLAIM needs 47 minutes per epoch. The 80% Method takes 81 minutes per epoch. The memory usage for BERT-base is around 7.5GB for a batchsize (per GPU) of TITAN X. K = 100 takes an additional 200MB, and the 80% Method takes full 12GB memory capacity. These empirical findings provide evidence that our method can be both efficient and effective. Due to space limitations, other ablation studies, including the investigation of different λ choices, are provided in the Appendix.

6. CONCLUSIONS

We have proposed a novel contrastive learning (CL) framework, RECLAIM, for natural language processing tasks. Our approach improves the "contrast" in the feature space, in an attempt to make the features learned by the representation encoder to be more distinguishable, representative and informative. We identified the challenges in CL training and proposed several remedies to alleviate these issues. Extensive experiments show that consistent improvements over a variety of NLP tasks, demonstrating the effectiveness of our approach. Using ANS algorithm to draw K negative features for u and v: importance of λ choices We also test the effect of choosing different λ ∈ {0.01, 0.1, 0.5, 1}. As shown in Table 9 , We can see that when λ ≤ 0.5, RECLAIM can consistently outperform the BERT-base model. It may because we need to re-scale the contrastive loss to the same numerical scale as the task-specific loss. {u i } K 1 ∼ Bu, {v i } K 1 ∼ Bv 7: Stop gradient for {u i } K 1 , {v i } K 1 , Experiment on BERT-Large We also test GLUE experiment on BERT-large model, to see whether our proposed algorithm can still be effective. Results can be found in Table 10 .

B.4 VQA TEST

We also tested our approach on Visual Question Answering (VQA) 2.0 task Goyal et al. (2017) , which contains human-annotated QA pairs on COCO images (Lin et al., 2014a) . For each image, an average of 3 questions are collected, with 10 candidate answers per question. The most frequent answer from the annotators is selected as the correct answer. Following previous work Kim et al. (2018) , we take the answers that appear more than 9 times in the training set as candidate answers, which results in 3129 candidates. Classification accuracy is used as the evaluation metric, defined as min(1, # humans provided ans.

3

). In this setup, we choose u, v as question features and image features respectively. By applying our RECLAIM approach directly to BAN model, we can see a improvement over the VQA task as shown in Table 2 

C TRAINING DETAILS

Image-Text Retrieval For the Flickr30K data, we train the model for 30 epochs. The initial learning rate is set to 0.0002, and decays by a factor of 10 after 15 epochs. For MS-COCO data, we train the model for 20 epochs. The initial learning rate is set to 0.0005, and decays by 10 after 10 epochs. We set the batch size to 128, and threshold the maximum gradient norm to 2.0 for gradient clipping. We also set the dimension of the GRU and joint embedding space to 1024, and the dimension of the word embedding to 300. GLUE We choose batch size as 32 for all 9 GLUE tasks, and 2 × 10 -5 is the starting learning rate. For each task, we only perform 3 epochs, since some datasets, such as RTE is quite small, they can be easily got over-fitted.



https://github.com/huggingface/transformers, version 2.5.1 using Pytorch 1.2.0(Paszke et al., 2017) https://github.com/kuanghuei/SCAN https://github.com/fangleai/Implicit-LVM



Figure 1: Illustration of our RECLAIM learning framework. An active negative-sample selection (ANS) module is applied for selecting challenging examples. We also relax the critic with the Wasserstein constraint.

Supervised results on GLUE dataset. Darker color indicates more improvement. Better in color.

Semi-supervised results on GLUE dataset (CoLA, SST-2, STS-B, MNLI). Results for QNLI, MRPC, and QQP are in the Supplementary Material (SM). Darker color indicates more improvement. Better in color.

Semi-supervised results on GLUE dataset (QNLI, MRPC, and QQP).

Variational language modeling results on PTB (left) and Yelp dataset (right).

Results on GLUE dataset.

Results for different λ choices on GLUE dataset.

Supervised results on GLUE dataset with BERT-large model.

. VQA validation dataset results

annex

dimension of features u and v Note that our CLAIM formulation do not require u and v to have a matching dimension. Contrastive learning seeks to compare π(u, v) to π(u)π(v), not comparing u directly with v. Though in practice, we often map u and v to matching dimensions via MLP or RNN to balance their respective contribution to the loss.Architecture for g: We use two different MLP layers first to map u, v into the same dimension, and then we feed both into a three layer MLP. Details will be included in our next revision. For instance, in BERT model, we will map word representations and hidden states to dim = 64. choice of u and v : In GLUE experiment: u is chosen as word vectors (from one-hot tokens to real vectors), v is chosen as the BERT output features with dim=768.In VAE setup: u is chosen as word vectors (from one-hot tokens to real vectors), v is chosen as the encoded latent variable z.In Image-Text retrieval setup: u is chosen as word vectors (from one-hot tokens to real vectors), v is chosen as the image features.

