SEMANTIC RE-TUNING WITH CONTRASTIVE TENSION

Abstract

Extracting semantically useful natural language sentence representations from pre-trained deep neural networks such as Transformers remains a challenge. We first demonstrate that pre-training objectives impose a significant task bias onto the final layers of models, with a layer-wise survey of the Semantic Textual Similarity (STS) correlations for multiple common Transformer language models. We then propose a new self-supervised method called Contrastive Tension (CT) to counter such biases. CT frames the training objective as a noise-contrastive task between the final layer representations of two independent models, in turn making the final layer representations suitable for feature extraction. Results from multiple common unsupervised and supervised STS tasks indicate that CT outperforms previous State Of The Art (SOTA), and when combining CT with supervised data we improve upon previous SOTA results with large margins.

1. INTRODUCTION

Representation learning concerns the pursuit of automatically learning representations of data that are useful for future extraction of information (Bengio et al., 2013) . Recent work has predominantly been focused on training and extracting such representations from various deep neural architectures. However, as these deep models are mostly trained via error minimization of an objective function applied to the final layers (Rumelhart et al., 1988) , features residing in layers close to the objective function will be task-specific Yosinski et al. (2014) . Therefore, to reduce the representation's bias towards the objective function it is common to discard one or several of the final layers, or alternatively consider features of other intermediate layers, as with AutoEncoders (Rumelhart et al., 1986) . One domain where this issue is particularly striking is learning semantic sentence embeddings with deep Transformer networks (Vaswani et al., 2017) pre-trained towards some language modeling task. Although utilizing pre-trained Transformer models such as BERT, XLnet, ELECTRA and GPT-2(Devlin et al., 2019; Yang et al., 2019; Clark et al., 2020; Brown et al., 2020) has become the dominant approach within the field of Natural Language Processing (NLP), with current State Of The Art (SOTA) results in basically all NLP tasks belonging to fine-tuned versions of such models, it has been shown that simply extracting features from the layers of such models does not produce competitive sentence embeddings (Reimers & Gurevych, 2019; Liu et al., 2019a) . Our interpretation of this phenomenon, which we will demonstrate in this paper, is that the currently used language modeling objectives enforce a task-bias at the final layers of the Transformer, and that this bias is not beneficial for the learning of semantic sentence representations. Reimers & Gurevych (2019) propose to solve this by pooling a fixed size sentence embedding from the final Transformer layer and fine-tune towards a Natural Language Inference (NLI) task, an approach that when applied to Transformers is known as Sentence-BERT (or S-BERT in short). While Hill et al. (2016a) empirically show that fine-tuning language models towards NLI data yields good results on Semantic Textual Similarity (STS), there exists no convincing argument for why NLI is preferred over other tasks. Hence, it is unclear whether the impressive improvements of S-BERT are to be mainly attributed to the NLI task itself, or if this merely trains the model to output sentence embeddings, in turn exposing the semantics learned during pre-training. Since NLI requires labeled data, it would be highly valuable if an alternative method that requires no such labels was possible. We therefore propose a fully self-supervised training objective that aims to remove the bias posed by the pre-training objective and to encourage the model to output semantically useful sentence representations. Our method trains two separate language models on the task of maximizing the dot product between the two models' representations for identical sentences, and minimizing the dot product between the models' representations for different sentences. When applied to pre-trained BERT models, our method achieves SOTA results for multiple unsupervised STS tasks, and when applied to the S-BERT model it outperforms previous SOTA by a clear margin. To further bolster the robustness of our method, we demonstrate that CT drastically improves STS scores for various models, across multiple languages. Additionally, we contribute with a layer-wise STS survey for the most common Transformer-based language models, in which we find great variability in performance between different architectures and pre-training objectives. Finally, by introducing an alteration to the supervised regression task of S-BERT, we are able to improve upon the supervised STS embedding results for all tested models. In summary, the main contributions of our paper are as follows: 1. A novel self-supervised approach for learning sentence embeddings from pre-trained language models. 2. Analytical results of the layer-wise STS performance for commonly used language models. 3. An improvement to the supervised regression task of S-BERT that yields a higher performance for all tested models. Code and models is available at Github.com/FreddeFrallan/Contrastive-Tension 2020) introduced the self-supervised BYOL framework that attain useful image representations, comparable with previous supervised methods. Although their method also utilizes two untied dual networks, the main training objective and the underlying motivation for this differ greatly. Where BYOL train using solely positive samples generated via data augmentation, our method mainly aims to dissipate negative examples and relies on two networks in order to stabilize the training process. To the best of our knowledge, our work is the first that suggests learning sentence representations by removing the bias imposed from the pre-training objective.



Self-supervised methods typically rely on the assumption that sentences sharing similar adjacent sentences, have similar meaning. Utilizing this assumption,Kiros et al. (2015)  introduced Skip-Thoughts that trains an encoder-decoder to reconstruct surrounding sentences from an encoded passage.Logeswaran & Lee (2018)  proposed QuickThoughts that instead frames the training objective as a sentence context classification task. Recently, and still under peer-review,Giorgi et al. (2020)   proposed DeCLUTR that uses a setup similar to QuickThoughts, but allow positive sentences to be overlapping or subsuming (one being a subsequence of the other), which further improves results.Supervised methods utilize labeled datasets to introduce a semantic learning signal. As the amount of explicitly labeled STS data is very limited, supervised methods often rely on various proxy tasks where more labeled data is available.Conneau et al. (2017)  introduced InferSent that learns sentence embeddings via a siamese BiLSTM trained on NLI data. The Universal Sentence Encoder (USE)

