SEMANTIC RE-TUNING WITH CONTRASTIVE TENSION

Abstract

Extracting semantically useful natural language sentence representations from pre-trained deep neural networks such as Transformers remains a challenge. We first demonstrate that pre-training objectives impose a significant task bias onto the final layers of models, with a layer-wise survey of the Semantic Textual Similarity (STS) correlations for multiple common Transformer language models. We then propose a new self-supervised method called Contrastive Tension (CT) to counter such biases. CT frames the training objective as a noise-contrastive task between the final layer representations of two independent models, in turn making the final layer representations suitable for feature extraction. Results from multiple common unsupervised and supervised STS tasks indicate that CT outperforms previous State Of The Art (SOTA), and when combining CT with supervised data we improve upon previous SOTA results with large margins.

1. INTRODUCTION

Representation learning concerns the pursuit of automatically learning representations of data that are useful for future extraction of information (Bengio et al., 2013) . Recent work has predominantly been focused on training and extracting such representations from various deep neural architectures. However, as these deep models are mostly trained via error minimization of an objective function applied to the final layers (Rumelhart et al., 1988) , features residing in layers close to the objective function will be task-specific Yosinski et al. (2014) . Therefore, to reduce the representation's bias towards the objective function it is common to discard one or several of the final layers, or alternatively consider features of other intermediate layers, as with AutoEncoders (Rumelhart et al., 1986) . One domain where this issue is particularly striking is learning semantic sentence embeddings with deep Transformer networks (Vaswani et al., 2017) pre-trained towards some language modeling task. Although utilizing pre-trained Transformer models such as BERT, XLnet, ELECTRA and GPT-2(Devlin et al., 2019; Yang et al., 2019; Clark et al., 2020; Brown et al., 2020) has become the dominant approach within the field of Natural Language Processing (NLP), with current State Of The Art (SOTA) results in basically all NLP tasks belonging to fine-tuned versions of such models, it has been shown that simply extracting features from the layers of such models does not produce competitive sentence embeddings (Reimers & Gurevych, 2019; Liu et al., 2019a) . Our interpretation of this phenomenon, which we will demonstrate in this paper, is that the currently used language modeling objectives enforce a task-bias at the final layers of the Transformer, and that this bias is not beneficial for the learning of semantic sentence representations. Reimers & Gurevych (2019) propose to solve this by pooling a fixed size sentence embedding from the final Transformer layer and fine-tune towards a Natural Language Inference (NLI) task, an approach that when applied to Transformers is known as Sentence-BERT (or S-BERT in short). While Hill et al. (2016a) empirically show that fine-tuning language models towards NLI data yields good results on Semantic Textual Similarity (STS), there exists no convincing argument for why NLI is preferred over other tasks. Hence, it is unclear whether the impressive improvements of S-BERT are to be mainly attributed to the NLI task itself, or if this merely trains the model to output sentence embeddings, in turn exposing the semantics learned during pre-training. Since NLI requires labeled data, it would be highly valuable if an alternative method that requires no such labels was possible. * Main contribution.

