CONTRASTING DISTINCT STRUCTURED VIEWS TO LEARN SENTENCE EMBEDDINGS

Abstract

We propose a self-supervised method that builds sentence embeddings from the combination of diverse explicit syntactic structures of a sentence. We assume structure is crucial to build consistent representations as we expect sentence meaning to be a function from both syntax and semantic aspects. In this perspective, we hypothesize that some linguistic representations might be better adapted given the considered task or sentence. We, therefore, propose to jointly learn individual representation functions for different syntactic frameworks. Again, by hypothesis, all such functions should encode similar semantic information differently and consequently, be complementary for building better sentential semantic embeddings. To assess such hypothesis, we propose an original contrastive multi-view framework that induces an explicit interaction between models during the training phase. We make experiments combining various structures such as dependency, constituency, or sequential schemes. We evaluate our method on standard sentence embedding benchmarks. Our results outperform comparable methods on several tasks.

1. INTRODUCTION

We propose a self-supervised method that builds sentence embeddings from the combination of diverse explicit syntactic structures. Such a method aims at improving the ability of models to perform compositional knowledge. In particular, we evaluate the embedding potential to solve downstream tasks. Building generic sentence embeddings remains an open question. Many training methods have been explored: generating past and previous sentences (Kiros et al., 2015; Hill et al., 2016) , discriminating context sentences (Logeswaran & Lee, 2018) , predicting specific relations between pairs of sentences (Conneau et al., 2017; Nie et al., 2019) . While all these methods propose efficient training objectives, they all rely on a similar RNN as encoder architecture. Nonetheless, model architectures have been subject to extensive work as well (Tai et al., 2015; Zhao et al., 2015; Arora et al., 2017; Lin et al., 2017) , and in supervised frameworks, many encoder structures outperform standard RNN networks. We hypothesize structure is a crucial element to perform compositional knowledge. In particular, the heterogeneity of performances given models and tasks makes us assume that some structures may be better adapted for a given example or task. Therefore, combining diverse structures should be more robust for tasks requiring complex word composition to derive their meaning. Hence, we aim here to evaluate the potential benefit from interactions between pairs of encoders. In particular, we propose a training method for which distinct encoders are learned jointly. We conjecture this association might improve our embeddings' power of generalization and propose an experimental setup to corroborate our hypothesis. We take inspiration from multi-view learning, which is successfully applied in a variety of domains. In such a framework, the model learns representations by aligning separate observations of the same object. Traditionally, views are issued from a complementary natural perception of the data. For example, a picture and a sound recording of a dog. However, it can be extended to any pair of samples that share similar semantic content, such as the translation of the same sentence in two different languages. The definition can be extended to synthetic views, which are derived from the same unimodal data. In our case, we derived multiple views from a single sentence by pairing it with a distinct syntactic framework. We illustrated in Figure 2 , two views derived from the same input sentence by applying respectively a constituent or dependency parser. As proposed in image processing (Tian et al., 2019; Bachman et al., 2019) , we propose to align the different views using a contrastive learning framework. Indeed, contrastive learning is broadly used in NLP Mikolov et al. (2013b; a) ; Logeswaran & Lee (2018) . We proposed to enhance the sentence embedding framework proposed in Logeswaran & Lee (2018) with a multi-view paradigm. As detailed in Section 2, composing multiple views has demonstrated its effectiveness in many NLP applications. However, as far as we are aware, combining distinct structured models to build standalone embeddings has not yet be explored. Nevertheless, this paradigm benefits from several structural advantages: as already mentioned, it pairs nicely with contrastive learning. It might thus be trained in a self-supervised manner that does not require data annotation. Moreover, contrary to models presented in Section 2, our method is not specific to a certain kind of encoder architecture. It does not require, for example, the use of attention layers or tree-structured models. More generally, it could be extended to any notion of view, even in other domains than language processing. Our setup could therefore be extended with any encoding function. Finally, our training method induces an interaction between models during inference and, paramountly, during the training phase. Our paper is organized as follows: we detail our contrastive multi-view framework in Section 3. In Section 4, we propose an evaluation of our framework on standard evaluation benchmarks and propose qualitative analysis from our embeddings.

2. RELATED WORK

Multi-view is effectively used in a broad variety of domains. For image processing, some methods aim to learn representations by filling the missing part of an image or solving jigsaw puzzles. For video, Tian et al. (2019) propose to build image tuples using video frames and flow. For audio, van den Oord et al. ( 2018) maximize the mutual information between the embedding of the signal at different time steps. Regarding NLP, combining different structural views has already been proven to be successful. Kong & Zhou (2011) provide a heuristic to combine dependency and constituency analysis for coreference resolution. Zhou et al. ( 2016 (2017b) combine sequential LSTM and Tree LSTM for natural language inference tasks. However, to our knowledge, combining distinct structured models in a contrastive learning setup was not attempted to build sentence embeddings.

3. METHOD

Given a sentence s, the model aims at discriminating the sentences s + in the neighborhood of s from sentences s -outside of this neighborhood. This is contrastive learning (Section 3.1). The representation of each sentence is acquired by using multiple views (Section 3.2).

3.1. CONTRASTIVE LEARNING

Contrastive learning is successfully applied in a variety of domains including audio van den Oord et al. (2018 ), image (Wu et al., 2018; Tian et al., 2019) , video or natural language processing for word embedding (Mikolov et al., 2013b) or sentence embedding (Logeswaran & Lee, 2018) . Some mathematical foundations are detailed in Saunshi et al. (2019) . The idea is to build a dataset such that each sample x is combined with another sample x + , which is somehow close. For word or sentence embeddings, the close samples are the words or the sentences appearing in the given textual context. For image processing, close samples might be two different parts of the same image. Systems are trained to bring close sentences together while dispersing negative examples. In particular, a sentence embedding framework is proposed by Logeswaran & Lee (2018) . The method takes inspiration from the distributional hypothesis successfully applied for word, but this



); Ahmed et al. (2019) combine Tree LSTM and standard sequential LSTM with a cross-attention method and observe improvement on a semantic textual similarity task. Chen et al. (2017a) combine CNN and Tree LSTM using attention methods on a sentiment classification task, and CNN outperforms both Tree-LSTM and CNN separately. Finally, Chen et al.

