CONTRASTING DISTINCT STRUCTURED VIEWS TO LEARN SENTENCE EMBEDDINGS

Abstract

We propose a self-supervised method that builds sentence embeddings from the combination of diverse explicit syntactic structures of a sentence. We assume structure is crucial to build consistent representations as we expect sentence meaning to be a function from both syntax and semantic aspects. In this perspective, we hypothesize that some linguistic representations might be better adapted given the considered task or sentence. We, therefore, propose to jointly learn individual representation functions for different syntactic frameworks. Again, by hypothesis, all such functions should encode similar semantic information differently and consequently, be complementary for building better sentential semantic embeddings. To assess such hypothesis, we propose an original contrastive multi-view framework that induces an explicit interaction between models during the training phase. We make experiments combining various structures such as dependency, constituency, or sequential schemes. We evaluate our method on standard sentence embedding benchmarks. Our results outperform comparable methods on several tasks.

1. INTRODUCTION

We propose a self-supervised method that builds sentence embeddings from the combination of diverse explicit syntactic structures. Such a method aims at improving the ability of models to perform compositional knowledge. In particular, we evaluate the embedding potential to solve downstream tasks. Building generic sentence embeddings remains an open question. Many training methods have been explored: generating past and previous sentences (Kiros et al., 2015; Hill et al., 2016) , discriminating context sentences (Logeswaran & Lee, 2018) , predicting specific relations between pairs of sentences (Conneau et al., 2017; Nie et al., 2019) . While all these methods propose efficient training objectives, they all rely on a similar RNN as encoder architecture. Nonetheless, model architectures have been subject to extensive work as well (Tai et al., 2015; Zhao et al., 2015; Arora et al., 2017; Lin et al., 2017) , and in supervised frameworks, many encoder structures outperform standard RNN networks. We hypothesize structure is a crucial element to perform compositional knowledge. In particular, the heterogeneity of performances given models and tasks makes us assume that some structures may be better adapted for a given example or task. Therefore, combining diverse structures should be more robust for tasks requiring complex word composition to derive their meaning. Hence, we aim here to evaluate the potential benefit from interactions between pairs of encoders. In particular, we propose a training method for which distinct encoders are learned jointly. We conjecture this association might improve our embeddings' power of generalization and propose an experimental setup to corroborate our hypothesis. We take inspiration from multi-view learning, which is successfully applied in a variety of domains. In such a framework, the model learns representations by aligning separate observations of the same object. Traditionally, views are issued from a complementary natural perception of the data. For example, a picture and a sound recording of a dog. However, it can be extended to any pair of samples that share similar semantic content, such as the translation of the same sentence in two different languages. The definition can be extended to synthetic views, which are derived from the same unimodal data. In our case, we derived multiple views from a single sentence by pairing it with

