LEARNING LOCALITY AND ISOTROPY IN DIALOGUE MODELING

Abstract

Existing dialogue modeling methods have achieved promising performance on various dialogue tasks with the aid of Transformer and the large-scale pre-trained language models. However, some recent studies revealed that the context representations produced by these methods suffer the problem of anisotropy. In this paper, we find that the generated representations are also not conversational, losing the conversation structure information during the context modeling stage. To this end, we identify two properties in dialogue modeling, i.e., locality and isotropy, and present a simple method for dialogue representation calibration, namely SimDRC, to build isotropic and conversational feature spaces. Experimental results show that our approach significantly outperforms current state-of-the-art models on three open-domain dialogue tasks with eight benchmarks. More in-depth analyses further confirm the effectiveness of our proposed approach. We release the code at https://github.com/hahahawu/SimDRC.

1. INTRODUCTION

Dialogue modeling (Serban et al., 2016; Mehri et al., 2019; Liu et al., 2021) is to encode the raw text of the input dialogue to the contextual representations. Although the Transformer-based dialogue modeling methods (Hosseini-Asl et al., 2020; Liu et al., 2021) have achieved great success on various dialogue tasks, there are still some impediments in these methods that are not well explored nowadays. Specifically, recent studies (Ethayarajh, 2019; Su et al., 2022) have revealed that on dialogue generation tasks, the representations produced by existing dialogue modeling methods are anisotropic, i.e. features occupy a narrow cone in the vector space, thus leading to the problem of degeneration. To alleviate this problem, previous solutions (e.g. SimCTG) (Su et al., 2021; 2022) encourage the model to learn isotropic token embeddings by pushing away the representations of distinct tokens. While building the more discriminative and isotropic feature space, these methods still ignore learning dialogue-specific features, such as inter-speaker correlations and conversational structure information, in the dialogue modeling stage. Therefore, a question is naturally raised -are the representations produced by existing dialogue modeling methods really conversational? To answer this question, in Figure 1 (a), we showcase the cosine similarity matrix of token representations produced by BART (Lewis et al., 2020) that is well trained on response generation task. First, we can easily observe the phenomenon of anisotropy from the heatmap where the similarities of distinct tokens are relatively high, over 0.5 for most token pairs. Then, Figure 1 (b) illustrates the similarity heatmap of token representations produced by SimCTG where the color is faded on the whole, suggesting the problem of anisotropy is relaxed. However, another critical problem still remains, is that the representations of tokens in different utterances are nearby to each other, making the utterance indistinguishable on the token representations. It is undesirable that no conversational features can be captured from the token similarity matrix while the matrix is produced by a "dialogue modeling" method trained on the dialogue task using dialogue data. Ideally, we expect that the representations of tokens within an utterance are close to voice a concentrated idea of the utterance, and the representations of different utterances are discriminative and isotropic to convey the maximal information of the dialogue. Accordingly, the ideal similarity matrix of token representations should be similar to Figure 1(c) , where tokens within an utterance are intensive and different utterances are easily distinguishable on representations. Our motivation is that humans pay more attention to the central idea of the utterance rather than how the utterance is organized by words when humans utter, and humans also prefer to express more information with fewer utterances (Woolnough et al., 2021) . Based on the above observation and motivation, we identify two properties, i.e., locality and isotropy in dialogue modeling, and then present SimDRC, a simple dialogue representation calibration method, which encourages the model to aggregate the representations of tokens within an utterance and push away the representations of distinct utterances. We evaluate our approach on three open-domain dialogue tasks, including multi-turn dialogue response generation (Li et al., 2017) , conversational response retrieval (Lowe et al., 2015) and conversational semantic role labeling (Xu et al., 2021) . The experimental results show that our approach achieves comparable or better performance against the current state-of-the-art methods across the three dialogue tasks on both automatic and human evaluations. In-depth analyses towards the effects of the hyper-parameters and the measurements of locality and isotropy further verify the effectiveness of our proposed approach.

2.1. DIALOGUE MODELING

Dialogue modeling is to transform the raw text of the dialogue to machine-readable representations, which is an indispensable step to most dialogue tasks (Li et al., 2017; Liu et al., 2021) . To achieve this goal, conventional approaches (Serban et al., 2016; 2017; Xing et al., 2018) with recurrent neural networks (RNN) (Hochreiter & Schmidhuber, 1997; Mikolov et al., 2010) prefer to hierarchically learn the representations due to the long distance dependency problems in RNNs. With the remarkable success of Transformer (Vaswani et al., 2017) and pre-trained language models (Devlin et al., 2019; Raffel et al., 2020) on various NLP tasks, Transformer-based dialogue modeling methods (Hosseini-Asl et al., 2020; Gu et al., 2020; Liu et al., 2021; Wu et al., 2021) are widely used, and significantly outperform the traditional methods on many dialogue tasks, such as response generation (Li et al., 2017) and response retrieval (Lowe et al., 2015) . In this work, we concentrate on the standard Transformer-based dialogue modeling method, which directly encodes the flattened dialogue context with the pre-trained language models. By studying the geometry of the representation space of the model, we find that the contextual representations produced by existing dialogue modeling methods are not isotropic and conversational.

2.2. REPRESENTATION CALIBRATION

Outside dialogue modeling, many other representation learning approaches also attempt to normalize their feature distributions from different perspectives. A bunch of studies theoretically verify that



Figure 1: Illustrations of the token cosine similarity matrix produced by BART (a), SimCTG (Su et al., 2022) (b) and our proposed SimDRC (c).

