DIALOGRAPH: INCORPORATING INTERPRETABLE STRATEGY-GRAPH NETWORKS INTO NEGOTIATION DIALOGUES

Abstract

To successfully negotiate a deal, it is not enough to communicate fluently: pragmatic planning of persuasive negotiation strategies is essential. While modern dialogue agents excel at generating fluent sentences, they still lack pragmatic grounding and cannot reason strategically. We present DIALOGRAPH, a negotiation system that incorporates pragmatic strategies in a negotiation dialogue using graph neural networks. DIALOGRAPH explicitly incorporates dependencies between sequences of strategies to enable improved and interpretable prediction of next optimal strategies, given the dialogue context. Our graph-based method outperforms prior state-of-the-art negotiation models both in the accuracy of strategy/dialogue act prediction and in the quality of downstream dialogue response generation. We qualitatively show further benefits of learned strategy-graphs in providing explicit associations between effective negotiation strategies over the course of the dialogue, leading to interpretable and strategic dialogues. 1 

1. INTRODUCTION

Negotiation is ubiquitous in human interaction, from e-commerce to the multi-billion dollar sales of companies. Learning how to negotiate effectively involves deep pragmatic understanding and planning the dialogue strategically (Thompson; Bazerman et al., 2000b; Pruitt, 2013) . Modern dialogue systems for collaborative tasks such as restaurant or flight reservations have made considerable progress by modeling the dialogue history and structure explicitly using the semantic content, like slot-value pairs (Larionov et al., 2018; Young, 2006) , or implicitly with encoder-decoder architectures (Sordoni et al., 2015; Li et al., 2016) . In such tasks, users communicate explicit intentions, enabling systems to map the utterances into specific intent slots (Li et al., 2020) . However, such mapping is less clear in complex non-collaborative tasks like negotiation (He et al., 2018) and persuasion (Wang et al., 2019) , where user intent and most effective strategies are hidden. Hence, along with the generated dialogue, the strategic choice of framing and the sequence of chosen strategies play a vital role, as depicted in Figure 1 . Indeed, prior work on negotiation dialogues has primarily focused on optimizing dialogue strategies-from highlevel task-specific strategies (Lewis et al., 2017) , to more specific task execution planning (He et al., 2018) , to fine-grained planning of linguistic outputs given strategic choices (Zhou et al., 2019) . These studies have confirmed that it is crucial to control for pragmatics of the dialogue to build effective negotiation systems. To model the explicit dialogue structure, prior work incorporated Hidden Markov Models (HMMs) (Zhai & Williams, 2014; Ritter et al., 2010) , Finite State Transducers (FSTs) (Zhou et al., 2020) and RNNs (He et al., 2018; Shi et al., 2019) . While RNN-based models lack interpretability, HMMand FST-based approaches may lack expressivity. In this paper, we hypothesize that Graph Neural Networks (GNNs) (Wu et al., 2020) can combine the benefits of interpretability and expressivity because of their effectiveness in encoding graph-structured data through message propagation. While being sufficiently expressive to model graph structures, GNNs also provide a natural means for interpretation via intermediate states (Xie & Lu, 2019; Pope et al., 2019) . We propose DIALOGRAPH, an end-to-end negotiation dialogue system that leverages Graph Attention Networks (GAT) (Veličković et al., 2018) to model complex negotiation strategies while providing interpretability for the model via intermediate structures. DIALOGRAPH incorporates the recently proposed hierarchical graph pooling based approaches (Ranjan et al., 2020) to learn the associations between negotiation strategies, including conceptual and linguistic strategies and dialogue acts, and their relative importance in predicting the best sequence. We focus on buyer-seller negotiations in which two individuals negotiate on the price of an item through a chat interface, and we model the seller's behavior on the CraigslistBargain dataset (He et al., 2018) . 2 We demonstrate that DIALOGRAPH outperforms previous state-of-art methods on strategy prediction and downstream dialogue responses. This paper makes several contributions. First, we introduce a novel approach to model negotiation strategies and their dependencies as graph structures, via GNNs. Second, we incorporate these learned graphs into an end-to-end negotiation dialogue system and demonstrate that it consistently improves future-strategy prediction and downstream dialogue generation, leading to better negotiation deals (sale prices). Finally, we demonstrate how to interpret intermediate structures and learned sequences of strategies, opening-up the black-box of end-to-end strategic dialogue systems.

2. DIALOGRAPH

We introduce DIALOGRAPH, a modular end-to-end dialogue system, that incorporates GATs with hierarchical pooling to learn pragmatic dialogue strategies jointly with the dialogue history. DIALO-GRAPH is based on a hierarchical encoder-decoder model and consists of three main components: (1) hierarchical dialogue encoder, which learns a representation for each utterance and encodes its local context; (2) structure encoder for encoding sequences of negotiation strategies and dialogue acts; and (3) utterance decoder, which finally generates the output utterance. Formally, our dialogue input consists of a sequence of tuples, D = [(u 1 , da 1 , ST 1 ), (u 2 , da 2 , ST 2 ), ..., (u n , da n , ST n )] where u i is the utterance, da i is the coarse dialogue act and ST i = {st i,1 , st i,2 , . . . , st i,k } is the set of k fine-grained negotiation strategies for the utterance u i . 3 The dialogue context forms the input to (1) and the previous dialogue acts and negotiation strategies form the input to (2). The overall architecture is shown in Figure 2 . In what follows, we describe DIALOGRAPH in detail.

2.1. HIERARCHICAL DIALOGUE ENCODER

A dialogue context typically comprises of multiple dialogue utterances which are sequential in nature. We use hierarchical encoders for modeling such sequential dialogue contexts (Jiao et al., 2019) . To encode the utterance u t at time t, we use the pooled representations from BERT (Devlin et al., 2019) to obtain the corresponding utterance embedding e t . We then pass the utterance embeddings through a GRU to obtain the dialogue context encoding till time t, denoted by h U t . 

2.2. STRUCTURE ENCODER

Our structure encoder is designed to model the graph representations of the strategies and dialogue acts using GATs and output their structural representations. These structural representations are used to predict the next set of strategies and dialogue acts and enrich the encoded dialogue representation. Below we describe the structure encoder for negotiation strategies. We model the sequence of negotiation strategies, ST = [ST 1 , ST 2 , . . . , ST t ] by creating a directed graph, where ST i is the set of k fine-grained negotiation strategies for the utterance u i . Formally, we define a graph G(V, E, X) with |E| edges and N = |V| nodes where each node v i ∈ V represents a particular negotiation strategy for an utterance and has a d-dimensional feature representation denoted by z i . Z ∈ R N ×d denotes the feature matrix of the nodes and A ∈ R N ×N represents the adjacency matrix, where N is the total number of nodes (strategies) that have occurred in the conversation till that point. Therefore, each node represents a strategy-utterance pair. We define the set of edges as E = {(a, b)}; a, b ∈ V where a and b denote strategies at utterances u a and u b , present at turns t a and t b , such that t b > t a . In other words, we make a directed edge from a particular node (strategy in an utterance) to all the consecutive nodes. This ensures a direct connection from all the previous strategies to the more recent ones. 4 In the same way, we form the graph out of the sequence of dialogue acts. These direct edges and learned edge attention weights help us interpret the dependence and influence of strategies on each other. To get the structural representations from the strategy graphs, we pass them through a hierarchical graph pooling based encoder, which consists of l layers of GAT, each followed by the Adaptive Structure Aware Pooling (ASAP) layer (Ranjan et al., 2020) . As part of the ASAP layer, the model first runs GAT over the input graph representations to obtain structurally informed representations of the nodes. Then a cluster assignment step is performed which generates a cluster assignment matrix, S, which tells the model which nodes come in a similar structural context. After that, the clusters are ranked and then the graph is pooled by taking the top few clusters as new nodes and forming edges between them using the existing graph. This way the size of the graph is reduced at every step which leads to a structurally informed graph representation. We take advantage of the cluster formulation to obtain the associations between the negotiation strategies, as identified from the cluster assignment matrix, S. These association scores can later be used to interpret which strategies are associated with each other and tend to co-occur in similar contexts. Moreover, we also use the node attention scores from GAT to interpret the influence of different strategies on the representation of a particular strategy, which essentially gives the dependence information between strategies. In this way, the structure representation is learned and accumulated in a manner that preserves the structural information (Ying et al., 2018; Lee et al., 2019) . After each pooling step, the graph representation is summarized using the concatenation of mean and max of the node representations. The summaries are then added and passed through fully connected layers to obtain the final structural representation of the strategies h ST t . We employ a similar structure encoder to encode the graph obtained from the sequence of dialogue acts, to obtain h da t .

2.3. UTTERANCE DECODER

The utterance decoder uses the dialogue context representation and structural representations of dialogue acts and negotiation strategies to produce the dialogue response (next utterance). We enrich the dialogue representation by concatenating the structural representations before passing it to a standard greedy GRU (Cho et al., 2014) decoder. This architecture follows Zhou et al. (2020) , who introduced a dynamic negotiation system that incorporates negotiation strategies and dialogue acts via FSTs. We thus follow their utterance decoder architecture to enable direct baseline comparison. For the j th word of utterance u t+1 , w j t+1 , we condition on the previous word w j-1 t+1 to calculate the probability distribution over the vocabulary as p wj t+1 = softmax(GRU(h t , w j-1 t+1 )) where h t = [h u t ; h ST t ; h da t ] and [; ] represents the concatenation operator. For encoding the price, we replace all price information in the dataset with placeholders representing the percentage of the offer price. For example, we would replace $35 with < price -0.875 > if the original selling price is $40. The decoder generates these placeholders which are then replaced with the calculated price before generating the utterance.

2.4. MODEL TRAINING

We use h ST t to predict the next set of strategies ST t+1 , a binary value vector which represents the k-hot representation of negotiation strategies for the next turn. We compute the probability of the j th strategy occurring in u t+1 as p(st t+1,j |h ST t ) = σ(h ST t ). where σ denotes the sigmoid operator. We threshold the probability by 0.5 to obtain the k-hot representation. We denote the weighted negative log likelihood of strategies L ST as the loss function of the task of next strategy prediction L ST =j δ j log(p(st t+1,j ))k log(1 -p(st t+1,k )) where the summation of j are over the strategies present (st t+1,j = 1) and not present (st t+1,k = 0) in the ground truth strategies set, ST . Here δ j is the positive weight associated with the particular strategy. We add this weight to the positive examples to trade off precision and recall. We put δ j = # of instances not having strategy j/# of instances having strategy j. Similarly, we use h da t to predict the dialogue act for the next utterance da t+1 . Given the target dialogue act da t+1 and the class weights ρ da for the dialogue acts, we denote the class-weighted cross entropy loss over the set of possible dialogue acts, L DA = -ρ da log(softmax(h da t )) . We pass h t = [h u t ; h ST t ; h da t ] through a linear layer to predict the negotiation success, which is denoted by the sale-to-list ratio r = (sale pricebuyer target price)/(listed pricebuyer target price) (Zhou et al., 2019) . We split the ratios into 5 negotiation classes of equal sizes using the training data and use those to predict the success of negotiation. Therefore, given the predicted probabilities for target utterance u t+1 from §2.3, target ratio class y r and the learnable parameters W r and b r , we use the cross entropy loss as the loss for the generation task (L N LG ) as well as the negotiation outcome prediction task (L R ), thus L N LG =wj ∈u t+1 log(p wj t+1 ) and L R = -r∈[1,5] y r log(softmax(W r h t + b r )). The L R loss optimizes for encoding negotiation strategies to enable accurate prediction of negotiation outcome. We use hyperparameters α, β and γ to optimize the joint loss L joint , of strategy prediction, dialogue act prediction, utterance generation and outcome prediction together, using the Adam optimizer (Kingma & Ba, 2014) , to get L joint = L N LG + αL ST + βL DA + γL R .

Dataset:

We use the CraigslistBargain datasetfoot_4 (He et al., 2018) to evaluate our model. The dataset was created using Amazon Mechanical Turk (AMT) in a negotiation setting where two workers were assigned the roles of buyer and seller respectively and were tasked to negotiate the price of an item on sale.The buyer was additionally given a target price. Both parties were encouraged to reach an agreement while each of the workers tried to get a better deal. We remove all conversations with less than 5 turns. Dataset statistics are listed in Table 11 in the Appendix. We extract from the dataset the coarse dialogue acts as described by He et al. (2018) . This includes a list of 10 utterance dialogue acts, e.g., inform, agree, counter-price. We augment this list by 4 outcome dialogue acts, namely, offer , accept , reject and quit , which correspond to the actions taken by the users. Negotiation strategies are extracted from the data following Zhou et al. (2019) . These include 21 fine-grained strategies grounded in prior economics/behavioral science research on negotiation (Pruitt, 2013; Bazerman & Neale, 1993; Bazerman et al., 2000a; Fisher et al., 2011; Lax & Sebenius, 2006; Bazerman et al., 2000b ), e.g, negotiate side offers, build rapport, show dominance. All dialogue acts and strategies are listed in Appendices A and B. Baselines: DIALOGRAPH refers to our proposed method. To corroborate the efficacy of DI-ALOGRAPH, we compare it against our implementation of the present state-of-the-art model for the negotiation task: FST-enhanced hierarchical encoder-decoder model (FeHED) (Zhou et al., 2020) which utilizes FSTs for encoding sequences of strategies and dialogue acts. 6 We also conduct and ablation study, and evaluate the variants of DIALOGRAPH with different ways of encoding negotiation strategies, namely, HED, HED+RNN, and HED+Transformer. HED completely ignores the strategy and dialogue act information, whereas HED+RNN and HED+Transformer encode them using RNN and Transformers (Vaswani et al., 2017) respectively. While HED+RNN is based on the dialogue manager of He et al. (2018) , HED+Transformer has not been proposed earlier for this task. For a fair comparison, we use a pre-trained BERT (Devlin et al., 2019) model as the utterance encoder ( §2.1) and a common utterance decoder ( §2.4) in all the models, and only vary the structure encoders as described above. The strategies and dialogue acts in RNN and Transformer based encoders are fed as sequence of k-hot vectors. Evaluation Metrics: For evaluating the performance on the next strategy prediction and the next dialogue act prediction task, we report the F1 and ROC AUC scores for all the models. For these metrics, macro scores tell us how well the model performs on less frequent strategies/dialogue acts and the micro performance tells us how good the model performs overall while taking the label imbalance into account. Strategy prediction is a multi-label prediction problem since each utterance can have multiple strategies. For the downstream tasks of utterance generation, we compare the models using BLEU score (Papineni et al., 2002) and BERTScore (Zhang et al., 2020) . Finally, we also evaluate on another downstream task of predicting the outcome of negotiation, using the ratio class prediction accuracy (RC-Acc) (1 out of 5 negotiation outcome classes, as described in §2.4). Predicting sale outcome provides better interpretability over the progression of a sale and potentially control to intervene when negotiation has a bad predicted outcome. Additionally, being able to predict the sale outcome with high accuracy shows that the model encodes the sequence of negotiation strategies well.

4. RESULTS

We evaluate (1) strategy and dialogue act prediction (intrinsic evaluation), and (2) dialogue generation and negotiation outcome prediction (downstream evaluation). For all metrics, we perform bootstrapped statistical tests (Berg-Kirkpatrick et al., 2012; Koehn, 2004) and we bold the best results for a metric in all tables (several results are in bold if they have statistically insignificant differences).

Strategy and Dialogue Act Prediction:

We compare DIALOGRAPH's effectiveness in encoding the explicit sequence of strategies and dialogue acts with the baselines, using the metrics described in §3. strategy prediction macro scores and outperforms it on other metrics. Moreover, both significantly outperform the FST-based based method, prior state-of-the-art. We hypothesize that lower gains for dialogue acts are due to the limited structural dependencies between them. Conversely, we validate that for negotiation strategies, RNNs are significantly worse than DIALOGRAPH. We also observe that higher macro scores show that DIALOGRAPH and Transformers are able to capture the sequences containing the less frequent strategies/dialogue acts as well. These results supports our hypothesis of the importance to encode the structure in a more expressive model. Moreover, DIALOGRAPH also provides interpretable structures which the other baselines do not. We will discuss these findings in §5.

Automatic Evaluation on Downstream tasks:

In this section, we analyze the impact of DIALO-GRAPH on the downstream task of Negotiation Dialogue based on the automatic evaluation metrics described in §3. In Table 2 , we show that DIALOGRAPH helps improve the generation of dialogue response. Even though DIALOGRAPH attains higher BLEU scores, we note that single-reference BLEU assumes only one possible response while dialogue systems can have multiple possible responses to the same utterance. BERTScore alleviates this problem by scoring semantically similar responses equally high (Zhang et al., 2020) . We also find that both Transformer and DIALOGRAPH have a comparable performance for negotiation outcome prediction, which is significantly better than the previously published baselines (FeHED and HED+RNN). A higher performance on this metric demonstrates that our model is able to encode the strategy sequence better and consequently predict the negotiation outcome more accurately. Additionally, ablation results in Table 3 show that both strategy and dialogue act information helps DIALOGRAPH in improving dialogue response. The difference in BERTScore F1 scores in Tables 2 and 3 arises due to different metrics chosen for early stopping. More details in Appendix D. Although, both HED+Transformer and DIALOGRAPH are based on attention mechanisms, DIALO-GRAPH has the added advantage of having structural attention which helps encode the pragmatic structure of negotiation dialogues which in turn provides an interpretable interface. The components in our graph based encoder such as the GAT and ASAP layer provide strategy influence and cluster association information which is useful to understand and control negotiation systems. This is described in more detail in §5. Though transformers have self attention, the architecture is limited and doesn't model the structure/dependence between strategies providing only limited understanding. Further, our results show that DIALOGRAPH maintains or improves performance over strong models like Transformer and has much more transparent interpretability. We later show that DI-ALOGRAPH performs significantly better than HED+Transformer in human evaluation. Human Evaluation: Since automatic metrics only give us a partial view of the system, we complement our evaluation with detailed human evaluation. For that, we set up DIALOGRAPH and the baselines on Amazon Mechanical Turk (AMT) and asked workers to role-play the buyer and negotiate with a single bot. After their chat is over, we ask them to fill a survey to rate the dialogue on how persuasive (My task partner was persuasive.), coherent (My task partner's responses were on topic and in accordance with the conversation history.), natural (My task partner was humanlike.) and understandable (My task partner perfectly understood what I was typing.) the bot wasfoot_6 . Prior research in entailment has shown that humans tend to get better as they chat (Mizukami et al., 2016; Beňuš et al., 2011) and so we restrict one user to chat with just one of the bots. We further prune conversations which were incomplete potentially due to dropped connections. Finally, we manually inspect the conversations extracted from AMT to extract the agreed sale price and remove conversations that were not trying to negotiate at all. The results of human evaluations of the resulting 90 dialogues (about 20 per model) are presented in Table 4 . We find that baselines are more likely to accept unfair offers and apply inappropriate strategies. Additionally, DIALOGRAPH bot attained a significantly higher Sale Price Ratio, which is the outcome of negotiation, showing that effectively modeling strategy sequences leads to more effective negotiation systems. Our model also had a higher average total number of turns and wordsper-turn (for just the bots) compared to all baselines, signifying engagement. It was also more persuasive and coherent while being more understandable to the user. From qualitative inspection we observe that the HED model generates utterances that are shorter and less coherent. They are natural responses like "Yes it is", but generic and contextually irrelevant. We hypothesize that this is due to the HED model not being optimized to encode the sequence of negotiation strategies and dialogue acts. We believe that this is the reason for the high natural score for HED. From manual inspection we see that HED is not able to produce very persuasive responses. We provide an example of a dialogue in Appendix F. We see that although HED+Transformer model performs well, DIALOGRAPH achieves a better sale price outcome as it tries to repeatedly offer deals to negotiate the price. We see that the HED is unable to understand the user responses well and tends to repeat itself. Both the FeHED and HED baselines tend to agree with the buyer's proposal more readily whereas HED+Transformers and DIALOGRAPH provide counter offers and trade-ins to persuade the user.

5. INTERPRETING LEARNED STRATEGY GRAPHS

We visualize the intermediate attention scores generated by the GATs while obtaining the strategy node representations. These attention scores tell us what strategies influenced the representation of a particular strategy and can be used to observe the dependence between strategies (cf. Xie & Lu, Table 4 : Human evaluation ratings on a scale of 1-5 for various models. We also provide the average sale price ratio ( §2.4). Negative ratio means that average sale price was lower than the buyer's target. Table 5 : Examples of strategies and their least / highly associated strategies based on association scores extracted using the cluster attention scores given by the ASAP layer. ). We show an example in Figure 3 where for brevity, we present a subset of few turns and only the top few most relevant edges in the figure. For visualization, we re-scale the attention values for all incoming edges of a node (strategy) using min-max normalization. This is done because the range of raw attention values would differ based on the number of edges and this allows us to normalize any difference in scales and visualize the relative ranking of strategies (Yi et al., 2005; Chen & Liu, 2004) . We notice that as soon as the first propose at u 5 happens, the strategies completely change and become independent of the strategies before the propose point. From Figure 3 , we see that the edge weight from u 4 to u 6 is 0.01, signifying very low influence. We noticed this trend in other examples as well, wherein, the influence of strategies coming before the first propose turn to strategies coming after that, is very low. A similar phenomenon was also observed by Zhou et al. (2019) who study the conversations by splitting into two parts based on the first propose turn. Another interesting thing we note is that the trade-in and propose strategies at u 5 seem to be heavily influenced by informal from u 3 . Similarly, the informal of u 5 was influenced by positive sentiment from u 4 . This indicates that the seller was influenced by previous informal interactions to propose and trade-in at this turn, and that sellers tend to be more informal if the conversation partner is positive. In other examples, we see that at a particular utterance, different strategies depend on separate past strategies and also observe that the attention maps usually demonstrate the strategy switch as soon as the first propose happens, which is similar to what has been observed by prior work. These examples demonstrate that DIALOGRAPH can model fine-grain strategies, learn dependence beyond just utterances and give interpretable representations, which previous baselines, including the FSTs, lack. Specifically, each state of the FST is explicitly represented by an action distribution which can only be used to see the sequence of strategies and not observe associations or dependence information which DIALOGRAPH provides. We utilize these cluster attention scores from the ASAP pooling layer to observe the association between various strategies which can help us observe strategies with similar contextual behaviour and structural co-occurrence. We take the average normalized value of the cluster attention scores between two strategies to obtain the association score between them. In Table 5 , we show some examples of strategies and their obtained association scores. We observe that negative sentiment tends to be most associated to propose. We hypothesize that this is because that people who disagree more tend to get better deals. We observe that people do not tend to associate negative sentiment with trade-in, which is in-fact highly associated with positive sentiment, because people might want to remain positive while offering something. Similarly, people tend to give vague proposals by hedging, for instance, I could go lower if you can pick it up, than when suggesting trade-in. Concern also seems to be least associated with certainty, and most with politeness-based strategies. Thus, we observe that our model is able to provide meaningful insights which corroborate prior observations, justifying its ability to learn strategy associations well.

6. RELATED WORK

Dialogue Systems: Goal-oriented dialogue systems have a long history in the NLP community. Broadly, goal-oriented dialogue can be categorized into collaborative and non-collaborative systems. The aim of agents in a collaborative setting is to achieve a common goal, such as travel and flight reservation (Wei et al., 2018) and information-seeking (Reddy et al., 2019) . Recent years have seen a rise in non-collaborative goal-oriented dialogue systems such as persuasion (Wang et al., 2019; Dutt et al., 2020; 2021) , negotiation (He et al., 2018; Lewis et al., 2017) and strategy games (Asher et al., 2016) due to the challenging yet interesting nature of the task. Prior work has also focused on decision-making games such as Settlers of Catan (Cuayáhuitl et al., 2015) which mainly involve decision-making skills rather than communication. Lewis et al. (2017) developed the DealOrNoDeal dataset in which agents had to reach a deal to split a set of items. Extensive work has been done on capturing the explicit semantic history in dialogue systems (Kumar et al., 2020; Vinyals & Le, 2015; Zhang et al., 2018) . Recent work has shown the advantage of modeling the dialogue history in the form of belief span (Lei et al., 2018) and state graphs (Bowden et al., 2017) . He et al. (2018) proposed a bargaining scenario that can leverage semantic and strategic history. Zhou et al. (2020) used unsupervisedly learned FSTs to learn dialogue structure. This approach, however, although effective in explicitly incorporating pragmatic strategies, does not leverage the expressive power of neural networks. Our model, in contrast, combines the interpretablity of graph-based approaches and the expressively of neural networks, improving the performance and interpretability of negotiation agents. Graph Neural Networks: The effectiveness of GNNs (Bruna et al., 2013; Defferrard et al., 2016; Kipf & Welling, 2017) has been corroborated in several NLP applications (Vashishth et al., 2019) , including semantic role labeling (Marcheggiani & Titov, 2017) , machine translation (Bastings et al., 2017) , relation extraction (Vashishth et al., 2018) , and knowledge graph embeddings (Schlichtkrull et al., 2018; Vashishth et al., 2020) . Hierarchical graph pooling based structure encoders have been successful in encoding graphical structures (Zhang et al., 2019) . We leverage the advances in GNNs and propose to use a graph-based explicit structure encoder to model negotiation strategies. Unlike HMM and FST based encoders, GNN-based encoders can be trained by optimizing the downstream loss and have superior expressive capabilities. Moreover, they provide better interpretability of the model as they can be interpreted based on observed explicit sequences (Tu et al., 2020; Norcliffe-Brown et al., 2018) . In dialogue systems, graphs have been used to guide dialogue policy and response selection. However, they have been used to encode external knowledge (Tuan et al., 2019; Zhou et al., 2018) or speaker information (Ghosal et al., 2019) , rather than compose dialogue strategies on-the-fly. Other works (Tang et al., 2019; Qin et al., 2020) focused on keyword prediction using RNN-based graphs. Our work is the first to incorporate GATs with hierarchical pooling, learning pragmatic dialogue strategies jointly with the end-to-end dialogue system. Unlike in prior work, our model leverages hybrid end-to-end and modularized architectures (Liang et al., 2020; Parvaneh et al., 2019) and can be plugged as explicit sequence encoder into other models.

7. CONCLUSION

We present DIALOGRAPH, a novel modular negotiation dialogue system which models pragmatic negotiation strategies using Graph Attention Networks with hierarchical pooling and learns an explicit strategy graph jointly with the dialogue history. DIALOGRAPH outperforms strong baselines in downstream dialogue generation, while providing the capability to interpret and analyze the intermediate graph structures and the interactions between different strategies contextualized in the dialogue. As future work, we would like to extend our work to discover successful (e.g.: good for the seller) and unsuccessful strategy sequences using our interpretable graph structures.

A DIALOGUE ACTS

Here we provide the details about the dialogue acts that we have used to annotate the utterances. 10 are taken from He et al. (2018) and 4 are based on the actions taken by the users. The rule based acts are extracted using the code provided by themfoot_7 . The details are in Table 6 . 

B NEGOTIATION STRATEGIES

Here we provide the details about the 15 Negotiation Strategies (Zhou et al., 2019) and 21 Negotiation Strategies (Zhou et al., 2020) in Tables 7 and 8 . (2020) . These are used to operationalize the 15 strategies using a rule based system (https: 



Code, data and a demo system is released at https://github.com/rishabhjoshi/ DialoGraph_ICLR21 We focus on the seller's side followingZhou et al. (2019) who devised a set of strategies specific to maximizing the seller's success. Our proposed methodology, however, is general. For example, in an utterance Morning! My bro destroyed my old kit and I'm looking for a new pair for $10, the coarse dialogue act is Introduction, and the finer grained negotiation strategies include Proposing price, Being informal and Talking about family for building rapport. Appendix C shows an example of the graph obtained from a sequence of strategies. https://github.com/stanfordnlp/cocoa/tree/master/craigslistbargain We replace the utterance encoder with BERT for fair comparison. This improved slightly the performance of the FeHED model compared to results published inZhou et al. (2020). We use the setup of https://github.com/stanfordnlp/cocoa/. Screenshots in Appendix H. https://github.com/stanfordnlp/cocoa/ //github.com/zhouyiheng11/augmenting-non-collabrative-dialog/). The frequency statistics on the train set (5383 conversations) is given. A detailed description regarding the rules used by prior work to extract these are out of scope of this work, however, we intend to provide the code and extracted strategies, along with the rule based mapping to the 15 strategies upon acceptance of this work.



Figure 1: Both options are equally plausible and fluent, but a response with effective pragmatic strategies leads to a better deal.

Figure 2: Overview of DIALOGRAPH. At time t, utterance u t is encoded using BERT and then passed to the Dialogue Context Encoder to generate the dialogue representation. This representation is enriched with the encodings of explicit strategy and dialogue act sequences using the structure encoders which is then used to condition the Utterance decoder. Please refer to §2 for details.

Figure 3: Visualization of the learnt latent strategy sequences in DIALOGRAPH where bolder edges represent higher influence. Here we present only a few edges for brevity and visualize min-max normalized attention values as edge weights to analyze the relative ranking of strategies. For example, for family at u 7 , informal of u 5 has the most influence followed by propose. We present the full attention map for this example in Figure 5 in the Appendix.

Figure 5: Visualization of the attention map learned by DIALOGRAPH for the example depicted in Figure 3 in the main paper. We only show it for a few turns for brevity. Here the axis labels represent the turn and the strategy. Refer to the Figure 3 in the main paper for description.

Figure 7: Screenshot of the chat window for the human evaluation interface.

Figure 8: Screenshot of the survey for the human evaluation interface.

Table1shows that DIALOGRAPH performs on par with the Transformer based encoder in Performance of the next strategy and dialogue-act prediction of various models. We report the F1 and ROC AUC scores. Significance tests were performed as described in §4 and the best results (along with all statistically insignificant values) are bolded.

Downstream evaluation of negotiation dialogue generation and negotiation outcome prediction. The best results (along with all statistically insignificant values to those) are bolded.

DIALOGRAPH ablation analysis. This shows that all the different components provide complementary benefits. We also evaluate without BERT for comparison with previously published works.

The list of dialogue acts that we use to annotate the data.

The details of 15 Negotiation Strategies proposed byZhou et al. (2019).

The details of 21 Negotiation Strategies (<start> added by us) used byZhou et al.

ACKNOWLEDGMENTS

The authors are grateful to the anonymous reviewers for their invaluable feedback, and to Alissa Ostapenko, Shruti Rijhwani, Ritam Dutt, and members of the Tsvetshop at CMU for their helpful feedback on this work. The authors would also like to thank Yiheng Zhou for helping with negotiation strategy extraction and FeHED model. This material is based upon work supported by the National Science Foundation under Grant No. IIS2007960 and by the Google faculty research award. We would also like to thank Amazon for providing GPU credits.

C STRATEGY-GRAPH VISUALIZATION

A visualization of a strategy sequence graph. Refer to §2.2 for more details. We also provide additional details regarding the number of nodes and edges in our strategy graphs in Table 9 . Here we present only a few edges for brevity. For example, there would be two more additional edges from u 4 to the strategies of u 5 .Published as a conference paper at ICLR 2021 

D HYPERPARAMETERS

We present the hyper-parameters for all the experiments, their corresponding search space and their final values in Table 10 . We also present additional details of our experiments below. We use most of the hyperparameters from Zhou et al. (2020) . Each training run took at most 3 hours on a single Nvidia GeForce GTX 1080Ti GPU and all the models were saved based on Strategy Macro F1 performance.For experiments for Table 1 and 2 we saved the best models on best Strategy Macro F1 performance (HED being saved on outcome class prediction). This is because we wanted to prioritize and optimize our final model to capture sequence-structural information owing to our focus on interpretability. While performing ablation studies for Table 3 , not all models have structure encoders, and hence for a fair comparison we chose a metric independent of the different modules for all the models in ablations. We use the negotiation outcome class prediction (RC-Acc) scores as that optimizes the dialogue for good negotiation outcome, which indirectly helps train the model to capture the sequence of strategies. 

E NEGOTIATION DATASET STATISTICS

In Table 11 we provide the CraiglistBargain dataset statistics along with data sizes after filtering conversations with less than 5 turns. The maximum and average number of turns in any conversation is 47 and 9.2 respectively. Also, the maximum and average number of strategies in an utterance is 13 and 3 respectively. 

F EXAMPLE CONVERSATIONS

Table 12 : Examples of the generated dialogues of various models when we keep the buyer utterances same. We see that DIALOGRAPH gets the best deal for the same dialogue context and is more persistent. The FeHED and HED models accept offers more readily. We provide more examples of DIALOGRAPH in Table 13 . 

