LEARNING REASONING PATHS OVER SEMANTIC GRAPHS FOR VIDEO-GROUNDED DIALOGUES

Abstract

Compared to traditional visual question answering, video-grounded dialogues require additional reasoning over dialogue context to answer questions in a multiturn setting. Previous approaches to video-grounded dialogues mostly use dialogue context as a simple text input without modelling the inherent information flows at the turn level. In this paper, we propose a novel framework of Reasoning Paths in Dialogue Context (PDC). PDC model discovers information flows among dialogue turns through a semantic graph constructed based on lexical components in each question and answer. PDC model then learns to predict reasoning paths over this semantic graph. Our path prediction model predicts a path from the current turn through past dialogue turns that contain additional visual cues to answer the current question. Our reasoning model sequentially processes both visual and textual information through this reasoning path and the propagated features are used to generate the answer. Our experimental results demonstrate the effectiveness of our method and provide additional insights on how models use semantic dependencies in a dialogue context to retrieve visual cues.

1. INTRODUCTION

Traditional visual question answering (Antol et al., 2015; Jang et al., 2017) involves answering questions about a given image. Extending from this line of research, recently Das et al. (2017) ; Alamri et al. (2019) add another level of complexity by positioning each question and answer pair in a multi-turn or conversational setting (See Figure 1 for an example). This line of research has promising applications to improve virtual intelligent assistants in multi-modal scenarios (e.g. assistants for people with visual impairment). Most state-of-the-part approaches in this line of research (Kang et al., 2019; Schwartz et al., 2019b; Le et al., 2019) tackle the additional complexity in the multi-turn setting by learning to process dialogue context sequentially turn by turn. Despite the success of these approaches, they often fail to exploit the dependencies between dialogue turns of long distance, e.g. the 2 nd and 5 th turns in Figure 1 . In long dialogues, this shortcoming becomes more obvious and necessitates an approach for learning long-distance dependencies between dialogue turns. To reason over dialogue context with long-distance dependencies, recent research in dialogues discovers graph-based structures at the turn level to predict the speaker's emotion (Ghosal et al., 2019) or generate sequential questions semi-autoregressively (Chai & Wan, 2020) . Recently Zheng et al. (2019) incorporate graph neural models to connect the textual cues between all pairs of dialogue turns. These methods, however, involve a fixed graphical structure of dialogue turns, in which only a small number of nodes contains lexical overlap with the question of the current turn, e.g. the 1 st , 3 rd , and 5 th turns in Figure 1 . These methods also fail to factor in the temporality of dialogue turns as the graph structures do not guarantee the sequential ordering among turns. In this paper, we propose a novel framework of Reasoning Paths in Dialogue Context (PDC). PDC model learns a reasoning path that traverses through dialogue turns to propagate contextual cues that are densely related to the semantics of the current questions. Our approach balances between a sequential and graphical process to exploit dialogue information. Our work is related to the long-studied research domain of discourse structures, e.g. (Barzilay & Lapata, 2008; Feng & Hirst, 2011; Tan et al., 2016; Habernal & Gurevych, 2017) . A form of discourse structure is argument structures, including premises and claims and their relations. Argument structures have been studied to assess different characteristics in text, such as coherence, persuasiveness, and susceptibility to attack. However, most efforts are designed for discourse study in monologues and much less attention is directed towards conversational data. In this work, we investigate a form of discourse structure through semantic graphs built upon the overlap of component representations among dialogue turns. We further enhance the models with a reasoning path learning model to learn the best information path for the next utterance generation. To learn a reasoning path, we incorporate our method with bridge entities, a concept often seen in reading comprehension research, and earlier used in entity-based discourse analysis (Barzilay & Lapata, 2008) . In reading comprehension problems, bridge entities denote entities that are common between two knowledge bases e.g. Wikipedia paragraphs in HotpotQA (Yang et al., 2018b) . In discourse analysis, entities and their locations in text are used to learn linguistic patterns that indicate certain qualities of a document. In our method, we first reconstruct each dialogue turn (including question and answer) into a set of component sub-nodes (e.g. entities, action phrases) using common syntactical dependency parsers. Each result dialogue turn contains sub-nodes that can be used as bridge entities. Our reasoning path learning approach contains 2 phases: (1) first, at each dialogue turn, a graph network is constructed at the turn level. Any two turns are connected if they have an overlapping sub-node or if two of their sub-nodes are semantically similar. (2) secondly, a path generator is trained to predict a path from the current dialogue turn to past dialogue turns that provide additional and relevant cues to answer the current question. The predicted path is used as a skeleton layout to propagate visual features through each step of the path. Specifically, in PDC, we adopt non-parameterized approaches (e.g. cosine similarity) to construct the edges in graph networks and each sub-node is represented by pre-trained word embedding vectors. Our path generator is a transformer decoder that regressively generates the next turn index conditioned on the previously generated turn sequence. Our reasoning model is a combination of a vanilla graph convolutional network (Kipf & Welling, 2017) and transformer encoder (Vaswani et al., 2017) . In each traversing step, we retrieve visual features conditioned by the corresponding dialogue turn and propagate the features to the next step. Finally, the propagated multimodal features are used as input to a transformer decoder to predict the answer. Our experimental results show that our method can improve the results on the Audio-Visual Scene-Aware Dialogues (AVSD) generation settings (Alamri et al., 2019) , outperform previous state-ofthe-art methods. We evaluate our approach through comprehensive ablation analysis and qualitative study. PDC model also provides additional insights on how the inherent contextual cues in dialogue context are learned in neural networks in the form of a reasoning path.

2. RELATED WORK

Discourses in monologues. Related to our work is the research of discourse structures. A longstudied line of research in this domain focuses on argument mining to identify the structure of argument, claims and premises, and relations between them (Feng & Hirst, 2011; Stab & Gurevych, 2014; Peldszus & Stede, 2015; Persing & Ng, 2016; Habernal & Gurevych, 2017) . More recently,



: is it just one person in the video ? A: There is one visible person , yes .Q: what is he carrying in his hand ? A: he is looking down at his cellphone and laughing while walking forward in a living room . Q: Is there any noise in the video ? A: No there is no noise in the video . Q: can you tell if he's watching a video on his phone ? A: I can't tell what he's watching . he walks into a table from not paying attention . Q: does he just walk back and forth in the video? A: he walks towards the back of the living room , and walks right into the table .

Figure1: Sequential reasoning approaches fail to detect long-distance dependencies between the current turn and the 2 nd turn. Graph-based reasoning approaches signals from all turns are directly forwarded to the current turn but the 1 st and 3 rd contain little lexical overlap to the current question.

