VILNMN: A NEURAL MODULE NETWORK APPROACH TO VIDEO-GROUNDED LANGUAGE TASKS

Abstract

Neural module networks (NMN) have achieved success in image-grounded tasks such as Visual Question Answering (VQA) on synthetic images. However, very limited work on NMN has been studied in the video-grounded language tasks. These tasks extend the complexity of traditional visual tasks with the additional visual temporal variance. Motivated by recent NMN approaches on image-grounded tasks, we introduce Visio-Linguistic Neural Module Network (VilNMN) to model the information retrieval process in video-grounded language tasks as a pipeline of neural modules. VilNMN first decomposes all language components to explicitly resolve any entity references and detect corresponding action-based inputs from the question. The detected entities and actions are used as parameters to instantiate neural module networks and extract visual cues from the video. Our experiments show that VilNMN can achieve promising performance on two video-grounded language tasks: video QA and video-grounded dialogues.

1. INTRODUCTION

Vision-language tasks have been studied to build intelligent systems that can perceive information from multiple modalities, such as images, videos, and text. Extended from imaged-grounded tasks, e.g. (Antol et al., 2015) , recently Jang et al. (2017) ; Lei et al. (2018) propose to use video as the grounding features. This modification poses a significant challenge to previous image-based models with the additional temporal variance through video frames. Recently Alamri et al. (2019) further develop video-grounded language research into the dialogue domain. In the proposed task, videogrounded dialogues, the dialogue agent is required to answer questions about a video over multiple dialogue turns. Using Figure 1 as an example, to answer questions correctly, a dialogue agent has to resolve references in dialogue context, e.g. "he" and "it", and identify the original entity, e.g. "a boy" and "a backpack". In addition, the dialogue agent also needs to identify the actions of these entities, e.g. "carrying a backpack" to retrieve information along the temporal dimension of the video. Current state-of-the-art approaches to video-grounded language tasks, e.g. (Le et al., 2019b; Fan et al., 2019) have achieved remarkable performance through the use of deep neural networks to retrieve grounding video signals based on language inputs. However, these approaches often assume the reasoning structure, including resolving references of entities and detecting the corresponding actions to retrieve visual cues, is implicitly learned. An explicit reasoning structure becomes more beneficial as the tasks complicates in two scenarios: video with complex spatial and temporal dynamics, and language inputs with sophisticated semantic dependencies, e.g. questions positioned in a dialogue context. In these cases, it becomes challenging to interpret model outputs, assess model reasoning capability, and identify errors in neural network models. Similar challenges have been observed in image-grounded tasks in which deep neural networks often exhibit shallow understanding capability as they simply exploit superficial visual cues (Agrawal et al., 2016; Goyal et al., 2017; Feng et al., 2018; Serrano & Smith, 2019) . Andreas et al. (2016b) propose neural model networks (NMNs) by decomposing a question into sub-sequences called program and assembling a network of neural operations. Motivated by this line of research, we propose an NMN approach to video-grounded language tasks. Our approach benefits from integrating neural networks with a compositional reasoning structure to exploit low-level information signals in video. An example of the reasoning structure can be seen on the right side of Figure 1 . We propose Visio-Linguistic Neural Module Network (VilNMN) for video-grounded language tasks. VilNMN leverages entity-based dialogue representations as inputs to neural operations on spatial and temporal-level visual features. Previous approaches exploit question-level and token-level representations to extract question-dependent information from video (Jang et al., 2017; Fan et al., 2019; Le et al., 2019b) . In complex videos with many entities or actions, these approaches might not be optimal to locate the right features. To exploit object-level features, VilNMN is trained to identify relevant entities first, and then to extract the temporal steps using detected actions of these entities. VilNMN is also trained to resolve any co-references in language inputs, e.g. questions in a dialogue context, to identify the original entities. Previous approaches to video-grounded dialogues often obtain question global representations in relation to dialogue context. These approaches might be suitable to represent general semantics in open-domain or chit-chat dialogues (Serban et al., 2016; Li et al., 2016) but they are not ideal to detect fine-grained entity-based information as the dialogue context evolves over time. In summary, we introduce a neural module network approach to video-grounded language tasks through a reasoning pipeline with entity and action representations applied on the spatio-temporal dynamics of video. To cater to complex semantic inputs in language inputs, e.g. dialogues, our approach also allows models to resolve entity references to incorporate question representations with fine-grained entity information. In our evaluation, we achieve competitive performance on the large-scale benchmark Audio-visual Scene-aware Dialogues (AVSD) (Alamri et al., 2019) . We also adapt VilNMN for video QA and obtain the state-of-the-art on the TGIF-QA benchmark (Jang et al., 2017) across all tasks. Our experiments and ablation analysis indicate a potential direction to develop compositional and interpretable neural models for video-grounded language tasks.

2. RELATED WORK

Video QA has been a proxy for evaluating a model's understanding capability of language and video and the task is treated as a visual information retrieval task. Jang et al. ( 2017 2019) considers the task as a video summary task and concatenates question and dialogue history into a single sequence and proposes to transfer parameter weights from a large-scale video summary model. Different from prior work, we dissect the question sequence and explicitly detect and decode any entities and their references. Our models also benefit from the additional insights on how models learn to use component linguistic inputs for extraction of visual information.



Figure 1: A sample video-grounded dialogue: Inputs are question, dialogue history, video with caption, visual and audio input, and the output is the answer to the question. On the right side, we demonstrate an example symbolic reasoning process a dialogue agent can perform to extract textual and visual clues for the answer.

); Gao et al. (2018); Jiang et al. (2020) propose to learn attention guided by question global representation to retrieve spatial-level and temporal-level visual features. Li et al. (2019); Fan et al. (2019); Jiang & Han (2020) model interaction between all pairs of question token-level representations and temporal-level features of input video. Extended from video QA, video-grounded dialogue is an emerging task that combines dialogue response generation and video-language understanding research. Nguyen et al. (2018); Hori et al. (2019); Hori et al. (2019); Sanabria et al. (2019); Le et al. (2019a;b) extend traditional QA models by adding dialogue history neural encoders. Kumar et al. (2019) enhances dialogue features with topic-level representations to express the general topic in each dialogue. Sanabria et al. (

