VILNMN: A NEURAL MODULE NETWORK APPROACH TO VIDEO-GROUNDED LANGUAGE TASKS

Abstract

Neural module networks (NMN) have achieved success in image-grounded tasks such as Visual Question Answering (VQA) on synthetic images. However, very limited work on NMN has been studied in the video-grounded language tasks. These tasks extend the complexity of traditional visual tasks with the additional visual temporal variance. Motivated by recent NMN approaches on image-grounded tasks, we introduce Visio-Linguistic Neural Module Network (VilNMN) to model the information retrieval process in video-grounded language tasks as a pipeline of neural modules. VilNMN first decomposes all language components to explicitly resolve any entity references and detect corresponding action-based inputs from the question. The detected entities and actions are used as parameters to instantiate neural module networks and extract visual cues from the video. Our experiments show that VilNMN can achieve promising performance on two video-grounded language tasks: video QA and video-grounded dialogues.

1. INTRODUCTION

Vision-language tasks have been studied to build intelligent systems that can perceive information from multiple modalities, such as images, videos, and text. Extended from imaged-grounded tasks, e.g. (Antol et al., 2015) , recently Jang et al. (2017) ; Lei et al. (2018) propose to use video as the grounding features. This modification poses a significant challenge to previous image-based models with the additional temporal variance through video frames. Recently Alamri et al. (2019) further develop video-grounded language research into the dialogue domain. In the proposed task, videogrounded dialogues, the dialogue agent is required to answer questions about a video over multiple dialogue turns. Using Figure 1 as an example, to answer questions correctly, a dialogue agent has to resolve references in dialogue context, e.g. "he" and "it", and identify the original entity, e.g. "a boy" and "a backpack". In addition, the dialogue agent also needs to identify the actions of these entities, e.g. "carrying a backpack" to retrieve information along the temporal dimension of the video. Current state-of-the-art approaches to video-grounded language tasks, e.g. (Le et al., 2019b; Fan et al., 2019) have achieved remarkable performance through the use of deep neural networks to retrieve grounding video signals based on language inputs. However, these approaches often assume the reasoning structure, including resolving references of entities and detecting the corresponding actions to retrieve visual cues, is implicitly learned. An explicit reasoning structure becomes more beneficial as the tasks complicates in two scenarios: video with complex spatial and temporal dynamics, and language inputs with sophisticated semantic dependencies, e.g. questions positioned in a dialogue context. In these cases, it becomes challenging to interpret model outputs, assess model reasoning capability, and identify errors in neural network models. Similar challenges have been observed in image-grounded tasks in which deep neural networks often exhibit shallow understanding capability as they simply exploit superficial visual cues (Agrawal et al., 2016; Goyal et al., 2017; Feng et al., 2018; Serrano & Smith, 2019) . Andreas et al. (2016b) propose neural model networks (NMNs) by decomposing a question into sub-sequences called program and assembling a network of neural operations. Motivated by this line of research, we propose an NMN approach to video-grounded language tasks. Our approach benefits from integrating neural networks with a compositional reasoning structure to exploit low-level information signals in video. An example of the reasoning structure can be seen on the right side of Figure 1 . 

Dialogue Understanding Video Understanding

Figure 1 : A sample video-grounded dialogue: Inputs are question, dialogue history, video with caption, visual and audio input, and the output is the answer to the question. On the right side, we demonstrate an example symbolic reasoning process a dialogue agent can perform to extract textual and visual clues for the answer. We propose Visio-Linguistic Neural Module Network (VilNMN) for video-grounded language tasks. VilNMN leverages entity-based dialogue representations as inputs to neural operations on spatial and temporal-level visual features. Previous approaches exploit question-level and token-level representations to extract question-dependent information from video (Jang et al., 2017; Fan et al., 2019; Le et al., 2019b) . In complex videos with many entities or actions, these approaches might not be optimal to locate the right features. To exploit object-level features, VilNMN is trained to identify relevant entities first, and then to extract the temporal steps using detected actions of these entities. VilNMN is also trained to resolve any co-references in language inputs, e.g. questions in a dialogue context, to identify the original entities. Previous approaches to video-grounded dialogues often obtain question global representations in relation to dialogue context. These approaches might be suitable to represent general semantics in open-domain or chit-chat dialogues (Serban et al., 2016; Li et al., 2016) but they are not ideal to detect fine-grained entity-based information as the dialogue context evolves over time. In summary, we introduce a neural module network approach to video-grounded language tasks through a reasoning pipeline with entity and action representations applied on the spatio-temporal dynamics of video. To cater to complex semantic inputs in language inputs, e.g. dialogues, our approach also allows models to resolve entity references to incorporate question representations with fine-grained entity information. In our evaluation, we achieve competitive performance on the large-scale benchmark Audio-visual Scene-aware Dialogues (AVSD) (Alamri et al., 2019) . We also adapt VilNMN for video QA and obtain the state-of-the-art on the TGIF-QA benchmark (Jang et al., 2017) across all tasks. Our experiments and ablation analysis indicate a potential direction to develop compositional and interpretable neural models for video-grounded language tasks.

2. RELATED WORK

Video QA has been a proxy for evaluating a model's understanding capability of language and video and the task is treated as a visual information retrieval task. Jang et al. (2017) ; Gao et al. (2018) ; Jiang et al. (2020) propose to learn attention guided by question global representation to retrieve spatial-level and temporal-level visual features. Li et al. (2019) ; Fan et al. (2019) ; Jiang & Han (2020) model interaction between all pairs of question token-level representations and temporal-level features of input video. Extended from video QA, video-grounded dialogue is an emerging task that combines dialogue response generation and video-language understanding research. Nguyen et al. (2018); Hori et al. (2019) ; Hori et al. (2019) ; Sanabria et al. (2019) ; Le et al. (2019a; b) extend traditional QA models by adding dialogue history neural encoders. Kumar et al. (2019) enhances dialogue features with topic-level representations to express the general topic in each dialogue. Sanabria et al. (2019) considers the task as a video summary task and concatenates question and dialogue history into a single sequence and proposes to transfer parameter weights from a large-scale video summary model. Different from prior work, we dissect the question sequence and explicitly detect and decode any entities and their references. Our models also benefit from the additional insights on how models learn to use component linguistic inputs for extraction of visual information. Extending from the line of research on neural semantic parsing (Jia & Liang, 2016; Liang et al., 2017) , Andreas et al. (2016b; a) introduce NMNs to address visual QA by decomposing questions into linguistic sub-structures, known as programs, to instantiate a network of neural modules. NMN models have achieved significant success in synthetic image domains where complex reasoning process is required (Johnson et al., 2017b; Hu et al., 2018; Han et al., 2019) . Our work is related to the recent work that extends NMN models to real data domains. For instance, Kottur et al. (2018) ; Jiang & Bansal (2019) ; Gupta et al. (2020) extend NMNs to visual dialogues and reading comprehension tasks. In this paper, we introduce a new approach that exploits NMN to learn dependencies between the lexical composition in language inputs and the spatio-temporal dynamics in videos. This is not present in prior NMN models which are designed to apply on a two-dimensional image input without temporal variance. In video represented as sequence of images, each represented by object-level features, applying prior NMN models require aggregating frame-level features, e.g. through average pooling, resulting in potential loss of information. An alternative solution is a late-fusion method in which an NMN model performs reasoning structure programs on sampled video frames only. An object tracking mechanism or attention mechanism is then used to fuse the output representations. Instead, we propose to construct a reasoning structure with multi-step interaction between the space-time information in video with entity-action detected in text.

3. METHOD

In this section, we present the design of our model, called Visio-Linguistic Neural Module Networks (VilNMN). An overview of the model can be seen in Figure 2 . The input to the model consists of a dialogue D which is grounded on a video V. The input components include the question of current dialogue turn Q, dialogue history H, and the features of input video, including visual and audio input. The output is a dialogue response, denoted as R. Each text input component is a sequence of words w 1 , ..., w m ∈ V in , the input vocabulary. Similarly, the output response R is a sequence of tokens w 1 , ..., w n ∈ V out , the output vocabulary. To learn compositional programs, we follow Johnson et al. (2017a) ; Hu et al. (2017) and consider program generation as a sequence-to-sequence task. Different from prior approaches, our models are trained to fully generate the parameters of component modules in text. This approach is appropriate as reasoning programs in real data domains such as current video-grounded dialogues are usually shorter than those for synthetic data (Johnson et al., 2017a) and thus, program generation takes less computational cost. However, module parameters, i.e. entities and actions, contain much higher semantic variance than synthetic data, and our approach facilitates better transparency and interpretability. We adopt a simple template " param 1 module 1 param 2 module 2 ..." as the target sequence. The resulting target sequences for dialogue and video understanding programs are sequences P dial and P vid respectively. (Ba et al., 2016; Vaswani et al., 2017) . The embedding and positional representations are combined through element-wise summation. The encoded dialogue history and question of the current turn are defined as H = Norm(φ(H) + PE(H)) ∈ R LH×d and Q = Norm(φ(Q) + PE(Q)) ∈ R LQ×d . To decode program and response sequences auto-repressively, a special token "_sos" is concatenated as the first token w 0 . The decoded token w 1 is then appended to w 0 as input to decode w 2 and so on. Similarly to input source sequences, at decoding time step j, the input target sequence is encoded to obtain representations for dialogue understanding program P dial | j-1 0 , video understanding program P vid | j-1 0 , and system response R| j-1 0 . We combine vocabulary of input and output sequences and share the embedding matrix E ∈ R |V|×d where V = V in ∩ V out . Video Encoder. To encode video input, we use pre-trained models to extract visual features and audio features. We denote F as the sampled video frames or video clips. For object-level visual features, we denote O as the maximum number of objects considered in each frame. The resulting output from a pretrained object detection model is Z obj ∈ R F ×O×dvis . We concatenate each object representation with the corresponding coordinates projected to d vis dimensions. We also make use of a CNN-based pre-trained model to obtain features of temporal dimension Z cnn ∈ R F ×dvis . The audio feature is obtained through a pretrained audio model, Z aud ∈ R F ×d aud . We passed all video features through a linear transformation layer with ReLU activation to the same embedding dimension d.

3.2. NEURAL MODULES

We introduce neural modules that are used to assemble an executable program constructed by the generated sequence from question parsers. We provide an overview of neural modules in Table 1 and demonstrate dialogue understanding and video understanding modules in Figure 3 and 4 respectively. Each module parameter, e.g. "a backpack", is extracted from the parsed program. For each parameter, we denote P ∈ R d as the average pooling of component token embeddings. find(P,H)→H ent . This module handles entity tracing by obtaining a distribution over tokens in dialogue history. We use an entity-to-dialogue-history attention mechanism applied from an entity P i to all tokens in dialogue history. Any neural network that learn to generate attention between two tensors is applicable .e.g. (Bahdanau et al., 2015; Vaswani et al., 2017) . The attention matrix normalized by softmax, A find,i ∈ R LH , is used to compute the weighted sum of dialogue history token representations. The output is combined with entity embedding P i to obtain contextual entity representation H ent,i ∈ R d . summarize(H ent ,Q)→Q ctx . For each contextual entity representation H ent,i , i = 1, ..., N ent , it is projected to L Q dimensions and is combined with question token embeddings through elementwise summation to obtain entity-aware question representation Q ent,i ∈ R LQ×d . It is fed to a one-dimensional CNN with max pooling layer (Kim, 2014) to obtain a contextual entity-aware question representation. We denote the final output as Q ctx ∈ R Nent×d . While previous models usually focus on global or token-level dependencies (Hori et al., 2019; Le et al., 2019b) where(P,V)→V ent . Similar to the find module, this module handle entity-based attention to the video input. However, the entity representation P in this case is parameterized by the original entity in dialogue rather than in question (See Section 3.3 for more description). Each entity P i is stacked to match the number of sampled video frames/clips F . An attention network is used to obtain entity-to-object attention matrix A where,i ∈ R F ×O . The attended feature are compressed through weighted sum pooling along the spatial dimension, resulting in V ent,i ∈ R F ×d , i = 1, ..., N ent . when(P,V ent )→V ent+act . This module follows a similar architecture as the where module. However, the action parameter P i is stacked to match N ent dimensions. The attention matrix A when,i ∈ R F is then used to compute the visual entity-action representations through weighted sum along the temporal dimension. We denote the output for all actions P i as V ent+act ∈ R Nent×Nact×d describe(P,V ent+act )→V ctx . This module is a linear transformation to compute V ctx = W desc T [V ent+act ; P stack ] ∈ R Nent×Nact×d where W desc ∈ R 2d×d , P stack is the stacked representations of parameter embedding P to N ent × N act dimensions, and [; ] is the concatenation operation. The exist module is a special case of describe module where the parameter P is the average pooled question embeddings. The above where module is applied to object-level features. For temporal-based features such as CNN-based and audio features, the same neural operation is applied along the temporal dimension. Each resulting entity-aware output is then incorporated to frame-level features through element-wise summation (Please refer to Appendix A.1). An advantage of our architecture is that it separates dialogue and video understanding. We adopt a transparent approach to solve linguistic entity references during the dialogue understanding phase. The resolved entities are fed to the video understanding phase to learn entity-action dynamics in video. We show that our approach is robust when dialogue evolves to many turns and video extends over time (Please see Section 4 and Appendix C).

3.3. DECODERS

Question parsers. The parsers decompose questions into sub-sequences to construct compositional reasoning programs for dialogue and video understanding. Each parser is an attention-based Transformer decoder. Given the encoded question Q, to decode program for dialogue understanding, the contextual signals are integrated through 2 attention layers: one attention on previously generated tokens, and the other on question tokens. To generate programs for video understanding, the contextual signals are learned and incorporated in a similar manner. However, to exploit dialogue contextual cues, the execution output of dialogue understanding neural modules Q ctx (See Section 3.2) is incorporated to each vector in P vid through an additional attention layer. This layer integrates the entity-dependent contextual representations from Q ctx to explicitly decode the original entities for video understanding programs. Response Decoder. System response is decoded by incorporating the dialogue context and video context outputs from the corresponding reasoning programs to target token representations. We follows a vanilla Transformer decoder architecture (Le et al., 2019b) , which consists of 3 attention layers: self-attention to attend on existing tokens, attention to Q ctx from dialogue understanding program execution, and attention to V ctx from video understanding program execution. Optimization. We use the standard cross-entropy losses for prediction of dialogue and video understanding programs and output responses. We optimize models by joint training to minimize: L = αL dial + βL vid + L res = α j -log(P dial (P dial,j )) + β l -log(P video (P video,l )) + n -log(P res (R n )) where P is the probability distribution of an output token. The probability is computed by passing output representations from the parsers and decoder to a linear layer W ∈ R d×V with softmax activation. We share the parameters between W and embedding matrix E. The hyper-parameters α ≥ 0 and β ≥ are fine-tuned during training.

4. EXPERIMENTS

Datasets. We use the AVSD benchmark from the 7 th Dialogue System Technology Challenge (DSTC7) (Hori et al., 2019) . In the experiments with AVSD, we consider two settings: one with video summary and one without video summary as input. In the setting with video summary, the summary is concatenated to the dialogue history before the first dialogue turn. We also adapt VilNMN to the video QA benchmark TGIF-QA (Jang et al., 2017) . Different from AVSD, TGIF-QA contains a diverse set of tasks, which address different visual aspects in video. Training Procedure. We follow prior approaches (Hu et al., 2017; 2018; Kottur et al., 2018) by obtaining the annotations of the programs through a language parser (Hu et al., 2016) and a reference resolution model (Clark & Manning, 2016) . During training, we directly use these soft labels of programs and the given ground-truth responses to train the models. The labels are augmented with label smoothing technique (Szegedy et al., 2016) . During inference time, we generate all programs and responses from given dialogues and videos. We run beam search to enumerate programs for dialogue and video understanding and dialogue responses. (Please see Appendix B for more details). AVSD Results. We evaluate model performance by the objective metrics based on word overlapping, including BLEU (Papineni et al., 2002) , METEOR (Banerjee & Lavie, 2005) , ROUGE-L (Lin, 2004) , and CIDEr (Vedantam et al., 2015) , between each generated response and 6 reference gold responses. As seen in Table 2 , our models outperform most of existing approaches. In particular, the performance of our model in the setting without video summary input is comparable to the GPT-based RLM (Li et al., 2020) with much smaller model size. The Student-Teacher baseline (Hori et al., 2019) specifically focuses on the performance gap between models with and without textual signals from video summary through a dual network of expert and student models. Instead, VilNMN reduces this performance gap through efficiently extracting relevant visual/audio information based on fine-grained entity and action signals. We also found that VilNMN applied on object-level features is competitive to the model applied on CNN-based features. The flexibility of VilNMN neural programs can also be seen in the experiment when the video understanding program is applied on the caption input as a visual feature. Ablation Analysis. We experiment with several variants of VilMNM (either NMN or non-NMNbased) in the setting with CNN based features and video summary input As can be seen in Table 3 , our approach to video and dialogue understanding through compositional reasoning programs exhibits better performance than non-compositional approaches. Compared to the approaches that directly process frame-level features in videos (Row B) or token-level features in dialogues (Row C, D), our full VilNMN (Row A) considers entity-level and action-level information extraction and thus, avoids unnecessary and possibly noisy extraction. Compared to the approaches that obtain dialogue contextual cues through a hierarchical encoding architecture (Row E, F) such as (Serban et al., 2016; Hori et al., 2019) , VilNMN directly addresses the challenge of entity references in dialogues. As mentioned, we hypothesize that the hierarchical encoding architecture is more appropriate for less Intepretability. A difference of VilNMN from previous approaches to video-grounded dialogues is the model interpretability based on the predicted dialogue and video programs. From Figure 5 , we observe that in cases where predicted dialogue programs and video program match or are close to the gold labels, the model can generate generally correct responses. In cases of wrong predicted responses, we can further look at how the model understands the questions based on predicted programs. In the 3 rd turn of example 1, the output response is missing a minor detail as compared to the label response because the video program fails to capture the parameter "rooftop". These subtle yet important details can determine whether output responses can fully address user queries. Similarly, in example 2, the model answers the question "what room" instead of question about "an object". For additional qualitative analysis, please see Appendix D. TGIF-QA Results. In TGIF-QA experiments, we report the result using the L2 loss in Count task and accuracy in other tasks. From Table 4 , VilNMN outperforms all baseline models in all tasks by a large margin. Compared to AVSD experiments, the TGIF-QA experiments emphasize video understanding ability of the models, removing the requirement for dialogue understanding and natural language generation. This is demonstrated through higher performance gaps between VilNMN with generated programs and soft label programs as compared to ones in AVSD experiments. We also observe that an attention layer attending to question is important during the response decoding phase in TGIF-QA as there is no dialogue context Q ctx in Video QA tasks.

5. CONCLUSION

While conventional neural network approaches have achieved notable successes in video-grounded dialogues and video QA, they often rely on superficial pattern learning principles between contextual cues from questions/dialogues and videos. In this work, we introduce Visio-Linguistic Neural Module Network (VilNMN). VilNMN consists of dialogue and video understanding neural modules, each of which performs entity and action-level operations on language and video components. Our comprehensive experiments on AVSD and TGIF-QA benchmarks show that our models can achieve competitive performance while promoting a compositional and interpretable learning approach.

A ADDITIONAL MODEL DETAILS A.1 NEURAL MODULES ON TEMPORAL FEATURES

To adapt our neural modules to temporal features, we apply the same neural architectures in all modules except for the where module. In object-level features, this module operates on object-based or spatial-based level. We can apply this module to temporal-based features similarly simply by not stacking the parameter and pooling the attended features along the temporal dimension. For an entity parameter P i , the attention matrix in this case is an entity-to-temporal-step matrix A where,i ∈ R F and the resulting pooled feature is V ent,i ∈ R d . Before feeding this representation to a when module, we incorporate each V ent,i into feature of each temporal step through an MLP layer and element-wise summation, resulting in V stack ent,i ∈ R F×d where F is the number of sampled video frames/clips. An overview of the where module with temporal features can be seen in Figure 6 . We adapt this module in a similar manner to other temporal-level features such as audio and textual features such as video caption. We keep the same architecture in the when module. We denote the resulting output from the when module for all actions P i is V act ∈ R Nact×d . We concatenate this to the output from the previous where module V ent to obtain V ent+act ∈ R (Nent+Nact)×d . This is used as input to the describe or exist module.

A.2 QUESTION PARSER

The parsers decompose questions into sub-sequences to construct compositional reasoning programs for dialogue and video understanding. Each parser is an attention-based Transformer decoder. The Transformer attention is a multi-head attention on query, key, and value tensors, denoted as Attention(Query, Key, Value). For each token in the Query sequence , the distribution over tokens in the Key sequence is used to obtain the weighted sum of the corresponding representations in the Value sequence. Attention(Query, Key, V alue) = softmax( QueryKey T d key )V alue ∈ R Lquery×dquery Each attention is followed by a feed-forward network applied on each position identically. We exploit the multi-head and feed-forward architecture, which show good performance in NLP tasks such as NMT and QA (Vaswani et al., 2017; Dehghani et al., 2019) , to efficiently incorporate contextual cues from dialogue components to parse question into reasoning programs. Given the encoded question Q, to decode program for dialogue understanding, the contextual signals are integrated through 2 attention layers: one attention on previously generated tokens, and the other on question tokens. At time step j, we denote the output from an attention layer as A dial,j .

A

(1) dial = Attention(P dial | j-1 0 , P dial | j-1 0 , P dial | j-1 0 ) ∈ R j×d A (2) dial = Attention(A (1) dial , Q, Q) ∈ R j×d Similarly, to generate programs for video understanding, the contextual signals are learned and incorporated in a similar manner. However, to exploit dialogue contextual cues, the execution output of dialogue understanding neural modules Q ctx is incorporated to each vector in P dial through an additional attention layer. This layer integrates the resolved entity information to decode the original entities for video understanding. It is equivalent to a reasoning process that converts the question from its original multi-turn semantics to single-turn semantics. A (3) vid = Attention(A (2) vid , Q ctx , Q ctx ) ∈ R j×d A.3 NON-NMN MODELS For ablation analysis, we evaluate several variants of VilNMN, based on the following categories: To test the contribution of our NMN approach for video understanding, we remove the parser for video understanding program and related neural modules and replace them with pure neural network architecture (Model B). Specifically, we remove neural modules where, when, describe, and exist. We then directly use video feature embeddings V as V ctx as input to the original attention layer in response decoder similarly to (Hori et al., 2019; Sanabria et al., 2019) . A (3) res = Attention(A (2) res , V, V ) ∈ R j×d To further test the contribution of NMN architecture for dialogue understanding, we similarly remove the question parser for dialogue understanding program and neural modules find and describe. We then directly use the dialogue history embeddings H and question embeddings Q as inputs to the response decoder in two different ways. First, we replace the original attention on dialogue context Q ctx with two attention layers to attend on dialogue history and question sequentially (Model C). As noted by Le et al. (2019b) , question input contains much more relevant signals than dialogue history and attention operation should be separated from the one on dialogue history. 2b) res = Attention(A (2a) res , Q, Q) ∈ R j×d Alternatively, we simply concatenate dialogue and question embeddings similarly to (Hori et al., 2019; Sanabria et al., 2019) and use it as input to the original attention layer (Model D). A (2a) res = Attention(A (1) res , H, H) ∈ R j×d A ( A (2) res = Attention(A (1) res , [H; Q], [H; Q]) ∈ R j×d To use more sophisticated neural models for dialogue understanding, we further adopt the hierarchical encoding architecture with question attention (Li et al., 2016; Serban et al., 2016; Hori et al., 2019) . Each dialogue turn H t , including a pair of human utterance and system response, is processed separately by a word-level RNN such as LSTM (Model E) or GRU (Model F). A sentence-level RNN is used to sequentially process the last hidden states obtained previously turn by turn. The output in each recurrent step is fed to an attention layer such as (Bahdanau et al., 2015; Vaswani et al., 2017) to obtain question-aware representations of dialogue history. H word t = RNN(H t ) ∈ R d H sent t = RNN(H word t ) ∈ R d H = [H sent t ]| T -1 t=1 ∈ R d×(T -1) Q ctx = Attention(Q, H, H) ∈ R L Q ×d where T is the current dialogue turn. The output is treated as Q ctx and is fed to the corresponding attention layer in the response decoder.

B ADDITIONAL EXPERIMENT DETAILS B.1 DATASETS

We use the AVSD benchmark from DSTC7 (Hori et al., 2019) which consists of dialogues grounded on the Charades videos (Sigurdsson et al., 2016) . Each dialogue contains up to 10 dialogue turns, each turn consists of a question and expected response about a given video. For visual features, we use the 3D CNN based features from a pretrained I3D model (Carreira & Zisserman, 2017 ) and object-level features from a pretrained FasterRNN model (Ren et al., 2015b) . The audio features are obtained from a pretrained VGGish model (Hershey et al., 2017) . In the experiments with AVSD, we consider two settings: one with video summary and one without video summary as input. In the setting with video summary, the summary is concatenated to the dialogue history before the first dialogue turn. We also adapt VilNMN to the video QA benchmark TGIF-QA (Jang et al., 2017) . Different from AVSD, TGIF-QA contains a diverse set of QA tasks: • Count: open-ended task which counts the number of repetitions of an action • Action: multiple-choice (MC) task which asks about a certain action occurring for a fixed number of times • Transition: MC task which emphasizes temporal transition in video • Frame: open-ended task which can be answered from visual contents of one of video frames For the TGIF-QA benchmark, we use the extracted features from a pretrained ResNet model (He et al., 2016) . 

C ADDITIONAL RESULTS

To evaluate model robustness, we report the relative performance by calculating the difference of CIDEr in experimental settings against the most basic setting. Specifically, we compare against performance of output responses in the first dialogue turn position (i.e. 2 nd -10 th turn vs. the 1 st turn), or responses grounded on the shortest video length range (video ranges are intervals of 0-10 th , 10-20 th percentile and so on). We report the results of the model variants A, B, and E (See the Ablation Analysis section in the main paper and Appendix A.3 for model description). First, as can be seen in Figure 7 , for various dialogue turn positions, we observe that the original VilNMN (model A) suffers less than model E when dialogues extend over time up the 8 th turn. This explains the contribution of dialogue understanding modules in solving entities even when the dialogues grow longer. Secondly, as compared to model B, we observe that the Full VilNMN (model A) is less affected as the videos grounding the dialogues grow longer. The difference is clear when the video length increases up to 33 seconds. We also report the absolute scores and compare model variants. In Table 6a , we compare model variants B and E. We observe that model B generally performs better than model E in overall, especially in higher turn positions, i.e. from the 4 th turn to 8 th turn. Interestingly, we note some mixed results in very low turn position, i.e. the 2 nd and 3 rd turn, and very high turn position, i.e. the 10 th turn. Potentially, in very high turn position, the neural based approach such as hierarchical RNN can better capture the global dependencies within dialogue context than the entity-based compositional NMN method. In Table 6b , we compare model variants A and B. We note that the performance gap between model A and B is quite distinct, with 7/10 cases of video ranges in which model A outperforms. However, similarly to our prior observations in experiments by dialogue turn, in lower ranges (i.e. 1-23 There are additional factors that we will need to examine further to explain the results, such as the complexity of the questions for these short and long-range videos. Potentially, our question parser for video understanding program needs more sophisticated composition method to retrieve information from these video ranges.



Caption: a boy and a man walk to the room. The boy carries his backpack while the man… Visual: ... Audio: ... Question: what is he doing while carrying it? many people are in the video? Answer: there are a boy and a man Question: what is the boy doing? Answer: the boy walks downstairs and carries a backpack Predicted Answer: he is cleaning a mirror .

Figure 4: where and when neural modules for video understanding

(two men in the video),where(the scene)→ when(doing in the scene)→ describe(what) ✘ Gold: where(two men), where(rooftop) →when(doing in the scene) →describe(what) Predicted: one is washing a chair and the other is taking pictures ✘ Gold: yes , on a second floor roof deck , one man is washing a (one man in the video), where(a chair)→ when(sit in the chair after washing it)→exist() ✓ Gold: where(one man), where(a chair)→ when(sit in the chair after washing it)→exist() Predicted: no , he does not sit in the chair ✓ Gold: no he does not , there is a pipe with water running all over way to the next room, there is an object. what is that? Predicted: find(the room), find(the door)→summarize() ✘ Gold: summarize() Predicted: where(what room) →when(what is that)→exist() ✘ Gold: describe(what) Predicted: it looks like he is in a living room. ✘ Gold: he went to the doorway for a vacuum. (one person in the video)→when(get up, have anything in his hands)→ describe(when) ✘ Gold: where(one person in the video)→when(get up), when(have anything in his hands)→ describe(when)Predicted: he has a vacuum in his hands. ✓ Gold: he goes for the vacuum.

Figure 5: Intepretability of model outputs from a dialogue in the test split of the AVSD benchmark.

Figure 6: Adaptation of the where module to temporal-based features

Figure 7: Performance of model variants A, B, and E, by dialogue turn position and video length. The performance is calculated relatively to performance of the most basic setting, i.e. responses of the first dialogue turn ∆CIDErturn_i = CIDErturn_i -CIDErturn_1, or responses grounding on the lowest video range (0 to 23 seconds) ∆CIDErrange_i = CIDErrange_i -CIDEr0-23.

Description of the modules and their functionalities. We denote P as the parameter to instantiate each module, H as the dialogue history, Q as the question of the current dialogue turn, and V as video input. For related entities in question, select the relevant tokens from dialogue history summarize H ent , Q Q ctx Based on contextual entity representations, summarise the question semantics where P, V V ent Select the relevant spatial position corresponding to original (resolved) entities when P, V ent V ent+act Select the relevant entity-aware temporal steps corresponding to the action parameter describe P, V ent+act V ctx Select visual entity-action features based on non-binary question types exist Q, V ent+act V ctx Select visual entity-action features based on binary (yes/no) question types 3.1 ENCODERS Text Encoder. A text encoder is shared to encode text inputs, including dialogue history, questions, and captions. The text encoder converts each text sequence X = w 1 , ..., w m into a sequence of embeddings X ∈ R m×d . We use a trainable embedding matrix to map token indices to vector representations of d dimensions through a mapping function φ. These vectors are then integrated with ordering information of tokens through a positional encoding function with layer normalization

to encode question features, our modules compress fine-grained question representations Figure3: find and summarize neural modules for dialogue understanding at entity level. Specifically, find and summarize modules can generate entity-dependent local and global representations of question semantics. We show that our modularized approach can achieve better performance and transparency than traditional approaches to encode dialogue context(Serban et al., 2016; Vaswani et al., 2017)  (Section 4).

AVSD test results: The visual features are: I (I3D), ResNeXt-101 (RX), Faster-RCNN (FR), C (caption as a video input). The audio features are: VGGish (V), AclNet (A). on PT denotes models using pretrained weights and/or additional finetuning. Best and second best results are bold and underlined respectively.

Ablation analysis of VilNMN with different model variants on the test split of the AVSD benchmark



Summary of DSTC7 AVSD and TGIF-QA benchmarkWe use a training batch size of 32 and embedding dimension d = 128 in all experiments. Where Transformer attention is used, we fix the number of attention heads to 8 in all attention layers. In neural modules with MLP layers, the MLP network is fixed to 2 linear layers with a ReLU activation in between. In neural modules with CNN, we adopt a vanilla CNN architecture for text classification (without the last MLP layer) where the number of input channels is 1, the kernel sizes are {3, 4, 5}, and the number of output channels is d. We initialize models with uniform distribution(Glorot & Bengio, 2010). During training, we adopt the Adam optimizer(Kingma & Ba, 2015) and a decaying learning rateVaswani et al. (2017)  where we fix the warm-up steps to 15K training steps. We employ dropout(Srivastava et al., 2014) of 0.2 at all networks except the last linear layers of question parsers and response decoder. We train models up to 50 epochs and select the best models based on the average loss per epoch in the validation set.

Performance breakdown in BLEU4 and CIDEr (a) by dialogue turn between model variants B and E.

D QUALITATIVE ANALYSIS

We extract the predicted programs and responses for some example dialogues in Figure 8 , 9, 10, and 11 and report our observations:• We observe that when the predicted programs are correct, the output responses generally match the ground-truth (See the 1 st and 2 nd turn in Figure 8 , and the 1 st and 4 th turn in Figure 10 ) or close to the ground-truth responses (1 st turn in Figure 9 ).• When the output responses do not match the ground truth, we can understand the model mistakes by interpreting the predicted programs. For example, in the 3 rd turn in Figure 8 , the output response describes a room because the predicted video program focuses on the entity "what room" instead of the entity "an object" in the question. Another example is the 3 rd turn in Figure 10 where the entity "rooftop" is missing in the video program. These mismatches can deviate the information retrieved from the video during video program execution, leading to wrong output responses with wrong visual contents.• We also note that in some cases, one or both of the predicted programs are incorrect, but the predicted responses still match the ground-truth responses. This might be explained as the predicted module parameters are not exactly the same as the ground truth but they are close enough (e.g. 4 th turn in Figure 8 ). Sometimes, our model predicted programs that are more appropriate than the ground truth. For example, in the 2 nd turn in Figure 9 , the program is added with a where module parameterized by the entity "the shopping bag" which was solved from the reference "them" mentioned in the question.• We observe that for complex questions that involve more than one queries (e.g. the 3 rd turn in Figure 10 ), it becomes more challenging to decode an appropriate video understanding program and generate responses that can address all queries.• In Figure 11 , we demonstrate some output examples of VilNMN and compare with two baselines: Baseline (Hori et al., 2019) and MTN (Le et al., 2019b) . We noted that VilNMN can include important entities relevant to the current dialogue turn to construct output responses while other models might miss some entity details, e.g. "them/dishes" in example A and "the magazine" in example B. These small yet important details can determine the correctness of dialogue responses. Predicted: find(he), find(the chair) →summarize() ✓ Gold: find(he), find(the chair) → summarize()Predicted: where(one man in the video), where(a chair)→ when(sit in the chair after washing it)→exist() ✓ Gold: where(one man), where(a chair)→ when(sit in the chair after washing it) →exist()Predicted: no , he does not sit in the chair ✓ Gold: no he does not , there is a pipe with water running all over (Hori et al., 2019; Le et al., 2019b) 

