ANSWER ME IF YOU CAN: DEBIASING VIDEO QUESTION ANSWERING VIA ANSWERING UNANSWERABLE QUESTIONS

Abstract

Video Question Answering (VideoQA) is a task to predict a correct answer given a question-video pair. Recent studies have shown that most VideoQA models rely on spurious correlations induced by various biases when predicting an answer. For instance, VideoQA models tend to predict 'two' as an answer without considering the video if a question starts with "How many" since the majority of answers to such type of questions are 'two'. In causal inference, such bias (question type), which simultaneously affects the input X (How many...) and the answer Y (two), is referred to as a confounder Z that hinders a model from learning the true relationship between the input and the answer. The effect of the confounders Z can be removed with a causal intervention P (Y |do(X)) when Z is observed. However, there exist many unobserved confounders affecting questions and videos, e.g., dataset bias induced by annotators who mainly focus on human activities and salient objects resulting in a spurious correlation between videos and questions. To address this problem, we propose a novel framework that learns unobserved confounders by capturing the bias using unanswerable questions, which refers to an artificially constructed VQA sample with a video and a question from two different samples, and leverages the confounders for debiasing a VQA model through causal intervention. We demonstrate that our confounders successfully capture the dataset bias by investigating which part in a video or question that confounders pay attention to. Our experiments on multiple VideoQA benchmark datasets show the effectiveness of the proposed debiasing framework, resulting in an even larger performance gap compared to biased models under the distribution shift.

1. INTRODUCTION

Video Question Answering (VideoQA) task is a multi-modal understanding task to find the correct answer given a question-video pair, which requires an understanding of both vision and text modalities along with causal reasoning. However, recent studies (Ramakrishnan et al., 2018; Cadene et al., 2019) point out that the success of the VideoQA models is due to its reliance on spurious correlations caused by bias instead of reasonable inference for answer prediction. In other words, the models concentrate on the co-occurrence between the question (or video) and the answer based on the dataset statistics and tend to simply predict the frequent answers. For instance, given a question that starts with "How many", a biased VideoQA model often blindly predicts 'two' as an answer as depicted in Fig. 1b . Fig. 1a illustrates the statistics of MSVD-QA dataset, showing that the majority of the answer to the "How many" questions are 'two'. In this case, 'question type' is acting as a bias simultaneously influencing the input question-video pair and the answer, which hinders the model from learning a true relationship between the input and the answer. In causal inference (Glymour et al., 2016) , such variable, e.g., question type, affecting both the input X and the answer Y is called a confounder Z, which interrupts finding a true causal relationship between X and Y . The causal intervention P (Y |do(X)) intentionally cuts off the relation between X and Z via do-calculus, which is also called 'deconfounding', to remove the effect of the con-foundersfoot_0 . Nevertheless, Z should be predefined to apply causal intervention but most confounders are unobserved in the dataset and hard to be applied to the causal intervention. Therefore, we introduce learnable confounder queries and train them to capture the bias, and leverage the learned confounders for debiasing through the causal intervention. To achieve this, we force the model to answer the unanswerable question. An unanswerable question refers to an artificially constructed VQA sample with a video and a question from two different samples in a mini-batch, along with an answer that corresponds to either the video or the question. When a model answers the unanswerable question, the model inevitably learns the bias of the specific modality that corresponds to the answer since another modality is randomly sampled and irrelevant to the answer. To summarize, we propose a novel framework Debiasing a Video Question Answering Model by Answering Unanswerable Questions (VoidQ) with causal inference. In order to apply the causal intervention P (Y |do(X)), we introduce learnable confounder queries. The proposed confounder queries are trained to capture the bias by answering the unanswerable questions. Our framework leverages the confounder queries and their outputs to debias our VQA model via causal intervention. We validate our models on three benchmark VideoQA datasets (TGIF-QA, MSRVTT-QA, and MSVD-QA) and demonstrate the effectiveness of our debiasing strategy. Also, ablation studies reveal that the performance gap between conventional biased models and the proposed model gets larger when the training and test distribution significantly differ, supporting the improved generalization ability by the proposed approach. Lastly, visualization of confounders via our variant of Grad-CAM shows that the learned confounder queries adequately debias a VQA model by taking into account falsely correlated keywords in questions or salient regions in videos. To sum up, our contributions are as follows: • We propose a novel debiasing framework for VideoQA model to predict correct answers based on causal inference by removing the effect of the confounders. • We also present a training scheme encouraging the learnable confounder queries to capture the bias by forcing the model to answer the unanswerable questions. • Our extensive experiments demonstrate that the proposed framework outperforms previous models on various benchmark datasets, even with a larger margin under the distribution shift where the biased models suffer significant performance degradation from. • We verify that our confounders successfully capture the dataset bias by investigating which parts in a video or which words in a question are utilized by confounder queries to correct the predictions of a VQA model.

2. RELATED WORK

Video Question Answering (VideoQA). VideoQA is a task to infer a correct answer, given a video and a question. While models for the VisualQA task focuses on spatial information of an image (Antol et al., 2015; Yang et al., 2016) , VideoQA requires reasoning over both temporal and spatial dynamics, making it a more challenging task. Previous works have applied spatio-temporal contextual attention to various scenarios (Jang et al., 2017; Xiao et al., 2022; Zhao et al., 2017) .



Additional descriptions about the causal inference are in Sec. 3.1



Figure 1: Dataset statistics of MSVD dataset and an example of the biased answer. (Left) The majority of answers to the "How many" questions are 'two'. (Right) The model outputs the biased answer 'two' instead of the right answer 'four'.

