ANSWER ME IF YOU CAN: DEBIASING VIDEO QUESTION ANSWERING VIA ANSWERING UNANSWERABLE QUESTIONS

Abstract

Video Question Answering (VideoQA) is a task to predict a correct answer given a question-video pair. Recent studies have shown that most VideoQA models rely on spurious correlations induced by various biases when predicting an answer. For instance, VideoQA models tend to predict 'two' as an answer without considering the video if a question starts with "How many" since the majority of answers to such type of questions are 'two'. In causal inference, such bias (question type), which simultaneously affects the input X (How many...) and the answer Y (two), is referred to as a confounder Z that hinders a model from learning the true relationship between the input and the answer. The effect of the confounders Z can be removed with a causal intervention P (Y |do(X)) when Z is observed. However, there exist many unobserved confounders affecting questions and videos, e.g., dataset bias induced by annotators who mainly focus on human activities and salient objects resulting in a spurious correlation between videos and questions. To address this problem, we propose a novel framework that learns unobserved confounders by capturing the bias using unanswerable questions, which refers to an artificially constructed VQA sample with a video and a question from two different samples, and leverages the confounders for debiasing a VQA model through causal intervention. We demonstrate that our confounders successfully capture the dataset bias by investigating which part in a video or question that confounders pay attention to. Our experiments on multiple VideoQA benchmark datasets show the effectiveness of the proposed debiasing framework, resulting in an even larger performance gap compared to biased models under the distribution shift.

1. INTRODUCTION

Video Question Answering (VideoQA) task is a multi-modal understanding task to find the correct answer given a question-video pair, which requires an understanding of both vision and text modalities along with causal reasoning. However, recent studies (Ramakrishnan et al., 2018; Cadene et al., 2019) point out that the success of the VideoQA models is due to its reliance on spurious correlations caused by bias instead of reasonable inference for answer prediction. In other words, the models concentrate on the co-occurrence between the question (or video) and the answer based on the dataset statistics and tend to simply predict the frequent answers. For instance, given a question that starts with "How many", a biased VideoQA model often blindly predicts 'two' as an answer as depicted in Fig. 1b . Fig. 1a illustrates the statistics of MSVD-QA dataset, showing that the majority of the answer to the "How many" questions are 'two'. In this case, 'question type' is acting as a bias simultaneously influencing the input question-video pair and the answer, which hinders the model from learning a true relationship between the input and the answer. In causal inference (Glymour et al., 2016) , such variable, e.g., question type, affecting both the input X and the answer Y is called a confounder Z, which interrupts finding a true causal relationship between X and Y . The causal intervention P (Y |do(X)) intentionally cuts off the relation between X and Z via do-calculus, which is also called 'deconfounding', to remove the effect of the con-

