ANSWER ME IF YOU CAN: DEBIASING VIDEO QUESTION ANSWERING VIA ANSWERING UNANSWERABLE QUESTIONS

Abstract

Video Question Answering (VideoQA) is a task to predict a correct answer given a question-video pair. Recent studies have shown that most VideoQA models rely on spurious correlations induced by various biases when predicting an answer. For instance, VideoQA models tend to predict 'two' as an answer without considering the video if a question starts with "How many" since the majority of answers to such type of questions are 'two'. In causal inference, such bias (question type), which simultaneously affects the input X (How many...) and the answer Y (two), is referred to as a confounder Z that hinders a model from learning the true relationship between the input and the answer. The effect of the confounders Z can be removed with a causal intervention P (Y |do(X)) when Z is observed. However, there exist many unobserved confounders affecting questions and videos, e.g., dataset bias induced by annotators who mainly focus on human activities and salient objects resulting in a spurious correlation between videos and questions. To address this problem, we propose a novel framework that learns unobserved confounders by capturing the bias using unanswerable questions, which refers to an artificially constructed VQA sample with a video and a question from two different samples, and leverages the confounders for debiasing a VQA model through causal intervention. We demonstrate that our confounders successfully capture the dataset bias by investigating which part in a video or question that confounders pay attention to. Our experiments on multiple VideoQA benchmark datasets show the effectiveness of the proposed debiasing framework, resulting in an even larger performance gap compared to biased models under the distribution shift.

1. INTRODUCTION

Video Question Answering (VideoQA) task is a multi-modal understanding task to find the correct answer given a question-video pair, which requires an understanding of both vision and text modalities along with causal reasoning. However, recent studies (Ramakrishnan et al., 2018; Cadene et al., 2019) point out that the success of the VideoQA models is due to its reliance on spurious correlations caused by bias instead of reasonable inference for answer prediction. In other words, the models concentrate on the co-occurrence between the question (or video) and the answer based on the dataset statistics and tend to simply predict the frequent answers. For instance, given a question that starts with "How many", a biased VideoQA model often blindly predicts 'two' as an answer as depicted in Fig. 1b . Fig. 1a illustrates the statistics of MSVD-QA dataset, showing that the majority of the answer to the "How many" questions are 'two'. In this case, 'question type' is acting as a bias simultaneously influencing the input question-video pair and the answer, which hinders the model from learning a true relationship between the input and the answer. In causal inference (Glymour et al., 2016) , such variable, e.g., question type, affecting both the input X and the answer Y is called a confounder Z, which interrupts finding a true causal relationship between X and Y . The causal intervention P (Y |do(X)) intentionally cuts off the relation between X and Z via do-calculus, which is also called 'deconfounding', to remove the effect of the con-foundersfoot_0 . Nevertheless, Z should be predefined to apply causal intervention but most confounders are unobserved in the dataset and hard to be applied to the causal intervention. Therefore, we introduce learnable confounder queries and train them to capture the bias, and leverage the learned confounders for debiasing through the causal intervention. To achieve this, we force the model to answer the unanswerable question. An unanswerable question refers to an artificially constructed VQA sample with a video and a question from two different samples in a mini-batch, along with an answer that corresponds to either the video or the question. When a model answers the unanswerable question, the model inevitably learns the bias of the specific modality that corresponds to the answer since another modality is randomly sampled and irrelevant to the answer. To summarize, we propose a novel framework Debiasing a Video Question Answering Model by Answering Unanswerable Questions (VoidQ) with causal inference. In order to apply the causal intervention P (Y |do(X)), we introduce learnable confounder queries. The proposed confounder queries are trained to capture the bias by answering the unanswerable questions. Our framework leverages the confounder queries and their outputs to debias our VQA model via causal intervention. We validate our models on three benchmark VideoQA datasets (TGIF-QA, MSRVTT-QA, and MSVD-QA) and demonstrate the effectiveness of our debiasing strategy. Also, ablation studies reveal that the performance gap between conventional biased models and the proposed model gets larger when the training and test distribution significantly differ, supporting the improved generalization ability by the proposed approach. Lastly, visualization of confounders via our variant of Grad-CAM shows that the learned confounder queries adequately debias a VQA model by taking into account falsely correlated keywords in questions or salient regions in videos. To sum up, our contributions are as follows: • We propose a novel debiasing framework for VideoQA model to predict correct answers based on causal inference by removing the effect of the confounders. • We also present a training scheme encouraging the learnable confounder queries to capture the bias by forcing the model to answer the unanswerable questions. • Our extensive experiments demonstrate that the proposed framework outperforms previous models on various benchmark datasets, even with a larger margin under the distribution shift where the biased models suffer significant performance degradation from. • We verify that our confounders successfully capture the dataset bias by investigating which parts in a video or which words in a question are utilized by confounder queries to correct the predictions of a VQA model.

2. RELATED WORK

Video Question Answering (VideoQA). VideoQA is a task to infer a correct answer, given a video and a question. While models for the VisualQA task focuses on spatial information of an image (Antol et al., 2015; Yang et al., 2016) , VideoQA requires reasoning over both temporal and spatial dynamics, making it a more challenging task. Previous works have applied spatio-temporal contextual attention to various scenarios (Jang et al., 2017; Xiao et al., 2022; Zhao et al., 2017) . Another line of research has proposed the end-to-end pretrained models on the large-scale dataset to improve the performance of various downstream tasks including the VideoQA. Fu et al. (2021) builds a additional cross-modal encoder as well as the video and text encoder and Wang et al. ( 2022) introduce a token rolling operation to efficiently perform the temporal attention on the cross-modal encoder. However, existing models still suffer from dataset bias (Ramakrishnan et al., 2018; Cadene et al., 2019) . Therefore, in this paper, we propose a debiasing framework for VideoQA to improve generalization power by reducing the dataset bias even under a distribution shift. Debiasing from the biased model. The first approach to alleviate the bias is to directly augment the dataset to remove statistical 'hints' and enlarge the size and diversity of the training set. Gokhale et al. (2020) , Chen et al. (2020), and Kil et al. (2021) propose to augment input images or questions to generate counterfactual or paraphrased QA pairs. On the other hand, there exist attempts to learn the text or image bias through additional branches by training the branch only with a single modality. Outputs from the biased branches are then utilized to debias training (Ramakrishnan et al., 2018; Zhang et al., 2021; Cadene et al., 2019) . However, these approaches require manually designed heuristic rules or separate branches to capture bias from a specific modality. We propose a novel unified debiasing framework that leverages the unanswerable questions, allowing the learnable confounder queries to capture the biases related to both modalities without any heuristics. Causal Inference. Causal inference, a method to find the true effect of a particular variable on a target variable without being interrupted by any other variables, is being widely adopted in Visual QA tasks. Previous works proposed augmenting the data to remove unwanted effects of a specific variable by utilizing the Structural Causal Model (SCM) (Glymour et al., 2016) which defines causal relationships between variables. Specifically, existing works generate counterfactual samples (Tang et al., 2020; Abbasnejad et al., 2020; Yue et al., 2021) or negative samples (Wen et al., 2021; Teney et al., 2020) to measure and remove an effect of a specific confounding variable. Besides, causal intervention is also used to directly remove the effects of a predefined confounding variable that hinders proper reasoning. Unfortunately, since the confounders are usually unobserved, most existing methods manually predefine the confounder sets as object classes (Zhang et al., 2020) or verb-centered relation tuples from caption data (Nan et al., 2021) in order to conduct the causal intervention. Unlike these works, we directly train the confounder queries so that they can capture various types of bias, instead of manually predefining what confounder should be.

3. METHOD

Bias misleads the model to become reliant on spurious correlations, resulting in poor generalization ability. Therefore, in this section, we propose a novel framework Debiasing a Video Question Answering Model by Answering Unanswerable Questions (VoidQ) with causal inference. Firstly, we briefly revisit the basic concepts of the VideoQA and causal inference. We then present the debiasing framework with learnable confounder queries based on the causal intervention. Finally, we introduce the training objective using unanswerable questions to let the confounder queries learn the bias.

3.1. PRELIMINARIES

VideoQA. VideoQA is a task to predict the answer Ŷ given a question-video pair X = (x q , x v ). There are two types of tasks in the VideoQA: multi-choice question answering (MCQA) and openended question answering (OEQA). For the MCQA, the model predicts the answer among five options in general. Each option is concatenated with the question and the model calculates the similarities between each concatenated text and the video to output the final prediction. In the OEQA setting, the task is mostly converted to the classification task to predict the correct answer among the predefined global vocab-set containing all the candidate answers. For simplicity, we will explain concepts only with OEQA. Further details including MCQA are in the supplement. The prediction Ŷ under the OEQA setting given a pair X = (x q , x v ) can be written as: Ŷ = P (Y |X) = h(f (X)), where f is a feature encoder and h is a classifier. We construct tokenized text tokens xq and patchified video tokens xv from input question and video, then concatenate them to form X and feed them into the data encoder f . Learnable confounder queries Z are fed to the confounder encoder g, which are cross-attended with output features of f , i.e., X. Z is trained to learn the dataset bias by minimizing Lconfounder between the ground truth and a biased prediction Ŷg generated from Z through hg. Causal intervention utilizing learned confounders Z and Z is applied to generate the final debiased prediction Ỹf from X through h f . Causal Inference. To train f and h in Eq. 1, most previous approaches (Fan et al., 2019; Gao et al., 2018; Jiang et al., 2020; Jiang & Han, 2020; Le et al., 2020 ) have adopted the standard cross entropy (CE) loss as: min f,h CE( Ŷ , Y ). (2) Eq. 2 aims simply to minimize CE between the ground-truth Y and the prediction Ŷ , in which f and h naturally learn the spurious correlation between X and Y . To address this problem, some recent works (Tang et al., 2020; Zhang et al., 2020; Niu et al., 2021) tried applying causal inference (Glymour et al., 2016) to alleviate the bias. As shown in Fig. 2a , conventional approaches have calculated the likelihood P (Y |X) to predict the answer as: P (Y |X) = z∈Z P (Y |X, z)P (z|X). On the other hand, the predicted answer with causal intervention P (Y |do(X)) where the connection between X and Z is cut-off is defined as: P (Y |do(X)) = z∈Z P (Y |X, z)P (z). Unlike in Eq. 3, the causal intervention removes the relation between X and Z, allowing the model to reason about the real causation of X to Y by considering the prior P (z) instead of P (z|X) in Eq. 4. Note that a set of confounders Z must be known in advance to calculate Eq. 4.

3.2. OVERALL ARCHITECTURE

VoidQ consists of two encoders. A data encoder f follows the Transformer (Vaswani et al., 2017) encoder based on the self-attention mechanism. Unlike f , a confounder encoder g is composed of a cross-attention layer followed by feed-forward networks (FFN). For the input of f , we concatenate text tokens x q and visual tokens x v , i.e., X = (x q , x v ) ∈ R N ×D , where N is the number of input tokens and D is the feature dimension. Then, the output feature X can be calculated as: ) and Xv = (x ′ q , xv), respectively. The second letter denotes the modality an output is biased towards, which is also represented in color. Ŷ ( * ,q) and Ŷ ( * ,v) denotes that the output are biased towards text and video, respectively. X = f (X) ∈ R N ×D . ( For unobserved confounders, we additionally introduce a set of learnable confounder queries Z ∈ R M ×D as an input of g, where M is the number of confounders and D is the feature dimension. We also add two different modality encodings to inject information about each token's modality so that they could learn modality-specific bias. Concretely, a text-type encoding and a video-type encoding are added to individual confounder queries Z[0 : M/2] and Z[M/2 : M ] The output of f , i.e., X, is also used as an input of g and cross-attended with Z. In detail, Z is adopted as query, and X is used as key and value of the encoder g. Then, the output Z can be written as: Z = {z|z = g( X, z), ∀z ∈ Z} ∈ R M ×D . We additionally introduce two FFN h f and h g as prediction heads of f and g, respectively. In short, the data encoder f and h f perform the main VideoQA task, while the confounder encoder g and h g learn and encode the bias in confounder queries Z, which will later be removed via causal intervention. The detailed objective functions to train each component are introduced in Sec. 3.3. Fig. 3 illustrates the overall architecture of our proposed framework.

3.3. TRAINING OBJECTIVE

Debiased prediction. Conventional approaches used P (Y |X) in Eq. 3 as the output logit which is simply calculated with an additional FFN h on the top of the encoder f , i.e., Ŷ = P (Y |X) = h(f (X)). On the other hand, to remove the effect of confounders, we use a logit P (Y |do(X)) with causal intervention. Since P (Y |do(X)) is calculated with the Softmax function, it can be approximated by Normalized Weighted Geometric Mean (NWGM) (Xu et al., 2015) as follows: Ŷf = P (Y |do(X)) = z∈Z P (Y |X, z)P (z) ≈ P (Y | z∈Z (X + z)P (z)). Then, the debiased output Ŷf in Eq. 7 can be calculated with FFN h f : P (Y | z∈Z (X+z)P (z)) = h f z∈Z X + z + g( X, z) P (z) = h f z∈Z X + z + z P (z) , where X = f (X) as defined in Eq. 5. Note that CLS token from X is used when calculating X + z + z in Eq. 8. In general, confounders are defined to be data-agnostic, i.e., each sample in a dataset shares the same confounders. However, in Eq. 8, we consider not only data-agnostic confounders Z but also data-modulated confounders Z. While debiasing the data-agnostic confounders leads t o the model to alleviate dataset bias, by introducing additional data-modulated confounders Z, we also mitigate in-sample spurious correlations, e.g., models tend to select the option which includes the visually salient 'object' in the video as an answer. We then apply the standard CE loss to perform the VideoQA task by using debiased logit with causal inference as: L causal = CE( Ŷf , Y ). ( ) Training confounders with unanswerable questions. To alleviate the effect of confounders as in Eq. 7, it is important to let the confounder queries Z ∈ R M ×D capture the bias during training. To achieve this, we first construct two unanswerable questions i.e., X q = (x q , x ′ v ) and X v = (x ′ q , x v ) by pairing a text x q and video x v from a sample X with label Y with another video x ′ v and text x ′ q from a different sample X ′ in mini-batch having label Y ′ . We then force the model to predict Y or Y ′ from these unanswerable questions. Since the model is unable to predict the proper answer given the unanswerable pair by relying only on the single modality but is forced to do so, the model inevitably learns text or video bias. Therefore, the model gets to only consider the spurious correlations to predict an answer. Fig. 4 shows the loss functions to train confounders with unanswerable questions. In detail, two unanswerable pairs are forwarded to the encoder f and g with the confounders Z as: Xq , Xv = f (X q ), f (X v ) Zq , Zv = {z|z = g( Xq , z), ∀z ∈ Z}, {z|z = g( Xv , z), ∀z ∈ Z}, where Zq , Zv ∈ R M ×D . In Eq. 10, Zq and Zv indicate that output features of confounders Z, which are cross-attended with unanswerable pairs X q and X v , respectively. As mentioned above, since the confounder is divided into two partsfoot_1 to learn the text and video bias, we feed both Zq and Zv , being separated into two parts respectively, to the FFN layer h g to output modality-biased predictions as. Zq,q , Zv,q , Zq,v , Zv,v = Zq [0 : M/2], Zv [0 : M/2], Zq [M/2 : M ], Zv [M/2 : M ] Ŷ (q,q) g , Ŷ (v,q) g , Ŷ (q,v) g , Ŷ (v,v) g = h g ( Zq,q ), h g ( Zv,q ), h g ( Zq,v ), h g ( Zv,v ), Here, the former letter in the superscript of Ŷ ( * , * ) g denotes the input modality which comes from the original pair X when constructing the unanswerable pair (e.g., An input question of Ŷ (q, * ) g is taken from X q ) and the latter denotes the modality an output would be biased towards. For instance, Ŷ (v,q) g , an output of g given an input X v = (x ′ q , x v ), is desired to be text-biased, while Ŷ (v,v) g is also an output of g given an input X v = (x ′ q , x v ), but desired to be video-biased. In other words, Z( * ,q) and Ŷ ( * ,q) g denote text bias and a text-biased output from g. Similarly, Z( * ,v) and Ŷ ( * ,v) g denote video bias and a video-biased output from g. Then, the loss function for training confounders to satisfy properties mentioned above is as follows: L confounder = GCE( Ŷ (q,q) g , Y ) + GCE( Ŷ (v,q) g , Y ′ ) + GCE( Ŷ (q,v) g , Y ′ ) + GCE( Ŷ (v,v) g , Y ), ( ) where GCE is the Generalized Cross Entropy (Zhang & Sabuncu, 2018 ) loss which will be further discussed below. For GCE( Ŷ (q,q) g , Y ) in Eq. 12, Ŷ (q,q) g is a text-biased output desired to match Y given an input X q = (x q , x ′ v ), i.e., it is forced to learn the text bias regardless of the video, since the question x q and the answer Y are from the same pair, while the irrelevant video x ′ v is from another sample in mini-batch. In the same way, GCE( Ŷ (q,v) g , Y ′ ) , where the input corresponding to Ŷ (q,v) g is (x q , x ′ v ) , is encouraged to learn the video bias because x ′ v and Y ′ are from the same pair. We can therefore train the confounders to learn both text bias and video bias with unanswerable questions in Eq. 12. We can also amplify the bias and train the confounders to be more bias-toward by adopting GCE as: GCE(p(x; θ), y) = 1 -p y (x; θ) q q , ( ) where p(x; θ) is an output probability parameterized by θ, q ∈ (0, 1] is a smoothing parameter, and y is a ground truth. The gradient of GCE loss is p y (x; θ) q times larger than the gradient of the standard CE loss, i.e., ∂GCE ∂θ = p y (x; θ) q • ∂CE ∂θ . Therefore, GCE loss leads the model to be biased by placing larger weight on 'easier' samples with a high confidence score (Lee et al., 2021; Nam et al., 2020) , inducing the model to be overfitted to easy shortcuts. Further details of GCE are in the supplement. Then, the final loss function of our proposed algorithm is as follows: L = L causal + L confounder . ( ) At the inference phase, we only use Ŷf , the output of causal intervention for the prediction. (Jiang et al., 2020) 75.9 81.0 59.7 36.1 34.6 Bridge2Answer (Park et al., 2021) 75.9 82.6 57.5 37.2 36.9 ClipBERT (Lei et al., 2021) 82.9 87.5 59.4 -37.4 VIOLET (Fu et al., 2021) 92.5 95.7 68.9 43.1 -MASN (Seo et al., 2021a) 84.4 87.4 59.5 38.0 35.2 QESAL (Liu et al., 2021) 76 Prior probability P (z). As for the prior probability P (z), we introduce a learnable parameter c ∈ R M , i.e., P (z) = Softmax(c). However, since a large variance of P (z) can make the training unstable, we apply exponential moving average (EMA) on P (z) to stabilize the training. We also apply Dropout (Srivastava et al., 2014) on P (z) and regularize a model from being overfitted on the particular confounders.

4. EXPERIMENTS

In this section, we evaluate the performance of VoidQ for both MCQA and OEQA settings on three benchmark VideoQA datasets: TGIF-QA, MSRVTT-QA, and MSVD-QA. In TGIF-QA, TGIF-Action and TGIF-Transition are conducted under the MCQA setting, predicting the proper answer among five options. For OEQA on TGIF-Frame, MSRVTT-QA, and MSVD-QA, we follow the conventional settings (Fu et al., 2021; Lei et al., 2021) to construct the answer candidates. In detail, the answer candidates of TGIF-Frame consist of 1,540 most frequent answers in the training set. Similarly, 1,500 and 1,000 most frequent answers in the training set are selected as the answer candidates for MSRVTT-QA, and MSVD-QA datasets. We also perform ablation studies to show that VoidQ is robust and generalizes well under distribution shifts. Our extensive qualitative analyses demonstrate that learnable confounder queries successfully capture the dataset bias, illustrated by our variant of Grad-CAM. Descriptions of datasets and implementation details are in the supplement.

4.1. QUANTITATIVE RESULTS

TGIF-QA, MSVD-QA, and MSRVTT-QA. We compare VoidQ with previous VideoQA methods in Tab. 1. On TGIF-QA, VoidQ outperforms ClipBERT and VIOLET, which are foundation models pretrained on the large-scale dataset, especially by a margin of 1.0% on TGIF-Action, and 1.6% on TGIF-Frame compared to VIOLET. The performance of VoidQ also improves by 0.6% on MSRVTT-QA compared to CoMVT which is specifically designed for VideoQA tasks. On MSVD-QA, VoidQ obtains a 2.5% improvement over CMCIR which also conducts causal intervention. Ablation studies. Tab. 2 demonstrates the ablation studies in terms of the three components: confounder encoder g, unanswerable questions, and GCE loss. Without the unanswerable questions to train confounders, adding a confounder encoder g degrades the performance from (a) 43.5% to (b) 41.9%. This result also evidences that the performance gain of VoidQ is not solely from increasing the model complexity. On the other hand, row (c) shows that the unanswerable questions significantly improve performance by a margin of 3.2% compared to (b). Adopting GCE loss (d) also leads to a further improvement in performance, since GCE loss helps the model better learn the confounder Z by amplifying the dataset bias. Table 2 : Ablation studies on MSVD. '-' at g, the confounder encoder, denotes that we do not conduct the causal intervention and use the conventional likelihood P (Y |X) for the prediction. UQ denotes the unanswerable questions with the standard CE loss. '✓' on both UQ and GCE stands for Lconfounder. To show the generalizability of VoidQ, we also conduct experiments under the distribution/domain shift; we trained models on the MSRVTT training set and evaluated them on the MSVD test set (MSRVTT→MSVD). Jensen-Shannon Divergence, i.e., JSD(P, Q) = 1 2 (D KL (P ∥R) + D KL (Q∥R)) where R = 1 2 (P + Q), is adopted to quantify the label distribution distance between training and test sets. We observe that VoidQ provides a larger performance gain when the distribution shift increases. Where both training and evaluating on MSVD, JSD between the train and test set shows the relatively small value of JSD(train MSVD , test MSVD ) 0.07. VoidQ obtains a 2.7% improvement from (a) 43.5% to (d) 46.2% in such setting. Whereas, the improvement increases to 4.7% when training on MSRVTT but evaluating on MSVD, where JSD(train MSRVTT , test MSVD ) = 0.26. We conduct additional experiments comparing model performances on the new test sets, including the standard test set, constructed by intentionally removing samples from the test set if the answer belongs in the top-1, 10, 20, 50, and 100 most frequent answer candidates. Such modification demonstrates how much a model is statistically biased. If a model is highly biased towards the dataset statistics, the model would perform worse when the frequent answers are removed. We compare the performance of VoidQ against the base VideoQA model without any debiasing scheme, which corresponds to (d) and (a) in Tab. 2, respectively. Fig. 5 illustrates the results of such experiments. The performance gap between the base model and the proposed model increases as the discrepancy between the train and test set enlarges. The performance gap of 2.7% between VoidQ and the baseline when JSD is 0.08 dramatically increases up to 6.79% when JSD is a larger value, 0.61. Two experiments done under distribution shift prove that VoidQ well alleviates the statistical bias, helping the model to successfully perform on the dataset that differs from the train set.

4.2. QUALITATIVE ANALYSES

How is the model debiased? Four examples in Fig. 6 illustrate how models' predictions are corrected via causal intervention P (Y |do(X)). Without the causal intervention, i.e, P (Y |X) = h f (f (X)), the model is prone to answer questions based on a single modality. For instance, in Fig. 6a , P (Y |X) predicts 'two' as the answer since about 84% answers to the 'How many' questions in the training set are 'two' without considering an input video. Another example in Fig. 6b P (Y |X) predicts 'animal' or 'panda' as an answer, focusing on visually salient objects, overlook- ing the input question. However, after conducting the causal intervention P (Y |do(X)), output are corrected to 'three' and 'milk' by thoroughly considering the previously neglected video and text input, respectively. VoidQ also succeeds predicting the answer 'ski' in Fig. 6c , although the answer 'play' appeared 51 times more than 'ski' during training. Finally, Fig. 6d shows that VoidQ plausibly predicts 'stage', overcoming the dataset bias that 'side' and 'field' are the two most frequent answers to the 'Where' questions. Where do the confounders look at? We investigate which parts in a video or which words in a question are taken into account by confounder queries to debias the predictions of a VideoQA model. To consider the gradient flows through Z, we modify GradCAM and Counterfactual Grad-CAM (Selvaraju et al., 2017)  denoted as ∇ (X; Z) Ŷ t f and ∇ c (X; Z) Ŷ t f , respectively. They are computed as: ∇ (X; Z) Ŷ t f := ReLU   z∈ Z ∂ Ŷ t f ∂ z • ∂ z ∂X   , ∇ c (X; Z) Ŷ t f := ReLU   z∈ Z - ∂ Ŷ t f ∂ z • ∂ z ∂X   , ( ) where t is the target label in question. ∇ (X; Z) Ŷ t f illustrates where confounder Z focused on to bolster correct predictions. Conversely, ∇ c (X; Z) Ŷ t f reveals where confounder Z focused on to suppress falsely correlated cues that cause bias. Fig. 7 illustrates ∇ (X; Z) Ŷ t f and ∇ c (X; Z) Ŷ t f on two examples shown in Fig. 6a and 6b which contains text and video bias, respectively. Fig. 7a (left) reveals that "How many" is strongly highlighted by ∇ c (X; Z) Ŷ t f , implying the phrase negatively influenced to predict a correct answer. Interestingly, it matches our assumption that "How many" can be considered as the confounder leading a model to predict 'two' as the answer. On the other hand, ∇ (X; Z) Ŷ t f highlights "dancing" most, meaning that VoidQ focuses on the right context to output the number of 'dancing men'. Similarly in Fig. 7b including video bias, ∇ c (X; Z) Ŷ t f and ∇ (X; Z) Ŷ t f focus on the 'panda' and 'milk' respectively, which matches our notion that looking at 'panda' in the video hinders the model to predict correct answer looking at 'milk' in the video helps debiasing.

5. CONCLUSION

In this work, we propose a novel debiasing framework for VideoQA, dubbed VoidQ, which trains confounder queries by answering unanswerable questions and utilizes the trained confounders to remove the dataset bias via causal intervention. Concretely, we adopt causal intervention to cut-off the relation between confounders Z and input X so that the model predicts the correct answer Y with bias removed. Since the causal intervention is not applicable when confounders are unobserved, we additionally introduce a training scheme that leverages unanswerable questions to let learnable confounder queries capture the dataset bias. We demonstrate the effectiveness of our method by validating the proposed architecture on various benchmark datasets, and provide qualitative analyses showing that confounders are well learned to capture the dataset bias and properly removed. Structural Causal Model (SCM). SCM is a statistical model representing the causal relationship between variables in the graph structure Glymour et al. (2016) . In the causal graph, each variable is denoted by nodes, and 'causation' between two different variables is denoted by a directed edge between nodes. Fig. 8a illustrates an example of SCM representation in graph form. The edge X → Y in the graph implies that X is the 'cause' of Y . Also, Z in the graph represents a 'confounder', which simultaneously affects both X and Y , therefore making it difficult to find a true effect of X on Y . Such confounder induces the spurious correlation between X and Y through the backdoor path between X and Y . A backdoor path is formally defined as any path from X to Y that starts with an arrow pointing to X (Yang et al., 2021) , such as X ← Z → Y in 8a. To find out the true causal relationship between X and Y , the causal intervention with do-calculus P (Y |do(X)) is applied to cut-off the relationship Z → X, as illustrated in 8b, therefore removing the spurious correlation induced by Z. Backdoor adjustment is a widely adopted approach to deconfound the effect of the confounders Z using the do-calculus, which we further concretize in the very following section. The backdoor adjustment. Given a directed acyclic graph consisting of X, Y , and Z as in 8a, backdoor adjustment can be applied to reveal the true causal effect of the X on Y given the confounder Z. By Bayes' theorem, P (Y |X) can be expressed as follows: P (Y |X) = z∈Z P (Y |X, Z = z)P (Z = z|X). The causal intervention with do-calculus P (Y |do(X)) mentioned in the previous section is then formally defined as below: P (Y |do(X)) = z∈Z P (Y |X, Z = z)P (Z = z). Through the backdoor adjustment, the true causal relationship between X and Y , which is denoted as P (Y |do(X)) is measured without any effect of the confounder Z. Normalized Weighted Geometric Mean (NWGM). To approximate P (Y |do(X)), we use NWGM. Before dealing with NWGM, we first revisit the definition of Weighted Geometric Mean (WGM). Given a discrete variable X and its distribution P (X), the expectation of f (x) is defined as: E x [f (x)] = x∈X f (x)P (x). The Weighted Geometric Mean (WGM), an approximation of E x [f (x)] is defined as follows: WGM(f (x)) = x∈X f (x) P (x) . If the activation function of f (x) is a composition of a function g(x) followed by an exponential function, i.e., f (x) = exp(g(x)), Eq. 19 can be reformulated as: WGM(f (x)) = x∈X exp[g(x)] P (x) = x∈X exp[g(x)P (x)] = exp( x∈X g(x)P (x)) = exp{E x [g(x)]}. Interpreting WGM in the perspective of deep learning, f (x) can be regarded as a neural network whose last activation function is the softmax function. Therefore, Xu et al. (2015) and Yang et al. (2021) approximate the expectation of the f (x) using the WGM as follows: E x [f (x)] ≈ WGM(f (x)) = exp{E x [g(x)]} To guarantee that output logits can be interpreted as a probability, NWGM, a normalized version of WGM, is applied so that the sum of output logits adds up to one, and it is formally defined as: NWGM(f (x)) = x exp(g(x)) P (x) j x exp(g(x)) P (x) = exp(E x [f (x)]) j exp(E x [f (x)]) = Softmax(E x [f (x)]) (22) Adopting the WGM defined above to our model, P (Y |do(X)) can be approximated as below, where P (Y |X, z) = Sof tmax(g(X, z)) ∝ exp(g(X, z)): P (Y |do(X)) = E z [P (Y |X, z)] = E z [exp(g(X, z))] ≈ exp(E z [g(X, z)]) = exp{ z∈Z (f (X) + z + z) P (z)}. where g(X, z) = f (X) + z + z. Then, we apply NWGM to normalize Eq. 23 as to get final deconfounded prediction probabilities P (Y |do(X)) as follows: P (Y |do(X)) ≈ Softmax(E z [g(X, z)]) = Softmax{ z∈Z (f (X) + z + z) P (z)}. B.2 GENERALIZED CROSS ENTROPY (GCE) LOSS GCE. GCE loss was first proposed as a generalized loss taking advantage of both Mean Absolute Error (MAE) loss, and Categorical Cross Entropy (CCE) loss by Zhang & Sabuncu (2018) . Given an input x, the ground truth one-hot vector y, and the set of parameters θ of the classifier f , MAE and CCE loss are formally defined as below in the common case where the softmax is followed by the classification layer: L M AE (f (x; θ), y) = ||y -f (x; θ)|| 1 L CCE (f (x; θ), y) = - C j=1 y j log f j (x; θ), where C denotes the number of target classes, y j and f j denote the j-th element of y and the j-th prediction of f . The gradient of loss functions with respect to parameter θ is as follows: ∂L M AE (f (x; θ), y) ∂θ = -∇ θ f y (x; θ) ∂L CCE (f (x; θ), y) ∂θ = - 1 f y (x; θ) ∇ θ f y (x; θ), where f y denotes the element of the output logit corresponding to the ground-truth label. As formulated in Eq. 26, CCE emphasizes samples with larger 1/f y (x; θ), or smaller f y (x; θ). On the contrary, MAE equally treats every sample with the same weight. The fact that MAE does not place a larger weight on difficult samples makes MAE robust to noisy labels, but it also makes training difficult since every sample is treated equally so that challenging examples are not learned enough. In contrast, optimizing a model using CCE is easier due to larger weights being given to challenging samples. However, CCE is sensitive to noisy labels, since the model could easily be overfitted to such noisy samples which are intrinsically difficult due to label noise. Then GCE loss can be viewed as a generalization between MAE and CCE loss, and is formally defined as below: L GCE (f (x; θ), y) = 1 -p y (x; θ) q q , ( ) where q ∈ (0, 1] is a smoothing parameter. The gradient of L GCE with respect to θ is as follows: ∂L GCE (f (x; θ), y) ∂θ = f y (x; θ) q (- 1 f y (x; θ) ∇ θ f y (x; θ)) = f y (x; θ) q ∂L CCE ∂θ = -f y (x; θ) q-1 ∇ θ f y (x; θ) = f y (x; θ) q-1 ∂L M AE ∂θ . Therefore, L GCE additionally weights each sample by f y (x; θ) q times compared to CCE loss, weighting difficult samples less. Also, it weights each sample by f y (x; θ) q-1 times compared to MAE loss, giving larger weight to difficult examples compared to MAE loss. If q is properly chosen, GCE can therefore act as a generalized loss that is more robust than CCE and easier to train than MAE, achieving a balanced trade-off between two losses.

GCE Loss in Computer Vision

. By the fact that GCE loss gives smaller weights to 'difficult' examples compared to conventional CCE loss, Lee et al. (2021) and Nam et al. (2020) propose capturing bias in the model by leveraging GCE loss to train a 'biased network', which is overfitted to easy samples, which corresponds to 'bias' or 'spurious correlation' existing in the dataset. Both works train the model with GCE loss to achieve the model to be biased by focusing on the "easier" samples compared to the conventional CCE.

B.3 GRADCAM

The standard GradCAM (Selvaraju et al., 2017) of prediction Ŷ t f with respect to input X can be calculated as: ∇ X Ŷ t f := ReLU ∂ Ŷ t f ∂X = ReLU ∂ Ŷ t f ∂ X • ∂ X ∂X + ∂ Ŷ t f ∂ Z • ∂ Z ∂X , since the information of input X is divided into two streams, i.e., X and Z, and merged to make the prediction Ŷ t f . Here, t is the target label in question so visualization of Eq. 29 illustrates which parts in the input affect predicting the label t. However, Eq. 29 takes into account the gradient flows through both X and Z although we want to know only the flows through confounders Z to visualize where the confounders look at. So we define and visualize the gradient through Z as: ∇ (X; Z) Ŷ t f := ReLU ∂ Ŷ t f ∂ Z • ∂ Z ∂X = ReLU   z∈ Z ∂ Ŷ t f ∂ z • ∂ z ∂X   , which is consistent with Eq. 15 of the main paper. Here,  Ŷ t f is defined as: ∇ c (X; Z) Ŷ t f := ReLU   z∈ Z - ∂ Ŷ t f ∂ z • ∂ z ∂X   , meaning that where confounder Z focuses on to suppress the prediction Ŷ t f . Algorithm 1 Overall Algorithm Inputs: sample {X = (x q , x v ), Y }, negative sample {X ′ = (x ′ q , x ′ v ), Y ′ }, confounder queries Z, number of confounder queries M Parameters: prior probability c, data encoder f , confounder encoder g, FFN {h f , h g } 1: X q , X v ← (x q , x ′ v ), (x ′ q , x v ) 2: X, Xq , Xv ← f (X), f (X q ), f (X v ) 3: Z, Zq , Zv ← {z|z = g( X, z), ∀z ∈ Z}, {z|z = g( Xq , z), ∀z ∈ Z}, {z|z = g( Xv , z), ∀z ∈ Z} 4: Ŷf ← h f z∈Z X + z + z c z ▷ c z is a prior probability of z 5: Zq,q , Zq,v , Zv,q , Zv,v ← Zq [0 : M/2], Zq [M/2 : M ], Zv [0 : M/2], Zv [M/2 : M ] 6: Ŷ (q,q) g , Ŷ (q,v) g , Ŷ (v,q) g , Ŷ (v,v) g ← h g ( Zq,q ), h g ( Zq,v ), h g ( Zv,q ), h g ( Zv,v ) 7: L causal ← CE( Ŷf , Y ) 8: L confounder ← GCE( Ŷ (q,q) g , Y ) + GCE( Ŷ (q,v) g , Y ′ ) + GCE( Ŷ (v,q) g , Y ′ ) + GCE( Ŷ (v,v) g , Y ) 9: L ← L causal + L confounder 10: return L C EXPERIMENTAL SETTINGS C.1 DATASET We validate the proposed model on four benchmark datasets: TGIF-QA (Li et al., 2016; Jang et al., 2017) , MSVD-QA (Chen & Dolan, 2011; Xu et al., 2017) , and MSRVTT-QA (Xu et al., 2016; 2017) . TGIF-QA consists of 103,913 QA pairs from 56,720 GIFs and includes three multiplechoice VideoQA tasks: repetition count, repeating action, and state transition, along with an openended frameQA task reasoning on a single frame. MSVD-QA and MSRVTT-QA are both openended VideoQA datasets with descriptive QA tasks, while MSRVTT-QA consists of more complex and longer 10,000 trimmed videos and larger 243,000 QA pairs compared to MSVD-QA with 1,970 trimmed videos and 50,500 QA pairs. C.2 IMPLEMENTATION DETAILS. Model architecture. We adopt the Transformer (Vaswani et al., 2017) architecture with 12 layers for both the data encoder f and the confounder encoder g. Concretely, for the data encoder f , visual tokens X v and text tokens X q are concatenated with an additional [CLS] token to form an input X = (x q , x v ) ∈ R N ×D . To build x v , we sample 3 frames per single input video. Each frame has a spatial resolution of 224×224, and is patchified into 14×14 patches with the size of 16×16 for each. For text token x q , we set 40 as the max length of the input text sequence. An input text is then tokenized to have a hidden dimension of D = 768. After concatenating x q and x v , modality encoding is added to input tokens having corresponding modalities. When conducting cross-attention in g, we apply a stop-gradient operation to X so that it could not be affected by L confounder . Also, we use M = 128 for the number of confounder query tokens. Training details. For training, the initial learning rate is set to 10 -4 with cosine decay and warmup applied until 10% of the total training step is done. We train the models with AdamW (Loshchilov & Hutter, 2017) optimizer with a weight decay rate of 0.01. The probability of confounder dropout is 0.15. Our backbone encoders are pretrained on Webvid (Bain et al., 2021) , YT-Temporal 180M (Zellers et al., 2021) , HowTo100M (Miech et al., 2019) , CC3M (Sharma et al., 2018) , CC12M (Changpinyo et al., 2021) , COCO (Lin et al., 2014) , VisualGenome (Krishna et al., 2017) , and SBU (Ordonez et al., 2011) as in Fu et al. (2021) and Wang et al. (2022) . All the experiments are conducted on 4 × Tesla A100 GPUs. MCQA details. We concatenate each option and the question and insert the [SEP] token between them to construct the text token sequence. To efficiently calculate L confounder , we only take into account two negative pairs (X q , Y ) and (X v , Y ′ ) instead of four negative pairs including (X q , Y ′ ) and (X v , Y ), i.e., L confounder = GCE( Ŷ (q,q) g , Y ) + GCE( Ŷ (v,q) g , Y ′ ). This is because it is cumbersome to forward all the combinations of negative pairs including concatenated text token sequences for each option.

D OVERALL ALGORITHM

The overall algorithm to train our proposed framework is formulated in Alg. 1.

E FURTHER QUALITATIVE ANALYSES

E.1 DEBIASED PREDICTION Fig. 9 illustrates how models' predictions are corrected via causal intervention P (Y |do(X)). We discuss the detected biases for three representative question types in TGIF. TGIF-Action. As shown in Fig. 9a , the model tends to predict 'shake head' without considering the visual context before the causal intervention. On the other hand, the prediction is corrected to 'rub something with fingers' after the causal intervention. We believe that this case is biased to text since the 'shake head' co-occurs 141 times more than 'rub' with the word 'man' in the question. Here, the word 'man' serves as the confounder inducing the text bias. TGIF-Transition. Fig. 9b shows the text bias. In this case, the 'smile' & 'man' pair co-occurs 169 times more than 'dump' & 'man' pair, which leads the model to predict 'smile' only considering the text. However, after the causal intervention with P (Y |do(X)), the model predicts the answer correctly. As shown in the Tab. 4, the model performs best when the number of confounders is near 64 or 128. In our experiments, we used 128 confounder queries. 



Additional descriptions about the causal inference are in Sec. 3.1 the first M/2 confounders Z[0 : M/2] denotes text confounders and the latter M/2 confounders Z[M/2 : M ] means video confounders



Figure 1: Dataset statistics of MSVD dataset and an example of the biased answer. (Left) The majority of answers to the "How many" questions are 'two'. (Right) The model outputs the biased answer 'two' instead of the right answer 'four'.

Figure3: VoidQ Architecture. We construct tokenized text tokens xq and patchified video tokens xv from input question and video, then concatenate them to form X and feed them into the data encoder f . Learnable confounder queries Z are fed to the confounder encoder g, which are cross-attended with output features of f , i.e., X. Z is trained to learn the dataset bias by minimizing Lconfounder between the ground truth and a biased prediction Ŷg generated from Z through hg. Causal intervention utilizing learned confounders Z and Z is applied to generate the final debiased prediction Ỹf from X through h f .

Figure 2: Causal Inference. X is a cause, Y is an effect, and Z is a set of confounders.

Figure 4: Objective functions with unanswerable questions. The first letter in the superscripts of the four outputs Ŷ ( * , * ) g

Figure 5: Distribution shift. X-axis: JSD between the train and modified test sets. Yaxis: Accuracy difference of the baseline and ours.

GradCAM on MSVD. We visualize ∇ c (X; Z) Ŷ t f and ∇ (X; Z) Ŷ t f on two samples to show where the confounders look at in the input.

Figure 8: Illustration of Structural Causal Model (SCM) and do-calculus definition.

Figure 9: Qualitative results on TGIF. Confidence scores of the top-5 predicted answers using conventional likelihood P (Y |X) and causal intervention P (Y |do(X)). Ground-truth answers are colored in red.

Comparison on TGIF-QA, MSVD-QA, and MSRVTT-QA. We report the accuracy for all datasets. TGIF-Action and TGIF-Transition are MCQA and TGIF-Frame, MSVD-QA, and MSRVTT-QA are OEQA.

Ablation study on the type of confounders. Z[0 : M ], Z[0 : M/2], and Z[M/2 : M ] refer to entire confounder queries, text confounder queries, and video confounder queries, respectively. Z[0 : M ] Z[0 : M/2] Z[M/2 : M ]

Ablation study on the number of confounder queries.

annex

TGIF-Frame. Fig. 9b illustrates the video-biased case. P (Y |X) is likely to focus on the visually salient object 'cat' without considering the question. By applying causal intervention, the video bias is alleviated and the prediction is corrected to 'paw' from 'cat' considering both the video and question.

E.2 GRADCAM VIUSALIZATION

In the main paper, using variants of GradCAM, we have investigated which words in a question or which parts in a video are taken into account by confounder queries to debias the predictions of VideoQA model with GradCAM ∇ (X; Z) Ŷ t f and Counterfactual GradCAM ∇ c (X; Z) Ŷ t f for MSVD. For TGIF-QA, Fig. 10a shows the same QA pair with Fig.e 9b, which is biased to the text. The word 'man' is strongly highlighted by counterfactual GradCAM ∇ c (X; Z) Ŷ t f implying it negatively influences to predict the correct answer. On the other hand, GradCAM ∇ (X; Z) Ŷ t f focuses on the word 'bucket' to output a correct answer 'dump ice water on himself'. This is consistent with our observation that the word 'man' is considered as the text confounder hindering the model from predicting correctly. In Fig. 10b , the video-biased QA pair come from Fig. 9c , Counterfactual GradCAM ∇ c (X; Z) Ŷ t f shows that the object 'cat/kitten' which is visually salient object in the video hinders the model to predict the proper answer. However, GradCAM ∇ (X; Z) Ŷ t f focuses on the object 'paw' in the video so the model correctly predicts the answer. This indicates that the object 'cat' serves as the video confounder in the video-biased sample.

F FURTHER ABLATION STUDIES F.1 CONFOUNDER QUERIES Z

As shown in Tab. 3, the performance slightly decreases by 0.8% when only using the text confounder queries. On the other hand, the performance decreases by 1.9% when only using the video confounder queries. This indicates that the dataset has a stronger text bias than video.

