COUNTERFACTUAL VISION-LANGUAGE DATA SYN-THESIS WITH INTRA-SAMPLE CONTRAST LEARNING

Abstract

Existing Visual Learning (VL) benchmarks often contain exploitative biases. Most former works only attempted to mitigate biases in semantically low-level and conventional visual-question-answering typed datasets like VQA and GQA. However, these methods cannot generalize to recently emerging highly semantic VL datasets like VCR and are also difficult to scale due to many severe problems like high-cost labors, drastically disrupting the data distribution, etc.To resolve those problems and also address other unique biases on VCR-like datasets, we first conduct in-depth analysis and identify important biases in VCR dataset. We further propose a generalized solution that synthesizes counterfactual image and text data based on the original query's semantic focus while producing less distortion to the data distribution. To utilize our synthesized data, we also design an innovative intra-sample contrastive training strategy to assist QA learning in Visual Commonsense Reasoning (VCR). Moreover, our synthesized VL data also serve as a highly-semantic debiased benchmark for evaluating future VL models' robustness. Extensive experiments show that our proposed synthesized data and training strategy improve existing VL models' performances on both the original VCR dataset and our proposed debiased benchmark.



Figure 1: An example from VCR and the paired visual Grad-CAM result from a finetuned V L -BERT L Su et al. (2019). With training, we expect the VL model to integrate mulitmodal information and commonsense when select the correct answer (labelled by a green check). For instance, to answer this question, we expect the model to focus on [person6] on the left and people in the center. However, the model fails to pick up the correct visual clue and focuses on the irrelevant entity, the window in the background. The orange words are the overlapping words between the correct choice and the question. Recently many works have explored Vision-Language (VL) models' learning in high semantics from image and text data. As a result, many VQAT-like benchmarks such as GQAHudson & Manning (2019), VQALi et al. (2018), VCRZellers et al. (2019) and SNLI-VEXie et al. (2019) were proposed to evaluate models' abilities in learning visual commonsense and reasoning. Despite recent

