COUNTERFACTUAL VISION-LANGUAGE DATA SYN-THESIS WITH INTRA-SAMPLE CONTRAST LEARNING

Abstract

Existing Visual Learning (VL) benchmarks often contain exploitative biases. Most former works only attempted to mitigate biases in semantically low-level and conventional visual-question-answering typed datasets like VQA and GQA. However, these methods cannot generalize to recently emerging highly semantic VL datasets like VCR and are also difficult to scale due to many severe problems like high-cost labors, drastically disrupting the data distribution, etc.To resolve those problems and also address other unique biases on VCR-like datasets, we first conduct in-depth analysis and identify important biases in VCR dataset. We further propose a generalized solution that synthesizes counterfactual image and text data based on the original query's semantic focus while producing less distortion to the data distribution. To utilize our synthesized data, we also design an innovative intra-sample contrastive training strategy to assist QA learning in Visual Commonsense Reasoning (VCR). Moreover, our synthesized VL data also serve as a highly-semantic debiased benchmark for evaluating future VL models' robustness. Extensive experiments show that our proposed synthesized data and training strategy improve existing VL models' performances on both the original VCR dataset and our proposed debiased benchmark.

1. INTRODUCTION

Many problems like the above mentioned prevail in former methods and prevent them from generalizing to highly semantic VL datasets like VCR. To raise the community's attention in biases of highly semantic VL datasets like VCR and countering them, in this work, we first conduct indepth analysis and identify unique biases in VCR. Second, we propose a generalized Counterfactual Vision-Language Data Synthesis (CDS) method to help counter the identified biases. CDS utilizes adversarial models to modify images and answer choices to create synthesized positive and negative image and text data without drastically disturbing data distribution like direct occlusions. Further, we prove that CDS's synthesized data can compliments VCR data to effectively mitigate our identified biases and even integrate to a debiased evaluation benchmark to evaluate future models's robustness. To better leverage our synthesized data in training, we also propose Intra-sample Contrastive Learning (ICL) framework to assist existing VL models focus on intra-sample differentiation among answer choices and images. Unlike Chen et al. (2020a); Gokhale et al. ( 2020), ICL frees us from creating paired answers for negative synthesized images. With extensive experiments, we demonstrate that ICL with synthesized data can help existing VL models to be more robust in terms of domain-shifts in data. In conclusion, our contributions are four-folds. Firstly, we identify significant biases in VCR and analyze VL models' over-reliance on text data. Secondly, we propose an innovative counterfactual VL data synthesis method, CDS, to mitigate the dataset biases. This is the first work to propose an adversarial VL data synthesis method in VCR. Thirdly, to better leverage our synthesized data in training, we further propose an intra-sample contrastive learning mechanism to assist the conventional QA learning with cross entropy loss. To



Figure 1: An example from VCR and the paired visual Grad-CAM result from a finetuned V L -BERT L Su et al. (2019). With training, we expect the VL model to integrate mulitmodal information and commonsense when select the correct answer (labelled by a green check). For instance, to answer this question, we expect the model to focus on [person6] on the left and people in the center. However, the model fails to pick up the correct visual clue and focuses on the irrelevant entity, the window in the background. The orange words are the overlapping words between the correct choice and the question. Recently many works have explored Vision-Language (VL) models' learning in high semantics from image and text data. As a result, many VQAT-like benchmarks such as GQAHudson & Manning (2019), VQALi et al. (2018), VCRZellers et al. (2019) and SNLI-VEXie et al. (2019) were proposed to evaluate models' abilities in learning visual commonsense and reasoning. Despite recent

