CAUSALITY COMPENSATED ATTENTION FOR CON-TEXTUAL BIASED VISUAL RECOGNITION

Abstract

Visual attention does not always capture the essential object representation desired for robust predictions. Attention modules tend to underline not only the target object but also the common co-occurring context that the module thinks helpful in the training. The problem is rooted in the confounding effect of the context leading to incorrect causalities between objects and predictions, which is further exacerbated by visual attention. In this paper, to learn causal object features robust for contextual bias, we propose a novel attention module named Interventional Dual Attention (IDA) for visual recognition. Specifically, IDA adopts two attention layers with multiple sampling intervention, which compensates the attention against the confounder context. Note that our method is model-agnostic and thus can be implemented on various backbones. Extensive experiments show our model obtains significant improvements in classification and detection with lower computation. In particular, we achieve the state-of-the-art results in multi-label classification on MS-COCO and PASCAL-VOC.

1. INTRODUCTION

The last several years have witnessed the huge success of attention mechanisms in computer vision. The key insight behind different attention mechanisms (Wang et al., 2018; Hu et al., 2018; Woo et al., 2018; Chen et al., 2017; Zhu & Wu, 2021; Dosovitskiy et al., 2020; Liu et al., 2021) is the same, i.e., emphasizing key facts in inputs, while different aspects of information such as feature map and token query are considered. The impressive performances of these works show that attention is proficient in fitting training data and exploiting valuable features. However, whether the information deemed valuable by attention is always helpful in real application scenarios? In Fig. 1 (a), the Spatial Class-Aware Attention (SCA) considers the context dining table as the cause of spoon, which helps the model make correct predictions even if the stressed region is wrong, but it fails in cases where the object is absent in the familiar context. Fig. 1(b) illustrates the problem in a quantitative view: we calculate the mIOU between the attention map (or the activation map for baseline) and the ground-truth mask. As is shown in the figure, although the model with attention gains some improvement measured by common evaluation metrics, the mIOU does not see progress, which means attention does not capture the more accurate regions of targets than the baseline. Moreover, Fig. 1(c ) illustrates the attention mechanism more easily attends to the wrong context regions when the training samples are insufficient. All above evidence reveals that attention mechanisms may not always improve the performance of the model by accessing meaningful factors and could be harmful, especially in the case of out-of-distribution scenarios. Fortunately, the causality provides us with a theoretical perspective for this problem (Pearl et al., 2000; Neuberg, 2003) . The familiar context of an object is a confounder (Yue et al., 2020; Zhang et al., 2020; Wang et al., 2020) confusing the causality between the object and its prediction. Even though the scene of the dining table is not the root cause for the spoon, the model is fooled into setting up a spurious correlation between them. On the other hand, the problem is tricky because context is naturally biased in the real world, and common datasets annotated by experts (e.g., MS-COCO (Lin et al., 2014) ) also suffer from severe contextual bias. As is well known, networks equipped with attention can learn better representations of the datasets. However, the contextual bias in datasets can be also exacerbated when using the attention mechanism.

