ODAM: GRADIENT-BASED INSTANCE-SPECIFIC VI-SUAL EXPLANATIONS FOR OBJECT DETECTION

Abstract

We propose the gradient-weighted Object Detector Activation Maps (ODAM), a visualized explanation technique for interpreting the predictions of object detectors. Utilizing the gradients of detector targets flowing into the intermediate feature maps, ODAM produces heat maps that show the influence of regions on the detector's decision for each predicted attribute. Compared to previous works classification activation maps (CAM), ODAM generates instance-specific explanations rather than class-specific ones. We show that ODAM is applicable to both one-stage detectors and two-stage detectors with different types of detector backbones and heads, and produces higher-quality visual explanations than the state-of-the-art both effectively and efficiently. We next propose a training scheme, Odam-Train, to improve the explanation ability on object discrimination of the detector through encouraging consistency between explanations for detections on the same object, and distinct explanations for detections on different objects. Based on the heat maps produced by ODAM with Odam-Train, we propose Odam-NMS, which considers the information of the model's explanation for each prediction to distinguish the duplicate detected objects. We present a detailed analysis of the visualized explanations of detectors and carry out extensive experiments to validate the effectiveness of the proposed ODAM.

1. INTRODUCTION

Significant breakthroughs have been made in object detection and other computer vision tasks due to the development of deep neural networks (DNN) (Girshick et al., 2014b) . However, the unintuitive and opaque process of DNNs makes them hard to interpret. As spatial convolution is a frequent component of state-of-the-art models for vision tasks, class-specific attention has emerged to interpret CNNs, which has been used to identify failure modes (Agrawal et al., 2016; Hoiem et al., 2012) , debug models (Koh & Liang, 2017) and establish appropriate users' confidence about models (Selvaraju et al., 2017) . These explanation approaches produce heat maps locating the regions in the input images that the model looked at, representing the influence of different pixels on the model's decision. Gradient visualization (Simonyan et al., 2013 ), Perturbation (Ribeiro et al., 2016) , and Class Activation Map (CAM) (Zhou et al., 2016) are three widely adopted methods to generate the visual explanation map. However, these methods have primarily focused on image classification (Petsiuk et al., 2018; Fong & Vedaldi, 2017; Selvaraju et al., 2017; Chattopadhay et al., 2018; Wang et al., 2020b; a) , or its variants, e.g., visual question answering (Park et al., 2018 ), video captioning (Ramanishka et al., 2017; Bargal et al., 2018) , and video activity recognition (Bargal et al., 2018) . Generating explanation heat maps for object detectors is an under-explored area. The first work in this area is D-RISE (Petsiuk et al., 2021) , which extends RISE (Petsiuk et al., 2018) for explaining image classifiers to object detectors. As a perturbation-based approach, D-RISE first randomly generates a large number of binary masks, resizes them to the image size, and then perturbs the original input to observe the change in the model's prediction. However, the large number of inference calculations makes the D-RISE computationally intensive, and the quality of the heat maps is influenced by the mask resolution (e.g., see Fig. 1b ). Furthermore, D-RISE only generates an overall heat map for the predicted object, which is unable to show the influence of regions on the specific attributes of a prediction, e.g., class probability and regressed bounding box corner coordinates. The popular CAM-based methods for image classification are not directly applicable to object detectors. CAM methods generate heat maps for classification via a linear combination of the weights and the activation maps, such as the popular Grad-CAM (Selvaraju et al., 2017) and its variants.

annex

Published as a conference paper at ICLR 2023 However, Grad-CAM provides class-specific explanations, and thus produces heat maps that highlight all objects of in a category instead of explaining a single detection (e.g., see Fig. 1a ). For object detection, the explanations should be instance-specific rather than class-specific, so as to discriminate each individual object. Exploring the spatial importance of different objects can help interpret the models' decision and show the important area in the feature maps for each prediction.Considering that direct application of existing CAM methods to object detectors is infeasible and the drawbacks of the current state-of-the-art D-RISE, we propose gradient-weighted Object Detector Activation Maps (ODAM). ODAM adopts a similar assumption as Grad-CAM that feature maps correlate with some concept for making the final outputs. Thus ODAM uses the gradients w.r.t. each pixel in the feature map to obtain the explanation heat map for each attribute of the object prediction.Compared with the perturbation-based D-RISE, ODAM is more efficient and generates less noisy heat maps (see Fig. 1c ), while also explaining each attribute separately.We also explore a unique explanation task for object detectors, object discrimination, which aims to explain which object was detected. This is different from the traditional explanation task of what features are important for class prediction (i.e., object specfication). We propose a training scheme, Odam-Train, to improve the explanation ability for object discrimination by introducing consistency and separation losses. The training encourages the model to produce consistent heat maps for the same object, and distinctive heat maps for different objects (see Fig. 1d ). We further propose Odam-NMS, which uses the instance-level heat maps from ODAM to aid the non-maximum suppression (NMS) process of removing duplicate predictions of the same object.The contributions of our paper are summarized as follows:1. We propose ODAM, a gradient-based visual explanation approach to produce instance-specific heat maps for explaining prediction attributes of object detectors, which is more efficient and robust compared with the current state-of-the-art. 2. We demonstrate the generalizability of ODAM by exhibiting explanations on one-and twostage, and transformer-based detectors with different types of backbones and detector heads. 3. We explore a unique explanation task for detector, object discrimination, for explaining which object was detected, and propose Odam-Train to obtain model with better object discrimination ability. 4. We propose Odam-NMS, which uses the instance-level heat maps generated by ODAM with Odam-Train to remove duplicate predictions during NMS, and its effectiveness verifies the object discrimination ability of ODAM with Odam-Train.

2. RELATED WORKS

Object detection Object detectors are generally composed of a backbone, neck and head. Based on the type of head, detectors can be mainly divided into one-stage and two-stage methods. Twostage approaches perform two steps: generating region candidates (proposals) and then using RoI (Region of Interest) features for the subsequent object classification and location regression. 

