EVALUATING WEAKLY SUPERVISED OBJECT LOCAL-IZATION METHODS RIGHT? A STUDY ON HEATMAP-BASED XAI AND NEURAL BACKED DECISION TREE

Abstract

Choe et al have investigated several aspects of Weakly Supervised Object Localization (WSOL) with only image label. They addressed the ill-posed nature of the problem and showed that WSOL has not significantly improved beyond the baseline method class activation mapping (CAM). We report the results of similar experiments on ResNet50 with some crucial differences: (1) we perform WSOL using heatmap-based eXplanaible AI (XAI) methods (2) we consider the XAI aspect of WSOL in which localization is used as the explanation of a class prediction task. Under similar protocol, we find that XAI methods perform WSOL with very sub-standard MaxBoxAcc scores. The experiment is then repeated for the same model trained with Neural Backed Decision Tree (NBDT) and we found that vanilla CAM yields significantly better WSOL performance after NBDT training.

1. INTRODUCTION

Weakly-supervised object localization (WSOL) aims to use only image-level labels (class labels) to perform localization. Compared to methods that require full annotations, WSOL can be much more resource efficient; it has therefore been widely studied (Choe & Shim, 2019; Singh & Lee, 2017; Zhang et al., 2018a; b; Zhou et al., 2016; Guo et al., 2021; Wei et al., 2021; Babar & Das, 2021; Gao et al., 2021; Xie et al., 2021) . Class Activation Mapping (CAM) (Zhou et al., 2016 ) is a heatmap-based explainable artificial intelligence (XAI) method that enables Convolutional Neural Network (CNN) to perform WSOL. Other heatmap-based XAI methods have been designed to compute relevance/attribution maps, some of which have been treated as localization maps after some processing e.g. Saliency (Simonyan et al., 2014) has been used for WSOL using only gradient (obtained from backpropagation) and minimal post-processing. In this paper, besides Saliency, we will also investigate the WSOL capability of several heatmap-based XAI methods: GradCAM (Selvaraju et al., 2016) (generalization of CAM), Guided Backpropagation (GBP) (Springenberg et al., 2015) and DeepLift (Shrikumar et al., 2017) . Admittedly, there are many other methods that are not included in this paper e.g., Layerwise Relevance Propagation (also its derivatives (Bach et al., 2015; Montavon et al., 2017; Kohlbrenner et al., 2020) ) and modifications of CAM (Muhammad & Yeasin, 2020; Wang et al., 2020; Jalwana et al., 2021; Kindermans et al., 2018) . Main objective of this paper: measure the WSOL capability of existing heatmap-based XAI method applied on ResNet50 and improve them. Fig. 1 shows how existing XAI methods can be very unsuitable for WSOL (e.g. high granular heatmaps and uneven edge detection). This paper shows that it is possible to modify the aforementioned methods and improve their localization ability beyond baseline CAM. Important clarifications: 1. It is not our intention to invent yet another XAI method. Instead, we add intermediate steps (based on CAM-like concept) on existing techniques to improve WSOL performance. 2. We do not claim to attain the best localization. We are in fact testing the metric MaxBoxAcc presented in CVPR 2020 (Choe et al., 2020) . In that paper, a dedicated training process is performed to optimize the said metric; in their github, this training is simply called the WSOL training. While their training improved WSOL, as a trade-off, their classification performance has degraded, and we quote them "There are in general great fluctuations in the classification results

annex

(26.8% to 74.5% on CUB)". This accuracy degradation can be found in their appendix, section C2. By contrast, we prioritize interpretability, hence our baseline is CAM without WSOL training. Instead of WSOL training, we use NBDT training (see later). Other XAI methods are tested on the same metric and compared to CAM.

Summary of our contributions and results:

1. Vanilla CAM after Neural Backed Decision Tree (NBDT) training yields the highest performance, besting most other methods by a significant margin. Heatmaps derived from Saliency method applied to the input layer obtains high scores as well but the method requires a peculiarly low operating threshold. 2. With the proper application of CAM-like concepts, heatmaps obtained from the inner layers of existing XAI methods can perform WSOL that beats the original CAM without NBDT. 3. The NBDT original paper (Wan et al., 2021) trains larger models from scratch. However, we successfully fine-tuned pre-trained ResNet50 without causing the collapse of predictive performance. CAM: weighted sum of feature maps. To use a CNN for prediction, usually the last convolutional feature maps f ∈ R (C,H,W ) is average-pooled and fed into fully-connected (FC) layer, i.e. F C(AvgP ool(f )) where C = 3 is the color channel, H height and W width of the images. Suppose our CNN classifies images into K different categories. Denote the i-th feature map before pooling by f i where i = 1, . . . , c, c is the number of output channels before FC. The FC weight is then given by w ∈ R c×K . To obtain CAM for class k ∈ {1, . . . , K}, compute weighted-sum across channel c of f , so that CAM = Σ c i f i w ik . In this paper, we tested various weighting schemes i.e. different w ik and other empirical modifications aimed to yield better WSOL performance.

2. RELATED WORKS AND LITERATURE REVIEW

WSOL Metric. Popular metrics to evaluate WSOL are Top-1, Top-5 localization (accuracy or error) and GT-known localization accuracy (Choe & Shim, 2019; Singh & Lee, 2017; Zhang et al., 2018a; b; Zhou et al., 2016; Guo et al., 2021; Wei et al., 2021; Babar & Das, 2021; Gao et al., 2021; Xie et al., 2021) . The problems with these simple metrics are well described in (Choe et al., 2020) . Firstly, WSOL with only image-level labels can be an ill-posed problem. Secondly, the dependence on operating threshold τ may lead to misleading comparison. Thus they introduced

