ACAT: ADVERSARIAL COUNTERFACTUAL ATTENTION FOR CLASSIFICATION AND DETECTION IN MEDICAL IMAGING

Abstract

In some medical imaging tasks and other settings where only small parts of the image are informative for the classification task, traditional CNNs can sometimes struggle to generalise. Manually annotated Regions of Interest (ROI) are sometimes used to isolate the most informative parts of the image. However, these are expensive to collect and may vary significantly across annotators. To overcome these issues, we propose a method to generate ROIs via saliency maps, obtained from adversarially generated counterfactual images. With this method, we are able to isolate the area of interest in brain and lung CT scans without using any manual annotations. Our saliency maps, in the task of localising the lesion location out of 6 possible regions, obtain a score of 65.05% on brain CT scans, improving the score of 61.29% obtained with the best competing method. We then employ the saliency maps in a framework that refines a classifier pipeline; in particular, the saliency maps are used to obtain soft spatial attention masks that modulate the image features at different scales. We refer to our method as Adversarial Counterfactual Attention (ACAT). ACAT increases the baseline classification accuracy of lesions in brain CT scans from 71.39% to 72.55% and of COVID-19 related findings in lung CT scans from 67.71% to 70.84% and exceeds the performance of competing methods.

1. INTRODUCTION

In computer vision classification problems, it is often assumed that an object that represents a class occupies a large part of an image. However, in other image domains, such as medical imaging or histopathology, only a small fraction of the image contains information that is relevant for the classification task (Kimeswenger et al., 2019) . With object-centric images, using wider contextual information (e.g. planes fly in the sky) and global features can aid the classification decision. In medical images, variations in parts of the image away from the local pathology are often normal, and using any apparent signal from such regions is usually spurious and unhelpful in building robust classifiers. Convolutional Neural Networks (CNNs) (Krizhevsky et al., 2012; He et al., 2016; Szegedy et al., 2017; Huang et al., 2017a) can struggle to generalise well in such settings, especially when training cannot be performed on a very large amount of data (Pawlowski et al., 2019) . This is at least partly because the convolutional structure necessitates some additional 'noisy' statistical response to filters away from the informative 'signal' regions. Because the 'signal' response region is small, and the noise region is potentially large, this can result in low signal to noise in convolutional networks, impacting performance. To help localisation of the most informative parts of the image in medical imaging applications, Region Of Interest (ROI) annotations are often collected (Cheng et al., 2011; Papanastasopoulos et al., 2020) . However, these annotations require expert knowledge, are expensive to collect, and opinions on ROI of a particular case may vary significantly across annotators (Grünberg et al., 2017) . Alternatively, attention systems could be applied to locate the critical regions and aid classification. Previous work has explored the application of attention mechanisms over image features, either aiming to capture the spatial relationship between features (Bell et al., 2016; Newell et al., 2016; Santoro et al., 2017) , the channel relationship (Hu et al., 2018) or both (Woo et al., 2018; Wang et al., 2017) . Other authors employed self-attention to model non-local properties of images (Wang et al., 2018; Zhang et al., 2019) . However, in our experiments, attention methods applied on the image features failed to improve the baseline accuracy in brain and lung CT scans classification. Other authors employed saliency maps to promote the isolation of the most informative regions during training of a classification network. They sometimes employed target ground-truth maps to generate these saliency maps (Murabito et al., 2018) . Moreover, by fusing salient information with the image branch at a single point of the network (Murabito et al., 2018; Flores et al., 2019; Figueroa-Flores et al., 2020) , these approaches may miss important data. Indeed, when the signal is low, key information could be captured by local features at a particular stage of the network, but not by features at a different scale. We propose to use counterfactual images, acquired with a technique similar to adversarial attacks (Huang et al., 2017b) , as a means to acquire saliency maps which highlight useful information about a particular patient's case. In general, counterfactual examples display the change that has to be applied to the input image for the decision of a black-box model to change. Our method achieves good isolation of the area of interest, without requiring any annotation masks. In particular, for generating counterfactual examples, we employ an autoencoder and a trained classifier to find the minimal movement in latent space that shifts the input image towards the target class, according to the output of the classifier. These saliency maps can also be used in a classification pipeline, as shown in Figure 1 , to obtain soft spatial attention masks that modulate the image features. To capture information at different scales, the attention masks are computed from the saliency features at different stages of the network and also combined through an attention fusion layer in order to better inform the final decision of the network. The main contributions of this paper are the following: 1) we introduce a method to generate counterfactual examples, from which we obtain saliency maps that outperform competing methods in isolating small areas of interest in large images, achieving a score of 65.05% in the task of localising the lesion location out of 6 possible regions on brain CT scans (vs. 61.29% obtained with the best competing method), 2) we propose ACAT, a framework that employs these saliency maps as attention mechanisms at different scales and show that it improves the baseline classification accuracy in two medical imaging tasks (from 71.39% to 72.55% on brain CT scans and from 67.71% to 70.84% in lung CT scans), 3) we show how ACAT can also be used to evaluate saliency generation methods.



Figure1: Architecture of the framework proposed for 3D volumes. The slices of each volume are first processed separately and then combined by applying an attention module over the slices. For each volume we also consider as input the corresponding saliency map. From the saliency branch, we obtain soft spatial attention masks that are used to modulate the image features. The salient attention modules capture information at different scales of the network and are combined through an attention fusion layer to better inform the final classification.

