ACAT: ADVERSARIAL COUNTERFACTUAL ATTENTION FOR CLASSIFICATION AND DETECTION IN MEDICAL IMAGING

Abstract

In some medical imaging tasks and other settings where only small parts of the image are informative for the classification task, traditional CNNs can sometimes struggle to generalise. Manually annotated Regions of Interest (ROI) are sometimes used to isolate the most informative parts of the image. However, these are expensive to collect and may vary significantly across annotators. To overcome these issues, we propose a method to generate ROIs via saliency maps, obtained from adversarially generated counterfactual images. With this method, we are able to isolate the area of interest in brain and lung CT scans without using any manual annotations. Our saliency maps, in the task of localising the lesion location out of 6 possible regions, obtain a score of 65.05% on brain CT scans, improving the score of 61.29% obtained with the best competing method. We then employ the saliency maps in a framework that refines a classifier pipeline; in particular, the saliency maps are used to obtain soft spatial attention masks that modulate the image features at different scales. We refer to our method as Adversarial Counterfactual Attention (ACAT). ACAT increases the baseline classification accuracy of lesions in brain CT scans from 71.39% to 72.55% and of COVID-19 related findings in lung CT scans from 67.71% to 70.84% and exceeds the performance of competing methods.

1. INTRODUCTION

In computer vision classification problems, it is often assumed that an object that represents a class occupies a large part of an image. However, in other image domains, such as medical imaging or histopathology, only a small fraction of the image contains information that is relevant for the classification task (Kimeswenger et al., 2019) . With object-centric images, using wider contextual information (e.g. planes fly in the sky) and global features can aid the classification decision. In medical images, variations in parts of the image away from the local pathology are often normal, and using any apparent signal from such regions is usually spurious and unhelpful in building robust classifiers. Convolutional Neural Networks (CNNs) (Krizhevsky et al., 2012; He et al., 2016; Szegedy et al., 2017; Huang et al., 2017a) can struggle to generalise well in such settings, especially when training cannot be performed on a very large amount of data (Pawlowski et al., 2019) . This is at least partly because the convolutional structure necessitates some additional 'noisy' statistical response to filters away from the informative 'signal' regions. Because the 'signal' response region is small, and the noise region is potentially large, this can result in low signal to noise in convolutional networks, impacting performance. To help localisation of the most informative parts of the image in medical imaging applications, Region Of Interest (ROI) annotations are often collected (Cheng et al., 2011; Papanastasopoulos et al., 2020) . However, these annotations require expert knowledge, are expensive to collect, and opinions on ROI of a particular case may vary significantly across annotators (Grünberg et al., 2017) . Alternatively, attention systems could be applied to locate the critical regions and aid classification. Previous work has explored the application of attention mechanisms over image features, either aiming to capture the spatial relationship between features (Bell et al., 2016; Newell et al., 2016; Santoro et al., 2017) , the channel relationship (Hu et al., 2018) or both (Woo et al., 2018; Wang et al., 2017) . Other authors employed self-attention to model non-local properties of images (Wang

