SALIENCY IS A POSSIBLE RED HERRING WHEN DIAGNOSING POOR GENERALIZATION

Abstract

Poor generalization is one symptom of models that learn to predict target variables using spuriously-correlated image features present only in the training distribution instead of the true image features that denote a class. It is often thought that this can be diagnosed visually using attribution (aka saliency) maps. We study if this assumption is correct. In some prediction tasks, such as for medical images, one may have some images with masks drawn by a human expert, indicating a region of the image containing relevant information to make the prediction. We study multiple methods that take advantage of such auxiliary labels, by training networks to ignore distracting features which may be found outside of the region of interest. This mask information is only used during training and has an impact on generalization accuracy depending on the severity of the shift between the training and test distributions. Surprisingly, while these methods improve generalization performance in the presence of a covariate shift, there is no strong correspondence between the correction of attribution towards the features a human expert has labelled as important and generalization performance. These results suggest that the root cause of poor generalization may not always be spatially defined, and raise questions about the utility of masks as "attribution priors" as well as saliency maps for explainable predictions.

1. INTRODUCTION

A fundamental challenge when applying deep learning models stems from poor generalization due to covariate shift (Moreno-Torres et al., 2012) when the probably approximately correct (PAC) learning i.i.d. assumption is invalid (Valiant, 1984) i.e. the training distribution is different from the test distribution. One explanation for this is shortcut learning or incorrect feature attribution, where the model during training overfits to a set of training-data specific decision rules to explain the training data instead of modelling the more general causative factors that generated the data (Goodfellow et al., 2016; Reed & Marks, 1999; Geirhos et al., 2020; Hermann & Lampinen, 2020; Parascandolo et al., 2020; Arjovsky et al., 2019; Zhang et al., 2016) . In medical imaging, poor generalization due to test-set distribution shifts are common and this problem is exacerbated by small cohorts. Previous work has hypothesized that this poor generalization is in part due to the presence of confounding variables in the training data such as acquisition site or other image acquisition parameters because attribution maps (aka saliency maps; Simonyan et al. ( 2014)) produced by the trained model do not highlight features that a human expert would use to make a diagnosis (Zech et al., 2018; DeGrave et al., 2020; Badgeley et al., 2019; Zhao et al., 2019; Young et al., 2019) . Previous researchers have made the assumption that saliency maps can demonstrate that the model is not overfit or behaving unexpectedly (Pasa et al., 2019; Tschandl et al., 2020) . We started this work under the same assumption only to find the contradictory evidence we present in this paper. In this work, we set out to test the hypothesis that models with good generalization properties have attribution maps which only utilize the class-discriminative features to make predictions, by explicitly regularizing the models to ignore confounders using attribution priors (Erion et al., 2019; Ross et al., 2017) , i.e., to make predictions using the correct anatomy (as a doctor would). We evaluated whether this regularization would A) improve out of distribution generalization, and B) change feature attribution to be more like the attribution priors. If there exists a relationship between the attribution map and generalization performance, we would expect these regularizations to positively impact both generalization and attribution quality simultaneously. To evaluate attribution quality, we define good attribution to be an attribution map that agrees strongly with expert knowledge in the form of a binary mask on the input data. We show that the existing and proposed feature-attribution-aware methods help facilitate generalization in the presence of a train-test distribution shift. However, while feature-attribution-aware methods change the attribution maps relative to baseline, there is no strong correlation between generalization performance and good attribution. This in turn challenges the assumption made in previous works that the "incorrect" attribution maps were indicative of poor generalization performance. This suggests that A) efforts to validate model correctness using attribution maps may not be reliable, and B) that efforts to control feature attribution using masks on the input may not function as expected. All code and datasets for this paper are publicly availablefoot_0 . Our contributions include: • A synthetic dataset that encourages models to overfit to an easy to represent confounder instead of a more complicated counting task. et al., 2017; Ganin & Lempitsky, 2015) . • A new method for controlling feature attribution based on minimizing activation differences between masked and unmasked images (ActDiff ). • Evaluate the relationship between generalization improvement and feature attribution in reallife out of distribution generalization tasks with traditional classifiers.

2. RELATED WORK

It is a well-documented phenomenon that convolutional neural networks (CNNs), instead of building object-level representations of the input data, tend to find convenient surface-level statistics in the training data that are predictive of class (Jo & Bengio, 2017) . Previous work has attempted to reduce the model's proclivity to use confounding features by randomly masking out regions of the input (DeVries & Taylor, 2017), forcing the network to learn representations that aren't dependent on a single input feature. However, this regularization approach gives no control over the kinds of representations learned by the model, so we do not include it in our study. Recently, multiple approaches have proposed to control feature representations by penalizing the model for producing saliency gradients outside of a regions of interest indicating the classdiscriminative feature (Simpson et al., 2019; Zhuang et al., 2019; Rieger et al., 2019) . These approaches were introduced by Right for the Right Reasons (RRR), which showed impressive improvements in attribution correctness on synthetic data (Ross et al., 2017) . The follow-up work has generally demonstrated a small improvement in generalization accuracy on real data, and much more impressive results on synthetic data. Another feature attribution-aware regularization approach additionally dealt with class imbalances by increasing the impact of the gradients inside the region of interest of the under-represented class Zhuang et al. (2019) . One alternative to saliency-based methods, which can be noisy due to the ReLU activations allowing irrelevant features to pass through the activation function (Kim et al., 2019) , would be to leverage methods that aim to produce domain invariant features in the latent space of the network. These methods regularize the network such that the latent representations of two or more "domains" are encouraged to be as similar as possible, often by minimizing a distance metric or by employing an adversary that is trained to distinguish between the different domains (Kouw & Loog, 2019; Ganin & Lempitsky, 2015; Tzeng et al., 2015; Liu & Tuzel, 2016) . In this work, we view the masked version of the input as the training domain and the unmasked version of the input as the test domain, and compare these approaches with saliency-based approaches for the task of reducing the model's reliance on confounding features. To the best of our knowledge, these strategies have not been tried to control feature attribution.



https://github.com/josephdviviano/saliency-red-herring



• Two new tasks constructed from open medical datasets which have a correlation between the pathology and either imaging site (site pathology correlation; SPC) or view (view pathology correlation; VPC), and we manipulate the nature of this correlation differently in the training and test distributions to create a distribution shift (Figure5), introducing confounding variables as observed in previous work(Zhao et al., 2019; DeGrave et al., 2020). • Evaluation of existing methods for controlling feature attribution using mask information; right for the right reasons (RRR; Ross et al. (2017)), GradMask (Simpson et al., 2019), and adversarial domain invariance (Tzeng

