SALIENCY IS A POSSIBLE RED HERRING WHEN DIAGNOSING POOR GENERALIZATION

Abstract

Poor generalization is one symptom of models that learn to predict target variables using spuriously-correlated image features present only in the training distribution instead of the true image features that denote a class. It is often thought that this can be diagnosed visually using attribution (aka saliency) maps. We study if this assumption is correct. In some prediction tasks, such as for medical images, one may have some images with masks drawn by a human expert, indicating a region of the image containing relevant information to make the prediction. We study multiple methods that take advantage of such auxiliary labels, by training networks to ignore distracting features which may be found outside of the region of interest. This mask information is only used during training and has an impact on generalization accuracy depending on the severity of the shift between the training and test distributions. Surprisingly, while these methods improve generalization performance in the presence of a covariate shift, there is no strong correspondence between the correction of attribution towards the features a human expert has labelled as important and generalization performance. These results suggest that the root cause of poor generalization may not always be spatially defined, and raise questions about the utility of masks as "attribution priors" as well as saliency maps for explainable predictions.

1. INTRODUCTION

A fundamental challenge when applying deep learning models stems from poor generalization due to covariate shift (Moreno-Torres et al., 2012) when the probably approximately correct (PAC) learning i.i.d. assumption is invalid (Valiant, 1984) i.e. the training distribution is different from the test distribution. One explanation for this is shortcut learning or incorrect feature attribution, where the model during training overfits to a set of training-data specific decision rules to explain the training data instead of modelling the more general causative factors that generated the data (Goodfellow et al., 2016; Reed & Marks, 1999; Geirhos et al., 2020; Hermann & Lampinen, 2020; Parascandolo et al., 2020; Arjovsky et al., 2019; Zhang et al., 2016) . In medical imaging, poor generalization due to test-set distribution shifts are common and this problem is exacerbated by small cohorts. Previous work has hypothesized that this poor generalization is in part due to the presence of confounding variables in the training data such as acquisition site or other image acquisition parameters because attribution maps (aka saliency maps; Simonyan et al. ( 2014)) produced by the trained model do not highlight features that a human expert would use to make a diagnosis (Zech et al., 2018; DeGrave et al., 2020; Badgeley et al., 2019; Zhao et al., 2019; Young et al., 2019) . Previous researchers have made the assumption that saliency maps can demonstrate that the model is not overfit or behaving unexpectedly (Pasa et al., 2019; Tschandl et al., 2020) . We started this work under the same assumption only to find the contradictory evidence we present in this paper. In this work, we set out to test the hypothesis that models with good generalization properties have attribution maps which only utilize the class-discriminative features to make predictions, by explicitly regularizing the models to ignore confounders using attribution priors (Erion et al., 2019; Ross et al., 2017) , i.e., to make predictions using the correct anatomy (as a doctor would). We evaluated whether this regularization would A) improve out of distribution generalization, and B) change feature attribution to be more like the attribution priors. If there exists a relationship between the attribution map and generalization performance, we would expect these 1

