EXEMPLARY NATURAL IMAGES EXPLAIN CNN ACTI-VATIONS BETTER THAN STATE-OF-THE-ART FEATURE VISUALIZATION

Abstract

Feature visualizations such as synthetic maximally activating images are a widely used explanation method to better understand the information processing of convolutional neural networks (CNNs). At the same time, there are concerns that these visualizations might not accurately represent CNNs' inner workings. Here, we measure how much extremely activating images help humans to predict CNN activations. Using a well-controlled psychophysical paradigm, we compare the informativeness of synthetic images by Olah et al. ( 2017) with a simple baseline visualization, namely exemplary natural images that also strongly activate a specific feature map. Given either synthetic or natural reference images, human participants choose which of two query images leads to strong positive activation. The experiment is designed to maximize participants' performance, and is the first to probe intermediate instead of final layer representations. We find that synthetic images indeed provide helpful information about feature map activations (82 ± 4% accuracy; chance would be 50%). However, natural images -originally intended to be a baseline -outperform these synthetic images by a wide margin (92 ± 2%). Additionally, participants are faster and more confident for natural images, whereas subjective impressions about the interpretability of the feature visualizations by Olah et al. (2017) are mixed. The higher informativeness of natural images holds across most layers, for both expert and lay participants as well as for hand-and randomly-picked feature visualizations. Even if only a single reference image is given, synthetic images provide less information than natural images (65 ± 5% vs. 73 ± 4%). In summary, synthetic images from a popular feature visualization method are significantly less informative for assessing CNN activations than natural images. We argue that visualization methods should improve over this simple baseline.

1. INTRODUCTION

As Deep Learning methods are being deployed across society, academia and industry, the need to understand their decisions becomes ever more pressing. Under certain conditions, a "right to explanation" is even required by law in the European Union (GDPR, 2016; Goodman & Flaxman, 2017) . Fortunately, the field of interpretability or explainable artificial intelligence (XAI) is also growing: Not only are discussions on goals and definitions of interpretability advancing (Doshi-Velez & Kim, 2017; Lipton, 2018; Gilpin et al., 2018; Murdoch et al., 2019; Miller, 2019; Samek et al., 2020) but the number of explanation methods is rising, their maturity is evolving (Zeiler & Fergus, 2014; Ribeiro et al., 2016; Selvaraju et al., 2017; Kim et al., 2018) and they are tested and ... We here focus on the popular post-hoc explanation method (or interpretability method) of feature visualizations via activation maximizationfoot_0 . First introduced by Erhan et al. ( 2009) and subsequently improved by many others (Mahendran & Vedaldi, 2015; Nguyen et al., 2015; Mordvintsev et al., 2015; Nguyen et al., 2016a; 2017) , these synthetic, maximally activating images seek to visualize features that a specific network unit, feature map or a combination thereof is selective for. However, feature visualizations are surrounded by a great controversy: How accurately do they represent a CNN's inner workings-or in short, how useful are they? This is the guiding question of our study. On the one hand, many researchers are convinced that feature visualizations are interpretable (Graetz, 2019) and that "features can be rigorously studied and understood" (Olah et al., 2020b) . Also other applications from Computer Vision and Natural Language Processing support the view that features are meaningful (Mikolov et al., 2013; Karpathy et al., 2015; Radford et al., 2017; Zhou et al., 2014; Bau et al., 2017; 2020) and might be formed in a hierarchical fashion (LeCun et al., 2015; Güc ¸lü & van Gerven, 2015; Goodfellow et al., 2016) . Over the past few years, extensive investigations to better understand CNNs are based on feature visualizations (Olah et al., 2020b; a; Cammarata et al., 2020; Cadena et al., 2018) , and the technique is being combined with other explanation methods (Olah et al., 2018; Carter et al., 2019; Addepalli et al., 2020; Hohman et al., 2019) . On the other hand, feature visualizations can be equal parts art and engineering as they are science: vanilla methods look noisy, thus human-defined regularization mechanisms are introduced. One way to advance this debate is to measure the utility of feature visualizations in terms of their helpfulness for humans. In this study, we therefore design well-controlled psychophysical experiments that aim to quantify the informativeness of the popular visualization method by Olah et al. (2017) . Specifically, participants choose which of two natural images would elicit a higher activa-



Also known as input maximization or maximally exciting images (MEIs).



Figure 1: How useful are synthetic compared to natural images for interpreting neural network activations? A: Human experiment. Given extremely activating reference images (either synthetic or natural), a human participant chooses which out of two query images is also a strongly activating image. Synthetic images were generated via feature visualization (Olah et al., 2017). B: Core result. Participants are well above chance for synthetic images -but even better when seeing natural reference images. used in real-world scenarios like medicine (Cai et al., 2019; Kröll et al., 2020) and meteorology (Ebert-Uphoff & Hilburn, 2020).

But do the resulting beautiful visualizations accurately show what a CNN is selective for? How representative are the seemingly well-interpretable, "hand-picked" (Olah et al., 2017) synthetic images in publications for the entirety of all units in a network, a concern raised by e.g. Kriegeskorte (2015)? What if the features that a CNN is truly sensitive to are imperceptible instead, as might be suggested by the existence of adversarial examples (Szegedy et al., 2013; Ilyas et al., 2019)? Morcos et al. (2018) even suggest that units of easily understandable features play a less important role in a network. Another criticism of synthetic maximally activating images is that they only visualize extreme features, while potentially leaving other features undetected that only elicit e.g. 70% of the maximal activation. Also, polysemantic units(Olah et al., 2020b), i.e. units that are highly activated by different semantic concepts, as well as the importance of combinations of units(Olah et al., 2017;  2018; Fong & Vedaldi, 2018) already hint at the complexity of how concepts are encoded in CNNs.

