SHAPE OR TEXTURE: UNDERSTANDING DISCRIMINATIVE FEATURES IN CNNS

Abstract

Contrasting the previous evidence that neurons in the later layers of a Convolutional Neural Network (CNN) respond to complex object shapes, recent studies have shown that CNNs actually exhibit a 'texture bias': given an image with both texture and shape cues (e.g., a stylized image), a CNN is biased towards predicting the category corresponding to the texture. However, these previous studies conduct experiments on the final classification output of the network, and fail to robustly evaluate the bias contained (i) in the latent representations, and (ii) on a per-pixel level. In this paper, we design a series of experiments that overcome these issues. We do this with the goal of better understanding what type of shape information contained in the network is discriminative, where shape information is encoded, as well as when the network learns about object shape during training. We show that a network learns the majority of overall shape information at the first few epochs of training and that this information is largely encoded in the last few layers of a CNN. Finally, we show that the encoding of shape does not imply the encoding of localized per-pixel semantic information. The experimental results and findings provide a more accurate understanding of the behaviour of current CNNs, thus helping to inform future design choices.

1. INTRODUCTION

Convolutional neural networks (CNNs) have achieved unprecedented performance in various computer vision tasks, such as image classification (Krizhevsky et al., 2012; Simonyan & Zisserman, 2015; He et al., 2016) , object detection (Ren et al., 2015; He et al., 2017) and semantic segmentation (Long et al., 2015; Chen et al., 2017; Islam et al., 2017) . Despite their black box nature, various studies have shown that early layers in CNNs activate for low-level patterns, like edges and blobs, while deeper layers activate for more complex and high-level patterns (Zeiler & Fergus, 2014; Springenberg et al., 2014) . The intuition is that this hierarchical learning of latent representations allows CNNs to recognize complex object shapes to correctly classify images (Kriegeskorte, 2015) . In contrast, recent works (Brendel & Bethge, 2019; Hermann & Lampinen, 2020) have argued that CNNs trained on ImageNet (IN) (Deng et al., 2009) classify images mainly according to their texture, rather than object shape. These conflicting results have large implications for the field of computer vision as it suggests that CNNs trained for image classification might be making decisions based largely off spurious correlations rather than a full understanding of different object categories. One example of these spurious correlations is how the Inception CNN (Szegedy et al., 2015) recognizes the difference between 'Wolf' and 'Husky', based on whether there is snow in the background (Tulio Ribeiro et al., 2016) . Recognizing object shapes is important for the generalization to out-of-domain examples (e.g., few-shot learning), as shape is more discriminative than texture when ) segmentation maps with a one convolutional layer readout module shows that, while the model classifies the image level shape label correctly as a 'bird', it fails to encode the full object shape (3 rd col.) as well as fails to categorically assign every object pixel to the 'bird' class (4 th col.). texture-affecting phenomena arise, such as lighting, shading, weather, motion blur, or when switching between synthetic and real data. In addition to performance, identifying the discriminative features that CNNs use for decision making is critical for the transparency and further improvements of computer vision models. While the model may achieve good performance for a certain task, it cannot communicate to the user about the reasons it made certain predictions. In other words, successful models need to be good, and interpretable (Lipton, 2019) . This is crucial for many domains where causal mechanisms should play a significant role in short or long-term decision making such as healthcare (e.g., what in the MRI indicates a patient has cancer?). Additionally, if researchers intend for their algorithms to be deployed, there must be a certain degree of trust in the decision making algorithm. One downside of the increasing abstraction capabilities of deep CNNs is the lack of interpretability of the latent representations due to hidden layer activations coding semantic concepts in a distributed fashion (Fong & Vedaldi, 2018) . It has therefore been difficult to precisely quantify the type of information contained in the latent representations of CNNs. Some methods have looked at ways to analyze the latent representations of CNNs on a neuron-to-neuron level. For instance, (Bau et al., 2017) quantify the number of interpretable neurons for a CNN by evaluating the semantic segmentation performance of an individual neuron from an upsampled latent representation. Later work (Fong & Vedaldi, 2018) then removed the assumption that each neuron encodes a single semantic concept. These works successfully quantify the number of filters that recognize textures or specific objects in a CNN, but do not identify shape information within these representations. The most similar works to ours are those that aim to directly quantify the shape information in CNNs. For example, (Geirhos et al., 2018) analyzed the outputs of CNNs on images with conflicting shape and texture cues. By using image stylization (Huang & Belongie, 2017), they generated the Stylized ImageNet dataset (SIN), where each image has an associated shape and texture label. They then measured the 'shape bias' and 'texture bias' of a CNN by calculating the percentage of images a CNN predicts as either the shape or texture label, respectively. They conclude that CNNs are 'texture biased' and make predictions mainly from texture in an image. This metric has been used in subsequent work exploring shape and texture bias in CNNs (Hermann & Kornblith, 2019); however, the method only compares the output of a CNN, and fails to robustly quantify the amount of shape information contained in the latent representations (note that they refer to 'shape' as the entire 3D form of an object, including contours that are not part of the silhouette, while in our work, we define 'shape' as the 2D class-agnostic silhouette of an object). Thus, the method from (Hermann & Kornblith, 2019) cannot answer a question of focus in our paper: 'What fraction of the object's shape is actually encoded in the latent representation?'. Further, as their metric for shape relies solely on the semantic class label, it precludes them from evaluating the encoded shape and associated categorical information on a per-pixel level. For instance, we show in Fig. 1 that shape biased models (i.e., trained on stylized images) do not classify images based on the entire object shape: even though the CNN correctly classifies the image as a bird, only the partial binary mask (i.e., 'shape') can be extracted from the latent representations and it cannot attribute the correct class label to the entire object region (i.e., semantic segmentation mask). Contributions. To address these issues, we perform an empirical study on the ability of CNNs to encode shape information on a neuron-to-neuron and per-pixel level. To quantify these two aspects, we first approximate the mutual information of latent representations between pairs of semantically related images which allows us to estimate the number of dimensions in the feature space dedicated to encoding shape and texture. We then propose a simple strategy to evaluate the amount of shape information contained in the internal representations of a CNN, on a per-pixel level. The latter technique is utilized to distinguish the quality of different shape encodings, regardless of the number of neurons used in each encoding. After showing the efficacy of the two methods, we reveal a number of meaningful properties of CNNs with respect to their ability to encode shape information,



Figure1: A shape biased model (trained on Stylized Ima-geNet) makes predictions based on the object's shape, or does it? Extracting binary (3 rd column) and semantic (4 th col.) segmentation maps with a one convolutional layer readout module shows that, while the model classifies the image level shape label correctly as a 'bird', it fails to encode the full object shape (3 rd col.) as well as fails to categorically assign every object pixel to the 'bird' class (4 th col.).

