SHAPE OR TEXTURE: UNDERSTANDING DISCRIMINATIVE FEATURES IN CNNS

Abstract

Contrasting the previous evidence that neurons in the later layers of a Convolutional Neural Network (CNN) respond to complex object shapes, recent studies have shown that CNNs actually exhibit a 'texture bias': given an image with both texture and shape cues (e.g., a stylized image), a CNN is biased towards predicting the category corresponding to the texture. However, these previous studies conduct experiments on the final classification output of the network, and fail to robustly evaluate the bias contained (i) in the latent representations, and (ii) on a per-pixel level. In this paper, we design a series of experiments that overcome these issues. We do this with the goal of better understanding what type of shape information contained in the network is discriminative, where shape information is encoded, as well as when the network learns about object shape during training. We show that a network learns the majority of overall shape information at the first few epochs of training and that this information is largely encoded in the last few layers of a CNN. Finally, we show that the encoding of shape does not imply the encoding of localized per-pixel semantic information. The experimental results and findings provide a more accurate understanding of the behaviour of current CNNs, thus helping to inform future design choices.

1. INTRODUCTION

Convolutional neural networks (CNNs) have achieved unprecedented performance in various computer vision tasks, such as image classification (Krizhevsky et al., 2012; Simonyan & Zisserman, 2015; He et al., 2016 ), object detection (Ren et al., 2015; He et al., 2017) and semantic segmentation (Long et al., 2015; Chen et al., 2017; Islam et al., 2017) . Despite their black box nature, various studies have shown that early layers in CNNs activate for low-level patterns, like edges and blobs, while deeper layers activate for more complex and high-level patterns (Zeiler & Fergus, 2014; Springenberg et al., 2014) . The intuition is that this hierarchical learning of latent representations allows CNNs to recognize complex object shapes to correctly classify images (Kriegeskorte, 2015) . In contrast, recent works (Brendel & Bethge, 2019; Hermann & Lampinen, 2020) have argued that CNNs trained on ImageNet (IN) (Deng et al., 2009) classify images mainly according to their texture, rather than object shape. These conflicting results have large implications for the field of computer vision as it suggests that CNNs trained for image classification might be making decisions based largely off spurious correlations rather than a full understanding of different object categories. One example of these spurious correlations is how the Inception CNN (Szegedy et al., 2015) recognizes the difference between 'Wolf' and 'Husky', based on whether there is snow in the background (Tulio Ribeiro et al., 2016) . Recognizing object shapes is important for the generalization to out-of-domain examples (e.g., few-shot learning), as shape is more discriminative than texture when

