SHAPE-TEXTURE DEBIASED NEURAL NETWORK TRAINING

Abstract

Shape and texture are two prominent and complementary cues for recognizing objects. Nonetheless, Convolutional Neural Networks are often biased towards either texture or shape, depending on the training dataset. Our ablation shows that such bias degenerates model performance. Motivated by this observation, we develop a simple algorithm for shape-texture debiased learning. To prevent models from exclusively attending on a single cue in representation learning, we augment training data with images with conflicting shape and texture information (e.g., an image of chimpanzee shape but with lemon texture) and, most importantly, provide the corresponding supervisions from shape and texture simultaneously. Experiments show that our method successfully improves model performance on several image recognition benchmarks and adversarial robustness. For example, by training on ImageNet, it helps ResNet-152 achieve substantial improvements on ImageNet (+1.2%), ImageNet-A (+5.2%), ImageNet-C (+8.3%) and Stylized-ImageNet (+11.1%), and on defending against FGSM adversarial attacker on ImageNet (+14.4%). Our method also claims to be compatible to other advanced data augmentation strategies, e.g., Mixup and Cut-Mix. The code is available here:

1. INTRODUCTION

It is known that both shape and texture serve as essential cues for object recognition. A decade ago, computer vision researchers had explicitly designed a variety of hand-crafted features, either based on shape (e.g., shape context (Belongie et al., 2002) and inner distance shape context (Ling & Jacobs, 2007) ) or texture (e.g., textons (Malik et al., 2001) ), for object recognition. Moreover, researchers found that properly combining shape and texture can further recognition performance (Shotton et al., 2009; Zheng et al., 2007) , demonstrating the superiority of possessing both features. Nowadays, as popularized by Convolutional Neural Networks (CNNs) (Krizhevsky et al., 2012) , the features used for object recognition are automatically learned, rather than manually designed. This change not only eases human efforts on feature engineering, but also yields much better performance on a wide range of visual benchmarks (Simonyan & Zisserman, 2015; He et al., 2016; Girshick et al., 2014; Girshick, 2015; Ren et al., 2015; Long et al., 2015; Chen et al., 2015) . But interestingly, as pointed by Geirhos et al. (2019) , the features learned by CNNs tend to bias toward either shape or texture, depending on the training dataset. We verify that such biased representation learning (towards either shape or texture) weakens CNNs' performance.foot_0 Nonetheless, surprisingly, we also find (1) the model with shape-biased representations and the model with texture-biased representations are highly complementary to each other, e.g., they focus on completely different cues for predictions (an example is provided in Figure 1 ); and (2) being biased towards either cue may inevitably limit model performance, e.g., models may not be able to tell the difference between a lemon and an orange without texture information. These observations altogether deliver a promising message-biased models (e.g., ImageNet trained (texturebiased) CNNs (Geirhos et al., 2019) To this end, we hereby develop a shape-texture debiased neural network training framework to guide CNNs for learning better representations. Our method is a data-driven approach, which let CNNs automatically figure out how to avoid being biased towards either shape or texture from their training samples. Specifically, we apply style transfer to generate cue conflict images, which breaks the correlation between shape and texture, for augmenting the original training data. The most important recipe of training a successful shape-texture debiased model is that we need to provide supervision from both shape and texture on these generated cue conflict images, otherwise models will remain being biased. Experiments show that our proposed shape-texture debiased neural network training significantly improves recognition models. For example, on the challenging ImageNet dataset (Russakovsky et al., 2015) , our method helps ResNet-152 gain an absolute improvement of 1.2%, achieving 79.8% top-1 accuracy. Additionally, compared to its vanilla counterpart, this debiased ResNet-152 shows better generalization on ImageNet-A (Hendrycks et al., 2019) 

2. SHAPE/TEXTURE BIASED NEURAL NETWORKS

The biased feature representation of CNNs mainly stems from the training dataset, e.g., Geirhos et al. (2019) point out that models will be biased towards shape if trained on Stylized-ImageNet dataset. Following Geirhos et al. ( 2019), we hereby present a similar training pipeline to acquire shapebiased models or texture-biased models. By evaluating these two kinds of models, we observe the necessity of possessing both shape and texture representations for CNNs to better recognize objects.

2.1. MODEL ACQUISITION

Data generation. Similar to Geirhos et al. ( 2019), we apply images with conflicting shape and texture information as training samples to obtain shape-biased or texture-biased models. But different from Geirhos et al. (2019) , an important change in our cue conflict image generation procedure is that we override the original texture information with the informative texture patterns from another



Biased models are acquired similar to Geirhos et al. (2019), see Section for details.



Figure 1: Both shape and texture are essential cues for object recognition, and biasing towards either one degenerates model performance. As shown above, when classifying this fur coat image, the shape-biased model is confounded by the cloth-like shape therefore predict it as a poncho, and the texture-biased model confuses it as an Egyptian cat because of the misleading texture. Nonetheless, our debiased model can successfully recognize it as a fur coat by leveraging both shape and texture.

or (shape-biased) CNNs (Shi et al., 2020)) are improvable.

