LEARNING VISUAL REPRESENTATIONS FOR TRANS-FER LEARNING BY SUPPRESSING TEXTURE

Abstract

Recent literature has shown that features obtained from supervised training of CNNs may over-emphasize texture rather than encoding high-level information. In self-supervised learning in particular, texture as a low-level cue may provide shortcuts that prevent the network from learning higher level representations. To address these problems we propose to use classic methods based on anisotropic diffusion to augment training using images with suppressed texture. This simple method helps retain important edge information and suppress texture at the same time. We empirically show that our method achieves state-of-the-art results on object detection and image classification with eight diverse datasets in either supervised or self-supervised learning tasks such as MoCoV2 and Jigsaw. Our method is particularly effective for transfer learning tasks and we observed improved performance on five standard transfer learning datasets. The large improvements (up to 11.49%) on the Sketch-ImageNet dataset, Synthetic-DTD dataset and additional visual analyses with saliency maps suggest that our approach helps in learning better representations that better transfer.

1. INTRODUCTION

Deep convolutional neural networks (CNNs) can learn powerful visual features that have resulted in significant improvements on many computer vision tasks such as semantic segmentation (Shelhamer et al., 2017) , object recognition (Krizhevsky et al., 2012) , and object detection (Ren et al., 2015) . However, CNNs often fail to generalize well across datasets under domain-shift due to varied lighting, sensor resolution, spectral-response etc. One of the reasons for this poor generalization is CNNs' over reliance on low-level cues like texture (Geirhos et al., 2018) . These low-level cues and texture biases have been identified as grave challenges to various learning paradigms ranging from supervised learning (Brendel & Bethge, 2019; Geirhos et al., 2018; Ringer et al., 2019) to self-supervised learning (SSL) (Noroozi & Favaro, 2016; Noroozi et al., 2018; Doersch et al., 2015; Caron et al., 2018; Devlin et al., 2019) . We focus on learning visual representation that are robust to changes in low-level information, like texture cues. Specifically, we propose to use classical tools to suppress texture in images, as a form of data augmentation, to encourage deep neural networks to focus more on learning representations that are less dependent on textural cues. We use the Perona-Malik non-linear diffusion method (Perona & Malik, 1990) , robust Anistropic diffusion (Black et al., 1998) , and Bilateral filtering (Tomasi & Manduchi, 1998) to augment our training data. These methods suppress texture while retaining structure, by preserving boundaries. Our work is inspired by the observations that ImageNet pre-trained models fail to generalize well across datasets (Geirhos et al., 2018; Recht et al., 2019) , due to over-reliance on texture and lowlevel features. Stylized-ImageNet (Geirhos et al., 2018) attempted to modify the texture from images by using style-transfer to render images in the style of randomly selected paintings from the Kaggle paintings dataset. However, this approach offers little control over exactly which cues are removed from the image. The resulting images sometimes retain texture and distort the original shape. In our approach (Fig. 1 ), we suppress the texture instead of modifying it. We empirically show that this helps in learning better higher level representations and works better than CNN-based stylized augmentation. We compare our approach with Gaussian blur augmentation, recently used in (Chen Overall, we achieve significant improvements on several benchmarks: • In a set of eight diverse datasets, our method exhibits substantial improvements (as high as +11.49% on Sketch ImageNet and 10.41% on the Synthetic-DTD dataset) in learning visual representations across domains. • We also get improvements in same domain visual recognition tasks on ImageNet validation (+0.7%) and a label corruption task (Hendrycks et al., 2019) . • We achieve state-of-the-art results in self-supervised learning on VOC detection and other transfer learning tasks.



Figure 1: An overview of our approach. We propose to augment the ImageNet dataset by a version of the dataset with Anisotropic diffused images. The use of this augmentation helps the network rely less on texture information and increases performance in diverse experiments.

Figure 2: Examples of images from Sketch-ImageNet. Images have very little or no texture, which implies texture will have little to no impact on object classification.

