ROBUST AND GENERALIZABLE VISUAL REPRESENTA-TION LEARNING VIA RANDOM CONVOLUTIONS

Abstract

While successful for various computer vision tasks, deep neural networks have shown to be vulnerable to texture style shifts and small perturbations to which humans are robust. In this work, we show that the robustness of neural networks can be greatly improved through the use of random convolutions as data augmentation. Random convolutions are approximately shape-preserving and may distort local textures. Intuitively, randomized convolutions create an infinite number of new domains with similar global shapes but random local texture. Therefore, we explore using outputs of multi-scale random convolutions as new images or mixing them with the original images during training. When applying a network trained with our approach to unseen domains, our method consistently improves the performance on domain generalization benchmarks and is scalable to ImageNet. In particular, in the challenging scenario of generalizing to the sketch domain in PACS and to ImageNet-Sketch, our method outperforms state-of-art methods by a large margin. More interestingly, our method can benefit downstream tasks by providing a more robust pretrained visual representation. 1 

1. INTRODUCTION

Generalizability and robustness to out-of-distribution samples have been major pain points when applying deep neural networks (DNNs) in real world applications (Volpi et al., 2018) . Though DNNs are typically trained on datasets with millions of training samples, they still lack robustness to domain shift, small perturbations, and adversarial examples (Luo et al., 2019) . Recent research has shown that neural networks tend to use superficial features rather than global shape information for prediction even when trained on large-scale datasets such as ImageNet (Geirhos et al., 2019) . These superficial features can be local textures or even patterns imperceptible to humans but detectable to DNNs, as is the case for adversarial examples (Ilyas et al., 2019) . In contrast, image semantics often depend more on object shapes rather than local textures. For image data, local texture differences are one of the main sources of domain shift, e.g., between synthetic virtual images and real data (Sun & Saenko, 2014) . Our goal is therefore to learn visual representations that are invariant to local texture and that generalize to unseen domains. While texture and color may be treated as different concepts, we follow the convention in Geirhos et al. (2019) and include color when talking about texture. We address the challenging setting of robust visual representation learning from single domain data. Limited work exists in this setting. Proposed methods include data augmentation (Volpi et al., 2018; Qiao et al., 2020; Geirhos et al., 2019 ), domain randomization (Tobin et al., 2017; Yue et al., 2019) , self-supervised learning (Carlucci et al., 2019) , and penalizing the predictive power of low-level network features (Wang et al., 2019a) . Following the spirit of adding inductive bias towards global shape information over local textures, we propose using random convolutions to improve the robustness to domain shifts and small perturbations. First column is the input image of size 224 2 ; following columns are convolutions results using random filters of different sizes k. Bottom: Mixing results between an image and one of its random convolution results with different mixing coefficients α. unseen environments, we focus on visual representation learning and examine our approach on visual domain generalization benchmarks. Our method also includes the multiscale design and a mixing variant. In addition, considering that many computer vision tasks rely on training deep networks based on ImageNet-pretrained weights (including some domain generalization benchmarks), we ask "Can a more robust pretrained model make the finetuned model more robust on downstream tasks?" Different from (Kornblith et al., 2019; Salman et al., 2020) who studied the transferability of a pretrained ImageNet representation to new tasks while focusing on in-domain generalization, we explore generalization performance on unseen domains for new tasks. We make the following contributions: • We develop RandConv, a data augmentation technique using multi-scale random-convolutions to generate images with random texture while maintaining global shapes. We explore using the RandConv output as training images or mixing it with the original images. We show that a consistency loss can further enforce invariance under texture changes. • We provide insights and justification on why RandConv augments images with different local texture but the same semantics with the shape-preserving property of random convolutions. • We validate RandConv and its mixing variant in extensive experiments on synthetic and realworld benchmarks as well as on the large-scale ImageNet dataset. Our methods outperform single domain generalization approaches by a large margin on digit recognition datasets and for the challenging case of generalizing to the Sketch domain in PACS and to ImageNet-Sketch. • We explore if the robustness/generalizability of a pretrained representation can transfer. We show that transferring a model pretrained with RandConv on ImageNet can further improve domain generalization performance on new downstream tasks on the PACS dataset.

2. RELATED WORK

Domain Generalization (DG) aims at learning representations that perform well when transferred to unseen domains. Modern techniques range between feature fusion (Shen et al., 2019 ), metalearning (Li et al., 2018a; Balaji et al., 2018), and adversarial training (Shao et al., 2019; Li et al., 2018b) . Note that most current DG work (Ghifary et al., 2016; Li et al., 2018a; b) requires a multisource training setting to work well. However, in practice, it might be difficult and expensive to collect data from multiple sources, such as collecting data from multiple medical centers (Raghupathi & Raghupathi, 2014) . Instead, we consider the more strict single-domain generalization DG setting,



Code is available at https://github.com/wildphoton/RandConv.



Figure 1: Top: Illustration that RandConv randomize local texture but preserve shapes in the image. Middle:

