ROBUST AND GENERALIZABLE VISUAL REPRESENTA-TION LEARNING VIA RANDOM CONVOLUTIONS

Abstract

While successful for various computer vision tasks, deep neural networks have shown to be vulnerable to texture style shifts and small perturbations to which humans are robust. In this work, we show that the robustness of neural networks can be greatly improved through the use of random convolutions as data augmentation. Random convolutions are approximately shape-preserving and may distort local textures. Intuitively, randomized convolutions create an infinite number of new domains with similar global shapes but random local texture. Therefore, we explore using outputs of multi-scale random convolutions as new images or mixing them with the original images during training. When applying a network trained with our approach to unseen domains, our method consistently improves the performance on domain generalization benchmarks and is scalable to ImageNet. In particular, in the challenging scenario of generalizing to the sketch domain in PACS and to ImageNet-Sketch, our method outperforms state-of-art methods by a large margin. More interestingly, our method can benefit downstream tasks by providing a more robust pretrained visual representation. 1 

1. INTRODUCTION

Generalizability and robustness to out-of-distribution samples have been major pain points when applying deep neural networks (DNNs) in real world applications (Volpi et al., 2018) . Though DNNs are typically trained on datasets with millions of training samples, they still lack robustness to domain shift, small perturbations, and adversarial examples (Luo et al., 2019) . Recent research has shown that neural networks tend to use superficial features rather than global shape information for prediction even when trained on large-scale datasets such as ImageNet (Geirhos et al., 2019) . These superficial features can be local textures or even patterns imperceptible to humans but detectable to DNNs, as is the case for adversarial examples (Ilyas et al., 2019) . In contrast, image semantics often depend more on object shapes rather than local textures. For image data, local texture differences are one of the main sources of domain shift, e.g., between synthetic virtual images and real data (Sun & Saenko, 2014) . Our goal is therefore to learn visual representations that are invariant to local texture and that generalize to unseen domains. While texture and color may be treated as different concepts, we follow the convention in Geirhos et al. (2019) and include color when talking about texture. We address the challenging setting of robust visual representation learning from single domain data. Limited work exists in this setting. Proposed methods include data augmentation (Volpi et al., 2018; Qiao et al., 2020; Geirhos et al., 2019) , domain randomization (Tobin et al., 2017; Yue et al., 2019) , self-supervised learning (Carlucci et al., 2019) , and penalizing the predictive power of low-level network features (Wang et al., 2019a) . Following the spirit of adding inductive bias towards global shape information over local textures, we propose using random convolutions to improve the robustness to domain shifts and small perturbations. While recently Lee et al. (2020) proposed a similar technique for improving the generalization of reinforcement learning agents in First column is the input image of size 224 2 ; following columns are convolutions results using random filters of different sizes k. Bottom: Mixing results between an image and one of its random convolution results with different mixing coefficients α. unseen environments, we focus on visual representation learning and examine our approach on visual domain generalization benchmarks. Our method also includes the multiscale design and a mixing variant. In addition, considering that many computer vision tasks rely on training deep networks based on ImageNet-pretrained weights (including some domain generalization benchmarks), we ask "Can a more robust pretrained model make the finetuned model more robust on downstream tasks?" Different from (Kornblith et al., 2019; Salman et al., 2020) who studied the transferability of a pretrained ImageNet representation to new tasks while focusing on in-domain generalization, we explore generalization performance on unseen domains for new tasks. We make the following contributions: • We develop RandConv, a data augmentation technique using multi-scale random-convolutions to generate images with random texture while maintaining global shapes. We explore using the RandConv output as training images or mixing it with the original images. We show that a consistency loss can further enforce invariance under texture changes. • We provide insights and justification on why RandConv augments images with different local texture but the same semantics with the shape-preserving property of random convolutions. • We validate RandConv and its mixing variant in extensive experiments on synthetic and realworld benchmarks as well as on the large-scale ImageNet dataset. Our methods outperform single domain generalization approaches by a large margin on digit recognition datasets and for the challenging case of generalizing to the Sketch domain in PACS and to ImageNet-Sketch. • We explore if the robustness/generalizability of a pretrained representation can transfer. We show that transferring a model pretrained with RandConv on ImageNet can further improve domain generalization performance on new downstream tasks on the PACS dataset.

2. RELATED WORK

where we train the model on source data from a single domain and generalize it to new unseen domains (Carlucci et al., 2019; Wang et al., 2019b) . Domain Randomization (DR) was first introduced as a DG technique by Tobin et al. (2017) to handle the domain gap between simulated and real data. As the training data in (Tobin et al., 2017) is synthesized in a virtual environment, it is possible to generate diverse training samples by randomly selecting background images, colors, lighting, and textures of foreground objects. When a simulation environment is not accessible, image stylization can be used to generate new domains (Yue et al., 2019; Geirhos et al., 2019) . However, this requires extra effort to collect data and to train an additional model; further, the number of randomized domains is limited by the number of predefined styles. Data Augmentation has been widely used to improve the generalization of machine learning models (Simard et al., 2003) . DR approaches can be considered a type of synthetic data augmentation. To improve performance on unseen domains, Volpi et al. (2018) (Wang et al., 2019b) , or by penalizing the predictive power of local, low-level layer features in a neural network via an adversarial classifier (Wang et al., 2019a) . Our approach shares the idea that learning representations invariant to local texture helps generalization to unseen domains. However, RandConv avoids searching over many hyper-parameters, collecting extra data, and training other networks. It also scales to large-scale datasets since it adds minimal computation overhead. Random Mapping in Machine Learning Random projections have also been effective for dimensionality reduction based on the distance-preserving property of the Johnson-Lindenstrauss lemma (Johnson & Lindenstrauss, 1984) . (Vinh et al., 2016) applied random projections on entire images as data augmentation to make neural networks robust to adversarial examples. Lee et al. (2020) recently used random convolutions to help reinforcement learning (RL) agents generalize to new environments. Neural networks with fixed random weights can encode meaningful representations (Saxe et al., 2011) and are therefore useful for neural architecture search (Gaier & Ha, 2019) , generative models (He et al., 2016b) , natural language processing (Wieting & Kiela, 2019) , and RL (Osband et al., 2018; Burda et al., 2019) . In contrast, RandConv uses non-fixed randomly-sampled weights to generate images with different local texture.

3. RANDCONV: RANDOMIZE LOCAL TEXTURE AT DIFFERENT SCALES

We propose using a convolution layer with non-fixed random weights as the first layer of a DNN during training. This strategy generates images with random local texture but consistent shapes, and is beneficial for robust visual representation learning. Sec. 3.1 justifies the shape-preserving property of a random convolution layer. Sec. 3.2 describes RandConv, our data augmentation algorithm using a multi-scale randomized convolution layer and input mixing.

3.1. A RANDOM CONVOLUTION LAYER PRESERVES GLOBAL SHAPES

Convolution is the key building block for deep convolutional neural networks. Consider a convolution layer with filters Θ ∈ R h×w×Cin×Cout with an input image I ∈ R H×W ×Cin , where H and W are the height and width of the input and C in and C out are the number of feature channels for the input and output, and h and w are the height and width of the layer's filter. The output (with appropriate input padding) will be g = I * Θ with g ∈ R H×W ×Cout . In images, nearby pixels with similar color or texture can be grouped into primitive shapes that represent parts of objects or the background. A convolution layer linearly projects local image patches to features at corresponding locations on the output map using shared parameters. While a convolution with random filters can project local patches to arbitrary output features, the output of a random linear projection approximately preserves relative similarity between input patches, proved in Appendix B. In other words, since any two locations within the same shape have similar local textures in the input image, they tend to be similar in the output feature map. Therefore, shapes that emerge in the output feature map are similar to shapes in the input image provided that the filter size is sufficiently small compared to the size of a typical shape. In other words, the size of a convolution filter determines the smallest shape it can preserve. For example, 1x1 random convolutions preserve shapes at the single-pixel level and thus work as a random color mapping; large filters perturb shapes smaller than the filter size that are considered local texture of a shape at this larger scale. See Fig. 1 Sec. 3.1 discussed how outputs of randomized convolution layers approximately maintain shape information at a scale larger than their filter sizes. Here, we develop our RandConv data augmentation technique using a randomized convolution layer with C out = C in to generate shape-consistent images with randomized texture (see Alg. 1). Our goal is not to use RandConv to parameterize or represent texture as in previous filter-bank based texture models (Heeger & Bergen, 1995; Portilla & Simoncelli, 2000) . Instead, we only use the three-channel outputs of RandConv as new images with the same shape and different "style" (loosely referred to as "texture"). We also note that, a convolution layer is different from a convolution operation in image filtering. Standard image filtering applies the same 2D filter on three color channels separately. In contrast, our convolution layer applies three different 3D filters and each takes all color channels as input and generates one channel of the output. Our proposed RandConv variants are as follows: RC img : Augmenting Images with Random Texture A simple approach is to use the randomized convolution layer outputs, I * Θ, as new images; where Θ are the randomly sampled weights and I is a training image. If the original training data is in the domain D 0 , a sampled weight Θ k generates images with consistent global shape but random texture forming the random domain D k . Thus, by random weight sampling, we obtain an infinite number of random domains D 1 , D 1 , . . . , D ∞ . Input image intensities are assumed to be a standard normal distribution N (0, 1) (which is often true in practice thanks to data whitening). As the outputs of RandConv should follow the same distribution, we sample the convolution weights from N (0, σ 2 ) where σ = 1/ √ C in × h × w, which is commonly applied for network initialization (He et al., 2015) . We include the original images for training at a ratio p as a hyperparameter. RC mix : Mixing Variant As shown in Fig. 1 , outputs from RC img can vary significantly from the appearance of the original images. Although generalizing to domains with significantly different local texture distributions is useful, we may not want to sacrifice much performance on domains similar to the training domain. Inspired by the AugMix (Hendrycks et al., 2020b ) strategy, we propose to blend the original image with the outputs of the RandConv layer via linear convex combinations αI + (1 -α)(I * Θ), where α is the mixing weight uniformly sampled from [0, 1].In RC mix , the RandConv outputs provide shape-consistent perturbations of the original images. Varying α, we continuously interpolate between the training domain and the randomly sampled domains of RC img . Multi-scale Texture Corruption As discussed in Sec. 3.1" image shape information at a scale smaller than a filter's size will be corrupted by RandConv. Therefore, we can use filters of varying sizes to preserve shapes at various scales. We choose to uniformly randomly sample a filter size k from a pool K = 1, 3, ...n before sampling convolution weights Θ ∈ R k×k×Cin×Cout from a Gaussian distribution N (0, 1 k 2 Cin ). Fig. 1 shows examples of multi-scale RandConv outputs. Consistency Regularization To learn representations invariant to texture changes, we use a loss encouraging consistent network predictions for the same RandConv-augmented image for different random filter samples. Approaches for transform-invariant domain randomization (Yue et al., 2019) , data augmentation (Hendrycks et al., 2020b) , and semi-supervised learning (Berthelot et al., 2019) use similar strategies. We use Kullback-Leibler (KL) divergence to measure consistency. However, enforcing prediction similarity of two augmented variants may be too strong. Instead, following (Hendrycks et al., 2020b) , we use RandConv to obtain 3 augmentation samples of image I: G j = RandConv j (I) for j = 1, 2, 3 and obtain their predictions with a model Φ: y j = Φ(G j ). We then compute the relaxed loss as λ 3 j=1 KL(y j ||ȳ), where ȳ = 3 j=1 y j /3 is the sample average.

4. EXPERIMENTS

Secs. 4.1 to 4.3 evaluate our methods on the following datasets: multiple digit recognition datasets, PACS, and ImageNet-sketch. Sec. 4.4 uses PACS to explore the out-of-domain generalization of a pretrained representation in transfer learning by checking if pretraining on ImageNet with our method improves the domain generalization performance in downstream tasks. All experiments are in the single-domain generalization setting where training and validation sets are drawn from one domain. Additional experiments with ResNet18 as the backbone are given in the Appendix.

4.1. DIGIT RECOGNITION

The five digit recognition datasets (MNIST (LeCun et al., 1998) , MNIST-M (Ganin et al., 2016) , SVHN (Netzer et al., 2011) , SYNTH (Ganin & Lempitsky, 2014) and USPS (Denker et al., 1989) ) have been widely used for domain adaptation and generalization research (Peng et al., 2019a; b; Qiao et al., 2020) . Following the setups in (Volpi et al., 2018) and (Qiao et al., 2020) , we train a simple CNN with 10,000 MNIST samples and evaluate the accuracy on the test sets of the other four datasets. We also test on MNIST-C (Mu & Gilmer, 2019) , a robustness benchmark with 15 common corruptions of MNIST and report the average accuracy over all corruptions. Selecting Hyperparameters and Ablation Study. Fig. 2(a) shows the effect of the hyperparameter p on RC img with filter size 1. We see that adding only 10% RandConv data (p = 0.9) immediately improves the average performance (DG-Avg) on MNIST-M, SVHN, SYNTH and USPS performance from 53.53 to 69.19, outperforming all other approaches (see Tab. 1) for every dataset. We choose p = 0.5, which obtains the best DG-Avg. Fig. 2(b) shows results for a multiscale ablation study. Increasing the pool of filter sizes up to 7 improves DG-Avg performance. Therefore we use multiscale 1-7 to study the consistency loss weight λ, shown in Fig. 2(c ). Adding the consistency loss improves both RandConv variants on DG-avg: RC mix1-7 favors λ = 10 while RC img1-7,p=0.5 performs similarly for λ = 5 and λ = 10. We choose λ = 10 for all subsequent experiments. Results. Tab. 1 compares the performance of RC img1-7,p=0.5,λ=10 and RC mix1-7,λ=10 with other state-of-the-art approaches. We show results of the adversarial training based methods GUD (Volpi et al., 2018) , M-ADA (Qiao et al., 2020) , and PAR (Wang et al., 2019a) . The baseline model is trained only on the standard classification loss. To show RandConv is more than a trivial color/contrast adjustment method, we also compare to ColorJitterfoot_1 data augmentation (which randomly changes image brightness, contrast, and saturation) and GreyScale (where images are transformed to greyscale for training and testing). We also tested data augmentation with a fixed Laplacian of Gaussian filter (Band-Pass) of size=3 and σ = 1 and the data augmentation pipeline (Multi-Aug) that was used in a recently proposed large scale study on domain generalization algorithms and datasets (Gulrajani & Lopez-Paz, 2020) . RandConv and its mixing variant outperforms the best competing method (M-ADA) by 17% on DG-Avg and achieves the best 91.62% accuracy on MNIST-C. While the difference between the two variants of RandConv is marginal, RC mix1-7,λ=10 performs better on both DG-Avg and MNIST-C. When combined with Multi-Aug, RandConv achieves improved performance except on MNIST-C. Fig 3 shows t-SNE image feature plots for unseen domains generated by the baseline approach and RC mix1-7,λ=10 . The RandConv embeddings suggest better generalization to unseen domains. 

4.2. PACS EXPERIMENTS

The PACS dataset (Li et al., 2018b) considers 7-class classification on 4 domains: photo, art painting, cartoon, and sketch, with very different texture styles. Most recent domain generalization work studies the multi-source domain setting on PACS and uses domain labels of the training data. Although we follow the convention to train on 3 domains and to test on the fourth, we simply pool the data from the 3 training domains as in (Wang et al., 2019a) , without using domain labels during the training. Baseline and State-of-the-Art. Following (Li et al., 2017) , we use Deep-All as the baseline, which finetunes an ImageNet-pretrained AlexNet on 3 domains using only the classification loss and tests on the fourth domain. We test our RandConv variants RC img1-7,p=0.5 and RC mix1-7 with and without consistency loss, and ColorJitter/GreyScale/BandPass/MultiAug data augmentation as in the digit datasets. We also implemented PAR (Wang et al., 2019a) combined with MultiAug is also tested. Further, we compare to the following state-of-the-art approaches: Jigen (Carlucci et al., 2019) using self-supervision, MLDG (Li et al., 2018a ) using meta-learning, and the conditional invariant deep domain generalization method CIDDG (Li et al., 2018c) . Note that previous methods used different Deep-All baselines which make the final accuracy not directly comparable, and MLDG and CIDDG use domain labels for training. Results. Tab. 2 shows significant improvements on Sketch for both RandConv variants. Sketch is the most challenging domain with no color and much less texture compared to the other 3 domains. The success on Sketch demonstrates that our methods can guide the DNN to learn global representations focusing on shapes that are robust to texture changes. Without using the consistency loss, RC mix1-7 achieves the best overall result improving over Deep-All by ∼4% but adding MultiAug does not further improve the performance. Adding the consistency loss with λ = 10, RC mix1-7 and RC img1-7,p=0.5 performs better on Sketch but degrades performance on the other 3 domains, so do GreyScale and ColorJitter. This observation will be discussed in Sec 4.4. ImageNet-Sketch (Wang et al., 2019a ) is an out-of-domain test set for models trained on ImageNet.

4.3. GENERALIZING AN IMAGENET MODEL TO IMAGENET-SKETCH

We trained AlexNet from scratch with RC img1-7,p=0.5,λ=10 and RC mix1-7,λ=10 . We evaluate their performance on ImageNet-Sketch. We use the AlexNet model trained without RandConv as our baseline. Tab. 3 compares PAR and its baseline model and AlexNet trained with Stylized ImageNet (SIN) (Geirhos et al., 2019) on ImageNet-Sketch. Although PAR uses a stronger baseline, RandConv achieves significant improvements over our baseline and outperforms PAR by a large margin. Our methods achieve more than a 7% accuracy improvement over the baseline and surpass PAR by 5%. SIN as an image stylization approach that can modify image texture in a hierarchical and realistic way. However, albeit its complexity, it still performs on par with RandConv. Note that image stylization techniques require additional data and heavy precomputation. Further, the images for the style source also need to be chosen. In contrast, RandConv is much easier to use: it can be applied to any dataset via a simple convolution layer. We also measure the shape-bias metric proposed by Geirhos et al. (2019) for RandConv trained AlexNet. RC img1-7,p=0.5,λ=10 and RC mix1-7,λ=10 improve the baseline from 25.36% to 48.24% and 54.85% respectively.

4.4. REVISITING PACS WITH MORE ROBUST PRETRAINED REPRESENTATIONS

A common practice for many computer vision tasks (including the PACS benchmark) is transfer learning, i.e. finetuning a backbone model pretrained on ImageNet. Recently, how the accuracy on ImageNet (Kornblith et al., 2019) and adversial robustness (Salman et al., 2020) of the pretrained model affect transfer learning has been studied in the context of domain generalization. Instead, we study how out-of-domain generalizability transfers from pretraining to downstream tasks and shed light on how to better use pretrained models.

Impact of ImageNet Pretraining

A model trained on ImageNet may be biased towards textures (Geirhos et al., 2019) . Finetuning ImageNet pretrained models on PACS may inherit this texture bias, thereby benefitting generalization on the Photo domain (which is similar to ImageNet), but hurting performance on the Sketch domain. Therefore, as shown in Sec. 4.2, using RandConv to correct this texture bias improves results on Sketch, but degrades them on the Photo domain. Since pretraining has such a strong impact on transfer performance to new tasks, we ask: "Can the generalizability of a pretrained model transfer to downstream tasks? I.e., does a pretrained model with better generalizability improve performance on unseen domains on new tasks?" To answer this, we revisit the PACS tasks based on ImageNet-pretrained weights where our two RandConv variants of Sec. 4.3 are used during ImageNet training. We study if this results in performance changes for the Deep-All baseline and for finetuning with RandConv. Better Performance via RandConv pretrained model We start by testing the Deep-All baselines using the two RandConv-trained ImageNet models of Sec. 4.3 as initialization. Tab. 4 shows significant improvements on Sketch. Results are comparable to finetuning with RandConv on a normal pretrained model. Art is also consistently improved. Performance drops slightly on Photo as expected, since we reduced the texture bias in the pretrained model, which is helpful for the Photo domain. A similar performance improvement is observed when using the SIN-trained AlexNet as initialization. Using RandConv for both ImageNet training and PACS finetuning, we achieve 76.11% accuracy on Sketch. As far as we know, this is the best performance using an AlexNet baseline. This approach even outperforms Jigen (Carlucci et al., 2019) (71.35%) with a stronger ResNet18 baseline This experiment confirms that generalizability may transfer: removing texture bias may not only make a pretrained model more generalizable, but it may help generalization on downstream tasks. For similar target and pretraining domains like Photo and ImageNet, where learning texture bias may actually be beneficial, performance may degrade slightly.

5. CONCLUSION AND DISCUSSION

Randomized convolution (RandConv) is a simple but powerful data augmentation technique for randomizing local image texture. RandConv helps focus visual representations on global shape information rather than local texture. We theoretically justified the approximate shape-preserving property of RandConv and developed RandConv techniques using multi-scale and mixing designs. We also make use of a consistency loss to encourage texture invariance. RandConv outperforms state-of-the-art approaches on the digit recognition benchmark and on the sketch domain of PACS and on ImageNet-Sketch by a large margin. By finetuning a model pretrained with RandConv on PACS, we showed that the generalizability of a pretrained model may transfer to and benefit a new downstream task. This resulted in a new state-of-art performance on PACS in the Sketch domain. RandConv can help computer vision tasks when a shape-biased model is helpful e.g. for object detection. RandConv can also provide a shape-biased pretrained model to improve performance on downstream tasks when generalizing to unseen domains. However, local texture features can be useful for many computer vision tasks, especially for fixed-domain fine-grained visual recognition. In such cases, visual representations that are invariant to local texture may hurt in-domain performance. Therefore, important future work includes learning representations that disentangle shape and texture features and building models to use such representations in an explainable way. Adversarial robustness of deep neural networks has received significant recent attention. Interestingly, Zhang & Zhu (2019) find that adversarially-trained models are more shape biased; Shi et al. (2020) show that their method for increasing shape bias also helps adversarial robustness, especially when combined with adversarial training. Therefore, exploring how RandConv affects the adversarial robustness of models could be interesting future work. Moreover, recent biologically inspired models for improving adversarial robustness (Dapello et al., 2020) use Gabor filters with fixed random configurations followed by a stochastic layer to add Gaussian noise to the network input, which may explain the importance of randomness in RandConv. Exploring connections between RandConv and biological mechanisms in the human visual system would be interesting future work. This supplementary material provides additional details. Specifically, in Sec. A and B, we discuss definitions of shapes and textures in images and justify why random convolution preserves global shapes and disrupts local texture formally by proving Theorem 1. This theorem shows that random linear projections are approximately distance preserving. We also discuss our simulation-based bound based on 80% distance rescaling on real image data. Sec. C provides more experimental details for the different datasets. Sec. D shows experimental results with a stronger backbone architecture and on a new benchmark ImageNet-R (Hendrycks et al., 2020a) . Sec. E provides more detailed results regarding hyperparameter selection and ablation studies. Lastly, Sec. F shows example visualizations of RandConv outputs and for its mixing variant.

A SHAPES AND TEXTURE IN IMAGES

As discussed in the main text, we define shapes in images that are preserved by a random convolution layer as primitive shapes: spatial clusters of pixels with similar local texture. An object in a image can be a single primitive shape alone but in most cases it is the composition of multiple primitive shapes e.g. a car includes wheels, body frames, windshields. Note that the definition of texture is not necessarily opposite to shapes, since the texture of a larger shape can includes smaller shapes. For example, in Fig. 4 , the left occluded triangle shape has texture composed by shapes of cobble stones while cobble stones have their own texture. Random convolution can preserve those large shapes that usually define the image semantics while distorting the small shapes as local texture. To formally define the shape-preserving property, we assume (x 1 , y 1 ), (x 2 , y 2 ) and (x 3 , y 3 ) are three locations on a image and (x 1 , y 1 ) has closer color and local texture with (x 2 , y 2 ) than (x 3 , y 3 ). For example, (x 1 , y 1 ) and (x 2 , y 2 ) are within the same shape while (x 3 , y 3 ) is located at a neighboring shape. Then we have p(x 1 , y 1 ) -p(x 2 , y 2 ) < p(x 1 , y 1 ) -p(x 3 , y 3 ) , where p(x i , y i ) is the image patch at location (x i , y i ). A transformation f is shape-preserving if it maintains such relative distance relations for most location triplets, i.e. f (p(x i , y i )) -f (p(x j , y j )) / p(x i , y i ) -p(x j , y j ) ≈ r for any two spatial location (x i , y i ) and (x j , y j ); r ≥ 0 is a constant.

B RANDOM CONVOLUTION IS SHAPE-PRESERVING AS RANDOM LINEAR PROJECTION IS DISTANCE PRESERVING

We can express a convolution layer as a local linear projection: g(x, y) = Up(x, y) , where p(x, y) ∈ R d (d = h × w × C in ) is the vectorized image patch centerized at location (x, y), g(x, y) ∈ R Cout is the output feature at location (x, y), and U ∈ R Cout×d is the matrix expressing the convolution layer filters Θ. I.e., for each sliding window centered at (x, y), a convolution layer applies a linear transform f : R d → R Cout projecting the d dimensional local image patch p(x, y) to its C out dimensional feature g(x, y). When Θ is independently randomly sampled, e.g. from a Gaussian distribution, the convolution layer preserves global shapes since that a random linear projection is approximately distance-preserving by bounding the range of r in Eq. 1 in Theorem 1. Theorem 1. Suppose we have N data points z 1 , • • • , z N ∈ R d . Let f (z) = Uz be a random linear projection f : R d → R m such that U ∈ R m×d and U i,j ∼ N (0, σ 2 ). Then we have: P sup i =j;i,j∈[N ] r i,j := f (z i ) -f (z j ) z i -z j > δ 1 ≤ , P inf i =j;i,j∈[N ] r i,j := f (z i ) -f (z j ) z i -z j < δ 2 ≤ , where δ 1 := σ χ 2 2 N (N -1) (m) and δ 2 := σ χ 2 1- 2 N (N -1) (m). Here, χ 2 α (m) denotes the α-upper quantile of the χ 2 distribution with m degrees of freedom. Thm. 1 tells us that for any data pair (z i , z j ) in a set of N points, the distance rescaling ratio r i,j after a random linear projection is bounded by δ 1 and δ 2 with probability 1 -. A Smaller N and a larger output dimension m give better bounds. E.g., when m = 3, N = 1, 000, σ = 1 and = 0.1, δ 1 = 5.8 and δ 2 = 0.01. Thm. 1 gives a theoretical bound for all the N (N -1)/2 pairs. However, in practice, preserving distances for a majority of N (N -1)/2 pairs is sufficient. To empirically verify this, we test the range of central 80% of {r i,j } on real image data. Using the same (m, N, σ, ), 80% of the pairs lie in [0.56, 2.87], which is significantly better than the strict bound: [0.01, 5.8]. A proof of the theorem and simulation details are given in the following. Proof. Let U k represent to the k-th row of U. It is easy to check that v k := U k , z i -z j / z i -z j ∼ N (0, σ 2 ). Therefore, f (z i ) -f (z j ) 2 σ 2 z i -z j 2 = 1 σ 2 (z i -z j ) U U(z i -z j ) z i -z j 2 = m k=1 v 2 k σ 2 ∼ χ 2 (m). Therefore, for 0 < < 1, we have P f (z i ) -f (z j ) 2 σ 2 z i -z j 2 > χ 2 2 N (N -1) (m) ≤ 2 N (N -1) . From the above inequality, we have P sup i =j;i,j∈[N ] f (zi)-f (zj ) 2 zi-zj 2 > σ 2 χ 2 2 N (N -1) (m) = P sup i =j;i,j∈[N ] f (zi)-f (zj ) 2 σ 2 zi-zj 2 > χ 2 2 N (N -1) (m) = P i =j;i,j∈[N ] f (zi)-f (zj ) 2 σ 2 zi-zj 2 > χ 2 2 N (N -1) (m) ≤ i =j;i,j∈[N ] P f (zi)-f (zj ) 2 σ 2 zi-zj 2 > χ 2 2 N (N -1) (m) ≤ , which is equivalent to P sup i =j;i,j∈[N ] f (z i ) -f (z j ) z i -z j > σ χ 2 2 N (N -1) (m) ≤ . Similarly, we have P inf i =j;i,j∈[N ] f (z i ) -f (z j ) z i -z j < σ χ 2 1- 2 N (N -1) (m) ≤ . Simulation on Real Image Data To better understand the relative distance preservation property of random linear projections in practice, we use Algorithm 2 to empirically obtain a bound for real image data. We choose m = 3, N = 1, 000, σ = 1 and = 0.1 as in computing our theoretical bounds. We use M = 1, 000 real images from the PACS dataset for this simulation. Note that the image patch size or d does not affect the bound. We use a patch size of 3 × 3 resulting in d = 27. This simulation tell us that applying linear projections with a randomly sampled U on N local images patches in every image, we have a 1chance that 80% of r i,j is in the range [δ 10% , δ 90% ]. Algorithm 2 Simulate the range of central 80% of r i,j on real image data 1: Input: M images {Ii} M i=1 , number of data points N , projection output dimension m, standard deviation σ of normal distribution, confidence level . 2: for m = 1 → M do 3: Sample images patches in Im at 1,000 locations and vectorize them as {z m l } N l=1 4: Sample a projection matrix U ∈ R m×d and Ui,j ∼ N (0, σ 2 ) 5: for i = 1 → N do 6: for j = i + 1 → N do 7: Compute r m i,j = f (z m i )-f (z m j ) z m i -z m j , where f (z) = Uz 8: q m 10% = 10% quantile of r m i,j for Im 9: q m 90% = 90% quantile of r m i,j for Im Get the central 80% of ri,j in each image 10: δ 10% = quantile of all q m 10% 11: δ 90% = (1 -) quantile of all q m 90% Get the confident bound for q m 10% and q m D MORE EXPERIMENTS WITH In this section, we demonstrate that RandConv also works on other stronger backbone architectures, e.g. for a Residual Network He et al. (2016a) . Specifically, we run the PACS and ImageNet experiments with ResNet-18 as the baseline and RandConv. As Table 5 shows, RandConv improves the baseline using ResNet18 on ImageNet-sketch by 10.5% accuracy. When using a RandConv pretrained ResNet-18 on PACS, the performance of finetuning with DeepAll and RandConv are both improved shown in Table 7 . The best average domain generalization accuracy is 84.09%, with a more than 8% improvement over our initial Deep-All baseline. A model pretrained with RC mix1-7,λ=10 generally performs better than when pretrained with RC img1-7,p=0.5,λ=10 . We also provide the ResNet-18 performance of JiGen (Carlucci et al., 2019) on PACS as reference. Note that JiGen uses extra data augmentation and a different data split than our approach and it only improves over its own baseline by 1.5%. In addition, we test RandConv trained ResNet-18 on ImageNet-R (Hendrycks et al., 2020a) , a domain generalization benchmark that contains images of artistic renditions of 200 object classes from the original ImageNet dataset. As Table 6 shows, RandConv also improve the generalization performance on ImageNet-R and reduce the gap between the in-domain (ImageNet-200) and out-of-domain (ImageNet-R) performance. We provide detailed experimental results for the digits recognition datasets. Table 8 shows results for different hyperameters p for RC img1 . Table 9 shows results for an ablation study on the multi-scale design for RC mix and RC img,p=0.5 . Table 10 shows results for studying the consistency loss weight λ for RC mix1-7 and RC img1-7,p=0.5 . Tables 8, 9 , and 10 correspond to Fig. 



Code is available at https://github.com/wildphoton/RandConv. See PyTorch documentation for implementation details; all parameters are set to 0.5. https://github.com/pytorch/examples/tree/master/imagenet



Figure 1: Top: Illustration that RandConv randomize local texture but preserve shapes in the image. Middle:

Figure 2: Average accuracy and 5-run variance of MNIST model on MNIST-M, SVHN, SYNTH and USPS. Studies for: (a) original data fraction p for RC img ; (b) multiscale design (1-n refers to using scales 1,3,..,n) for RC img,p=0.5 (orange) and RC mix (blue); (c) consistency loss weight λ for RC img1-7,p=0.5 (orange) and RC mix1-7 (blue).

Figure 3: t-SNE feature embedding visualization for digit datasets for models trained on MNIST without (top) and with our RC mix1-7,λ=10 approach (bottom). Different colors denote different classes.

Figure 4: Left: An image with texture and shapes at different scales; Middle: The output of RandConv with a small filter size which largely preserves the shapes of the stones. Right: The output of RandConv with a large filter size distorts the shape of the stones as well.

2 (a)(b)(c) in the main text respectively.

generate adversarial examples to augment the training data; Qiao et al. (2020) extend this approach via meta-learning. As with other adversarial training algorithms, significant extra computation is required to obtain adversarial examples.

Average accuracy and 5-run standard deviation (in parenthesis) of MNIST10K model on MNIST-M, SVHN, SYNTH, USPS and their average (DG-avg); and average accuracy of 15 types of corruptions in MNIST-C. Both RandConv variants significantly outperform all other methods.



Accuracy of ImageNet-trained AlexNet on ImageNet-Sketch (IN-S) data. Our methods outperform PAR by 5% and are on par with a Stylized-ImageNet (SIN) trained model. Note that PAR was built on top of a stronger baseline than our model, and both PAR and SIN fine-tuned the baseline model which helped the performance, while we train RandConv model from scratch.

Generalization results on PACS with RandConv and SIN pretrained AlexNet. ImageNet column shows how the pretrained model is trained on ImageNet (baseline represents training the ImageNet model using only the classification loss); PACS column indicates the methods used for finetuning on PACS. Best and second best accuracy for each target domain are highlighted in bold and underlined.

Accuracy of ImageNet-trained ResNet-18 on ImageNet-Sketch data.

Top 1 Accuracy of ImageNet-trained ResNet-18 on ImageNet-R data. ImageNet-200 are the original ImageNet data with the same 200 classes as ImageNet-R.

Generalization results on PACS with RandConv pretrained model using ResNet-18. Ima-geNet column shows how the pretrained model is trained on ImageNet (baseline represents training using only the classification loss); PACS column indicates the methods used for finetuning on PACS. Best and second best accuracy for each target domain are highlighted in bold and underlined. The performance of JiGen(Carlucci et al., 2019) and its baseline using ResNet-18 is also given.

Ablation study of hyperparameter p for RC img1 on digits recognition benchmarks. DG-Avg is the average performance on MNIST-M, SVHN, SYNTH and USPS. Best results are bold.

Ablation study of multi-scale RandConv on digits recognition benchmarks for RC mix and RC img,p=0.5 . Best entries for each variant are bold. RCmix1 98.62(0.06) 83.98(0.98) 53.26(2.59) 80.57(1.09) 59.25(1.38) 69.26(1.35) 88.59(0.38) RCmix1-3 98.76(0.02) 84.66(1.67) 55.89(0.83) 80.95(1.15) 60.07(1.05) 70.39(0.58) 89.80(0.94) RCmix1-5 98.76(0.06) 84.32(0.43) 56.50(2.68) 81.85(1.05) 60.76(1.02) 70.86(0.86) 90.06(0.80) RCmix1-7 98.82(0.06) 84.91(0.68) 55.61(2.63) 82.09(1.00) 62.15(1.30) 71.19(1.21) 90.30(0.44) RCmix1-9 98.81(0.12) 85.13(0.72) 54.18(3.36) 82.07(1.28) 61.85(1.41) 70.81(1.24) 90.83(0.52) p=0.5 98.79(0.07) 85.36(1.04) 55.60(1.09) 80.99(0.99) 61.26(0.80) 70.80(0.86) 89.84(0.70) RCimg1-5, p=0.5 98.83(0.07) 86.33(0.47) 54.99(2.48) 80.82(1.83) 62.61(0.75) 71.19(1.25) 90.70(0.43) RCimg1-7, p=0.5 98.83(0.07) 86.08(0.27) 54.93(1.27) 81.58(0.74) 62.78(0.86) 71.34(0.61) 91.18(0.38) RCimg1-9, p=0.5 98.80(0.12) 85.63(0.70) 52.82(2.01) 81.48(1.22) 62.55(0.74) 70.62(0.73) 90.79(0.48)

Ablation study of consistency loss weight λ on digits recognition benchmarks for RC mix1-7 and RC img1-7,p=0.5 . DG-Avg is the average performance on MNIST-M, SVHN, SYNTH and USPS. Best results for each variant are bold. (0.05) 87.18 (0.81) 57.68 (1.64) 83.55 (0.83) 63.08 (0.50) 72.87 (0.47) 91.14 (0.53) 10 98.85 (0.04) 87.76 (0.83) 57.52 (2.09) 83.36 (0.96) 62.88 (0.78) 72.88 (0.58) 91.62 (0.77) 5 98.94 (0.09) 87.53 (0.51) 55.70 (2.22) 83.12 (1.08) 62.37 (0.98) 72.18 (1.04) 91.46 (0.50) 1 98.95 (0.05) 86.77 (0.79) 56.00 (2.39) 83.13 (0.71) 63.18 (0.97) 72.27 (0.82) 91.15 (0.42) 0.1 98.84 (0.07) 85.41 (1.02) 56.51 (1.58) 81.84 (1.14) 61.86 (1.44) 71.41 (0.98) 90.72 (0.60) 0 98.82 (0.06) 84.91 (0.68) 55.61 (2.63) 82.09 (1.00) 62.15 (1.30) 71.19 (1.21) 90.30 (0.44) RCimg1-7,p=0.5 20 98.79 (0.04) 87.53 (0.79) 53.92 (1.59) 81.83 (0.70) 62.16 (0.37) 71.36 (0.49) 91.20 (0.53) 10 98.86 (0.05) 87.67 (0.37) 54.95 (1.90) 82.08 (1.46) 63.37 (1.58) 72.02 (1.15) 90.94 (0.51) 5 98.90 (0.04) 87.77 (0.72) 55.00 (1.40) 82.10 (0.55) 63.58 (1.33) 72.11 (0.62) 90.83 (0.71) 1 98.86 (0.04) 86.74 (0.32) 53.26 (2.99) 81.51 (0.48) 62.00 (1.15) 70.88 (0.93) 91.11 (0.62) 0.1 98.85 (0.14) 86.85 (0.31) 53.55 (3.63) 81.23 (1.02) 62.77 (0.80) 71.10 (1.31) 91.13 (0.69) 0 98.83 (0.07) 86.08 (0.27) 54.93 (1.27) 81.58 (0.74) 62.78 (0.86) 71.34 (0.61) 91.18 (0.38)

acknowledgement

Acknowledgments We thank Zhiding Yu for discussions on initial ideas and the experimental setup. We also thank Nathan Cahill for advice on proving the properties of random convolutions.

annex

following columns are convolution results using random filters of different sizes k. We can see that the smaller filter sizes help maintain the finer shapes.

