SELFNORM AND CROSSNORM FOR OUT-OF-DISTRIBUTION ROBUSTNESS

Abstract

Normalization techniques are crucial in stabilizing and accelerating the training of deep neural networks. However, they are mainly designed for the independent and identically distributed (IID) data, not satisfying many real-world out-ofdistribution (OOD) situations. Unlike most previous works, this paper presents two normalization methods, SelfNorm and CrossNorm, to promote OOD generalization. SelfNorm uses attention to recalibrate statistics (channel-wise mean and variance), while CrossNorm exchanges the statistics between feature maps. Self-Norm and CrossNorm can complement each other in OOD generalization, though exploring different directions in statistics usage. Extensive experiments on different domains (vision and language), tasks (classification and segmentation), and settings (supervised and semi-supervised) show their effectiveness.

1. INTRODUCTION

Normalization methods, e.g., Batch Normalization (Ioffe & Szegedy, 2015) , Layer Normalization (Ba et al., 2016) , and Instance Normalization (Ulyanov et al., 2016) , play a pivotal role in training deep neural networks. Most of them try to make training more stable and convergence faster, assuming that training and test data come from the same distribution. However, few studies investigate normalization in improving OOD generalization in real-world scenarios. For example, image corruptions (Hendrycks & Dietterich, 2019) , e.g., snow and blur, can cause test data out of the clean training distribution. Moreover, training on synthetic data (Richter et al., 2016) to generalize to realistic data can significantly reduce the annotation burden. This work aims to encourage the interaction between normalization and OOD generalization. Specifically, we manipulate feature mean and variance to make models generalize better to out-of-distribution data. Our inspiration comes from the observation that channel-wise mean and variance of feature maps carry some style information. For instance, exchanging the RGB means and variances between two instances can transfer style between them, as shown in Figure 1 (a). For many tasks such as CIFAR classification (Krizhevsky et al., 2009) , the style encoded by channel-wise mean and variance is usually less critical in recognizing the object than other information such as object shape. Therefore, we propose CrossNorm that swaps the channel-wise mean and variance of feature maps. CrossNorm can augment styles in training, making the model more robust to appearance changes. Furthermore, given one image in different styles, we can reduce their style discrepancy if adjusting their RGB means and variances properly, as illustrated in Figure 1 (b) . Intuitively, the style recalibration can reduce appearance variance, which may be useful in bridging distribution gaps between training and unforeseen testing data. To this end, we propose SelfNorm to use attention (Hu et al., 2018) to adjust channel-wise mean and variance automatically. It is interesting to analyze the distinction and connection between CrossNorm and SelfNorm. At first glance, they take opposite actions (style augmentation v.s. style reduction). Even so, they use the same tool: channel-wise statistics and pursue the same goal: OOD robustness. Additionally, CrossNorm can increase the capacity of SelfNorm by style augmentation. SelfNorm, with the help from CrossNorm, can generalize better to OOD data.

Concept and Intuition.

The style concept here refers to a family of weak cues associated with the semantic content of interest. For instance, the image style in object recognition can include many appearance-related factors such as color, contrast, and brightness. Style sometimes may help in decision-making, but the model should weigh more on more vital content cues to become robust. To reduce its bias rather than discard it, we use CrossNorm with probability in training. The insight beneath CrossNorm is that each instance, or feature map, has its unique style. Further, style cues are not equally important. For example, the yellow color seems more useful than other style cues in recognizing orange. In light of this, the intuition behind SelfNorm is that attention may help emphasize essential styles and suppress trivial ones. Assumption. Although we use the channel-wise mean and variance to modify styles, we do not assume that they are sufficient to represent all style cues. Better style representations are available with more complex statistics (Li et al., 2017) or even style transfer models (Ulyanov et al., 2017; Huang & Belongie, 2017) . We choose the first and second-order statistics mainly because they are simple, efficient to compute, and can connect normalization to out-of-distribution generalization. In summary, the key contributions are: • We propose SelfNorm and CrossNorm, two simple yet effective normalization techniques to enhance out-of-distribution generalization. • SelfNorm and CrossNorm form a unity of opposites in using feature mean and variance for model robustness. • They are domain agnostic and can advance state-of-the-art robustness performance for different domains (vision or language), settings (fully or semi-supervised), and tasks (classification and segmentation). et al., 2019) uses style augmentation for domain generalization on segmentation datasets. It suffers from the same issues of Stylized-ImageNet as it also uses pretrained style transfer models and additional style datasets. By contrast, CrossNorm is more efficient and balances better between the source and target domains' performance. Beyond the vision field, many natural language processing (NLP) applications also face the out-of-distribution generalization challenges (Hendrycks et al., 2020b) . Benefiting from the domain-agnostic property, SelfNorm and CrossNorm can also improve model robustness in the NLP area.



Figure 1: CIFAR examples of exchanging (Left) and adjusting (Right) RGB mean and variance.

Out-of-distribution generalization. Although the current deep models continue to break records on benchmark IID datasets, they still struggle to generalize to OOD data caused by common corruptions(Hendrycks & Dietterich, 2019)  and dataset gaps(Richter et al., 2016). To improve the robustness against corruption,Stylized-ImageNet (Geirhos et al., 2019)  conducts style augmentation to reduce the texture bias of CNNs. Compared to it, CrossNorm has two main advantages. First, CrossNorm is efficient as it transfer styles directly in the feature space of the target CNNs. However, Stylized-ImageNet relies on external style datasets and pre-trained style transfer models. Second, CrossNorm can advance the performance on both clean and corrupted data, while Stylized-ImageNet hurts clean generalization. In contrast to the consistent styles within one dataset, the external ones can result in massive distribution shifts. Recently, AugMix(Hendrycks et al., 2020c)  trains robust models by mixing multiple augmented images based on random image primitives or image-to-image networks(Hendrycks et al., 2020a). Adversarial noises training (ANT)(Rusak et al., 2020)  can also improve the robustness against corruption. CrossNorm is domain agnostic and orthogonal to AugMix and ANT, making it possible for their joint application. Moreover, unsupervised domain adaptation is also useful for corruption robustness in some situations(Schneider et al., 2020).

