SELFNORM AND CROSSNORM FOR OUT-OF-DISTRIBUTION ROBUSTNESS

Abstract

Normalization techniques are crucial in stabilizing and accelerating the training of deep neural networks. However, they are mainly designed for the independent and identically distributed (IID) data, not satisfying many real-world out-ofdistribution (OOD) situations. Unlike most previous works, this paper presents two normalization methods, SelfNorm and CrossNorm, to promote OOD generalization. SelfNorm uses attention to recalibrate statistics (channel-wise mean and variance), while CrossNorm exchanges the statistics between feature maps. Self-Norm and CrossNorm can complement each other in OOD generalization, though exploring different directions in statistics usage. Extensive experiments on different domains (vision and language), tasks (classification and segmentation), and settings (supervised and semi-supervised) show their effectiveness.

1. INTRODUCTION

Normalization methods, e.g., Batch Normalization (Ioffe & Szegedy, 2015) , Layer Normalization (Ba et al., 2016) , and Instance Normalization (Ulyanov et al., 2016) , play a pivotal role in training deep neural networks. Most of them try to make training more stable and convergence faster, assuming that training and test data come from the same distribution. However, few studies investigate normalization in improving OOD generalization in real-world scenarios. For example, image corruptions (Hendrycks & Dietterich, 2019), e.g., snow and blur, can cause test data out of the clean training distribution. Moreover, training on synthetic data (Richter et al., 2016) to generalize to realistic data can significantly reduce the annotation burden. This work aims to encourage the interaction between normalization and OOD generalization. Specifically, we manipulate feature mean and variance to make models generalize better to out-of-distribution data. Our inspiration comes from the observation that channel-wise mean and variance of feature maps carry some style information. For instance, exchanging the RGB means and variances between two instances can transfer style between them, as shown in Figure 1 (a). For many tasks such as CIFAR classification (Krizhevsky et al., 2009) , the style encoded by channel-wise mean and variance is usually less critical in recognizing the object than other information such as object shape. Therefore, we propose CrossNorm that swaps the channel-wise mean and variance of feature maps. CrossNorm can augment styles in training, making the model more robust to appearance changes. Furthermore, given one image in different styles, we can reduce their style discrepancy if adjusting their RGB means and variances properly, as illustrated in Figure 1 (b) . Intuitively, the style recalibration can reduce appearance variance, which may be useful in bridging distribution gaps between training and unforeseen testing data. To this end, we propose SelfNorm to use attention (Hu et al., 2018) to adjust channel-wise mean and variance automatically. It is interesting to analyze the distinction and connection between CrossNorm and SelfNorm. At first glance, they take opposite actions (style augmentation v.s. style reduction). Even so, they use the same tool: channel-wise statistics and pursue the same goal: OOD robustness. Additionally, CrossNorm can increase the capacity of SelfNorm by style augmentation. SelfNorm, with the help from CrossNorm, can generalize better to OOD data.

Concept and Intuition.

The style concept here refers to a family of weak cues associated with the semantic content of interest. For instance, the image style in object recognition can include many appearance-related factors such as color, contrast, and brightness. Style sometimes may help in

