SANDWICH BATCH NORMALIZATION

Abstract

We present Sandwich Batch Normalization (SaBN), a frustratingly easy improvement of Batch Normalization (BN) with only a few lines of code changes. SaBN is motivated by addressing the inherent feature distribution heterogeneity that one can be identified in many tasks, which can arise from model heterogeneity (dynamic architectures, model conditioning, etc.), or data heterogeneity (multiple input domains). A SaBN factorizes the BN affine layer into one shared sandwich affine layer, cascaded by several parallel independent affine layers. Its variants include further decomposing the normalization layer into multiple parallel ones, and extending similar ideas to instance normalization. We demonstrate the prevailing effectiveness of SaBN (as well as its variants) as a drop-in replacement in four tasks: neural architecture search (NAS), image generation, adversarial training, and style transfer. Leveraging SaBN immediately boosts two state-of-the-art weight-sharing NAS algorithms significantly on NAS-Bench-201; achieves better Inception Score and FID on CIFAR-10 and ImageNet conditional image generation with three state-of-the art GANs; substantially improves the robust and standard accuracy for adversarial defense; and produces superior arbitrary stylized results. We also provide visualizations and analysis to help understand why SaBN works. All our codes and pre-trained models will be released upon acceptance.

1. INTRODUCTION

This paper presents a simple, light-weight, and easy-to-implement modification of Batch Normalization (BN) (Ioffe & Szegedy, 2015) , yet strongly motivated by various observations (Zaj ąc et al., 2019; Deecke et al., 2018; Xie et al., 2019; Xie & Yuille, 2019) drawn from a number of application fields, that BN has troubles standardizing hidden features with very heterogeneous structures, e.g., from a multi-modal distribution. We call the phenomenon feature distribution heterogeneity. Such heterogeneity of hidden features could arise from multiple causes, often application-dependent: • One straightforward cause is due to input data heterogeneity. For example, when training a deep network on a diverse set of visual domains, that possess significantly different statistics, BN is found to be ineffective at normalizing the activations with only a single mean and variance (Deecke et al., 2018) , and often needs to be re-set or adapted (Li et al., 2016) . • Another intrinsic cause could arise from model heterogeneity, i.e., when the training is, or could be equivalently viewed as, on a set of different models. For instance, in neural architecture search (NAS) using weight sharing (Liu et al., 2018; Dong & Yang, 2019) , training the super-network during the search phase could be considered as training a large set of sub-models (with many overlapped weights) simultaneously. As another example, for conditional image generation (Miyato et al., 2018) , the generative model could be treated as a set of category-specific sub-models packed together, one of which would be "activated" by the conditional input each time. The vanilla BN (Figure 1 its attacked counterpart with some small perturbations applied But what may be missing? Unfortunately, using two separate BNs ignores one important fact that the two domains, while being different, are not totally independent. Considering that all adversarial images are generated by perturbing clean counterparts only minimally, it is convincing to hypothesize the two domains to be largely overlapped at least (i.e., they still share some hidden features despite the different statistics). To put it simple: while it is oversimplified to normalize the two domains as "same one", it is also unfair and unnecessary to treat them as "disparate two". More application examples can be found that all share this important structural feature prior, that we (informally) call as "harmony in diversity". For instance, weight-sharing NAS algorithms (Liu et al., 2018; Dong & Yang, 2019; Yu et al., 2018) train a large variety of child models, constituting model heterogeneity; but most child architectures inevitably have many weights in common since they are sampled from the same super net. Similarly, while a conditional GAN (Miyato et al., 2018) has to produce diverse images classes, those classes often share the same resolution and many other dataset-specific characteristics (e.g., the object-centric bias for CIFAR images); that is even more true when the GAN is trained to produce classes of one super-category, e.g., dogs and cats. Our Contributions: Recognizing the need to address feature normalization with "harmony in diversity", we propose a new SaBN as illustrated in Fig 1 (c ). SaBN modifies BN in a "frustratingly simple" way: it is equipped with two cascaded affine layers: a shared unconditional sandwich affine layer, followed by a set of independent affine layers that can be conditioned. Compared to Categorical Conditional BN, the new sandwich affine layer is designed to inject an inductive bias, that all re-scaling transformations will have a shared factor, indicating the commodity. Experiments on the applications of NAS and conditional generation demonstrate that SaBN addresses the model heterogeneity issue elegantly, and improves their performance in a plug-and-play fashion. 



(a)) fails to perform well when there is data or model heterogeneity. Recent trends split the affine layer into multiple ones and leverage input signals to modulate or select between them (De Vries et al., 2017; Deecke et al., 2018) (Figure 1 (b)); or even further, utilize several independent BNs to address such disparity (Zaj ąc et al., 2019; Xie et al., 2019; Xie & Yuille, 2019; Yu et al., 2018) (Figure 1 (c)). While those relaxations alleviate the data or model heterogeneity, we suggest that they might be "too loose" in terms of the normalization or regularization effect. Let us take adversarial training (AT) (Madry et al., 2017) as a concrete motivating example to illustrate our rationale. AT is by far the most effective approach to improve a deep model's adversarial robustness. The model is trained by a mixture of the original training set ("clean examples") and

Figure 1: Illustration of (a) the original batch normalization (BN), composed of one normalization layer and one affine layer; (b) Categorical Conditional BN, composed of one normalization layer following a set of independent affine layers to intake conditional information; (c) our proposed Sandwich BN, sequentially composed of one normalization layer, one shared sandwich affine layer, and a a set of independent affine layers.

("adversarial examples"). Yet, latest works (Xie et al., 2019; Xie & Yuille, 2019) pointed out that clean and adversarial examples behave like two different domains with distinct statistics on the feature level (Li & Li, 2017; Pang et al., 2018). Such data heterogeneity puts vanilla BN in jeopardy for adversarial training, where the two domains are treated as one. (Xie et al., 2019; Xie & Yuille, 2019) demonstrated a helpful remedy to improve AT performance by using two separate BNs for clean and adversarial examples respectively, which allows either BN to learn more stable and noiseless statistics over its own focused domain.

To better address the data heterogeneity altogether, SaBN could further integrate the idea of split/auxiliary BNs(Zaj ąc et al., 2019; Xie et al., 2019; Xie & Yuille, 2019; Yu et al., 2018), to decompose the normalization layer into multiple parallel ones. That yields the new variant called SaAuxBN. We demonstrate it using the application example of adversarial training. Lastly, we extend the idea of SaBN to Adaptive Instance Normalization (AdaIN) (Huang & Belongie, 2017), and show the resulting SaAdaIN to improve arbitrary style transfer. BN)(Ioffe & Szegedy, 2015)  made critical contributions to training deep convolutional networks and nowadays becomes a cornerstone of the latter for numerous tasks. BN normalizes the input mini-batch of samples by the mean and variance, and then re-scale them with learnable affine parameters. The success of BNs was initially attributed to overcoming internal co-

