SANDWICH BATCH NORMALIZATION

Abstract

We present Sandwich Batch Normalization (SaBN), a frustratingly easy improvement of Batch Normalization (BN) with only a few lines of code changes. SaBN is motivated by addressing the inherent feature distribution heterogeneity that one can be identified in many tasks, which can arise from model heterogeneity (dynamic architectures, model conditioning, etc.), or data heterogeneity (multiple input domains). A SaBN factorizes the BN affine layer into one shared sandwich affine layer, cascaded by several parallel independent affine layers. Its variants include further decomposing the normalization layer into multiple parallel ones, and extending similar ideas to instance normalization. We demonstrate the prevailing effectiveness of SaBN (as well as its variants) as a drop-in replacement in four tasks: neural architecture search (NAS), image generation, adversarial training, and style transfer. Leveraging SaBN immediately boosts two state-of-the-art weight-sharing NAS algorithms significantly on NAS-Bench-201; achieves better Inception Score and FID on CIFAR-10 and ImageNet conditional image generation with three state-of-the art GANs; substantially improves the robust and standard accuracy for adversarial defense; and produces superior arbitrary stylized results. We also provide visualizations and analysis to help understand why SaBN works. All our codes and pre-trained models will be released upon acceptance.

1. INTRODUCTION

This paper presents a simple, light-weight, and easy-to-implement modification of Batch Normalization (BN) (Ioffe & Szegedy, 2015) , yet strongly motivated by various observations (Zaj ąc et al., 2019; Deecke et al., 2018; Xie et al., 2019; Xie & Yuille, 2019) drawn from a number of application fields, that BN has troubles standardizing hidden features with very heterogeneous structures, e.g., from a multi-modal distribution. We call the phenomenon feature distribution heterogeneity. Such heterogeneity of hidden features could arise from multiple causes, often application-dependent: • One straightforward cause is due to input data heterogeneity. For example, when training a deep network on a diverse set of visual domains, that possess significantly different statistics, BN is found to be ineffective at normalizing the activations with only a single mean and variance (Deecke et al., 2018) , and often needs to be re-set or adapted (Li et al., 2016) . • Another intrinsic cause could arise from model heterogeneity, i.e., when the training is, or could be equivalently viewed as, on a set of different models. For instance, in neural architecture search (NAS) using weight sharing (Liu et al., 2018; Dong & Yang, 2019) , training the super-network during the search phase could be considered as training a large set of sub-models (with many overlapped weights) simultaneously. As another example, for conditional image generation (Miyato et al., 2018) , the generative model could be treated as a set of category-specific sub-models packed together, one of which would be "activated" by the conditional input each time. The vanilla BN (Figure 1 (a)) fails to perform well when there is data or model heterogeneity. Recent trends split the affine layer into multiple ones and leverage input signals to modulate or select between them (De Vries et al., 2017; Deecke et al., 2018) (Figure 1 (b)); or even further, utilize several independent BNs to address such disparity (Zaj ąc et al., 2019; Xie et al., 2019; Xie & Yuille, 2019; Yu et al., 2018) (Figure 1 (c)). While those relaxations alleviate the data or model heterogeneity, we suggest that they might be "too loose" in terms of the normalization or regularization effect. Let us take adversarial training (AT) (Madry et al., 2017 ) as a concrete motivating example to illustrate our rationale. AT is by far the most effective approach to improve a deep model's adversarial robustness. The model is trained by a mixture of the original training set ("clean examples") and its attacked counterpart with some small perturbations applied ("adversarial examples"). Yet, latest works (Xie et al., 2019; Xie & Yuille, 2019) pointed out that clean and adversarial examples behave like two different domains with distinct statistics on the feature level (Li & Li, 2017; Pang et al., 2018) . Such data heterogeneity puts vanilla BN in jeopardy for adversarial training, where the two domains are treated as one. (Xie et al., 2019; Xie & Yuille, 2019 ) demonstrated a helpful remedy to improve AT performance by using two separate BNs for clean and adversarial examples respectively, which allows either BN to learn more stable and noiseless statistics over its own focused domain. But what may be missing? Unfortunately, using two separate BNs ignores one important fact that the two domains, while being different, are not totally independent. Considering that all adversarial images are generated by perturbing clean counterparts only minimally, it is convincing to hypothesize the two domains to be largely overlapped at least (i.e., they still share some hidden features despite the different statistics). To put it simple: while it is oversimplified to normalize the two domains as "same one", it is also unfair and unnecessary to treat them as "disparate two". More application examples can be found that all share this important structural feature prior, that we (informally) call as "harmony in diversity". For instance, weight-sharing NAS algorithms (Liu et al., 2018; Dong & Yang, 2019; Yu et al., 2018) train a large variety of child models, constituting model heterogeneity; but most child architectures inevitably have many weights in common since they are sampled from the same super net. Similarly, while a conditional GAN (Miyato et al., 2018) has to produce diverse images classes, those classes often share the same resolution and many other dataset-specific characteristics (e.g., the object-centric bias for CIFAR images); that is even more true when the GAN is trained to produce classes of one super-category, e.g., dogs and cats. Our Contributions: Recognizing the need to address feature normalization with "harmony in diversity", we propose a new SaBN as illustrated in Fig 1 (c ). SaBN modifies BN in a "frustratingly simple" way: it is equipped with two cascaded affine layers: a shared unconditional sandwich affine layer, followed by a set of independent affine layers that can be conditioned. Compared to Categorical Conditional BN, the new sandwich affine layer is designed to inject an inductive bias, that all re-scaling transformations will have a shared factor, indicating the commodity. Experiments on the applications of NAS and conditional generation demonstrate that SaBN addresses the model heterogeneity issue elegantly, and improves their performance in a plug-and-play fashion. To better address the data heterogeneity altogether, SaBN could further integrate the idea of split/auxiliary BNs (Zaj ąc et al., 2019; Xie et al., 2019; Xie & Yuille, 2019; Yu et al., 2018) , to decompose the normalization layer into multiple parallel ones. That yields the new variant called SaAuxBN. We demonstrate it using the application example of adversarial training. Lastly, we extend the idea of SaBN to Adaptive Instance Normalization (AdaIN) (Huang & Belongie, 2017) , and show the resulting SaAdaIN to improve arbitrary style transfer.

2.1. NORMALIZATION IN DEEP LEARNING

Batch Normalization (BN) (Ioffe & Szegedy, 2015) made critical contributions to training deep convolutional networks and nowadays becomes a cornerstone of the latter for numerous tasks. BN normalizes the input mini-batch of samples by the mean and variance, and then re-scale them with learnable affine parameters. The success of BNs was initially attributed to overcoming internal co-variate shift (Ioffe & Szegedy, 2015) , but later on raises many open discussions on its effect of improving landscape smoothness (Santurkar et al., 2018) ; enabling larger learning rates (Bjorck et al., 2018) and reducing gradient sensitivity (Arora et al., 2018) ; preserving the rank of pre-activation weight matrices (Daneshmand et al., 2020) ; decoupling feature length and direction (Kohler et al., 2018) ; capturing domain-specific artifacts (Li et al., 2016) ; reducing BN's dependency on batch sizeIoffe (2017); Singh & Krishnan (2020) ; Preventing model from elimination singularities Qiao et al. (2019) ; and even characterizing an important portion of network expressivity (Frankle et al., 2020) 

2.2. BRIEF BACKGROUNDS FOR RELATED APPLICATIONS

We leverage four important applications as test beds. All of them appear to be oversimplified by using the vanilla BN, where the feature homogeneity and heterogeneity are not properly handled. We briefly introduce them below, and will concretely illustrate where the heterogeneity comes from and how our methods resolve the bottlenecks in Sec. 3. Generative Adversarial Network Generative adversarial networks (GANs) have been prevailing since its origin (Goodfellow et al., 2014a) for image generation. Many efforts have been made to improve GANs, such as modifying loss function (Arjovsky et al., 2017; Gulrajani et al., 2017; Jolicoeur-Martineau, 2018) , improving network architecture (Zhang et al., 2018; Karras et al., 2019; Gong et al., 2019) and adjusting training procedure (Karras et al., 2017) . Recent works also tried to improve the generated image quality by proposing new normalization modules, such as Categorical Conditional BN and spectral normalization (Miyato et al., 2018) . Neural Architecture Search (NAS) The goal of NAS is to automatically search for an optimal model architecture for the given task and dataset. It was first proposed in (Zoph & Le, 2016) where a reinforcement learning algorithm iteratively samples, trains and evaluates candidate models from the search space. Due to its prohibitive time cost, weight-sharing mechanism was introduced (Pham et al., 2018) and becomes a popular strategy to accelerate the training of sampled models (Liu et al., 2018) . However, weight-sharing causes performance deterioration due to unfair training (Chu et al., 2019) . In addition, a few NAS benchmarks (Ying et al., 2019; Dong & Yang, 2020; Zela et al., 2020) were recently released, with ground-truth accuracy for candidate models pre-recorded, enabling researchers to evaluate the performance of search method more easily. Adversarial Robustness Deep networks are notorious for the vulnerability to adversarial attacks (Goodfellow et al., 2014b) . In order to enhance adversarial robustness, countless training approaches have been proposed. (Dhillon et al., 2018; Papernot & McDaniel, 2017; Xu et al., 2017; Meng & Chen, 2017; Liao et al., 2018; Madry et al., 2017) . Among them, adversarial training (AT) (Madry et al., 2017) are arguably the strongest, which train the model over a mixture of clean and perturbed data. Overall, the normalization in AT has, to our best knowledge, not been studied in depth. A pioneer work (Xie et al., 2019) introduce an auxiliary batch norm (AuxBN) to improve the clean image recognition accuracy. Neural Style Transfer Style transfer is a technique generating a stylized image, by combining the content of one image with the style of another. Various improvements are made on the normalization methods in this area. Ulyanov et al. (2016) proposed Instance Normalization, improving the stylized quality of generated images. Conditional Instance Normalization (Dumoulin et al., 2016) and Adaptive Instance Normalization (Huang & Belongie, 2017) are proposed, enabling the network to perform arbitrary style transfer.

3. SANDWICH BATCH NORMALIZATION

Formulation: Given the input feature x ∈ R N ×C×H×W (N denotes the batch size, C the number of channels, H the height, and W the width), the vanilla Batch Normalization (BN) works as: h = γ( x -µ(x) σ(x) ) + β, where µ(x) and σ(x) are the running estimates (or batch statistics) of input x's mean and variance along the (N , H, W ) dimension. γ and β are the learnable parameters of the affine layer, and both are of shape C. However, as the vanilla BN only has a single re-scaling transform, it will simply treat any latent heterogeneous features as a single distribution. As an improved variant, Categorical Conditional BN (CCBN) (Miyato et al., 2018) is proposed to remedy the heterogeneity issue in the task of conditional image generation, boosting the quality of generated images. Categorical Conditional BN has a set of independent affine layers, whose activation is conditioned by the input domain index. It can be expressed as: h = γ i ( x -µ(x) σ(x) ) + β i , i = 1, ..., C, where γ i and β i are parameters of the i-th affine layer. Concretely, i is the expected output class in the image generation task (Miyato & Koyama, 2018) . However, we argue that this "separate/split" modification might be "too loose", ignoring the fact that the distributions of the expected generated images overlap largely (due to similar texture, appearance, illumination, object location, scene layout, etc.). Hence to better handle both the latent homogeneity and heterogeneity in x, we present Sandwich Batch Normalization (SaBN), that is equipped with both a shared sandwich affine layer and a set of independent affine layers. SaBN can be concisely formulated as: h = γ i (γ sa ( x -µ(x) σ(x) ) + β sa ) + β i , i = 1, ..., C. As depicted in Fig. 1 (d), γ sa and β sa denote the new sandwich affine layer, while γ i and β i are the i-th affine parameters, conditioned on categorical inputs. Implementation-wise, SaBN only takes a few lines of code changes compared to vanilla BN: please see appendix Fig. 6 for pseudo codes. This motivates us to unify both homogeneity and heterogeneity for conditional image generation with SaBN. We choose three representative GAN models, SNGAN, BigGAN (Brock et al., 2018) and AutoGAN-top1 (Gong et al., 2019) , as our backbones. SNGAN and BigGAN are equipped with Categorical Conditional BN. AutoGAN-top1 originally has no normalization layer and was designed for unconditional image generation, we manually insert Categorical Conditional BN into its generator to adapt it to the conditional image generation task. We then construct SNGAN-SaBN, BigGAN-SaBN and AutoGAN-SaBN, by replacing all Categorical Conditional BN in the above baselines with our SaBN.

3.1. UNIFYING HOMOGENEITY & HETEROGENEITY FOR CONDITIONAL IMAGE GENERATION

We test all the above models on CIFAR-10 dataset (Krizhevsky et al., 2009) (10 categories, resolution 32×32). Furthermore, we test SNGAN and SNGAN-SaBN on high resolution image generation task on ImageNet (Deng et al., 2009) , using the subset of all 143 classes belonging to the dog and cat super-classes, cropped to resolution 128 × 128) following (Miyato et al., 2018 )'s setting. Inception Score (Salimans et al., 2016) (the higher the better) and FID (Heusel et al., 2017) (the lower the better) are adopted as evaluation metrics. We summarize the best performance the models has achieved during training into Table 1 . We find that SaBN can consistently improve the generative quality of all three baseline GAN models. We also visualize the image generation results of all compared GANs in Fig. 21 , 22, 23 and 24 at appendix. The output of each layer is the sum of all operation paths' output, weighted by their associated architecture parameter α. Model heterogeneity is introduced during the summation process, due to the difference among each parallel operations. In the meanwhile, different operations still maintain an intrinsic homogeneity and are not completely independent, as they all share the same original input, and their gradient is also estimated from the same loss function. Understanding SaBN by Visualization. One might be curious about the effectiveness of SaBN, since at the inference time, the shared sandwich affine layer can be multiplied/merged into the independent affine layers, making the inference form of SaBN completely the same as Categorical Conditional BN. Hence to better understand how SaBN benefits the conditional image generation task, we dive into the inductive role played by the shared sandwich affine parameter. We choose SNGAN and SNGAN-SaBN on ImageNet as our testbed. Specifically, we propose a new measurement called Class-wise Affine Parameters Variance (CAPV), which aims to indicate how much classspecific heterogeneity is introduced with the set of independent affine parameters. For Categorical Conditional BN (CCBN), we define its CAPV for γ as V CCBN (γ) = Var([γ 1 , γ2 , • • • , γC ]) , where γi represents the channel-wise average value of γ i . Similarly, the CAPV of SaBN is defined as V SaBN (γ) = Var([γ Sa • γ 1 , γ Sa • γ 2 , • • • , γ Sa • γ C ]). A larger CAPV value implies more heterogeneity. We plot the CAPV values for both SNGAN and SNGAN-SaBN at each layer in Fig. 2 (left). The solid blue line (SaBN) is lower than orange line (CCBN) in the shallow layers, but soon surpasses the latter at the deeper layers. We additionally plot a dashed blue line, obtained by removing the shared sandwich affine layer from the SaBN parameters (i.e., only independent affine). Their comparison indicates that the shared sandwich affine layer helps inject an inductive bias: compared to using Categorical Conditional BN, training with SaBN will enforce to shallow layers to preserve more feature homogeneity, while encouraging the deeper layer to introduce more class-specific feature heterogeneity. In other plain words, SaBN seems to have shallower and deeper layers more "focused" on each dedicated role (common versus class-specific feature extractors). We also plot the value of shared sandwich affine parameters γ Sa (averaged across channel) along the network depth, in the right figure in Fig. 2 . It is also aligned with our above observation.

3.2. ARCHITECTURE HETEROGENEITY IN NEURAL ARCHITECTURE SEARCH (NAS)

Recent NAS works formulate the search space as a super-network which contains all candidate operations and architectures, and the goal is to find a sub-network of a single path that of the optimal performance. To support the search over the architectures in the super-network, DARTS (Liu et al., 2018) assigns each candidate operation a trainable parameter α, and the search problem is solved by alternatively optimizing the architecture parameters α and the model weights via stochastic gradient descent. The architecture parameters α can be treated as the magnitude of each operation, which will help rank the best candidates after the search phase. In Fig. 3 , we concretely illustrate the origin of the model heterogeneity in the supernet. To disentangle the mixed model heterogeneity during search process and still maintain the intrinsic homogeneity, we replace the BN in each operation path with a SaBN in the second layer (same for the first layer if it is also downstream of layers ahead). Ideally, the number of independent affine layers shall be set to the total number of unique architectures in the search space, enabling the architecture-wise disentanglement. However, it would be impractical as the search space size (number of unique architecture) is usually larger than 10 4 . Thus we adopt a greedy way that we only consider to disentangle previous layer's architecture (operations). Specifically, the number of independent affine layers in the SaBN equals to the total number of candidate operation paths of the connected previous layer. The categorical index i of SaBN during searching is obtained by applying a multinomial sampling on the softmax of previous layers' architecture parameters: softmax([α 0 , α 1 , α 2 , ...α n-1 ]). To validate the efficacy of our SaBN, we conduct ablation study of four settings: 1) no affine (i.e., γ = 1, β = 0, as in vanilla DARTS), where by default the learning of affine parameters of BN in each operation path is disabled (Liu et al., 2018) ; 2) homogeneity only ("DARTSaffine"), where the learning of affine parameters of BN are enabled in each operation path in "DARTS"; 3) heterogeneity only ("DARTS-CategoricalCBN"), where the BN in each operation path of "DARTS" are replaced by the Categorical Conditional BN (Miyato et al., 2018) ; 4) homogeneity and heterogeneity ("DARTS-SaBN"), where all operation paths' BNs in "DARTS" are replaced by SaBN. Following the suggestion by Dong & Yang (2020) , all experiments use batch statistics instead of keeping running estimates of the mean and variance in the normalization layer. We conduct our experiments on CIFAR-100 (Krizhevsky et al., 2009) and ImageNet16-120 (Chrabaszcz et al., 2017) using NAS-Bench-201 (Dong & Yang, 2020 ) (Fig. 4 ). Early stopping is applied for the searching phase as suggested in (Liang et al., 2019) . We can observe that SaBN dominates on both CIFAR-100 and ImageNet cases. Surprisingly, we also notice that by simply turning on the affine in the original DARTS, DARTS-affine gains fairly strong improvements. The performance gap between DARTS-affine and DARTS-SaBN demonstrate the effectiveness of the independent affine layers in SaBN. Experiments also shows that CCBN does help improve search performance. However, it falls largely behind SaBN, indicating the shared sandwich affine layer to also be vital. In Fig. 15 at appendix, we can observe the shared sandwich affine layer helps to preserve more homogeneity. The ground-truth accuracy of the final searched architecture is summarized in Tab. 2. Besides, we also find SaBN works well on another weight-sharing search method, GDAS (Dong & Yang, 2020) . The results are shown in Sec. A.2.3 at appendix.

3.3. SANDWICH AUXILIARY BATCH NORM IN ADVERSARIAL ROBUSTNESS

AdvProp (Xie et al., 2019) successfully utilized adversarial examples to boost network Standard Testing Accuracy (SA) by introducing Auxiliary Batch Norm (AuxBN). The design is quite simple: an additional BN is added in parallel of the original BN, where the original BN (clean branch) takes the clean image as input, while the additional BN (adversarial branch) is fed with only adversarial examples during training. That intuitively disentangles the mixed clean and adversarial distribution (data heterogeneity) into two splits, guaranteeing the normalization statistics and re-scaling are exclusively performed in either domain. However, one thing missed is that the domains of clean and adversarial images overlap largely, as adversarial images are generated by perturbing clean counterparts minimally. This inspires us to present a novel SaAuxBN, by leveraging domain-specific normalizations and affine layers, and also a shared sandwich affine layer for homogeneity preserving, SaAuxBN can be defined as: h = γ i (γ sa ( x -µ i (x) σ i (x) ) + β sa ) + β i , i = 0, 1. µ i (x) and σ i (x) denote the i-th (moving) mean and variance of input, where i = 0 for adversarial images and i = 1 for clean images. We use independent normalization layer to decouple the data from two different distributions, i.e., the clean and adversarial. For a fair comparison, we follow the settings in Madry et al. (2017) . In the adversarial training, we adopt ∞ based 10 steps Projected Gradient Descent (PGD) (Madry et al., 2017) with step size α = 2 255 and maximum perturbation magnitude = 8 255 ; As for assessing RA, PGD-20 with the same configuration is adopted. We replace AuxBN with SaAuxBN in AdvProp (Xie et al., 2019) and find it can further improve SA of the network with its clean branch. The experiments are conducted on CIFAR-10 ( Krizhevsky et al., 2009) with ResNet-18 (He et al., 2016) backbone, and the results are presented in Tab. 3. We further conduct an experiment to test the SA and Robust Testing Accuracy (RA) of the network using the adversarial branch of AuxBN and SaAuxBN. The comparison results are presented in Tab. 4. Tab. 4 shows that BN still achieves the highest performance on SA, but falls a lot on RA compared with other methods. Our proposed SaAuxBN is on par with the vanilla BN in terms of SA, while has significantly better results on RA than any other approaches. Compared with SaAuxBN, AuxBN suffers from worse SA and RA, indicating that both the shared sandwich affine layer are key to the disentanglement of adversarial domain from the clean domain, with their shared statistics properly preserved. The behaviors of AuxBN and SaAuxBN are visualized in Fig. 20 at appendix, which suggests that the sandwich affine layer here mainly encourages to enhance feature homogeneity. We additionally include ModeNorm (Deecke et al., 2018) as an ablation in our experiments, which was proposed to deal with multi-modal distributions inputs, i.e., data heterogeneity. It shares some similarity with AuxBN as both consider multiple independent norms. ModeNorm achieves fair performance on both SA and RA, while still lower than SaAuxBN. The reason might be the output of ModeNorm is a summation of two features weighted by a set of learned gating functions, which still mixes the statistics from two domains, leading to inferior performance in the attack scenario.

3.4. STYLE TRANSFER WITH SANDWICH ADAPTIVE INSTANCE NORMALIZATION

Huang & Belongie (Huang & Belongie, 2017) achieves arbitrary style transfer by introducing Adaptive Instance Norm (AdaIN). The AdaIN framework is composed of three parts: Encoder, AdaIN and Decoder. Firstly, the Encoder will extract content feature and style feature from content and style images. Then the AdaIN is leveraged to perform style transfer on feature space, producing a stylized content feature. The Decoder is learnt to decode the stylized content feature to stylized images. This framework is trained end-to-end with two loss terms, a content loss and a style loss. Concretely, AdaIN firstly performs a normalization on the content feature, then re-scale the normalized content feature with style feature's statistic. It can be formulated as: h = σ(y)( x -µ(x) σ(x) ) + µ(y), ( ) where y is the style input, x is the content input. Note that µ and σ here are quite different from BN, which are performed along the spatial axes (H, W ) for each sample and each channel. Obviously, style-dependent re-scale may be too loose and might further amplify the intrinsic data heterogeneity brought by the variety of the input content images, undermining the network's ability of maintaining the content information in the output. In order to reduce the data heterogeneity, we propose to insert a shared sandwich affine layer after the normalization, which introduce homogeneity for the style-dependent re-scaling transformation. Hereby, we present SaAdaIN: Its styleindependent affine is not only conditioned on style information, but also controlled by the input feature. Our training settings for all models are kept identical with (Huang & Belongie, 2017) . We depict the loss curves of the training process in Fig. 5 . We can notice that both the content loss and style loss of the proposed SaAdaIN is lower than that of AdaIN and ILM+IN. This observation demonstrates that the shared sandwich affine layer in our SaAdaIN is beneficial for the network to preserve the semantic information of the original input content, while also better migrating and merging style information. h = σ(y)(γ sa ( x -µ(x) σ(x) ) + β sa ) + µ(y), Furthermore, the qualitative visual results are demonstrated in Fig. 26 at appendix (Better zoomed in and viewed in color). The leftmost column displays content images and referenced style images. The next three columns are the stylized outputs using AdaIN, ILM+IN and SaAdaIN, respectively.

4. CONCLUSION

We present SaBN and its variants as plug-and-play normalization modules, which are motivated by addressing model & data heterogeneity issues. We demonstrate their effectiveness on several tasks, including neural architecture search, adversarial robustness, conditional image generation and arbitrary style transfer. Our future work plans to investigate the performance of SaBN on more applications, such as semi-supervised learning (Zaj ąc et al., 2019) , slimmable models, and once-forall training (Yu et al., 2018) .

A APPENDIX

A.1 IMPLEMENTATION DETAILS def BatchNorm(x, gamma, beta, running_mean, running_var, momentum=0 

Analysis of Optimization and Generalization Ability

We found SaBN benefits NAS w.r.t. both the generalization ability and the optimization. In DARTS, the goal is to learn the architecture parameter α and the model weights ω jointly. Specifically, α is optimized to minimize the architecture loss L val (ω, α) using validation dataset, where the weights ω is optimized to minimize the weight loss L train (ω, α) with training dataset. In our experiment, we first visualize the architecture loss as well as weight loss in Fig. 13 . Compared with DARTS, the loss value of DARTS-CCBN is larger in the first several epochs, but becomes similar in the later. This indicates that the independent affine layers slow the optimization at the beginning, due to the additional parameters. Compared DARTS-SaBN with DARTS-CCBN, we can observe the losses of DARTS-SaBN is lower than DARTS-CCBN at early stage, indicating the additional sandwich affine layer is beneficial to the optimization of the model. This is achieved by the injected inductive bias of commodity for features from different operations, leading to easier optimization. However, the losses of both DARTS and DARTS-CCBN become lower than DARTS-SaBN in the later stage. This is caused by the architecture collapse in the later searching stage, which can be observed in Fig. 8, 9 , where the supernet starts to be dominated by skip connections in DARTS and DARTS-CCBN, leading to a easier optimization Zhou et al. (2020) . We further visualize the gap between the L val (ω, α) and L train (ω, α) in Fig. 14 , with the setting of DARTS, DARTS-CCBN and DARTS-SaBN. The gap in DARTS-SaBN is lower than that of DARTS and DARTS-CCBN, indicating SaBN can also help to improve the generalization ability of the supernet.The inserted shared sandwich affine layer also serves as a regularization term here, since it is updated by data from all previous operations, thus improving the model generalization ability. A.2.3 THE RESULTS OF GDAS GDAS (Dong & Yang, 2019 ) is an advantaged extension of DARTS (Liu et al., 2018) where the forward activation of all paths during search is replaced by the activation of a sampled single path. This is achieved by introducing Gumbel-softmax to construct a differentiable one-hot vector 



Figure 1: Illustration of (a) the original batch normalization (BN), composed of one normalization layer and one affine layer; (b) Categorical Conditional BN, composed of one normalization layer following a set of independent affine layers to intake conditional information; (c) our proposed Sandwich BN, sequentially composed of one normalization layer, one shared sandwich affine layer, and a a set of independent affine layers.

Figure 2: The CAPV value of γ (left) and the shared sandwich parameter γSa's value (right) along the network depth.

Figure 3: We depict two consecutive layers in the super-network. By default, a BN is integrated into each operation in vanilla DARTS, except Zero and Skip-connection operation.The output of each layer is the sum of all operation paths' output, weighted by their associated architecture parameter α. Model heterogeneity is introduced during the summation process, due to the difference among each parallel operations. In the meanwhile, different operations still maintain an intrinsic homogeneity and are not completely independent, as they all share the same original input, and their gradient is also estimated from the same loss function.

Figure 4: Results of architecture search on CIFAR-100 and ImageNet16-120, based on DARTS. At the end of each searching epoch, the architecture is derived from current α values. The x-axis is the searching epoch. The y-axis is the ground truth test accuracy of current epoch's architecture, obtained via querying NAS-Bench-201. Each experiment is run for three times with different random seeds. Each curve in the figure is averaged across them.

Figure 5: The content loss and the style loss of using AdaIN, ILM+IN and SaAdaIN. The noisy shallow-color curves are the original data. The foreground smoothed curves are obtained via applying exponential moving average on the original data. Besides AdaIN, we also include Instance-Level Meta Normalization with Instance Norm (ILM+IN) proposed by Jia et al. (2019) as a task-specific comparison baseline.Its styleindependent affine is not only conditioned on style information, but also controlled by the input feature. Our training settings for all models are kept identical with(Huang & Belongie, 2017). We depict the loss curves of the training process in Fig.5. We can notice that both the content loss and style loss of the proposed SaAdaIN is lower than that of AdaIN and ILM+IN. This observation demonstrates that the shared sandwich affine layer in our SaAdaIN is beneficial for the network to preserve the semantic information of the original input content, while also better migrating and merging style information.

Figure 6: Pseudo Python code of BN, SaBN and SaAuxBN with TensorFlow. We highlight the main difference between our approaches with vanilla BN.

Figure 8: Architecture parameters (after softmax) on each edge in DARTS.

Figure 9: Architecture parameters (after softmax) on each edge in DARTS-CCBN.

Figure 10: Architecture parameters (after softmax) on each edge in our DARTS-SaBN.

DARTS' searched architecture (c) DARTS-SaBN's searched architecture (c) DARTS-CCBN's searched architecture

Figure11: The architectures searched by DARTS are dominated by "skip_connect" and the architecture of DARTS-CCBN is full of both "skip_connect" and "none". In contrast, DARTS-SaBN highly prefers "nor_conv_3x3".

Figure 12: The operation statistics of the searched architecture from DARTS and DARTS-SaBN.

Figure 13: The architecture loss Lval(ω, α) and the weight loss Ltrain(ω, α) of DARTS, DARTS-CCBN and DARTS-SaBN.

Figure 14: The architecture-weight loss gap of DARTS, DARTS-CCBN and DARTS-SaBN.

Figure 15: The CAPV for DARTS-CCBN and DARTS-SaBN. The shared sandwich affine layer is learned to reduce the feature heterogeneity across the whole network.

[α * 0 , α * 1 , α * 2 , ...α * n-1 ], α * i ∈ {0, 1}, from weight vector [α 0 , α 1 , α 2 , ...α n-1 ].

Figure 17: The adversarial branch loss L(fadv(xadv), y) and clean branch loss L(fclean(xclean), y) on training set. f, x, y denote model, input and label respectively. The model with SaAuxBN has lower training loss.

Figure 18: The adversarial branch loss L(fadv(xadv), y) and clean branch loss L(fclean(xclean), y) on testing set. The model with SaAuxBN has lower test loss.

Figure 19: The train-test loss gap for adversarial branch loss L(fadv(xadv), y) and clean branch loss L(fclean(xclean), y).

Figure20: The CAPV value for models with AuxBN and SaAuxBN. We can observe the shared sandwich affine layer is learned to reduce CAPV, i.e., the feature heterogeneity.

Figure 25: The generator loss of SNGAN and SNGAN-SaBN on ImageNet. SNGAN-SaBN achieves lower loss value.

The best Inception Scores ("IS", ↑) and FIDs (↓) achieved by conditional SNGAN, BigGAN, and AutoGAN-top1, using CCBN and SaBN on CIFAR-10 and ImageNet (dogs & cats).

The searched results top-1 accuracy of the four methods on NAS-Bench-201. Our proposed approach achieves the highest accuracy, with the lowest standard deviation.

Performance (SA) of different BN settings on clean branch.

Performance (SA&RA) of different BN settings. During evaluation, only the adversarial path is activated in AuxBN and SaAuxBN.

A.2 ADDITIONAL RESULTS IN NAS A.2.1 SEARCH SPACE OF NAS-BENCH-201

Our experiments of NAS are conducted on NAS-Bench-201, which adopts a cell-based search space, shown in Fig. 7 . The whole search space is composed by stacking several cells. Each cell consists of several layers L k j , which is composed of several parallel operation paths as illustrated in Fig. 2 in our paper. L k j denotes the k-th layer which take feature j as input. In Sec. 3.1 of our paper, we have concretely illustrate the detailed implementation of the SaBN under the circumstance a layer only has one connected previous layer. However, we can observe there is layer having multiple connected previous layers. Take L 1 3 as an example, it is connected with L 2 1 and L 1 2 . In this case, the number of independent affine parameter in SaBN at L 1 3 would be n 2 , where n is the number of parallel operation paths in each layer.

Visualization of architecture parameters in DARTS

We visualize the searching curves of the architecture parameters on each edge of the cell, in Fig. 8, 9 , 10, for DARTS, DARTS-CCBN and DARTS-SaBN respectively. In the case of DARTS (Fig. 8 ), we can see that "skip_connect" dominates on most of edges during the search, yielding bad architectures when search ends. DARTS-CCBN is in favor of both "skip_connect" and "none" operations at last, in Fig. 9 . For our DARTS-SaBN in Fig. 10 , the weight of "skip_connect" in architecture parameters climbs at first, but drops immediately after a few epochs, whereas the "nor_conv_3x3" takes the lead in most of edges.

Visualization of discovered architectures in DARTS

We visualize the discovered architectures of DARTS, DARTS-CCBN and DARTS-SaBN in Fig. 11 . It shows the architecture that discovered by DARTS is fully dominated by "skip_connect". We further analyze the operator composition of searched architectures of DARTS and DARTS-SaBN in Fig. 12 , based on their final searched architecture parameters. We can clearly see that "skip_connect" is highly preferred by DARTS, where DARTS-SaBN favors "nor_conv_3x3". We conducted four experiments, the original GDAS, GDAS-affine, GDAS-CategoricalCBN and GDAS-SaBN, which are quite similar to DARTS experiments settings. The only difference is that the category index in CCBN and SaBN is obtained by using Gumbel-softmax instead of multinomial sampling. For all experiments we use batch statistics instead of running mean and variance (Dong & Yang, 2020) .The 

