METANORM: LEARNING TO NORMALIZE FEW-SHOT BATCHES ACROSS DOMAINS

Abstract

Batch normalization plays a crucial role when training deep neural networks. However, batch statistics become unstable with small batch sizes and are unreliable in the presence of distribution shifts. We propose MetaNorm, a simple yet effective meta-learning normalization. It tackles the aforementioned issues in a unified way by leveraging the meta-learning setting and learns to infer adaptive statistics for batch normalization. MetaNorm is generic, flexible and model-agnostic, making it a simple plug-and-play module that is seamlessly embedded into existing meta-learning approaches. It can be efficiently implemented by lightweight hypernetworks with low computational cost. We verify its effectiveness by extensive evaluation on representative tasks suffering from the small batch and domain shift problems: few-shot learning and domain generalization. We further introduce an even more challenging setting: few-shot domain generalization. Results demonstrate that MetaNorm consistently achieves better, or at least competitive, accuracy compared to existing batch normalization methods.

1. INTRODUCTION

Batch normalization (Ioffe & Szegedy, 2015) is crucial for training neural networks, and with its variants, e.g., layer normalization (Ba et al., 2016), group normalization (Wu & He, 2018) and instance normalization (Ulyanov et al., 2016) , has thus become an essential part of the deep learning toolkit (Bjorck et Summers & Dinneen, 2020) . Batch normalization helps stabilize the distribution of internal activations when a model is being trained. Given a mini-batch B, the normalization is conducted along each individual feature channel for 2D convolutional neural networks. During training, the batch normalization moments are calculated as follows: µ B = 1 M M i=1 a i , σ 2 B = 1 M M i=1 (a i -µ B ) 2 , where a i indicates the i-th element of the M activations in the batch, M = |B| × H × W , in which H and W are the height and width of the feature map in each channel. We can now apply the normalization statistics to each activation: a i ← BN(a i ) ≡ γâ i + β, where, âi = a i -µ B σ 2 B + , where γ and β are parameters learned during training, is a small scalar to prevent division by 0, and operations between vectors are element-wise. At test time, the standard practice is to normalize activations using the moving average over mini-batch means µ B and variance σ 2 B . Batch normalization is based on an implicit assumption that the samples in the dataset are independent and identically distributed. However, this assumption does not hold in challenging settings like few-shot learning and domain generalization. In this paper, we strive for batch normalization when batches are of small size and suffer from distributions shifts between source and target domains. Batch normalization for few-shot learning and domain generalization problems have so far been considered separately, predominantly in a meta-learning setting. For few-shot meta-learning (Finn et al., 2017; Gordon et al., 2019) , most existing methods rely critically on transductive batch normalization, except those based on prototypes (Snell et al., 2017; Allen et al., 2019; Zhen et al., 2020a) . However, the nature of transductive learning restricts its application due to the requirement to sample from the test set. To address this issue, Bronskill et al. (2020) proposes TaskNorm, which leverages other statistics from both layer and instance normalization. As a non-transductive normalization approach, it achieves impressive performance and outperforms conventional batch normalization (Ioffe & Szegedy, 2015) . However, its performance is not always performing better than transductive batch normalization. Meanwhile, domain generalization (Muandet et al., 2013; Balaji et al., 2018; Li et al., 2017a; b) suffers from distribution shifts from training to test, which makes it problematic to directly apply statistics calculated from a seen domain to test data from unseen domains (Wang et al., 2019; Seo et al., 2019) . Recent works deal with this problem by learning a domain specific normalization (Chang et al., 2019; Seo et al., 2019) or a transferable normalization in place of existing normalization techniques (Wang et al., 2019) . We address the batch normalization challenges for few-shot classification and domain generalization in a unified way by learning a new batch normalization under the meta-learning setting. We propose MetaNorm, a simple but effective meta-learning normalization. We leverage the metalearning setting and learn to infer normalization statistics from data, instead of applying direct calculations or blending various normalization statistics. MetaNorm is a general batch normalization approach, which is model-agnostic and serves as a plug-and-play module that can be seamlessly embedded into existing meta-learning approaches. We demonstrate its effectiveness for few-shot classification and domain generalization, where it learns task-specific statistics from limited data samples in the support set for each few-shot task; and it can also learn to generate domain-specific statistics from the seen source domains for unseen target domains. We verify the effectiveness of MetaNorm by extensive evaluation on few-shot classification and domain generalization tasks. For few-shot classification, we experiment with representative gradient, metric and model-based meta-learning approaches on fourteen benchmark datasets. For domain generalization, we evaluate the model on three widely-used benchmarks for cross-domain visual object classification. Last but not least, we introduce the challenging new task of few-shot domain generalization, which combines the challenges of both few-shot learning and domain generalization. The experimental results demonstrate the benefit of MetaNorm compared to existing batch normalizations.

2. RELATED WORKS

Transductive Batch Normalization For conventional batch normalization under supervised settings, i.i.d. assumptions about the data distribution imply that estimating moments from the training set will provide appropriate normalization statistics for test data. However, in the meta-learning scenario data points are only assumed to be i.i.d. within a specific task. Therefore, it is critical to select the moments when batch normalization is applied to support and query set data points during meta training and meta testing. Hence, in the recent meta-learning literature the running moments are no longer used for normalization at meta-test time, but instead replaced with support/query set statistics. These statistics are used for normalization, both at meta-train and meta-test time. This approach is referred to as transductive batch normalization (TBN) (Bronskill et al., 2020) . Competitive meta-learning methods (e.g., Gordon et al., 2019; Finn et al., 2017; Zhen et al., 2020b) rely on TBN to achieve state-of-the-art performance. However, there are two critical problems with TBN. First, TBN is sensitive to the distribution over the query set used during meta-training, and as such is less generally applicable than non-transductive learning. Second, TBN uses extra information for multiple test samples, compared to non-transductive batch normalization at prediction time, which could be problematic as we are not guaranteed to have a set of test samples available during training in practical applications. In contrast, MetaNorm is a non-transductive normalization. It generates statistics from the support set only, without relying on query samples, making it more practical.

Meta Batch Normalization

To address the problem of transductive batch normalization and improve conventional batch normalization, meta-batch normalization (MetaBN) was introduced (Triantafillou et al., 2020; Bronskill et al., 2020) . In MetaBN, the support set alone is used to compute the normalization statistics for both the support and query sets at both meta-training and meta-test time. MetaBN is non-transductive since the normalization of a test input does not depend on other test inputs in the query set. However, Bronskill et al. (2020) observe that MetaBN performs less well for small-sized support sets. This leads to high variance in moment estimates, which is similar to the difficulty of using batch normalization with small-batch training (Wu & He, 2018) . To address this issue, Bronskill et al. (2020) proposed TaskNorm, which learns to combine statistics from both layer normalization and instance normalization, with a lending parameter to be learned at meta-train time. As a non-transductive normalization, TaskNorm achieves impressive performance, outperforming conventional batch normalization. However, it can not always perform better than transductive batch normalization. TaskNorm indicates non-transductive batch normalization estimates proper normalization statistics by involving learning in the normalization process. We also propose to learn batch normalization within the meta-learning framework, but instead of employing a learnable combination of existing normalization statistics, we directly learn to infer statistics from data. At meta-train time, the model learns to acquire the ability to generate statistics only from the support set and at meta-test time we directly apply the model to infer statistics for new tasks. Batch Normalization for Domain Adaptation and Domain Generalization Domain adaption suffers from a distribution shift between source and target domains, which makes it sub-optimal to directly apply batch normalization (Bilen & Vedaldi, 2017 2019) proposed a domain-specific batch normalization layer, which consists of two branches, each in charge of a single domain exclusively. The hope is that, through the normalization, the feature representation will become domain invariant. Nevertheless, these normalization methods are specifically designed for domain adaptation tasks, where data from target domains are available, though often unlabelled. This makes them inapplicable to domain generalization tasks where data from target domains are inaccessible at training time. Seo et al. (2019) proposed learning to optimize domain specific normalization for domain generalization tasks. Under the meta-learning settings, a mixture of different normalization techniques is optimized for each domain, where the mixture weights are learned specifically for different domains. Instead of combining different normalization statistics, MetaNorm learns from data to generate adaptive statistics specific to each domain. Moreover, we introduce an even more challenging setting, i.e., few-shot domain generalization, which combines the challenges of few-shot classification and domain generalization. Conditional Batch Normalization de Vries et al. (2017) proposed conditional batch normalization to modulate visual processing by predicting the scalars γ and β of the batch normalization conditioned on the language from an early processing stage. Conditional batch normalization has also been applied to align different data distributions for domain adaptation (Li et al., 2016) . Oreshkin et al. (2018) applies conditional batch normalization to metric-based models for the few-shot classification task. Tseng et al. (2020) proposed a learning-to-learn method to optimize the hyper-parameters of the feature-wise transformation layers by conditional batch normalization for cross-domain classification. Unlike conditional batch normalization, we use extra data (the query set) to generate normalization statistics under the meta-learning setting, rather than the scalars.

3. METHODOLOGY

We view finding appropriate statistics for batch normalization as a density estimation problem. We need to infer the distribution parameters, such as, µ and σ when a Gaussian distribution is presumed, as in existing batch normalization approaches. The motivation behind MetaNorm is to leverage the meta-learning setting and learn from data to generate adaptive normalization statistics. MetaNorm is generic and model-agnostic, addressing batch normalization in a unified way for different settings by minimizing the KL divergence, which is a common metric to measure the difference between two probability distributions: D KL q φ (m)|p θ (m) , where m is a random variable that represents the distribution of activations, p θ (m) and q φ (m) are defined as Gaussian distributions with different implementations depending on the task of interest, e.g., few-shot classification or domain generalization. We leverage the amortized inference technique (Kingma & Welling, 2013) and implement this by inference networks. To be more specific, for each individual channel in each convolutional layer, we infer the moments µ and σ by f µ (•) and f σ (•), respectively, which are realized as multi-layer perceptrons and we call hypernetworks (Ha et al., 2016) . Hypernetworks use one network to generate the weights for another network. Our hypernetworks generate the statistics from data by using amortization techniques. We simply incorporate the D KL term into the optimization of the existing model with the cross-entropy loss L CE , resulting in a general loss function as follows: L = L CE -λD KL q φ (m)|p θ (m) where λ > 0 is a regularization hyper-parameter. MetaNorm for Few-Shot Classification In the few-shot classification scenario, we define the C-way K-shot problem using the episodic formulation from (Vinyals et al., 2016) . Each task T i is a classification problem sampled from a task distribution p(T ). The tasks are divided into a training meta-set T tr , validation meta-set T val , and test meta-set T test , each with a disjoint set of target classes (i.e., a class seen during testing is not seen during training). The validation meta-set is used for model selection, and the testing meta-set is used only for final evaluation. Each task instance T i ∼ p (T ) is composed of a support set S and a query set Q, and only contains N classes randomly selected from the appropriate meta-set. We aim to infer statistics from the support set that better match the query set. Therefore, we adopt a straightforward criterion for the inference: D KL q φ (m|S)||p θ (m|Q) , where we define q(m|S)=N (µ S , σ S ) and p(m|Q)=N (µ Q , σ Q ), which are the distributions inferred from the support and query sets in a few-shot learning task. By minimizing the KL term in conjunction with the prime objective of a meta-learning algorithm, we are able to find the appropriate statistics from limited data samples for batch normalization. The KL term adheres to a closed form, which makes it easy to implement and computationally efficient. The p(m|Q) can be estimated by directly calculating statistics using the query set, which however performs inferior to inference by optimization. We note the inference from the query set only happens during meta-training time and we use the learned inference network to generate normalization statistics at meta-test time for a test task using its support set. To infer µ S , we deploy an inference function f µ (•) that takes activations of a sample as input, and the outputs from all samples are then averaged as the final µ S : µ S = 1 |S| |S| i=1 f µ (a i ), where a i ∈ R w×h is the flattened vector of the activation map of the i-th sample in the support set, w is the width of activations, and h is the height of the activation map. To infer σ S , we use the obtained µ S and deploy a separate inference function f σ (•): σ S = 1 |S| |S| i=1 f σ (a i -µ S ) 2 . ( ) It is worth mentioning that we actually use each sample to infer the statistics and take the average of all inferred statistics as the final normalization statistics. This enables us to fully exploit the samples to generate more accurate statistics. Note that the inference functions f µ (•) and f σ (•) are shared by different channels in the same layer and we will learn L pairs of those functions if we have L convolutional layers in the meta-learning model. They are parameterized by feed-forward multiple layer perception networks, which we call hypernetworks. Using these hypernetworks, we generate support moments (µ S , σ S ) and query moments (µ Q , σ Q ) from the support and query sets, which are used for calculating the KL term in Eq. ( 5) for optimization during meta-training time. At meta-training time, we apply the statistics inferred from the support set for normalization of both support and query samples: a = γ a -µ S σ 2 S + + β, where γ and β are jointly learned with parameters of the hypernetworks at meta-training time and directly applied at meta-test time, as in conventional batch normalization. At meta-test time, given a test task, we use hypernetworks that take the support set as input to generate normalization statistics directly used for the query set. In a similar vein to few-shot classification, we would like to learn to acquire the ability to generate domain-specific statistics from a single example, which can then be applied to unseen domains. We assume we can generate reasonable normalization statistics by using only one sample from the new domain, because, intuitively, a single sample already carries sufficient domain information. We use a single example and all the examples in the same domain to infer the domain-specific statistics and minimize the KL term:

MetaNorm for Domain Generalization

D KL q φ (m|a i )||p θ (m|D s \a i ) , where we define q(m|a i )=N (µ a , σ a ), and p(m|D s \a i )=N (µ D , σ D ), which are implemented in a similar way as Eq. ( 6) and Eq. ( 7), and a i is an example from its own domain D s . In both the meta-source and meta-target domains, each example is normalized using the statistics generated by itself, like in Eq. ( 8), in which we make γ and β shared across all domains. The minimization of the KL term in Eq. ( 9) is to encourage the model to generate domain-specific statistics for normalization from only a single example. This enables us to generate domain-specific statistics on target domains that are never seen at meta-training time. In practice, we take the sum of all samples in all source domains as follows: |D s | i J j D KL q φ (m|a i )||p θ (m|D s j \a i ) , where D s j denotes the j-th of J meta-source domains. The inference networks are first at metatraining time learned and then directly used as examples from the target domain at meta-test time. Note that on the meta-target domain we do not apply the KL term; instead, we simply rely on each example to generate its statistics for normalization.

MetaNorm for Few-Shot Domain Generalization

We introduce an even more challenging setting, i.e., few-shot domain generalization, that combines the challenges of both few-shot classification and domain generalization. Specifically, we aim to learn a model from a set of classification tasks, each of which has only a few samples in a support set for training and test the model on tasks in a query set, which are in a different domain from the support set. Like few-shot classification, the label space is not shared between training and testing. Cross-domain few-shot learning has been explored recently by Tseng et al. (2020) and Guo et al. (2020) . However, the setting of our few-shot domain generalization is different and considered to be more challenging, as the support and query set are from different domains in the meta-test stage and the target domain is also unseen throughout the training stage. An example for the few-shot domain generalization setting is provided in Figure 1 . We divide a dataset into the source domains S used for training and the target domains T held out for testing. During training time, data in the source domains S is episodically divided into sets of meta-train D s and meta-test D t domains. We sample C-way k-shot data as the support set from each meta-source domain D s , where k is the number of labelled examples for each of the C classes. We  |D s | i D KL [q φ (m|a i )||p θ (m|D s )], where a i is the activation associated with each sample from the meta-source domain D s . Likewise, q(m|a i ) and p(m|D s ) are also defined as factorized Gaussian distributions. We also adopt γ and β, which are shared across tasks and jointly learned. MetaNorm learns to acquire the ability to generate proper statistics for itself, and applies it to the samples in the meta-target domain.

4. EXPERIMENTAL RESULTS

We conduct an extensive set of experiments on a total of 17 datasets containing more than 15 million images. We use three representative approaches to meta-learning as our base models, i.e., MAML (Finn et , 2020) . All details about datasets and implementation settings are provided in the appendix. More experimental results, including convergence analysis, are also provided in the appendix. Our code will be publicly released. 1Effect of KL Term We first conduct ablation studies that measure the effectiveness of MetaNorm. The key of MetaNorm is the introduced KL term for learning to learn statistics. We test the performance of MetaNorm without the KL term by directly using the statistics generated from data. In this case, we also use the hypernetworks to generate the moments, µ and σ by simply removing the KL term in the objective function. In Table 1 we present results for few-shot classification on miniImageNet (Vinyals et al., 2016) and for domain generalization on PACS (Li et al., 2017a) . The performance of MetaNorm without KL degrades significantly. This is expected, as without the KL term the generation process of normalization statistics lacks direct supervision from the target distribution, resulting in improper statistics.

Impact of Target Set Size

The other key parameter in MetaNorm is the size of the target set; that is, the number |Q| of samples in the query set (in few-shot classification) and the number |D s | of samples in each domain (in domain generalization). This parameter is important when learning normalization statistics because we use the statistics generated by the target set as the 'ground truth'. We evaluate its impact on the performance of MetaNorm in Figure 2 . The experimental results show that TBN is not affected by the target size, both in the 5-way, 1-shot and 5 way, 5-shot tasks. MetaNorm performance rises as the size of the target set increases and plateaus at a reasonable size. In the few-shot setting, the performance reaches its peak at a size of about 125, which is slightly larger than the standard size of 75, while in the domain generalization setting, the performance plateaus at a size of about 128. This demonstrates that we are able to generate proper statistics with the mini-batch gradient descent optimization. In scenarios demanding a very small target set size, we could leverage image synthesis techniques to generate more samples for the targets sets. performance to transductive batch normalization, especially under the 5-way-1-shot setting, which is challenging since only a few examples are available to generate statistics. Notice that, MetaNorm performs well with the standard query set size |Q| of 75 (15 per category). It is slightly better than non-transductive TaskNorm and comparable with TBN. MetaNorm achieves its best performance with a query size |Q| of 125 (25 per category), only slightly larger than the standard size of 75. This demonstrates the benefit of leveraging meta-learning by MetaNorm for batch normalization. We conclude that MetaNorm is general and serves as a plug-and-play module for existing meta-learning models to improve their performance. 4 , MetaNorm achieves the best performance on PACS and Office-Home in terms of average accuracy. On PACS, MetaNorm consistently outperforms other normalization approaches including domain-specific normalization (Seo et al., 2019) , on all four domains. It is worth mentioning that the baseline normalization uses the statistics from the source domains for the batch normalization of the target domain. As expected, the baseline method produces relatively poor performance on most domains, since the source domains cannot provide proper statistics for target domains due to the distribution shift. We have also done an experiment using standard batch normalization. In the training stage, we compute the ground truth statistics using all the test data on the meta-target domain D t instead of using inferred statistics p(m|D s \a i ). MetaNorm is still better on most domains and on 51.9 ± 0.9 30.7 ± 1.8 49.1 ± 0.9

5. CONCLUSION

In this paper we present MetaNorm, a meta-learning based batch normalization. MetaNorm tackles the challenging scenarios where the batch size is too small to produce sufficient statistics or when training statistics are not directly applicable to test data due to a domain shift. MetaNorm learns to learn adaptive statistics that are specific to tasks or domains. It is generic and model-agnostic, which enables it to be used with various meta-learning algorithms for different applications. We evaluate MetaNorm on two well-known existing tasks, i.e., few-shot classification and domain generalization, and we also introduce the challenging evaluation scenario of few-shot domain generalization that addresses the small batch and distribution shift problems simultaneously. An extensive evaluation on 17 datasets reveals that MetaNorm consistently achieves results that are better, or at least competitive, compared to other normalization approaches, verifying its effectiveness as a new meta-learning based batch normalization approach.

A ALGORITHMS DESCRIPTIONS

In this Appendix we provide the detailed MetaNorm algorithm descriptions to conduct batch normalization for few-shot classification (Algorithm 1), domain generalization (Algorithm 2) and few-shot domain generalization (Algorithm 3). The dataflow of the implementation is shown in Figure 3 .

Algorithm 1 MetaNorm for Few-Shot Classification

Meta-train: Input values of a over support set aS,i and query set aQ,i; γ, β ← Initialize parameters. µS = 1 |S| |S| i=1 f µ (aS,i); µQ = 1 |Q| |Q| i=1 f µ (aQ,i); σS = 1 |S| |S| i=1 f σ (aS,i -µS) 2 ; σQ = 1 |Q| |Q| i=1 f σ (aQ,i -µQ) 2 ; a S,i = γ a S,i -µ S √ σ 2 S + + β; a Q,i = γ a Q,i -µ S √ σ 2 S + + β; LKL = DKL N (µS , σS )||N (µQ, σQ) return a S,i = MetaNorm(aS,i); a Q,i = MetaNorm(aQ,i); LKL Meta-test: Input values of a over support set aS,i and query set aQ,i; µS = 1 |S| |S| i=1 f µ (aS,i); σS = 1 |S| |S| i=1 f σ (aS,i -µS) 2 ; a Q,i = γ a Q,i -µ S √ σ 2 S + + β; return a Q,i = MetaNorm(aQ,i)

Algorithm 2 MetaNorm for Domain Generalization

Train: Input values of a over meta-source domain aS,i and meta-target domain aT,i; γ, β ← Initialize parameters. µS,i = f µ (aS,i); µS = 1 |S| |S| i=1 f µ (aS,i); σS,i = f σ (aS,i -µS,i) 2 ; σS = 1 |S| |S| i=1 f σ (aS,i -µS) 2 ; µT,i = f µ (aT,i); σT,i = f σ (aT,i -µT,i) 2 ; a T,i = γ a T ,i -µ T ,i σ 2 T ,i + + β; LKL = DKL N (µ S, , σ S, )||N (µS , σS ) return a T,i = MetaNorm(aT,i); LKL Test: Input values of a over test domain ai; µi = f µ (ai); σi = f σ (ai -µi) 2 ; a i = γ a i -µ i √ σ 2 i + + β; return a i = MetaNorm(ai)

B DATASETS

We conduct an extensive set of experiments on a total of 17 datasets containing more than 15 million images. All dataset details and settings are provided in this Appendix. miniImageNet. The miniImageNet is originally proposed in (Vinyals et al., 2016) and has been widely used for evaluating few-shot learning algorithms. It consists of 60,000 color images from 100

Algorithm 3 MetaNorm for Few-Shot Domain Generalization

Meta-train: Input values of a over meta-source domain set aS,i and meta-target domain set aQ,i; γ, β ← Initialize parameters. µS = 1 |S| |S| i=1 f µ (aS,i); µQ = 1 |Q| |Q| i=1 f µ (aQ,i); σS = 1 |S| |S| i=1 f σ (aS,i -µS) 2 ; σQ = 1 |Q| |Q| i=1 f σ (aQ,i -µQ) 2 ; a S,i = γ a S,i -µ S √ σ 2 S + + β; a Q,i = γ a Q,i -µ S √ σ 2 S + + β; LKL = DKL N (µS , σS )||N (µQ, σQ) return a S,i = MetaNorm(aS,i); a Q,i = MetaNorm(aQ,i); LKL Meta-test: Input values of a over support set aS,i and query set aQ,i;  µS = 1 |S| |S| i=1 f µ (aS,i); σS = 1 |S| |S| i=1 f σ (aS,i -µS) 2 ; a Q,i = γ a Q,i -µ S √ σ 2 S + + β; return a Q,i = MetaNorm(aQ,i) 𝒇 𝝁 𝓵 x #,%,& * ( # 𝑋 ),* 𝜇 +,,

G SENSITIVITY TO DATASET

The complete set of results for each of the thirteen datasets in Meta-Dataset are provided in Table 16 . 45.9 ± 1.5 59.5 ± 0.9 n = 1024 44.9 ± 1.5 58.7 ± 0.8



https://github.com/YDU-AI/MetaNorm.



Figure 1: Illustration of the novel few-shot domain generalization scenario using the 5-way, 1-shot setting. The training set in the upper box contains the meta-source domains D s and the meta-target domain D t , which are from different domains. Each training task contains meta-source domains with five different classes and one example of each meta-source domain, and more than four examples for evaluation in the meta-target domain. The test set is defined in the same way but with all source domains S covering classes not present in any of the datasets in the training set, and more than four examples are used for evaluation in the target domain T . sample C classes from the meta-test D t domain as the query set. At test time, we sample C-way k-shot data as the support set from each of the source domains S. The model learned at meta-training time is then fine-tuned on few-shot tasks samples from the source domains and tested on the target domain T . To learn the normalization statistics, we minimize the following KL term:

Figure 2: Impact of Target Set Size. The performance increases for larger target sets and plateaus at around 125 for few-shot classification on miniImageNet and around 256 for domain generalization on PACS. TBN here is based on VERSA. MetaNorm generates proper normalization statistics with a reasonable batch size.

Figure 4: Training loss. Results of using the ProtoNets algorithm on miniImageNet with respect to training loss versus iterations. Our MetaNorm achieves fastest training convergence.

Sensitivity to Dataset. Few-shot classification results on Meta-Dataset using ProtoNets. The ± sign indicates the 95% confidence interval over tasks. Best performing methods and any other runs within the 95% confidence margin in bold. Results of other methods provided by Bronskill et al. (2020).

al., 2018; Luo et al., 2018a; Yang et al., 2019; Jia et al., 2019; Luo et al., 2018b;

). Li et al. (2016) proposed adaptive batch normalization to increase the generalization ability of a deep neural network. By modulating the statistical information of all batch normalization layers in the neural network, it achieves deep adaptation effects for domain-adaptive tasks. Nado et al. (2020) noted the possibility of accessing small unlabeled batches of the shifted data just before prediction time. To improve model accuracy and calibration under covariate shift, they proposed prediction-time batch normalization. Since the activation statistics obtained during training do not reflect statistics of the test distribution, when testing in an out-of-distribution environment, Schneider et al. (2020) proposed estimating the batch statistics on the corrupted images. Kaku et al. (2020) demonstrated that standard non-adaptive feature normalization fails to correctly normalize the features of convolutional neural networks on held-out data where extraneous variables take values not seen during training. Learning domain-specific batch normalization has been explored (Chang et al., 2019; Wang et al., 2019). Wang et al. (2019) introduced transferable normalization, TransNorm, which normalizes the feature representations from source and target domain separately using domain-specific statistics. Along a similar vein, Chang et al. (

In the domain generalization scenario, we adopt the metalearning setting from (Li et al., 2018a; Balaji et al., 2018; Du et al., 2020), and divide a dataset into the source domains used for training and the target domains held out for testing. At meta-training time, data in the source domains is episodically divided into sets of meta-source D s and meta-target D t domains.

al., 2017), ProtoNets(Snell et al., 2017), and VERSA(Gordon et al., 2019), which can verify our MetaNorm is generic, flexible and model-agnostic, making it a simple plug-and-play module that is seamlessly embedded into existing meta-learning approaches. We further compare different normalization methods: transductive batch normalization (TBN), "example" that denotes testing with one example at a time by using TBN, "class" that denotes testing with one class at a time by using TBN, w/o BN which is not using batch normalization, CBN which is using conventional batch normalization, RN(Nichol et al., 2018), MetaBN (Bronskill et al., 2020), TaskNorm-L (Bronskill et al., 2020), and TaskNorm-I (Bronskill et al.

Effect of KL Term in MetaNorm for few-shot classification with MAML (Finn & Levine, 2018) on miniImageNet and domain generalization on PACS with ResNet-18. More few-shot classification results with ProtoNets (Snell et al., 2017) and VERSA (Gordon et al., 2019), as well as domain generalization results on Office-Home are provided in the appendix. Best performing methods and any other runs within the 95% confidence margin in bold. The KL term is crucial.

Sensitivity to Algorithm. Few-shot results on miniImageNet using different algorithms. Results on Omniglot are provided in the appendix. Best performing methods and any other runs within the 95% confidence margin in bold. Transductive results indicated above dashed line. MetaNorm is a consistent top-performer, regardless of the meta-learning algorithm. Results for MAML and ProtoNets (except w/o BN) provided by(Bronskill et al., 2020), and VERSA with TBN provided by(Gordon et al., 2019). All other results based on our re-implementations.

Sensitivity to Dataset. Few-shot classification on Meta-Dataset using Pro-toNets. MetaNorm performs best overall. Sensitivity to Domains For this experiment we adopt two widely-used benchmarks for domain generalization of visual object recognition, i.e., PACS (Li et al., 2017a) and Office-Home (Venkateswara et al., 2017). Detailed descriptions on the experimental settings and implementations are provided in the appendix. For fair comparison with prior methods (Balaji et al., 2018; Li et al., 2018b; Seo et al., 2019), we employ ResNet-18 as the backbone network in all experiments. As shown in Table

Sensitivity to Domains. Performance comparison on domain generalization. MetaNorm consistently achieves the best performance among all normalization methods. This is reasonable because ground truth statistics from the test data do not necessarily reflect the true data distribution. The experimental results demonstrate MetaNorm can generate reasonable normalization statistics from only one sample in its domain. We conclude that MetaNorm is effective for domain generalization.Few-Shot Domain GeneralizationIn our final experiment, we adopt the DomainNet dataset(Peng et al., 2019) and introduce a new, more challenging setting to evaluate the performance for few-shot domain generalization. Detailed descriptions on the dataset and experimental settings are provided in the appendix. We conduct the experiments with the MAML and ProtoNets algorithms under both 5-way 1-shot and 5-way 5-shot settings, and the results are reported in Table5. We implement transductive batch normalization, MetaBN and the variants of TaskNorm for direct comparison. Under both settings, our MetaNorm produces the best performance and surpasses the transductive batch normalization by large margins of up to 4.0% on the challenging 5-way 1-shot setting with MAML. MetaNorm also achieves better results than the non-transductive TaskNorm approaches. At the same time, with ProtoNet our MetaNorm again consistently delivers the best performance and surpasses both transductive and non-transductive normalizations, The performance on the challenging few-shot domain generalization scenario with different meta-learning algorithms again demonstrates the effectiveness of MetaNorm in handling the challenges of batch normalization for small batches and across domains.

Few-Shot Domain Generalization. Comparison with different normalizations using MAML and ProtoNets on the Few-shot DomainNet dataset. Best performing methods and any other runs within the 95% confidence margin denoted in bold. Reported results use "Painting" as the target domain, all based on our implementations. MetaNorm consistently achieves top performance.

Omniglot. Omniglot (Lake et al., 2015) is a few-shot learning dataset consisting of 1,623 handwritten characters (each with 20 instances) derived from 50 alphabets. We follow the pre-processing and training procedure defined in(Vinyals et al., 2016). We resize images to 28×28. The training, validation and test sets consist of a random split of 1,100, 100, and 423 characters.PACS(Li et al., 2017a) contains a total of 9,991 images of the size 224×224 from 4 domains, i.e., photo, art-painting, cartoon and sketch, which demonstrate huge domain gaps. Images are from 7 object classes, i.e., dog, elephant, giraffe, guitar, horse, house, and person. We follow the "leave-one-out" protocol in(Li et al., 2017a; 2018b;Carlucci et al., 2019), where the model is trained on any three of the four domains, which we call source domains, and tested on the last (target) domain. The train-val-test splits are the same as in(Li et al., 2017a).Office-Home (Venkateswara et al., 2017) also has 4 domains: art, product, clipart and real-world. For each domain, the dataset contains images of 65 object categories found typically in office and home settings. We use the same experimental protocol as for PACS. DomainNet (Peng et al., 2019) contains 6 distinct domains, i.e., clipart, infograph, painting, quickdraw, real, and sketch for 345 categories. The categories are from 24 divisions, which are: Furniture, Mammal, Tool, Cloth, Electricity, Building, Office, Human Baby, Road Transportation, Food, Nature, Cold Blooded, Music, Fruit, Sport, Tree, Bird, Vegetable, Shape, Kitchen, Water Transportation, Sky Transportation, Insect, Others. Meta-Dataset (Triantafillou et al., 2020) is composed of ten (eight train, two test) existing image classification datasets. These are: ILSVRC-2012 (ImageNet, (Russakovsky et al., 2015)), Omniglot (Lake et al., 2015), Aircraft (Maji et al., 2013), CUB-200-2011 (Birds, (Wah et al., 2011)), Describable Textures (Cimpoi et al., 2014), Quick Draw, Fungi, VGG Flowr (Nilsback & Zisserman, 2008), Traffic Signs (Houben et al., 2013) and MSCOCO (Lin et al., 2014). Each episode generated in Meta-Dataset uses classes from a single dataset. Two of these datasets, Traffic Signs and MSCOCO, are fully reserved for evaluation, it means no classes from these sets participate in the training set. Except for Traffic Signs and MSCOCO, the remaining datasets contribute some classes to each of training, validation and test splits of classes. There are about 14 million images in total in Meta-Dataset. C FEW-SHOT DOMAINNET To construct Few-shot DomainNet, we chose 200 random classes from DomainNet and used 140 for training, 20 for validation and the last 40 for testing. Note that the last 40 object classes were never seen during training. The dataset consists of 200,000 colour images of size 84×84 with each of the 200 classes having 1,000 examples. Please seeTable 6, Table 7 and Table 8 for training, validation, and test classes. Training classes of Few-shot DomainNet

Inference function f l µ (•)E EXTRA RESULTS FOR EFFECT OF KL TERMIn this Appendix we consider extra results for the ablation on measuring the effect of the KL term. We report results for few-shot classification on miniImageNet with ProtoNets(Snell et al., 2017) and VERSA(Gordon et al., 2019) in Table11. We also report domain generalization results on Office-Home in Table12. In all cases the KL term is crucial.

Effect of KL Term in MetaNorm for few-shot classification on miniImageNet with ProtoNets and VERSA. Best performing methods and any other runs within 95% confidence margin denoted in bold.

Effect of KL Term in MetaNorm for domain generalization on Office-Home.

Sensitivity to Algorithm. Few-shot results on Omniglot using MAML. Best performing methods and any other runs within the 95% confidence margin in bold. Transductive results indicated above dashed line. Results (except w/o BN and our MetaNorm) provided by(Bronskill et al., 2020).

Sensitivity to Algorithm. Few-shot results on Omniglot using VERSA. Best performing methods and any other runs within the 95% confidence margin in bold. Transductive results indicated above dashed line. Results of TBN provided by(Gordon et al., 2019). All other results based on our re-implementations.

Sensitivity to Algorithm. Few-shot results on Omniglot using ProtoNets. Best performing methods and any other runs within 95% confidence margin denoted in bold. Transductive results indicated above dashed line. Results (except w/o BN and our MetaNorm) provided by(Bronskill et al., 2020).

Effect of number of units of hidden layers in MetaNorm for few-shot classification with MAML (Finn & Levine, 2018) on miniImageNet. The ± sign indicates the 95% confidence interval over tasks. We achieve best results with 128 units of hidden layers.

annex

In the few-shot learning task, MAML and ProtoNets use a simple CNN containing 4 convolutional layers, each of which is a 3×3 convolution with 32 filters, followed by MetaNorm, a ReLU nonlinearity, and finally a 2×2 max-pooling. VERSA uses a CNN containing 5 convolutional layers, each of which is a 3×3 convolution with 64 filters, followed by MetaNorm, a ReLU non-linearity, and finally a 2×2 max-pooling. In the domain generalization task, we rely on ResNet-18 as backbone for fair comparison with previous work. Each convolutional layer is followed by MetaNorm. The hypernetwork is a 3-layer MLP with 128 units per layer and rectifier nonlinearities. We implemented all models in the Tensorflow framework and tested on an NVIDIA Tesla V100. All code will be available at: https://github.com/YDU-AI/MetaNorm.

D.1 MAML EXPERIMENTS

For MAML experiments, we used the codebase by Finn (Finn, 2017) . We use the Adam optimizer with default parameters, and a meta batch size of 4 tasks. The number of test episodes is set as 600. The number of training iterations is 60,000. We set λ=0.001. The other hyper-parameters we use are the default MAML parameters. No early stopping was used. We used the first-order approximation of MAML for the experiments. 

D.2 PROTONETS EXPERIMENTS

For ProtoNets, we used the codebase by Fatir (Fatir, 2018) . For miniImageNet, we used the following ProtoNets options: a learning rate of 0.001, 60,000 training iterations, 200 validation episodes, 600 test episodes and λ=0.0001. We choose the units of hidden layers and λ by cross-validation. For Meta-Dataset, we reproduce the code provided by CNAPS (Requeima et al., 2019) with TensorFlow. We simply replace its normalization method with our MetaNorm method and add the KL term to the final loss. We are consistent with the dataset configuration and follow the training process as specified in (Triantafillou et al., 2020) . The number of training iterations is 80,000. We use a constant learning rate of 0.0001. We set λ=0.001. We follow TaskNorm's (Bronskill et al., 2020) options: they do not use feature adaptation, and allow updates pre-trained feature extractor weights during meta-training stage.

D.3 VERSA EXPERIMENTS

For VERSA, we used the codebase by Gordon (Gordon, 2019) . For the 5-way 5-shot model, we train using the setting of 8 tasks per batch for 100,000 iterations and use a constant learning rate of 0.0001, λ= 0.001. For the 5-way 1-shot model, we train with the setting of 8 tasks per batch for 150,000 iterations and use a constant learning rate of 0.00025, λ=0.01. We set validation episodes as 200, and test episodes as 600. The units of hidden layers and λ were chosen by cross-validation.

