REPURPOSING PRETRAINED MODELS FOR ROBUST OUT-OF-DOMAIN FEW-SHOT LEARNING

Abstract

Model-agnostic meta-learning (MAML) is a popular method for few-shot learning but assumes that we have access to the meta-training set. In practice, training on the meta-training set may not always be an option due to data privacy concerns, intellectual property issues, or merely lack of computing resources. In this paper, we consider the novel problem of repurposing pretrained MAML checkpoints to solve new few-shot classification tasks. Because of the potential distribution mismatch, the original MAML steps may no longer be optimal. Therefore we propose an alternative meta-testing procedure and combine MAML gradient steps with adversarial training and uncertainty-based stepsize adaptation. Our method outperforms "vanilla" MAML on same-domain and cross-domains benchmarks using both SGD and Adam optimizers and shows improved robustness to the choice of base stepsize.

1. INTRODUCTION

Deep learning approaches have shown improvements based on massive datasets and enormous computing resources. Despite their success, it is still challenging to apply state-of-the-art methods in the real world. For example, in semiconductor manufacturing (Nishi & Doering, 2000) , collecting each new data point is costly and time consuming because it requires setting up a new manufacturing process accordingly. Moreover, in the case of a "destructive inspection", the cost is very high because the wafer must be destroyed for measurement. Therefore, learning from small amounts of data is important for practical purposes. Meta-learning (learning-to-learn) approaches have been proposed for learning under limited data constraints. A meta-learning model optimizes its parameters for the best performance on the distribution of tasks. In particular, few-shot learning (FSL) formulates "learning from limited data" as an n-way k-shot problem, where n is the number of classes and k is the number of labeled samples per class. For each task in FSL, a support set is provided for training, while a query set is provided for evaluation. Ideally, a meta-learning model trained over a set of tasks (meta-training) will generalize well to new tasks (meta-testing). Model-agnostic meta-learning (MAML) (Finn et al., 2017) is a general end-to-end approach for solving few-shot learning tasks. MAML is trained on the meta-training tasks to learn a model initialization (also known as checkpoint) such that a few gradient steps on the support set will yield the best predictions on the query set. However, in practice it may not always be possible to retrain or finetune on the meta-training set. This situation may arise when the meta-training data is confidential, subject to restrictive licences, contains private user information, or protected intellectual property such as semiconductor manu-facturing know-how. Another reason is that one may not have the computing resources necessary for running large-scale meta-training. In this paper, we consider the novel problem of repurposing MAML checkpoints to solve new fewshot classification tasks, without the option of (re)training on the meta-training set. Since the metatesting set (new tasks) may differ in distribution from the meta-training set, the implicit assumption -in end-to-end learning-of identically distributed tasks may not hold, so there is no reason why the meta-testing gradient steps should match the meta-training. Therefore, we investigate various improvements over the default MAML gradient steps for test time adaptation. Conceptually, our approach consists of collecting information while training the model on a new support set and then proposing ways to use this information to improve the adaptation. In this paper, we consider the variance of model parameters during ensemble training as a source of information to use. We propose algorithms that uses this information both to adapt the stepsizes for MAML as well as to generate "task-specific" adversarial examples to help robust adaptation to the new task. Our main contributions are the following: • We motivate the novel problem of repurposing MAML checkpoints to solve cross-domain fewshot classification tasks, in the case where the meta-training set is inaccessible, and propose a method based on uncertainty-based stepsize adaptation and adversarial data augmentation, which has the particularity that meta-testing differs from meta-training steps. • Compared to "vanilla" MAML, our method shows improved accuracy and robustness to the choice of base stepsizes on popular cross-domain and same-domain benchmarks, using both the SGD and Adam (Kingma & Ba, 2014) optimizers. Our results also indicate that adversarial training (AT) is helpful in improving the model performance during meta-testing. To the best of our knowledge, our work is the first few-shot learning method to combine the use of ensemble methods for stepsize computation and generating adversarial examples from the metatesting support set for improved robustness. Moreover, our empirical observation of improving over the default meta-testing procedure of MAML motivates further research on alternative ways to leverage published model checkpoints.

2. RELATED WORK

2.1 META-LEARNING AND MODEL-AGNOSTIC META-LEARNING FSL approaches deal with an extremely small amount of training data and can be classified into three categories. First, metric-based approaches solve few-shot tasks by training a feature extractor that maximizes inter-class similarity and intra-class dissimilarity (Vinyals et al., 2016; Snell et al., 2017; Sung et al., 2018) . Second, memory-based approaches utilize previous tasks for new tasks with external memories (Santoro et al., 2016; Mishra et al., 2018) . Third, optimization-based approaches search for good initial parameters during training and adapt the pretrained model for new tasks (Finn et al., 2017; Lee & Choi, 2018; Grant et al., 2018) . We focus on the optimization-based approaches and suggest better training especially for the general family of MAML methods.

2.2. UNCERTAINTY FOR MODEL TRAINING

Uncertainty is an important criterion for measuring the robustness of a neural network (NN). Bayesian neural networks (BNN) (Blundell et al., 2015) obtain model uncertainty by placing prior distributions over the weights p(ω). This uncertainty has been used to adapt the stepsizes during continual learning in Uncertainty-guided Continual BNNs (UCB) (Ebrahimi et al., 2020) . For each parameter, UCB scales its stepsize inversely proportional to the uncertainty of the parameter in the BNN to reduce changes in important parameters while allowing less important parameters to be modified faster in favor of learning new tasks. Our approach also decreases the stepsizes for "uncertain" parameters, but using a different notion of uncertainty, and instead in the context of FSL with a pretrained MAML checkpoint. (2017) showed that they produce predictive uncertainty estimates comparable in quality to BNNs. Deep ensembles, however, are not directly applicable for FSL tasks, as training randomly initialized parameters from scratch with a limited amount of training data yields poor performance. Our approach is partly inspired from deep ensembles, adapting it to the FSL setting by using instead a parameter perturbations of the MAML checkpoint model rather than a random initialization. We use in particular a multiplicative Gaussian perturbation that rescales the parameters, as the information content of the weights is said to be invariant to their scale (Wen et al., 2018) .

2.3. ADVERSARIAL TRAINING FOR ATTACK AND DEFENSE

Deep NNs are sensitive to adversarial attacks (Goodfellow et al., 2015; Madry et al., 2018; Moosavi-Dezfooli et al., 2016) . These methods generate an adversarial sample to fool a trained model, where the generated image looks identical to the original one for humans. Goodfellow et al. (2015) proposed the fast gradient sign method (FGSM) which generates adversarial example using sign of input gradient. While stronger attacks have been proposed (such as using projected gradient descent (Madry et al., 2018) ), we will focus on the FGSM approach in this paper for simplicity. In FSL, adversarial attack and defense have also been studied. ADversarialMeta-Learner (Yin et al., 2018) utilized one-step adversarial attack for generating adversarial samples during meta-training. There is little consideration, however, for the degradation of accuracy in the original sample in adversarial defense approaches. (Goodfellow et al., 2014) . GAN-based DA approaches require having access to the large amount of unlabeled data from both the source and target datasets (Tzeng et al., 2017; Zhang et al., 2019; Wilson & Cook, 2020) . DA methods cannot be directly applied in the FSL scenario due to limited number of target domain samples (shots). Some domain-adaptive FSL (DA-FSL) methods have been proposed in the case where only a very few samples from the target domain are available (Motiian et al., 2017; Zhao et al., 2020) . DA and DA-FSL methods cannot be directly applied to our setting because we assume that the meta-training dataset is inaccessible -mirroring real world situations in which access to the metatraining set is restricted by privacy and confidentiality concerns. Our approach differs from DA and DA-FSL by not requiring access to the meta-training dataset (source domain).

3. PROPOSED METHOD

At meta-testing time, MAML normally uses the support set to compute fixed gradient steps, which were "calibrated" using end-to-end learning during meta-training. That is, the learned initialization is such that a fixed combination of stepsize and loss result in the desired result. However, those stepsizes and losses may be suboptimal on the new task, especially if the new task is out-of-domain with respect to the meta-training tasks. Our method is based on the assumption that the support set can be used to improve the meta-testing procedure itself, beyond merely serving as training examples. We start by leveraging the support set to estimate task-specific uncertainties over the model parameters. Then, we propose two improvements over the "vanilla" MAML gradient steps : we scale the gradient steps using layer-wise stepsizes computed from the support set and we train using task-specific adversarial examples.

3.1. MOTIVATING HYPOTHESES

At meta-testing time, we start by assuming that we can estimate task-specific uncertainties over the model parameters. One possibility, which we adopt in Section 3.2, is to train deep ensembles (Lakshminarayanan et al., 2017) on the support set and use the ensemble to estimate variances over model parameters. Each model learns slightly different parameters, and yields slightly different gradients. We regard the variance over the parameters and input gradients as task-specific uncertainties. Given the uncertainty estimates, we propose two modifications to the original MAML gradient steps. Proposal 1 (Task-specific stepsizes) Use lower stepsizes for model parameters with higher variance. In our case, the variance could be further amplified if high-variance components were to be moved with large stepsizes. Therefore, we carefully move high-variance components with a lower stepsize with the hope is that we can limit the variance over the model parameters. This approach can be related with the fact that lower stepsizes should be taken for SGD when the gradients are very noisy (Dieuleveut et al., 2020) . Proposal 2 (Task-specific adversarial examples) Use adversarial examples with higher adversarial perturbation on input components with higher variance over the input gradient. The intuition is that if only slightly perturbed models (from the ensemble) disagree on parts of the input gradient, then it means that they disagree on what to learn and therefore those parts of the input are more vulnerable to attack. Therefore, we propose to use AT with stronger adversarial perturbation in the weak parts of the input, with the hope to incur improved robustness on those parts of the input. We regard adversarial training as a form of data augmentation or regularization at meta-testing time, which we hope allows the model to overcome the limited size of the support set.

3.2. UNCERTAINTY-BASED GRADIENT STEPS AT TEST-TIME

We improve over the default MAML gradient steps by implementing the ideas presented in the previous section. The resulting approach is detailed in Algorithm 1 and is a combination of uncertaintybased stepsize adaption (USA, based on Proposal 1) and uncertainty-based fast gradient sign method (UFGSM), adversarial training (AT), and generating additional adversarial examples from deep ensembles (EnAug) which are based on Proposal 2. Denote L(D, θ) = 1/|D| (x,y)∈D l θ (x, y) the cross-entropy loss for model θ over the labeled dataset D = {(x, y) . . . }, A θ (x, y) = x + sign(∇ x l θ (x, y)) the FGSM adversarial example computed from (x, y) and L AT (D, θ) = 1/|D| (x,y)∈D l θ (A θ (x, y), y) the resulting adversarial cross-entropy, which we refer to as AT. For easy reference, all notations are summarized in Appendix A.1. Starting from the pretrained MAML checkpoint θ 0 , we perturb the model parameters with multiplicative Gaussian noise to create a deep ensemble (θ m 0 ) 1≤m≤M (lines 2-4). Then, we repeat the following for T steps. At time t, run gradient descent on each model θ m t of the ensemble (line 7), where the loss is a combination of the usual cross-entropy L and AT loss L AT as in (Lakshminarayanan et al., 2017) but on the support set, and the stepsizes α adap are updated using USA (details below). Also, we generate adversarial examples using FGSM and UFGSM (details below) and store them into D Aug (lines 8 and 11). Finally, we run T gradient steps on the original checkpoint, where the loss is a combination of cross-entropy on the support set D Spt , AT loss on the support set, and cross-entropy on the ensemble-augmented support set D Aug (lines 13-16). Note that we recover the original MAML steps for (λ AT = λ Aug = 0, λ α = 1), while we refer to the case (λ AT = λ Aug = 1, λ α = 0) as our full method. Algorithm 1: Uncertainty-based Gradient Steps at Test-time Data: New task support set D Spt = {(x 1 , y 1 ), . . . , (x n×k , y n×k )} with n ways and k shots. Require: Base stepsize α, pretrained weights θ 0 , Gaussian variance σ, AT coefficient , size of ensemble M , number of gradient steps T , selection coefficients in {0, 1} for: base stepsize λ α , adversarial loss λ AT , and augmented cross-entropy λ Aug . D Aug ← ∅, α adap ← α for m = 1 to M do θ m 0 ← θ 0 (1 + N (0, σ)) Initialize deep ensemble end for t = 1 to T do for m = 1 to M do θ m t ← θ m t-1 -α adap ∇ θ m t-1 [L(D Spt , θ m t-1 ) + L AT (D Spt , θ m t-1 )] D Aug ← D Aug ∪ {(A θ m t-1 (x, y), y) | (x, y) ∈ D Spt } EnAug end α adap ← USA(α, θ 1:M t ) Algorithm 2 D Aug ← D Aug ∪ { UFGSM(θ 0 , θ 1:M t , x, y) | (x, y) ∈ D Spt } Algorithm 3 end α ← λ α α + (1 -λ α )α adap for t = 1 to T do θ t ← θ t-1 -α ∇ θt-1 [L(D Spt , θ t-1 ) + λ AT L AT (D Spt , θ t-1 ) + λ Aug L(D Aug , θ t-1 )] end USA. We propose uncertainty-based stepsize adaptation (USA), which assigns lower stepsizes to layersfoot_0 with higher uncertainty (Algorithm 2) -we call this loosely an "inverse-relationship" below. Let α denote the default (scalar) stepsize and θ 1:M the parameters of the ensemble models, where M is the size of the ensemble and the number of layers in model is L. To compute the adapted layer-wise stepsizes α adap , we compute u, the parameter-wise standard deviation of the model parameters over the ensemble (line 2), apply an inverse-relationship transformation which flips the max with the min (line 3), average over each layer (line 4) and L 1 -normalize the result (line 5). The design choices for USA is explained in Appendix A.2. The resulting stepsizes α adap have an inverse-relationship with the variance of each layer. We give an example of applying USA to the 4-ConvNet architecture on miniImageNet (Figure 1 ). On the left, we plot the layer-wise standard deviations of the parameters and on the right the corresponding USA stepsizes, which follow an inverse relationship with the standard deviations. Algorithm 2: USA Function USA(α, θ 1:M t ) u = Std(θ 1:M t ) c = Max(u) -u + Min(u) µ l ← average of c over each layer l α adap = αµ l /(1/L L l=1 µ l ) return α adap end Algorithm 3: UFGSM Define: MinMaxNorm(x) = x-Min(x) Max(x)-Min(x) 1 Function UFGSM(θ, θ 1:M t , x, y) 2 u = Std({∇ x l θ (x, y) | θ ∈ θ 1:M t }) 3 u = MinMaxNorm(u) 4 x = x + u sign(∇ x l θ (x, y)) 5 return x 6 end UFGSM. We also propose uncertainty-based FGSM (UFGSM) to generate adversarial examples, with higher adversarial perturbation on input components with higher variance over the input gradient (Algorithm 3). Starting from an image x and label y, we compute the input gradient for each model from the ensemble and calculate the standard deviation u over the ensemble (line 2). Then, we linearly map u between 0 and 1 (line 3), compute the FGSM adversarial example for the pre- trained model θ 0 and rescale it using u, so that areas of higher variance get more perturbation. We give an example of applying UFGSM to miniImageNet in Figure 2 .

4. EXPERIMENTAL EVALUATION

We train MAML on miniImageNet (Vinyals et al., 2016) training split; we then apply our method on the resulting checkpoint. We evaluate our model on the test split of miniImageNet -for the samedomain setting -as well as CUB-200-2011 (Welinder et al., 2010) , Traffic Sign (Houben et al., 2013) and VGG Flower (Nilsback & Zisserman, 2008) -for the cross-domain setting. These datasets are denoted as Mini, Birds, Signs and Flowers respectively. A desirable feature for an optimizer is to maintain good performance in a broad range of stepsizes (Asi & Duchi, 2019) . Therefore, we evaluate our approach not only at the optimal stepsize, but also over a broad range of base stepsizesfoot_2 from 10 -4 to 1. We evaluate the performance with three metrics: All, Top-1 and Top-40%. 3 If two methods have comparable Top-1 performance, but one has better Top-40% performance, then that method is more robust to the choice of base stepsize. Detailed experimental setup and pretrained model selection are included in Appendix A.3. Our code is available at https://github.com/ NamyeongK/USA_UFGSM/.

4.1. MAIN RESULTS

Our main results are in Table 1 . We compare the default MAML steps, which we take as our baseline (denoted as SGD), with our full method (denoted as SGD+All), which consists of combining USA, UFSGM, EnAug, and AT (λ AT = λ Aug = 1, λ α = 0 in Algorithm 1). In terms of absolute performance (Top-1 row), our method outperforms the baseline on cross-domain tasks (Birds, Flowers, Signs), while the performance is comparable on same-domain tasks (Mini). In terms of robustness to the base stepsize (All and Top-40% rows), our method outperforms the baseline over a large range of stepsizes for 5-way 5-shot and 10-way 1-shot tasks, while for 5-way 1-shot the results are either comparable (All) or better (Top-40%) depending on the metric considered. More results can be found in Appendix A.6.

4.2. DISCUSSION

Ablation Study. We perform an ablation study for the 5-way 1-shot case in Table 2 and plot the accuracy at different base stepsizes for the Flowers dataset in Figure 3 . Comparing SGD to SGD+AT in Table 2 , we observe that adversarial training is beneficial over the baseline (Top1 and Top-40%), except for large stepsizes with SGD, which is reflected in the All metric and in the dip in performance in Figure 3a . Notice also how the use of AT flattens the accuracy curve near the optimal stepsize. Comparing SGD to SGD+USA, we observe a very small but consistent improvement over the baseline in the vicinity of the optimal stepsize (Top-1 and Top-40%). Comparing SGD+USA to SGD+USA+UFGSM, we observe improvement over a wide range of stepsizes, which is reflected in the curves and in All and Top-40% metrics. Comparing SGD+USA+UFGSM+EnAug+AT to SGD+USA+UFGSM+EnAug shows that AT+EnAug is beneficial in terms of absolute performance (Top-1) as well as creating robustness to the choice of stepsize (Top-40%). The drop for the All metric is explained by the dip in the curve for the largest stepsizes. Overall, the best absolute performance (Top-1) is always obtained through some use of adversarial training, while using the full method consistently results in increased robustness to the choice of stepsize (best Top-40% performance). Appendix A.5 shows various AT results. UFGSM vs. FGSM. We compare UFGSM to FGSM (Goodfellow et al., 2015) in Table 3 . The results show that our uncertainty-based approach consistently outperforms FGSM, which suggests that the uncertainty information extracted from the support set was useful. SGD vs. Adam. Our method can be used with SGD and Adam, however, we have mainly focused on SGD throughout the paper. SGD tends to yield the best results (Top-1 and Top-40%), as shown in Table 4 for the baseline and full method. More results for Adam can be found in Sections A.6 and A.8 of the appendix. Best checkpoint vs. Overfitted checkpoint. We also consider the problem of repurposing an overfitted checkpoint. We find that our method improves both absolute performance and robustness Table 2 : Ablation Study for 5-way 1-shot classification with SGD optimizer. SGD+AT corresponds to using (λ AT = λ α = 1, λ Aug = 0) in Algorithm 1, SGD+USA to (λ AT = λ Aug = λ α = 0), SGD+USA+UFGSM to (λ AT = λ α = 0, λ Aug = 1) and SGD+USA+UFGSM+EnAug+AT to (λ AT = λ Aug = 1, λ α = 0). to stepsize. In fact, the gains are more substantial on the overfitted (worse) checkpoint than the best checkpoint (Appendix A.8). Validating Proposal 1 and 2. Our method is built on the hypotheses that we should take small stepsizes on high uncertainty parameters and add more adversarial perturbation on high uncertainty input gradients. To validate those choices, we have tried taking large stepsizes on high uncertainty parameters, and using less adversarial perturbation on high uncertainty input gradients, which resulted in lower accuracy and robustness to the stepsize (Appendix A.7). Freezing most uncertain layers. We observed that the batch normalization (BN) layers have the most uncertainty (Figure 1 ). In Proposal 1, we argued that the higher uncertainty component amplifies the error when we use bigger stepsize. As suggested by an anonymous reviewer, this motivates an additional experiment that we perform where we "freeze" the BN layer during meta-testing (i.e. we manually set the stepsize for the BN scale and shift parameters to zero). We performed the BN layer freezing experiment on SGD+All (see detailed results in Table 10 and Figure 12 in Appendix A.9). It turns out that SGD+All (w/ freezing BN) outperforms SGD+All (w/o freezing BN) on the All metric, with most improvement in the higher stepsize range (>0.1). This is consistent with our intuition that updating high-uncertainty layers (such as BN) with large stepsizes can be harmful.

5. CONCLUSION

In this paper we considered the novel problem of repurposing pretrained MAML checkpoints for out-of-domain few-shot learning. Our method uses deep ensembles to estimate model parameter and input gradient uncertainties over the support set, and builds upon the default MAML gradient steps through the addition of uncertainty-based adversarial training and adaptive stepsizes. Our experiments over popular few-shot benchmarks show that our method yields increased accuracy and robustness to the choice of base stepsize. More generally, our results motivate the use of adversarial learning as a data augmentation scheme for improving few-shot generalization. In the future, it would be interesting to apply our method to related settings such as domain adaption and transfer learning. This Here, we introduce the design choices of the Algorithm 2. Line 3: We calculate the maximum Max(u) and minimum Min(u) values among the estimated standard deviations u for each parameter. As proposal 1, in order to assign to low stepsize to high uncertainty, the standard deviation of each parameter is subtracted from the maximum value Max(u) -u for inverse-relationship. At this time, the largest value among parameter values becomes 0. To prevent this, we add the smallest Max(u) -u + Min(u). Note that all values of u are non-negative. We also tried alternatives ways like non-inverse-relationship and relative standard deviation, but it did not seem to work better. To keep the same scale of the uncertainty before and after the transformation, we chose the inverse-relationship using Max(u) -u + Min(u). Line 4: In this study, we use layer-wise adapted stepsize. In order to assign the same stepsize for each layer, we calculate the average of the parameters of each layer and the values of the parameters of the corresponding layer are replaced with the average value. It repeats all layers and performs the corresponding operation. Line 5: To more easily compare the effectiveness of USA with the baseline, we set the average stepsize equals to the base stepsize, i.e. 1/|α adap | α adap = α. If USA is applied without this rescaling, than the average stepsize can change, which makes it difficult to distinguish the effect of using an overall different stepsize (for constant SGD e.g.) vs. our adaptation of stepsizes.

A.3 EXPERIMENTAL SETUP AND PRETRAINED MODEL SELECTION

Our baseline model was trained using the same hyperparameters with miniImageNet training of MAML except the inner loop stepsize. The inner loop stepsize was set to 0.1 for reproducing the 5-way 1-shot accuracy reported in the original MAML paper. We trained the model for 150,000 iterations. A pretrained model was selected with the validation accuracy among miniImageNet training checkpoints. We used the checkpoint with the highest validation accuracy as the pretrained model θ 0 . The highest test accuracy we achieved was 49.24% during meta-training. For the checkpoint that we chose with the highest validation performance, the test accuracy was 47.58%. Note that the cross-domain performance highly depended on which checkpoint we used. See The performance of the cross-domain significantly differed depending on the selected checkpoint. Figure 4 shows the performance for every 200 iterations. We selected the checkpoint based on miniImageNet validation performance, which is the highest at the 57,200 iteration. To evaluate the effect of AT, we only applied AT in MAML (λ α = λ AT = 1, λ Aug = 0). Table 5 , Figure 5 and 6 show AT results on 5-way 1-shot with SGD and Adam. Adam is worse than SGD in the results. However, we found that AT and Adam are good combination for a meta-test training. A.6 ADDITIONAL RESULTS FOR 5-WAY 5-SHOT AND 10-WAY 1-SHOT CLASSIFICATION. We present some additional results for 5-way 5-shot and 10-way 1-shot classification in this section. Table 6 shows the 5-way 5-shot and 10-way 1-shot classification results. Especially in the case of the Top-40% Avg our method significantly increased the performance about 5.84% and 1.83% in 5way 5-shot and 10-way 1-shot respectively. It means that it can increase the probability to select of optimal stepsize for a new task. As shown in Figure 7 , there are more flatten curve near the highest accuracy in both tasks. A.7 VALIDATING PROPOSALS 1 AND 2 WITH "INVERSE" COUNTERPARTS. To validate proposal 1, we evaluate the inverse strategy of USA, denoted I/USA, which assigns higher stepsizes to layers with higher uncertainty. Specifically, we flip the stepsizes by replacing line 3 of Algorithm 2 with c ← u. To measure the effect of USA, we only applied USA in MAML (λ α = λ AT = λ Aug = 0). The results for 5-way 1-shot classification are in Table 7 . USA outperforms the baselines over broad ranges of stepsizes compared to Adam and SGD. I/USA significantly decreased the performance of all metrics. Therefore, USA reflected useful knowledge well into the stepsize for a new task. See accuracy by stepsizes in Fig 8 . Figure 8 shows the results for verifying of Proposal 1. USA shows more flatten curve in Top-1 accuracy ranges than SGD and I/USA. I/USA degraded the performance through all ranges of stepsizes. For verifying the proposal 2, we evaluated three methods UFGSM , I/UFGSM and FGSM on USA. I/UFGSM is inversely implemented method of UFGSM. FGSM uses examples generated by FGSM instead of UFGSM. To measure the effect of UFGSM, we only applied UFGSM in MAML with USA (λ α = λ AT = 0, λ Aug = 1). Table 7 shows 5-way 1-shot classification results. UFGSM outperforms all the other methods for every metric. In spite of I/UFGSM had stronger adversarial perturbation than UFGSM, the performance was worse than UFGSM. Note that I/UFGSM and FGSM showed almost similar performances. The reason is that most of the input pixels have an uncertainty close to 0; therefore, when scaling after inverse, most pixels have a value of 1. Therefore, the generated examples are almost similar to the adversarial example in which FGSM is applied. See accuracy by stepsizes in Fig 9 . Figure 9 shows the result for verifying of proposal 2. UFGSM Table 6 : 5-way 5-shot and 10-way 1-shot classification results using SGD. Our method outperformed both tasks. Especially 5-way 5-shot classification performance significantly improved than baseline. As can seen in Appendix A.4, overfitted models on miniImageNet degraded performance over all datasets. We investigated the 5-way 5-shot classification results on the checkpoint after metatraining with MAML for 150K iterations. Table 8 and 9 show the results with SGD and Adam. The number in the parentheses is the difference from the baseline for each checkpoint (e.g. SGD or Adam). The results using the (Last) checkpoint are worse than (Validation) checkpoint due to the overfitting. However the performance boost from using our method is more substantial on the overfitted (Last) checkpoint than the best (Validation) checkpoint, both in terms of absolute performance (Top-1) and robustness to choice of base stepsize (other metrics). A.9 FREEZING MOST UNCERTAIN LAYERS. Since some of the BatchNorm parameters have the highest uncertainty (see Figure 1a ), we have experimented freezing the BatchNorm parameters during meta-testing (i.e. setting their stepsize to zero). It turns out that SGD+All (w/ freezing BN) outperforms SGD+All (w/o freezing BN) on the All metric (Table 10 ), with most improvement in the higher stepsize range (>0.1) (Figure 12 ). We Figure 7 : 5-way 5-shot (Top) and 10-way 1-shot (Bottom) classification with MAML(SGD) and our proposed method. In terms of robustness, our method outperforms MAML(SGD) more effectively. In addition, we show that our method outperforms Top-1 accuracy for Birds, Flowers and Signs. Also our method shows flatter curve near highest accuracy sections. This increases the probability of optimal stepsize selection. Figure 8 : 5-way 1-shot classification for verifying proposal 1. We compared to USA and I/USA. notice a similar trend for 5-way 5-shot SGD, where SGD (w/ freezing BN) also outperforms SGD (w/o freezing BN) on the All metric, with most improvement in the higher stepsize range. 



In this study, we choose to use layer-wise stepsizes, but it is also possible to use kernel-wise or parameterwise stepsizes. The motivation for using layer-wise stepsizes is because features from the same layer tend to have the same level of abstraction(Zeiler & Fergus, 2014). We determine the ranges of stepsize based on training performance on miniImageNet and keep it the same for other datasets. We selected the minimum and maximum stepsize where performance decreased drastically. All is the average accuracy over all stepsizes. Top-1 is the accuracy of the best performing stepsize. Top- 40% is the average of the top 40% accuracies among all the stepsizes



Figure 1: USA converts uncertainty into layer-wise adapted stepsize. Here α = 0.01. (a) is standard deviation of trained ensemble models' weights. Each layer has a difference uncertainty. (b) is adapted stepsize by USA. Each layer has a different stepsize based on the uncertainty.

Figure 2: Applying UFGSM to 4-ConvNet on miniImageNet with = 0.05. Starting from the clean image x, we add the signed gradient sign(∇ x l θ (x, y)) after rescaling it by the uncertainty over the input gradient u, to generate the adversarial example x (UFGSM). Note how UFGSM generated a more natural image than FGSM (rightmost, u = 1).

Figure 3: Ablation study for 5-way 1-shot classification on Flowers dataset.

Fig.4 in Appendix A.4.The size of ensemble is M = 5 which is the same as the deep ensembles(Lakshminarayanan et al., 2017). The parameter for the FGSM is = 0.05. The scale value a of Gaussian random perturbation for ensemble model training is σ = 0.05. The gradient step is T = 10 which is same with miniImageNet test of MAML.We do not split those datasets except miniImageNet, because we do not use the datasets in the metatraining. We used the same splits asRavi & Larochelle (2017) for miniImageNet. Tseng et al. (2020) used a randomly split manner for the cross-domain test and Triantafillou et al. (2020) used all traffic signs dataset for the test. A.4 CROSS-DOMAIN ACCURACY WHILE META-TRAINING MAML ON MINIIMAGENET

Figure 4: Cross-domain performance while meta-training on miniImageNet. X-axis is training iteration on meta-training on miniImageNet and y-axis is classification accuracy.

Figure 5: 5-way 1-shot classification results with AT on SGD.

Figure 6: 5-way 1-shot classification results with AT on Adam.

Figure 9: 5-way 1-shot classification results to verify proposal 2. We compared to UFGSM, I/UFGSM and FGSM.

Figure 11: (a) shows the re-scaled inverse input gradient uncertainty to generate the UFGSM example. (b) is a flatten plot of the re-scaled inverse input gradient uncertainty. (c) shows a histogram of the uncertainty.

Figure12: 5-way 1-shot (Top) and 5-way 5-shot (Bottom) classification results of freezing BN layers. We tested freezing BN on SGD+All. SGD+All (w/ BN freezing) outperformed SGD+All (w/o BN freezing) in the higher stepsize range (>0.1).

Main results. We compare the default MAML steps (SGD) with our method (SGD+All) on same-domain and cross-domain benchmarks.

Comparing UFGSM against FGSM(Goodfellow et al., 2015) on 5-way 1-shot tasks. For all metrics and datasets, the proposed uncertaintybased method is better.

Comparing SGD vs. Adam for default MAML steps (baseline) and our method (baseline+all) on 5-way 1-shot classification.

research was partially supported by the Canada CIFAR AI Chair Program, the NSERC Discovery Grant RGPIN-2017-06936 and a Google Focused Research award. Simon Lacoste-Julien is a CIFAR Associate Fellow in the Learning in Machines & Brains program.

5-way 1-shot classification results. In order to verify the effectiveness of AT, we applied only AT among the proposed method and the all proposed method.

UFGSM show very similar trends. We explained the reason why the two methods gave similar results. As can be seen in Fig10, almost all of the input gradient uncertainty is near zero. When we inverse the value, almost all the value is 1(See Fig 11). UFGSM improves performance despite less adversarial perturbation than I/UFGSM and FGSM. Through this, even a small change can help model learning if correct (useful) information is reflected.

5-way 1-shot classification results for verifying proposal 1 and 2. I/USA and I/UFGSM degraded performance than USA and UFGSM respectively. Our approach which is applied USA and UFGSM outperformed SGD on almost of the metrics.



5-way 5-shot classification results of selected checkpoint with Adam

ACKNOWLEDGMENTS

We thank Hugo Larochelle and Reza Babanezhad for insightful discussions, Damien Scieur and Emmanuel Bengio for helpful feedback on the manuscript. We also thank the anonymous reviewers for their comments and suggestions.

