ROBUST META-LEARNING WITH NOISE VIA EIGEN-REPTILE

Abstract

Recent years have seen a surge of interest in meta-learning techniques for tackling the few-shot learning (FSL) problem. However, the meta-learner's initial model is prone to meta-overfit, as there are only a few available samples with sampling noise. Besides, when handling the data sampled with label noise for FSL, meta-learner could be extremely sensitive to label noise. To address these two challenges that FSL with sampling and label noise. In particular, we first cast the meta-overfitting problem (overfitting on sampling and label noise) as a gradient noise problem since few available samples cause meta-learner to overfit on existing examples (clean or corrupted) of an individual task at every gradient step. We present Eigen-Reptile (ER) that updates the meta-parameters with the main direction of historical taskspecific parameters to alleviate gradient noise. Specifically, the main direction is computed by a special mechanism for the parameter's large size. Furthermore, to obtain a more accurate main direction for Eigen-Reptile in the presence of label noise, we propose Introspective Self-paced Learning (ISPL) that constructs a plurality of prior models to determine which sample should be abandoned. We have proved the effectiveness of Eigen-Reptile and ISPL, respectively, theoretically and experimentally. Moreover, our experiments on different tasks demonstrate that the proposed methods outperform or achieve highly competitive performance compared with the state-of-the-art methods with or without noisy labels.

1. INTRODUCTION

Meta-learning, also known as learning to learn, is the key for few-shot learning (FSL) (Vinyals et al., 2016; Wang et al., 2019a) . One of the meta-learning methods is the gradient-based method, which usually optimizes meta-parameters as initialization that can fast adapt to new tasks with few samples. However, fewer samples mean a higher risk of meta-overfitting, as the ubiquitous sampling noise in mini-batch cannot be ignored. Moreover, existing gradient-based meta-learning methods are fragile with few samples. For instance, a popular recent method, Reptile (Nichol et al., 2018) , updates the meta-parameters towards the inner loop direction, which is from the current initialization to the last task-specific parameters. Nevertheless, as shown by the bold line of Reptile in Figure 1 , with the gradient update at the last step, the update direction of meta-parameters has a significant disturbance, as sampling noise leads the meta-parameters to overfit on the few trained samples at gradient steps. Many prior works have proposed different solutions for the meta-overfitting problem, such as using dropout (Bertinetto et al., 2018; Lee et al., 2020) , and modifying the loss function (Jamal & Qi, 2019) etc., which stay at the model level. This paper casts the meta-overfitting problem as a gradient noise problem that from sampling noise while performing gradient update (Wu et al., 2019) . Neelakantan et al. (2015) .etc have proved that adding additional gradient noise can improve the generalization of neural networks with large samples. However, it can be seen from the model complexity penalty that the generalization of the neural network will increase when the number of samples is larger. To a certain extent adding gradient noise is equivalent to increasing the sample size. As for FSL, there are only a few samples of each task. In that case, the model will not only remembers the contents that need to be identified but also overfits on the noise (Zhang et al., 2016) . High-quality manual labeling data is often time-consuming and expensive. Low-cost approaches to collect low-quality annotated data, such as from search engines, will introduce label noise. Moreover, training meta-learner requires a large number of tasks, so that it is not easy to guarantee the quality of data. Conceptually, the initialization learned by existing meta-learning algorithms can severely Figure 1 : Inner loop steps of Reptile, Eigen-Reptile. Reptile updates meta-parameters towards the last task-specific parameters, which is biased. Eigen-Reptile considers all samples more fair. Note that the main direction is the eigenvector corresponding to the largest eigenvalue. degrade in the presence of noisy labels. Intuitively, as shown in FSL with noisy labels of Figure 1 , noisy labels cause a large random disturbance in the update direction. It means that label noise (Frénay & Verleysen, 2013) leads the meta-learner to overfit on wrong samples, which can be seen as further aggravating the influence of gradient noise. Furthermore, conventional algorithms about learning with noisy labels require much data for each class (Hendrycks et al., 2018; Patrini et al., 2017) . Therefore, these algorithms cannot be applied to noisy FSL problem, since few available samples per class. So it is crucial to propose a method to address the problem of noisy FSL. In this paper, we propose Eigen-Reptile (ER). In particular, as shown in Figure 1 , Eigen-Reptile updates the meta-parameters with the main direction of task-specific parameters that can effectively alleviate gradient noise. Due to the large scale of neural network parameters, it is unrealistic to compute historical parameters' eigenvectors. We introduce the process of fast computing the main direction into FSL, which computes the eigenvectors of the inner loop step scale matrix instead of the parameter scale matrix. Furthermore, we propose Introspective Self-paced Learning (ISPL), which constructs multiple prior models with randomly sampling. Then prior models will discard high loss samples from the dataset. We combine Eigen-Reptile with ISPL to address the noisy FSL problem, as ISPL can improve the main direction computed with noisy labels. Experimental results show that Eigen-Reptile significantly outperforms the baseline model by 5.35% and 3.66% on corrupted Mini-ImageNet of 5-way 1-shot and clean Mini-ImageNet of 5-way 5shot, respectively. Moreover, the proposed algorithms outperform or are highly competitive with state-of-the-art methods on few-shot classification tasks. The main contributions of this paper can be summarized as follows: • We cast the meta-overfitting issue (overfitting on sampling and label noise) as a gradient noise issue under the meta-learning framework. • We propose Eigen-Reptile that can alleviate gradient noise effectively. Besides, we propose ISPL, which improves the performance of Eigen-Reptile in the presence of noisy labels. • The proposed methods outperform or achieve highly competitive performance compared with the state-of-the-art methods on few-shot classification tasks.

2. RELATED WORK

There are three main types of meta-learning approaches: metric-based meta-learning approaches (Ravi & Larochelle, 2016; Hochreiter et al., 2001; Andrychowicz et al., 2016; Liu et al., 2018; Santoro et al., 2016) , model-based meta-learning approaches (Vinyals et al., 2016; Koch et al., 2015; Mordatch, 2018; Sung et al., 2018; Snell et al., 2017; Oreshkin et al., 2018; Shyam et al., 2017) and gradient-based meta-learning approaches (Finn et al., 2017; Nichol et al., 2018; Jamal & Qi, 2019; Zintgraf et al., 2018; Li et al., 2017; Rajeswaran et al., 2019; Finn et al., 2018) . In this paper, we focus on gradient-based meta-learning approaches which can be viewed as the bi-level loop. The goal of the outer loop is to update the meta-parameters on a variety of tasks, while task-specific parameters are learned through only a small amount of data in the inner loop. In addition, some algorithms achieve state-of-the-art results by additionally training a 64-way classification task on meta-training set (Yang et al., 2020; Hu et al., 2020) , we do not compare these algorithms for fairness. Meta-Learning with overfitting. Due to too few samples, meta-learner inevitably tends to overfit in FSL. Zintgraf et al. (2018) introduces additional context parameters to the model's parameters, which can prevent meta-overfitting. Furthermore, Bertinetto et al. (2018) find that regularization such as dropout can alleviate meta-overfitting in their prior work; Jamal & Qi (2019) propose a novel paradigm of Task-Agnostic Meta-Learning (TAML), which uses entropy or other approaches to minimize the inequality of initial losses beyond the classification tasks to improve the generalizability of meta-learner. All these methods stay at the model level. However, we solve the meta-overfitting problem from the gradient aspect. We propose Eigen-Reptile, which updates the meta-parameters by the main direction of task-specific parameters to alleviate meta-learner overfit on noise. Learning with noisy labels. Learning with noisy labels has been a long-standing problem. There are many approaches to solve it, such as studying the denoise loss function (Hendrycks et al., 2018; Patrini et al., 2017; Jindal et al., 2016; Patrini et al., 2017; Wang et al., 2019b; Arazo et al., 2019 ), relabeling (Lin et al., 2014) , and so on. Nevertheless, most of these methods require much data for each class. Gao et al. (2019) proposes a model for noisy few-shot relation classification but without good transferability. For noisy FSL, a gradient-based meta-learner is trained to optimize an initialization on various tasks with noisy labels. As there are few samples of each class, the traditional algorithms for noisy labels cannot be applied. When the existing gradient-based meta-learning algorithms, such as Reptile, update meta-parameters, they focus on the samples that generate the last gradient step. And these samples may be corrupted, which makes the parameters learned by meta-learner susceptible to noisy labels. For noisy FSL, we propose ISPL based on the idea of Self-paced Learning (SPL) (Kumar et al., 2010; Khan et al., 2011; Basu & Christensen, 2013; Tang et al., 2012) to learn more accurate main direction for Eigen-Reptile. ISPL constructs prior models to decide which sample should be discarded when train task-specific models, and this process can be regarded as the introspective process of meta-learner. In contrast, the model with SPL learns the samples gradually from easy to complex, and the model itself decides the order.

3. PRELIMINARIES

Gradient-based meta-learning aims to learn a set of initialization parameters φ that can be adapted to new tasks after a few iterations. The dataset D is usually divided into the meta-training set D meta-train and meta-testing set D meta-test , which are used to optimize meta-parameters and evaluate its generalization, respectively. For meta-training, we have tasks {T i } B i=1 drawn from task distribution p(T ), each task has its own train set D train and test set D test , and the tasks in metatesting D meta-test are defined in the same way. Note that there are only a small number of samples for each task in FSL. Specifically, the N-way K-shot classification task refers to K examples for each of the N classes. Generally, the number of shots used in meta-training should match the one used at meta test-time to obtain the best performance (Cao et al., 2019) . In this paper, we will increase the sample size appropriately to get the main direction of individual tasks during meta-training. To minimize the test loss L(D test , φ) of an individual task, meta-parameters need to be updated n times to get good task-specific parameters φ. That is minimizing L(D test , φ) = - 1 N E   1 K (x,y)∈Dtest log q ŷ = y | x, φ, φ   , φ = U n (D train , φ) (1) where U n represents n inner loop steps through gradient descent or Adam (Kingma & Ba, 2014) on batches from D train , q ŷ = y | x, φ, φ is the predictive distribution. When considering updating the meta-parameters in the outer loop, different algorithms have different rules. In the case of Reptile, after n inner loop steps, the meta-parameters can be updated as: φ ←-φ + β( φφ), where β is a scalar stepsize hyperparameter that controls the update rate of meta-parameters.

4. EIGEN-REPTILE FOR CLEAN AND CORRUPTED DATA

The proposed Eigen-Reptile alleviates gradient noise to alleviate meta-learner overfitting on sampling and label noise. Furthermore, ISPL improves the performance of Eigen-Reptile in noisy FSL.

4.1. THE EIGEN-REPTILE ALGORITHM

To alleviate gradient noise to improve the generalizability of meta-learner, we propose Eigen-Reptile, which updates d meta-parameters with the main direction of task-specific parameters. We train the task-specific model with n inner loop steps that start from the meta-parameters φ with few examples. Let i-th column W :,i ∈ R d×1 of parameter matrix W ∈ R d×n be the parameters after i-th gradient update, e.g., W :,i = U i (φ). And treat W :,i as a d-dimensional parameter point w i in the parameter space. e ∈ R d×1 is a unit vector that represents the main direction of n parameter points in W . Intuitively, projecting all parameter points onto e should retain the most information. We represent the parameter points by a straight line of the form w = w + le, which shows that the straight line passes through the mean point w and the signed distance of a point w from w is l. Then we get the loss function J(l 1 , l 2 , • • • , l n , e) = n i=1 w + l i e -w i 2 . And determine the signed distance l of each point by partially differentiating J with respect to l i , we get l i = e (w i -w). Plugging in this expression for l i in J, we get J(e) = - n i=1 e (w i -w)(w i -w) e + n i=1 w i -w 2 = -e Se + n i=1 w i -w 2 (2) where S = n i=1 (w i -w)(w i -w) is a scatter matrix. According to Eq.2, minimizing J is equivalent to maximizing: e Se. Note that e needs to be roughly consistent with the gradient update direction V in the process of learning task-specific parameters. Use Lagrange multiplier method as max e Se s.t. V e > 0 e e = 1 , where V = 1 n/2 n/2 i=1 w n-i+1 -w i (3) We get the objective function g(µ, e, λ, η) = e Se -λ(e e -1) + µ(-V e + η 2 ), where λ = 0, µ ≥ 0 (4) then partially differentiating g in Eq.4 with respect to µ, e, λ, η,          -V e + η 2 = 0 2Se -2λe -µV = 0 e e -1 = 0 2µη = 0 According to Eq.5, if η = 0, then V and e are orthogonal, which obviously does not meet our expectations. So we get η = 0, and µ = 0. Then Se = λe.We can see e is the eigenvector of S corresponding to the largest eigenvalue λ, as we need the main direction. It should be noted that even if Eq.3 is not directly related to the e, in ER, Eq.3 must be retained because it determines the update direction of the outer-loop. Otherwise, the algorithm will not converge. A concerned question about Se = λe is that the scatter matrix S ∈ R d×d grows quadratically with the number of parameters d. As the high dimensionality of parameters typically used in neural networks, computing eigenvalue and eigenvector of S could come at a prohibitive cost (the worst-case complexity is O d 3 ). Centralize W by subtracting the mean w, and scatter matrix S = W W . To avoid calculating the eigenvector of S directly, we focus on W W . As W W e = λ e. Multiply both sides of the equation with W , W W W e e = λ λ W e e It can be found from Eq.6 that W W ∈ R n×n and W W ∈ R d×d have the same eigenvalue, λ = λ. Furthermore, we get the eigenvector of W W as e = W e. The main advantage of Eq.6 is that the intermediate matrix W W now grows quadratically with the inner loop steps. As we are interested in FSL, n is very small. It will be much easier to compute the eigenvector e of W W , O n 3 , which can be ignored (detailed analysis refer to Appendix B). Then we get the eigenvector e of W W based on e. Moreover, we project parameter update vectors w i+1 -w i , i = 1, 2, • • • , n-1 on e to get the corresponding update stepsize ν, so meta-parameters φ can be updated as φ ←-φ + βνe (7) where β is a scalar stepsize hyperparameter that controls the update rate of meta-parameters. The Eigen-Reptile algorithm is summarized in Algorithm 1 of Appendix A. To illustrate the validity of Eigen-Reptile theoretically, we present Theorem 1 as follow: Theorem 1 Assume that the gradient noise variable x follows Gaussian distribution (Hu et al., 2017; Jastrzębski et al., 2017; Mandt et al., 2016) , x ∼ N 0, σ 2 . Furthermore, x and neural network parameter variable are assumed to be uncorrelated. The observed covariance matrix C equals noiseless covariance matrix C t plus gradient noise covariance matrix C x . Then, we get C = C t + C x = P t (Λ t + Λ x )P t = P t (Λ t + σ 2 I)P t = P t ΛP t = P ΛP 8) where P t and P are the orthonormal eigenvector matrices of C t and C respectively, Λ t and Λ are the corresponding diagonal eigenvalue matrices, and I is an identity matrix. It can be seen from Eq.8 that C and C t has the same eigenvectors. We defer the proof to the Appendix C. Theorem 1 shows that eigenvectors are not affected by gradient noise. So Eigen-Reptile can find a more generalizable starting point for new tasks without overfitting on noise. Self-paced learning (SPL) learns the samples from low losses to high losses, which is proven beneficial in achieving a better generalization result (Khan et al., 2011; Basu & Christensen, 2013; Tang et al., 2012) . Besides, the losses of samples are determined by the model itself. Nevertheless, in meta-learning setting, meta-learner is trained on various tasks, and initial model may have lower losses for trained classes and higher losses for unseen classes or noisy samples. For this reason, we cannot train the task-specific model by the same way as SPL to solve noisy FSL problem. In this paper, we use multiple prior models to vote on samples and decide which should be abandoned. As shown in Figure 2 , even though the two categories of yellow and green show an excellent distribution that can be well separated, some samples are marked wrong. To address this noisy label problem, we build three prior models. Specially, we randomly sample three times, and model 1 is trained with a corrupted label. Due to different samples learned by prior models, building multiple models to vote on the data will obtain more accurate losses. Such a learning process is similar to human introspection, and we call it Introspective Self-paced Learning (ISPL). Moreover, samples with losses above a certain threshold will be discarded. Furthermore, we imitate SPL to add the hidden variable v = 0 or 1 that is decided by Q prior models before the loss of each sample to control whether the sample should be abandoned. So we get the task-specific loss as

4.2. THE INTROSPECTIVE SELF-PACED LEARNING

L ISP L (φ, v) = h i=1 v i L (x i , y i , φ) , where v i = arg min v v i Q Q j=1 L j (x i , y i , φ j ) -γv i (9) where h is the number of samples from dataset D train , γ is the sample selection parameter, which gradually decreases, parameter of model j is φ j = U n (D j , φ), D j ∈ D train . Note that we update the meta-parameters with the model trained on h samples from D train . The ISPL is summarized in Algorithm 2 of Appendix A. Intuitively, it is difficult to say whether discarding high-loss samples containing correct samples and wrong samples will improve the accuracy of eigenvector, so we will use Theorem 2 to prove the effectiveness of ISPL. Theorem 2 Let W o be the parameter matrix generated by the corrupted samples. Compute the eigenvalues and eigenvectors of the expected observed parameter matrix 1 λ E(C tr )e = P o (I - Λ o λ )P o e ≈ P o (I - λ o λ I)P o e > P o (I - λ o -ξ λ -ξ I)P o e where C tr is the covariance matrix generated by true samples, λ is the observed largest eigenvalue, λ o is the largest eigenvalue in the corrupted diagonal eigenvalue matrix Λ o . According to Eq.21, if λ o /λ is smaller, the observed eigenvector e is more accurate. Assume that the discarded high loss samples have the same contributions ξ to λ and λ o , representing the observed and corrupted main directional variance, respectively. Note that these two kinds of data have the same effect on the gradient updating of the model, so this assumption is relatively reasonable. Furthermore, it is easy to find that (λ o -ξ)/(λ -ξ) is smaller than λ o /λ. Theorem 2 shows that discard the high loss samples can help improve the accuracy of the observed eigenvector of parameter matrix learned with corrupted labels. So ISPL can improve the performance of Eigen-Reptile, as it discards high loss samples in D train .

5. EXPERIMENTAL RESULTS AND DISCUSSION

In our experiments, we aim to (1) evaluate the effectiveness of Eigen-Reptile to alleviate gradient noise (sampling and label noise), (2) determine whether Eigen-Reptile can alleviate gradient noise in a realistic problem, (3) evaluate the improvement of ISPL to Eigen-Reptile in the presence of noisy labels, (4) validate theoretical analysis through numerical simulations. The code and data for the proposed model are provided for research purposesfoot_0 .

5.1. META-LEARNING WITH NOISE ON REGRESSION

In this experiment, we evaluate Eigen-Reptile by the 1D sine wave K-shot regression problem (Nichol et al., 2018) . Each task is defined by a sine curve y(x) = Asin(x + b), where the amplitude A ∼ U ([0.1, 5.0]) and phase b ∼ U ([0, 2π]). The amplitude A and phase b are varied between tasks. The goal of each task is to fit a sine curve with the data points sampled from the corresponding y(x). We calculate loss in 2 using 50 equally-spaced points from the whole interval [-5.0, 5.0] for each task. The loss is 5.0 -5.0 y(x) -y(x) 2 dx (11) Table 1 : Few Shot Classification on Mini-Imagenet N-way K-shot accuracy. The ± shows 95% confidence interval over tasks. Algorithm 5-way 1-shot 5-way 5-shot MAML (Finn et al., 2017) 48.70 ± 1.84% 63.11 ± 0.92% Relation Network (Sung et al., 2018) 50.44 ± 0.82% 65.32 ± 0.70% CAML (512) (Zintgraf et al., 2018) 51.82 ± 0.65% 65.85 ± 0.55% TAML (VL + Meta-SGD) (Jamal & Qi, 2019) 51.77 ± 1.86% 65.60 ± 0.93% Meta-dropout (Lee et al., 2019) 51.93 ± 0.67% 67.42 ± 0.52% Warp-MAML (Flennerhag et al., 2019) 52.30 ± 0.8% 68.4 ± 0.6% MC (128) (Park & Oliva, 2019) 54.08 ± 0.93% 67.99 ± 0.73% ARML (Yao et al., 2020) 50.42 ± 1.73% -ModGrad (64) (Simon et al., 2020) 53.20 ± 0.86% 69.17 ± 0.69 Reptile (32) (Nichol et al., 2018) 49.97 ± 0.32% 65.99 ± 0.58% Eign-Reptile (32) 51.80 ± 0.9% 68.10 ± 0.50% Eign-Reptile (64) 53.25 ± 0.45% 69.85 ± 0.85% where y(x) is the predicted function that start from the initialization learned by meta-learner. The K-shot regression task fits a selected sine curve through K points, here K = 10. For the regressor, we use a small neural network, which is the same as Nichol et al. (2018) , except that the activation functions are Tanh. Specifically, the small network includes an input layer of size 1, followed by two hidden layers of size 64, and then an output layer of size 1. In this part, we mainly compare Reptile and Eigen-Reptile. Both meta-learners use the same regressor and are trained for 30000 iterations with inner loop steps 5, batch size 10, and a fixed inner loop learning rate α = 0.02. We report the results of Reptile and Eigen-Reptile in Figure 3 . It can be seen that the curve fitted by Eigen-Reptile is closer to the true green curve, which shows that Eigen-Reptile performs better. According to Jamal & Qi (2019), the initial model that has a larger entropy before adapting to new tasks would better alleviate meta-overfitting. As shown in Figure 3 , from 1 to 30000 iterations, Eigen-Reptile is more generalizable than Reptile as the initial blue line of Eigen-Reptile is closer to a straight line, which shows that the initialization learned by Eigen-Reptile is less affected by gradient noise. Furthermore, Figure 4 shows that Eigen-Reptile converges faster and gets a lower loss than Reptile.

5.2. META-LEARNING IN REALISTIC PROBLEM

We evaluate our method on a popular few-shot classification dataset: Mini-ImageNet (Vinyals et al., 2016) . The Mini-Imagenet dataset contains 100 classes, each with 600 images. We follow Ravi & Larochelle (2016) to divide the dataset into three disjoint subsets: meta-training set, meta-validation set, and meta-testing set with 64 classes, 16 classes, and 20 classes, respectively. And we follow the few-shot learning protocols from prior work (Vinyals et al., 2016) , except that the number of the meta-training shot is 15, which is still much smaller than the number of samples required by traditional tasks. Moreover, we run our algorithm on the dataset for the different number of shots and compare our results to the state-of-the-art results. What needs to be reminded is that approaches that use deeper, residual networks can achieve higher accuracies (Gidaris & Komodakis, 2018) , so for a fair comparison, we only compare algorithms that use convolutional networks as Reptile does. Specifically, our model follows Nichol et al. (2018) , which has 4 modules with a 3 × 3 convolutions and 64 filters, 2 × 2 max-pooling etc.. The images are downsampled to 84 × 84, and the loss function is the cross-entropy error. We use the Adam optimizer with β 1 = 0 in the inner loop. Our model is trained for 100000 iterations with a fixed inner loop learning rate 0.0005, and 7 inner-loop steps. Regarding some hyperparameters analysis, defer to Appendix E. The results of Eigen-Reptile and other meta-learning approaches are summarized in Table 1 . The proposed Eigen-Reptile (64 filters) outperforms and achieves highly competitive performance compared with other algorithms for 5-shot and 1-shot classification problems, respectively. More specifically, for 1-shot, the result of MC (128 filters) with a higher capacity network is better than that of Eigen-Reptile. However, as a second-order optimization algorithm, the computational cost of MC will be much higher than Eigen-Reptile. And obviously, the result of Eigen-Reptile is much better than Reptile for each task. Compared with Reptile, Eigen-Reptile uses the main direction to update the meta-parameters to alleviate the meta-overfitting caused by gradient noise. More importantly, Eigen-Reptile outperforms the state-of-the-art meta-overfitting preventing method Meta-dropout (Lee et al., 2019) , which is based on regularization. This result shows that the effectiveness of addressing the meta-overfitting problem from the perspective of alleviating gradient noise.

5.3. META-LEARNING WITH LABEL NOISE

We conduct the 5-way 1-shot experiment with noisy labels generated by corrupting the original labels of Mini-Imagenet. More specifically, in this section, we only focus on symmetric label noise, as correct labels are flipped to other labels with equal probability, i.e., in the case of symmetric noise of ratio p, a sample retains the correct label with probability 1 -p. It becomes some other label with probability p/(N -1). An example of symmetric noise is shown in Figure 6 . Furthermore, the asymmetric label noise experiment is conducted in Appendix F. Note that we only introduce noise in the train set during meta-training, where the meta-training shot is 30. Moreover, all meta-learners with 32 filters are trained for 10000 iterations, with a learning rate of 0.001 in the inner loop. The sample selection parameter γ = 10 that decreases by 0.6 every 1000 iterations. The other settings of this experiment are the same as in section 5.2. As shown in Table 3 , with the increase of the ratio p, the performance of Reptile decreases rapidly. When p = 0.5, the initialization point learned by Reptile can hardly meet the requirements of quickly adapting to new tasks with few samples. On the other hand, Eigen-Reptile is less affected by noisy labels than Reptile, especially when the noise ratio is high, i.e., p = 0.5. The experimental results also verify the effectiveness of ISPL, as Eigen-Reptile+ISPL achieves better results than Eigen-Reptile when p = 0. It also can be seen that ISPL plays a more significant role when p is higher. However, when p = 0, ISPL harms Eigen-Reptile, as ISPL only discards correct samples. In addition, ISPL does not significantly improve Reptile, especially when p = 0.5. This is because too many high-loss samples are removed, causing Reptile to fail to converge quickly with the same number of iterations. These experimental results show that Eigen-Reptile and ISPL can effectively alleviate the gradient noise problem caused by noisy labels, thereby alleviating the meta-overfitting on corrupted samples.

6. CONCLUSION

In this paper, we cast the meta-overfitting problem (overfitting on sampling and label noise) as a gradient noise problem. Then, we propose a gradient-based meta-learning algorithm Eigen-Reptile. It updates the meta-parameters through the main direction, which has been proven by theory and experiments that it can alleviate the gradient noise effectively. Furthermore, to get closer to real-world situations, we introduce noisy labels into the meta-training dataset, and the proposed ISPL constructs prior models to select samples for Eigen-Reptile to get more accurate main direction.

A PSEUDO-CODE

Algorithm 1 Eigen-Reptile Require: Distribution over tasks P (T ), outer step size β 1: Initialize meta-parameters φ 2: while not converged do 3: W = [ ], ν = 0 4: Sample batch of tasks {T i } B i=1 ∼ P (T ) 5: for each task T i do 6: φ i = φ 7: Sample train set D train of T i 8: for j = 1, 2, 3, ..., n do 9: φ j i = U j (D train , φ i ) 10: W appends w j = f latten(φ j i ), w j ∈ R d×1 11: end for 12: Mean centering, W = Ww, w ∈ R d×1 13: Compute eigenvalue matrix Λ and eigenvector matrix P of scatter matrix W W 14: Eigenvalues λ 1 > λ 2 > • • • > λ n in Λ 15: Compute eigenvector matrix of W W , P = W P 16: Let the eigenvector corresponding to λ 1 be a unit vector e 1 i 2 2 = 1 17: for j = 1, 2, 3, ..., n -1 do 18: ν = ν + (W :,j+1 -W :,j )e 1 i 19: end for 20: e 1 i = λ1 n m=1 λm × e 1 i 21: Calculate the approximate direction of task-specific gradient update V : 22: V = 1 n/2 n/2 i=1 W :,n-i+1 -W :,i 23: if e 1 i • V < 0 then 24: e 1 i = -e 1 Update meta-parameters φ ←-φ + β × ν/B × ẽ 29: end while Algorithm 2 Introspective Self-paced Learning Require: Dataset D train , initialization φ, batch size b, selection parameter γ, attenuation coefficient µ, the number of prior models Q 1: Initialize network parameters φ * = φ for a sampled task 2: for j = 1, 2, 3, • • • , Q do 3: Sample examples D j from D train for training model j , φ j = U m (D j , φ * ) 4: end for 5: Train task-specific parameters: 6: for i = 1, 2, 3, • • • , n do 7: Compute hidden variable vector v: 8: v = arg min v v q b q=1 L q -γ b q=1 v q , where L q = 1 Q Q j=1 L j (x q , y q , φ j ) 9: Update task-specific parameters φ * : 10: φ * = arg min φ * L ISP L (φ * , v) 11: γ = γ -µ 12: end for

B ALGORITHM COMPLEXITY ANALYSIS

As for Eigen-Reptile, the cost of single gradient descent in the inner-loop is O(d), where d is the number of network parameters. The cost of the covariance matrix computations is O(n 2 d), where n is the number of inner-loop. Moreover, the worst-case complexity of computing eigenvalue decomposition is O(n 3 ). Finally, the computational complexity of restoring eigenvector is O(nd). We set the maximal number of outer-loop to T . Hence the overall time complexity is O(T (n 2 d + n 3 + nd + nd)). As in FSL, n is usually less than 10 (for this paper n = 7 ), so the overall time complexity is O(T d). As for Reptile, the computational complexity is also O(T d), which means that the time complexity of both Reptile and Eigen-Reptile is much lower than the second-order optimization algorithms. As for spatial complexity, Eigen-Reptile needs to store a d × n matrix and a n × n matrix. The overall space complexity is O(d), while the spatial complexity of Reptile is O(d), too. It can be seen that, compared to Reptile, Eigen-Reptile is the same in spatial complexity and time complexity. Still, its accuracy is much higher than that of Reptile.

C THEOREM 1

Gradient update always with gradient noise inserted at every iteration, which caused Reptile, MAML, etc. cannot find accurate directions to update meta-parameters. In this section, we will prove that Eigen-Reptile can alleviate gradient noise. Theorem 3 Assume that the gradient noise variable x follows Gaussian distribution (Hu et al., 2017; Jastrzębski et al., 2017; Mandt et al., 2016) , x ∼ N 0, σ 2 . Furthermore, x and neural network parameter variable are assumed to be uncorrelated. The observed covariance matrix C equals noiseless covariance matrix C t plus gradient noise covariance matrix C x . Then, we get C = C t + C x = P t (Λ t + Λ x )P t = P t (Λ t + σ 2 I)P t = P ΛP = P t ΛP t ( ) where P t and P are the orthonormal eigenvector matrices of C t and C respectively, Λ t and Λ are the corresponding diagonal eigenvalue matrices, and I is an identity matrix. It can be seen from Eq.12 that C and C t has the same eigenvectors. Proof C.1 In the following proof, we assume that the probability density function of gradient noise variable x follows Gaussian distribution, x ∼ N 0, σ 2 . Treat the parameters in the neural network as variables, and the parameters obtained by each gradient update as samples. Furthermore, gradient noise and neural network parameters are assumed to be uncorrelated. For observed parameter matrix W ∈ R d×n , there are n samples, let W i,: ∈ R 1×n be the observed values of the i-th variable W i , and W = [W 1,: , • • • , W i,: , • • • , W d,: ] . Similarly, we denote the noiseless parameter matrix by W t = [(W t 1,: ) , • • • , (W t i,: ) , • • • , (W t d,: ) ] , and W = W t + X (13) Where X = [X 1,: , • • • , X i,: , • • • , X d,: ] is the dataset of noise variables. Then, centralize each variable by W k = W k - 1 n n i=1 W k,: (i) (14) So we get W = [W 1 , • • • , W d ] . Suppose W t is also centralized by the same way and get W t = [W t 1 , • • • , W t d ] . Then, we have: W = W t + X (15) Computing the covariance matrix of W : C = 1 n W W = 1 n (W t + X)(W t + X ) = 1 n (W t W t + W t X + XW t + XX ) Since W t and X are uncorrelated, W t X and XW t are approximately zero matrices. Thus: C ≈ 1 n (W t W t + XX ) = C t + C x (17) The component C x (i, j) is the correlation between X i and X j which corresponds to the i-th and j-th rows of X. As the two noise variables are not related to each other, if i = j, then C x (i, j) = 0. So C x ∈ R d×d is a diagonal matrix with diagonal elements σ 2 . Decompose C t as: C t = P t Λ t P t ( ) where P t is the noiseless orthonormal eigenvector matrix and Λ t is the noiseless diagonal eigenvalue matrix, then C x = Λ x P t P t = P t Λ x P t = P t C x P t ( ) where Λ x = σ 2 I, and I is the identity matrix. Thus, C = C t + C x = P t Λ t P t + P t Λ x P t . = P t (Λ t + Λ x )P t . = P t ΛP t where Λ = Λ t + Λ x . It can be seen from Eq.20 that C and C t has the same eigenvector matrix. In other words, eigenvector is not affected by gradient noise.

D THEOREM 2

In this section, we will prove that discarding high loss samples will result in a more accurate main direction in noisy FSL. where C tr is the covariance matrix generated by true samples, λ is the observed largest eigenvalue, λ o is the largest eigenvalue in the corrupted diagonal eigenvalue matrix Λ o . According to Eq.21, if λ o /λ is smaller, the observed eigenvector e is more accurate. Assume that the discarded high loss samples have the same contributions ξ to λ and λ o , representing the observed and corrupted main directional variance, respectively. Note that these two kinds of data have the same effect on the gradient updating of the model, so this assumption is relatively reasonable. Furthermore, it is easy to find that (λ o -ξ)/(λ -ξ) is smaller than λ o /λ. Proof D.1 Here, we use w to represent the parameter point obtained after a gradient update. For convenience, let w be generated by a single sample, w ∈ R d×1 . Then the parameter matrix can be obtained, W = w tr 1 w tr 2 • • • w o 1 • • • w o m • • • w tr n ( ) where w o represents the parameters generated by the corrupted sample, and w tr represents the parameters generated by the true sample. Furthermore, there are n parameter points generated by n samples. Moreover, there are m corrupted parameter points generated by m corrupted samples. Mean centering W , and show the observed covariance matrix C as C = 1 n W W = 1 n w tr 1 w tr 2 • • • w tr n      (w tr 1 ) (w tr 2 ) . . . (w tr n )      = 1 n (w tr 1 (w tr 1 ) + • • • + w o 1 (w o 1 ) + • • • + w o m (w o m ) + • • • + w tr n (w tr n ) ) It can be seen from the decomposition of C that the required eigenvector is related to the parameters obtained from the true samples and the parameters obtained from the noisy samples. For a single parameter point ww =        a 1 . . . a i . . . a d        [a 1 • • • a i • • • a d ] =     a 2 1 a 1 a 2 • • • a 1 a d a 2 a 1 a 2 2 • • • a 2 a d . . . . . . . . . . . . a d a 1 a d a 2 • • • a 2 d     As we discard all high loss samples that make the model parameters change significantly, and the randomly generated noisy labels may cause the gradient to move in any direction, we assume that the variance of corrupted parameter point variables is δ. Compute the expectations of all variables in the corrupted parameter point E(ww ) =     δ 2 1 + E(a 1 ) 2 E(a 1 a 2 ) • • • E(a 1 a d ) E(a 2 a 1 ) δ 2 2 + E(a 2 ) 2 • • • E(a 2 a d ) . . . . . . . . . . . . E(a d a 1 ) E(a d a 2 ) • • • δ 2 d + E(a d ) 2     = Ω (25) Let the sum of all corrupted 1 n E(ww ) be Ω o , then Obviously, λ > λ o , and if we don't discard all samples, then λ > ξ. So Eq.30> 0, which means discarding high loss samples could reduce λ o /λ. Therefore, discarding high loss samples can improve the accuracy of eigenvector in the presence of noisy labels. Ω o = 1 n      mδ 2 + m j=1 E(a j1 ) 2 • • • m j=1 E(a j1 a jd ) m j=1 E(a j2 a j1 ) • • • m j=1 E( For further analysis, we assume that any two variables are independently and identically distributed, the expectation of variable a, E(a) = . Thus, 1 λ Ω o = p λ     δ 2 + 2 • • • 2 2 • • • 2 . . . . . . . . . 2 • • • δ 2 + 2     ( ) where p is the proportion of noisy labels, np = m. As can be seen from Eq.31, if p 2 /λ ≈ 0, then Ω o /λ is a diagonal matrix. According to proof. C.1, the observed eigenvector e is unaffected by noisy labels with the corresponding eigenvalue p(δ 2 + 2 ) λ .

E HYPERPARAMETERS OF EIGEN-REPTILE

In this section, we follows Lee et al. ( 2019); Cao et al. (2019) to vary the number of inner-loops and the number of corresponding training shots to show the robustness of Eigen-Reptile. Besides, other hyperparameters are the same as section 5.2. As shown in Figure 5 , after the number of inner-loops i reaches 7, the test accuracy tends to be stable, which shows that changing the number of inner-loops within a certain range has little effect on Eigen-Reptile. That is, Eigen-Reptile is robust to this hyperparameter. As for train shot, to make the trained task-specific parameters as unbiased as possible, we specify train shot roughly satisfies i×batch_size N + 1, where N is the number of classes. So when i = 7, the number of train shots is 15. It is important to note that in our experiments, Reptile uses the original implementation hyperparameters, the number of inner-loops is 8, the number of train shots is 15 and the corresponding accuracy is 65.99%. 

F THE ASYMMETRIC LABEL NOISE EXPERIMENT

Since asymmetric noise is as common as symmetric noise, we focus on asymmetric noise in this section. As illustrated in asymmetric noise of Figure 6 , we randomly flip the labels of one class to the labels of another class without duplication in the meta-training dataset (64 classes).



Code is included in the supplemental material. Code will be released upon the paper acceptance.



Figure 2: Randomly sample examples to build different prior models.

Figure 3: Eigen-Reptile and Reptile training process on the regression toy test. (a), (b), (c) and (d), (e), (f) show that after the gradient update 0, 8, 16, 24, 32 times based on initialization learned by Eigen-Reptile and Reptile respectively.

Figure 4: Loss of the 10-shot regression.

Let W o be the parameter matrix generated by the corrupted samples. Compute the eigenvalues and eigenvectors of the expected observed parameter matrix 1 λ E(C tr )e = P o (I -Λ o λ )P o e ≈ P o (I -λ o λ I)P o e > P o (I -λ o -ξ λ -ξ I)P o e (21)

a j2 a jd ) And let the sum of all true 1 n ww be C tr . So the expectation of C can be written as, E(C) = E(C tr ) + Ω o (27) Treat eigenvector and eigenvalue as definite values, we get (Ω o + E(C tr ))e = λe (28) where e is the observed eigenvector, λ is the corresponding eigenvalue. Divide both sides of the equation by λ.where λ o is the largest eigenvalue in the corrupted diagonal eigenvalue matrix Λ o , P o is the orthonormal eigenvector matrix of Ω o . According to Eq.29, if λ o /λ is smaller, e is more accurate. Discard some samples with the largest losses, which may contain true samples and noisy samples. Assume that the discarded high loss samples have the same contributions ξ to λ and λ o , as these two kinds of data have the same effect on the gradient updating of the model. Compare the ratio of eigenvalues before and after discarding, getλ o λ bef ore -λ o -ξ λ -ξ af ter = ξ(λ -λ o ) λ(λ -ξ) > 0(30)

Figure 5: The number of inner-loop and accuracy of 5-way 5-shot task on Mini-Imagenet.

Average test accuracy of 5-way 1-shot on the Mini-Imagenet with symmetric label noise.

annex

We compare our algorithms with Reptile in table 3 . We observe that meta-learning algorithms with asymmetric noise have closer results compared to symmetric noise. As meta-learner is trained on tasks with the same noise transition matrix, which allows the meta-learner to learn more useful information, the results are higher than that with symmetric noise. Similar to the symmetric noise results, Eigen-Reptile outperforms Reptile in all tasks, and ISPL plays a more significant role when p is higher. The experimental results also show that when the number of iterations is constant, ISPL does not significantly improve or even degrades the results of Reptile. On the contrary, ISPL can provide Eigen-Reptile with a more accurate main direction in difficult tasks, e.g., p = 0.2, 0.5. G META-LEARNING ON CIFAR-FS Bertinetto et al. (2018) propose CIFAR-FS (CIFAR100 few-shots), which is randomly sampled from CIFAR-100 (Krizhevsky et al., 2009) , containing images of size 32 × 32. The settings of Eigen-Reptile in this experiment are the same as in 5.2. Moreover, we do not compare algorithms with additional tricks, such as higher way (Bertinetto et al., 2018; Snell et al., 2017) . It can be seen from Table 4 that on CIFAR-FS, the performance of Eigen-Reptile is still far better than Reptile without any parameter adjustment.Table 4 : Few Shot Classification on CIFAR-FS N-way K-shot accuracy. The ± shows 95% confidence interval over tasks. Algorithm 5-way 1-shot 5-way 5-shot MAML (Finn et al., 2017) 58.90 ± 1.90% 71.50 ± 1.00% PROTO NET (Snell et al., 2017) 55.50 ± 0.70% 72.00 ± 0.60% GNN (Satorras & Estrach, 2018) 61.90% 75.30% Embedded Class Models (Ravichandran et al., 2019) 55.14 ± 0.48% 71.66 ± 0.39%Reptile (Nichol et al., 

H NEURAL NETWORK ARCHITECTURES

This section will show the performance of Eigen-Reptile on Mini-Imagenet when using a larger network as CAML, etc. Note that we only compare our algorithm with meta-learning algorithms based on the gradient in this section. As shown in Table 5 , when Eigen-Reptile uses a larger convolutional neural network (CNN), higher accuracy can be obtained, which shows that Eigen-Reptile benefits from increasing model expressiveness. 

