PUTTING THEORY TO WORK: FROM LEARNING BOUNDS TO META-LEARNING ALGORITHMS

Abstract

Most of existing deep learning models rely on excessive amounts of labeled training data in order to achieve state-of-the-art results, even though these data can be hard or costly to get in practice. One attractive alternative is to learn with little supervision, commonly referred to as few-shot learning (FSL), and, in particular, meta-learning that learns to learn with few data from related tasks. Despite the practical success of meta-learning, many of its algorithmic solutions proposed in the literature are based on sound intuitions, but lack a solid theoretical analysis of the expected performance on the test task. In this paper, we review the recent advances in meta-learning theory and show how they can be used in practice both to better understand the behavior of popular meta-learning algorithms and to improve their generalization capacity. This latter is achieved by integrating the theoretical assumptions ensuring efficient meta-learning in the form of regularization terms into several popular meta-learning algorithms for which we provide a large study of their behavior on classic few-shot classification benchmarks. To the best of our knowledge, this is the first contribution that puts the most recent learning bounds of meta-learning theory into practice for the task of few-shot classification.

1. INTRODUCTION

Since the very seeding of the machine learning field, its algorithmic advances were inevitably followed or preceded by the accompanying theoretical analyses establishing the conditions required for the corresponding algorithms to learn well. Such a synergy between theory and practice is reflected in numerous concepts and learning strategies that took their origins in the statistical learning theory: for instance, the famous regularized risk minimization approach is directly related to the minimization of the complexity of the hypothesis space, as suggested by the generalization bounds established for supervised learning (Vapnik, 1992) , while most of the adversarial algorithms in transfer learning (e.g., DANN from (Ganin & Lempitsky, 2015) ) follow the theoretical insights provided by the seminal theory of its domain (Ben-David et al., 2010) . Even though many machine learning methods now enjoy a solid theoretical justification, some more recent advances in the field are still in their preliminary state which requires the hypotheses put forward by the theoretical studies to be implemented and verified in practice. One such notable example is the emerging field of meta-learning, also called learning to learn (LTL), where the goal is to produce a model on data coming from a set of (meta-train) source tasks to use it as a starting point for learning successfully a new previously unseen (meta-test) target task with little supervision. This kind of approach comes in particularly handy when training deep learning models as their performance crucially depends on the amount of training data that can be difficult and/or expensive to get in some applications. Several theoretical studies (Baxter, 2000; Pentina & Lampert, 2014; Maurer et al., 2016; Amit & Meir, 2018; Yin et al., 2020) 1 provided probabilistic meta-learning bounds that require the amount of data in the meta-train source task and the number of meta-train tasks to tend to infinity for efficient meta-learning. While capturing the underlying general intuition, these bounds do not suggest that all the source data is useful in such learning setup due to the additive relationship between the two terms mentioned above. To tackle this drawback, two very recent studies (Du et al., 2020; Tripuraneni et al., 2020) aimed at finding deterministic assumptions that lead to faster learning rates allowing meta-learning algorithms to benefit from all the source data. Contrary to probabilistic bounds that have been used to derive novel learning strategies for meta-learning algorithms (Amit & Meir, 2018; Yin et al., 2020) , there was no attempt to verify the validity of the assumptions leading to the fastest known learning rates in practice or to enforce them through an appropriate optimization procedure. In this paper, we bridge the meta-learning theory with practice by harvesting the theoretical results from Tripuraneni et al. (2020) and Du et al. (2020) , and by showing how they can be implemented algorithmically and integrated, when needed, to popular existing meta-learning algorithms used for few-shot classification (FSC). This latter task consists in classifying new data having seen only few training examples, and represents one of the most prominent examples where meta-learning has shown to be highly efficient. More precisely, our contributions are three-fold: 1. We identify two common assumptions from the theoretical works on meta-learning and show how they can be verified and forced via a novel regularization scheme. 2. We investigate whether these assumptions are satisfied for popular meta-learning algorithms and observe that some of them naturally satisfy them, while others do not. 3. With the proposed regularization strategy, we show that enforcing the assumptions to be valid in practice leads to better generalization of the considered algorithms. The rest of the paper is organized as follows. After presenting preliminary knowledge on the metalearning problem in Section 2, we detail the existing meta-learning theoretical results with their corresponding assumptions and show how they can be enforced via a general regularization technique in Section 3. Then, we provide an experimental evaluation of several popular few-shot learning (FSL) methods in Section 4 and highlight the different advantages brought by the proposed regularization in practice. Finally, we conclude and outline future research perspectives in Section 5.

2. PRELIMINARY KNOWLEDGE

We start by formally defining the meta-learning problem following the model described in Du et al. (2020) . To this end, we assume having access to T source tasks characterized by their respective data generating distributions {µ t } T t=1 supported over the joint input-output space X × Y with X ⊆ R d and Y ⊆ R. We further assume that these distributions are observed only through finite size samples of size n 1 grouped into matrices X t = (x t,1 , . . . , x t,n1 ) ∈ R n1×d and vectors of outputs y t = (y t,1 , . . . , y t,n1 ) ∈ R n1 , ∀t ∈ [[T ]] := {1, . . . , T }. Given this set of tasks, our goal is to learn a shared representation φ belonging to a certain class of functions Φ := {φ | φ : X → V, V ⊆ R k } and linear predictors w t ∈ R k , ∀t ∈ [[T ] ] grouped in a matrix W ∈ R T ×k . More formally, this is done by solving the following optimization problem: φ, W = arg min φ∈Φ,W∈R T ×k 1 2T n 1 T t=1 n1 i=1 (y t,i , w t , φ(x t,i ) ), where : Y × Y → R + is a loss function. Once such a representation is learned, we want to apply it to a new previously unseen target task observed through a pair (X T +1 ∈ R n2×d , y T +1 ∈ R n2 ) containing n 2 samples generated by the distribution µ T +1 . We expect that a linear classifier w learned on top of the obtained representation leads to a low true risk over the whole distribution µ T +1 . More precisely, we first use φ to solve the following problem: ŵT +1 = arg min w∈R k 1 n 2 n2 i=1 (y T +1,i , w, φ(x T +1,i ) ). Then, we define the true target risk of the learned linear classifier ŵT +1 as: L( φ, ŵT +1 ) = E (x,y)∼µ T +1 [ (y, ŵT +1 , φ(x) )] and want it to be small and as close as possible to the ideal true risk L(φ * , w * T +1 ) where ∀t ∈ [[T + 1]] and (x, y) ∼ µ t , y = w * t , φ * (x) + ε, ε ∼ N (0, σ 2 ). (2) Equivalently, most of the works found in the literature seek to upper-bound the excess risk defined as ER( φ, ŵT +1 ) := L( φ, ŵT +1 ) -L(φ * , w * T +1 ) with quantities involved in the learning process. Remark 1 We note that many popular meta-learning algorithms used for FSL do not follow exactly the approach described above. However, we believe that the exact way of how this is done algorithmically (with or without the support set, with or without learning episodes) does not change the statistical challenge of it which is to learn a model that can provably generalize with little supervision. Supervised learning theory tells us that generalization in this case is poor (not enough target data and it is difficult to rely on data coming from different probability distributions), while the theoretical works we built upon suggest that source data may contribute in improving the generalization of the learned model alongside the target data if the assumptions described below are satisfied.

3. FROM THEORY TO PRACTICE

In this section, we highlight main theoretical contributions that provably ensure the success of metalearning in improving the performance on the previously unseen target task with the increasing number of source tasks and the amount of data available for them. We then concentrate our attention on the most recent theoretical advances leading to the fastest learning rates and show how the assumptions used to obtain them can be forced in practice through a novel regularization strategy.

3.1. WHEN DOES META-LEARNING PROVABLY WORK?

One requirement for meta-learning to succeed in FSC is that a representation learned on meta-train data should be useful for learning a good predictor on the meta-test data set. This is reflected by bounding the excess target risk by a quantity that involves the number of samples in both meta-train and meta-test samples and the number of available meta-train tasks. To this end, first studies in the context of meta-learning relied on probabilistic assumption (Baxter, 2000; Pentina & Lampert, 2014; Maurer et al., 2016; Amit & Meir, 2018; Yin et al., 2020) stating that meta-train and meta-test tasks distributions are all sampled i.i.d. from the same random distribution. This assumption, however, is considered unrealistic as in FSL source and target tasks' data are often given by different draws (without replacement) from the same dataset. In this setup, the above-mentioned works obtained the bounds having the following form: ER( φ, ŵT +1 ) ≤ O 1 √ n 1 + 1 √ T . This guarantee implies that not only the number of source data, but also the number of tasks should be large in order to draw the second term to 0. An improvement was then proposed by Du et al. (2020) and Tripuraneni et al. (2020) that obtained the bounds on the excess risk behaving as O kd √ n 1 T + k √ n 2 and Õ kd n 1 T + k n 2 , respectively, where k d is the dimensionality of the learned representation and Õ(•) hides logarithmic factors. Both these results show that all the source and target samples are useful in minimizing the excess risk: in the FSL regime where target data is scarce, all source data helps to learn well. From a set of assumptions made by the authors in both of these works , we note the following two: Assumption 1. The matrix of optimal predictors W * should cover all the directions in R k evenly. More formally, this can be stated as R σ (W * ) = σ 1 (W * ) σ k (W * ) = O(1), where σ i (•) denotes the i th singular value of W * . As pointed out by the authors, such an assumption can be seen as a measure of diversity between the source tasks that are expected to be complementary to each other in order to provide a meaningful representation for a previously unseen target task. Assumption 2. The norm of the optimal predictors w * should not increase with the number of tasks seen during meta-trainingfoot_2 . This assumption says that the classification margin of linear predictors should remain constant thus avoiding over-or under-specialization to the seen tasks. w * 1 R σ (W * ) ε→0 --→ +∞ w * 2 R σ (W * ) ε→0 --→ +∞ Source task 1 in Φ * space Source task 2 in Φ * space ŵ1 R σ ( W) ε→0 --→ 1 ŵ2 R σ ( W) ε→0 --→ 1 Source task 1 in Φ space Source task 2 in Φ space While being highly insightful, the authors did not provide any experimental evidence suggesting that verifying these assumptions in practice helps to learn more efficiently in the considered learning setting. To bridge this gap, we propose to use a general regularization scheme that allows us to enforce these assumptions when learning the matrix of predictors in several popular meta-learning algorithms.

3.2. PUTTING THEORY TO WORK

As the assumptions mentioned above are stated for the optimal predictors that are inherently linked to the data generating process, one may wonder what happens when these latter do not satisfy them. To this end, we aim to answer the following question: Given W * such that R σ (W * ) 1, can we learn W with R σ ( W) ≈ 1 while solving the underlying classification problems equally well? It turns out that we can construct an example illustrated in Fig. 1 for which the answer to this question is positive. To this end, let us consider a binary classification problem over X ⊆ R 3 with labels Y = {-1, 1} and two source tasks generated for k, ε ∈ ]0, 1], as follows: 1. µ 1 is uniform over {1 -kε, k, 1} × {1} ∪ {1 + kε, k, -1} × {-1}; 2. µ 2 is uniform over {1 + kε, k, k-1 ε } × {1} ∪ {-1 + kε, k, 1+k ε } × {-1}. We now define the optimal representation and two optimal predictors for each distribution as the solution to Eq. 1 over the two data generating distributions and Φ = {φ| φ(x) = Φ T x, Φ ∈ R 3×2 }: φ * , W * = arg min φ∈Φ,W∈R 2×2 2 i=1 E (x,y)∼µi (y, w i , φ(x) ), One solution to this problem can be given as follows: Φ * = 1 0 0 0 1 0 T , W * = 1 ε 1 -ε , where φ * projects the data generated by µ i to a two-dimensional space by discarding its third dimension and the linear predictors satisfy the data generating process from Eq. 2 with ε = 0. One can verify that in this case W * have singular values equal to √ 2 and √ 2ε, so that the ratio R σ (W * ) = 1 ε : when ε → 0, the optimal predictors make the ratio arbitrary large thus violating Assumption 1. Let us now consider a different problem where we want to solve Eq. 4 with a constraint that forces linear predictors to satisfy Assumption 1: φ, W = arg min φ∈Φ,W∈R 2×2 2 i=1 E (x,y)∼µi (y, w i , φ(x) ), s.t. R σ (W) ≈ 1. (5) Its solution is different and is given by Φ = 0 1 0 0 0 1 T , W = 0 1 1 -ε . Similarly to Φ * , Φ projects to a two-dimensional space by discarding the first dimension of the data generated by µ i . The learned predictors in this case also satisfy Eq. 2 with ε = 0, but contrary to W * , R σ ( W) = 2+ε 2 +ε √ ε 2 +4 2+ε 2 -ε √ ε 2 +4 tends to 1 when ε → 0. Several remarks are in order here. First, it shows that even when W * does not satisfy Assumption 1 in the space induced by φ * , it may still be possible to learn a new representation space φ such that the optimal predictors in this space will satisfy Assumption 1. This can be done either by considering the constrained problem from Eq. 5, or by using a more common strategy that consists in adding R σ (W) directly as a regularization term φ, W = arg min φ∈Φ,W∈R T ×k 1 2T n 1 T t=1 n1 i=1 (y t,i , w t , φ(x t,i ) ) + λ 1 R σ (W). Below, we explain how to implement this idea in practice for popular meta-learning algorithms. Ensuring assumption 1. We propose to compute singular values of W during the meta-training stage and follow its evolution during the learning episodes. In practice, this can be done by performing the Singular Value Decomposition (SVD) on W ∈ R T ×k with a computational cost of O(T k 2 ) floating-point operations (flop). However, as T is typically quite large, we propose a more computationally efficient solution that is to take into account only the last batch of N predictors (with N T ) grouped in the matrix W N ∈ R N ×k that capture the latest dynamics in the learning process. We further note that σ i (W N W N ) = σ 2 i (W N ), ∀i ∈ [[N ] ] implying that we can calculate the SVD of W N W N (or W N W N for k ≤ N ) and retrieve the singular values from it afterwards. We now want to verify whether the optimal linear predictors w t cover all directions in the embedding space by tracking the evolution of the ratio of singular values R σ (W N ) during the training process. For the sake of conciseness, we use R σ instead of R σ (W N ) thereafter. According to the theory, we expect R σ to decrease during training thus improving the generalization of the learned predictors and preparing them for the target task. When we want to enforce such a behavior in practice, we propose to use R σ as a regularization term in the training loss of popular meta-learning algorithms. Alternatively, as the smallest singular value σ N (W N ) can be close to 0 and lead to numerical errors, we propose to replace the ratio of the vector of singular values by its entropy as follows: H σ (W N ) = - N i=1 softmax(σ(W N )) i • log softmax(σ(W N )) i , where softmax(•) i is the i th output of the softmax function. As with R σ , we write H σ instead of H σ (W N ) from now on. Since uniform distribution has the highest entropy, regularizing with R σ or -H σ leads to a better coverage of R k by ensuring a nearly identical importance regardless of the direction. We refer the reader to the Supplementary materials for the derivations ensuring the existence of the subgradients for these terms. Ensuring assumption 2. In addition to the full coverage of the embedding space by the linear predictors, the meta-learning theory assumes that the norm of the linear predictors does not increase with the number of tasks seen during meta-training, i.e., w 2 = O(1) or, equivalently, W 2 F = O(T ). If this assumption does not hold in practice, we propose to regularize the norm of linear predictors during training or directly normalize the obtained linear predictors w = w w 2 . The final meta-training loss with the theory-inspired regularization terms is given as: min φ∈Φ,W∈R T ×k 1 2T n 1 T t=1 n1 i=1 (y t,i , w t , φ(x t,i ) ) + λ 1 R σ (W N ) + λ 2 W N 2 F , and depending on the considered algorithm, we can replace R σ by -H σ and/or replace w t by wt instead of regularizing with W N 2 F . In what follows, we consider λ 1 = λ 2 = 1 and we refer the reader to the Supplementary materials for more details and experiments with other values. To the best of our knowledge, such regularization terms based on insights from the advances in metalearning theory have never been used in the literature before. We also further use the basic quantities involved in the proposed regularization terms as indicators of whether a given meta-learning algorithm naturally satisfies the assumptions ensuring an efficient meta-learning in practice or not.

3.3. RELATED WORK

Below, we discuss several related studies aiming at improving the general understanding of metalearning, and mention other regularization terms specifically designed for meta-learning. Understanding meta-learning While a complete theory for meta-learning is still lacking, several recent works aimed to shed light on phenomena commonly observed in meta-learning by evaluating different intuitive heuristics. For instance, Raghu et al. (2020) investigated whether the popular gradient-based MAML algorithm relies on rapid learning with significant changes in the representations when deployed on target task, or due to feature reuse where the learned representation remains almost intact. They establish that the latter factor is dominant and propose a new variation of MAML that freezes all but task-specific layers of the neural network when learning new tasks. In another study (Goldblum et al., 2020) the authors explain the success of meta-learning approaches by their capability to either cluster classes more tightly in feature space (task-specific adaptation approach), or to search for meta-parameters that lie close in weight space to many task-specific minima (full fine-tuning approach). Finally, the effect of the number of shots on the classification accuracy was studied theoretically and illustrated empirically in Cao et al. (2020) for the popular metric-based PROTONET algorithm. Our paper is complementary to all other works mentioned above as it investigates a new aspect of meta-learning that has never been studied before, while following a sound theory. Also, we provide a more complete experimental evaluation as the three different approaches of meta-learning (based on gradient, metric or transfer learning), separately presented in Raghu et al. (2020) , Cao et al. (2020) and Goldblum et al. (2020) , are now compared together. Other regularization strategies Regularization is a common tool to reduce model complexity during learning for better generalization, and the variations of its two most famous instances given by weight decay (Krogh & Hertz, 1992) and dropout (Srivastava et al., 2014) are commonly used as a basis in meta-learning literature as well. In general, regularization in meta-learning is applied to the weights of the whole neural network (Balaji et al., 2018; Yin et al., 2020) , the predictions (Jamal & Qi, 2019; Goldblum et al., 2020) or is introduced via a prior hypothesis biased regularized empirical risk minimization (Pentina & Lampert, 2014; Kuzborskij & Orabona, 2017; Denevi et al., 2018a; b; 2019) . Our proposal is different from all the approaches mentioned above for the following reasons. First, we do not regularize the whole weight matrix learned by the neural network but the linear predictors of its last layer contrary to what was done in the methods of the first group, and, more specifically, the famous weight decay approach (Krogh & Hertz, 1992) . The purpose of the regularization in our case is also completely different: weight decay is used to improve generalization through sparsity in order to avoid overfitting, while our goal is to keep the classification margin unchanged during the training to avoid over-/under-specialization to some source tasks. Similarly, spectral normalization proposed by Miyato et al. (2018) to satisfy the Lipschitz constraint in GANs through dividing W values by σ max (W) does not affect the ratio between σ max (W) and σ min (W) and serves a completely different purpose. Second, we regularize the singular values (entropy or ratio) of the matrix of linear predictors instead of the predictions, as done by the methods of the second group (e.g., using the theoretic-information quantities in Jamal & Qi (2019) and Yin et al. (2020) ). Finally, the works of the last group are related to the online setting with convex loss functions only, and, similarly to the algorithms from the second group, do not specifically target the spectral properties of the learned predictors. Last, but not least, our proposal is built upon the most recent advances in the meta-learning field leading to faster learning rates contrary to previous works.

4. PRACTICAL RESULTS

In this section, we use extensive experimental evaluations to answer the following two questions: Q1) Do popular meta-learning methods naturally satisfy the learning bounds assumptions? Q2) Does ensuring these assumptions help to (meta-)learn more efficiently? For Q1, we run the original implementations of popular meta-learning methods to see what is their natural behavior. For Q2, we study the impact of forcing them to closely follow the theoretical setup.

4.1. EXPERIMENTAL SETUP

Datasets & Baselines We consider few-shot image classification problem on three benchmark datasets, namely: 1) Omniglot (Lake et al., 2015) consisting of 1,623 classes with 20 images/class of size 28 × 28; 2) miniImageNet (Ravi & Larochelle, 2017) consisting of 100 classes with 600 images of size 84 × 84 per class and 3) tieredImageNet (Ren et al., 2018) consisting of 779,165 images divided into 608 classes. For each dataset, we follow the commonly adopted experimental protocol used in Finn et al. (2017) and Chen et al. (2019) and use a four-layer convolution backbone (Conv-4) with 64 filters as done by Chen et al. (2019) . On Omniglot, we perform 20-way classification with 1 shot and 5 shots, while on miniImageNet and tieredImageNet we perform 5-way classification with 1 shot and 5 shots. Finally, we evaluate four FSL methods: two popular meta-learning strategies, namely, MAML (Finn et al., 2017) , a gradient-based method, and Prototypical Networks (PROTONET) (Snell et al., 2017) , a metric-based approach; two popular transfer learning baselines, termed as BASELINE and BASELINE++ (Ravi & Larochelle, 2017; Gidaris & Komodakis, 2018; Chen et al., 2019) . Even though these baselines are trained with the standard supervised learning framework, such a training can also be seen as learning a single task in the LTL framework. Implementation details Enforcing Assumptions 1 and 2 for MAML is straightforward as it closely follows the LTL framework of episodic training. For each task, the model learns a batch of linear predictors and we can directly take them as W N to compute its SVD. Since the linear predictors are the weights of our model and change slowly, regularizing the norm W N F and the ratio of singular values R σ does not cause instabilities during training. Meanwhile, metric-based methods do not use linear predictors but compute a similarity between features. In the case of PROTONET, the similarity is computed with respect to class prototypes (i.e. the mean features of the images of each class). Since they act as linear predictors, a first idea would be to regularize the norm and ratio of singular values of the prototypes. Unfortunately, this latter strategy hinders the convergence of the network and leads to numerical instabilities. Most likely because prototypes are computed from image features which suffer from rapid changes across batches. Consequently, we regularize the entropy of singular values H σ instead of the ratio R σ to avoid instabilities during training to ensure Assumption 1 and we normalize the prototypes to ensure Assumption 2 by replacing w t with wt in Eq. 7. For transfer learning methods BASELINE and BASELINE++, the last layer of the network is discarded and linear predictors are learned during meta-testing. Thus, we only regularize the norm W N F of predictors learned during the finetuning phase of meta-testing. Similarly to MAML, we compute R σ with the last layer of the network during training and fine-tuning phase. Remark 2 We choose well-established meta-learning algorithms for our comparison, but the proposed regularization can be integrated similarly into their recent variations (Park & Oliva, 2019; Lee et al., 2019) (see Supplementary materials for results obtained with the method of Park & Oliva (2019) ). Finally, using models that do not rely on linear predictors is also possible but might be more difficult as it would require upstream work to understand which part of the model acts as predictors (as done for PROTONET in this paper) and how to compute and track the desired quantities.

4.2. INSIGHTS

Q1 -Verifying the assumptions According to theory, W N F and R σ should remain constant or converge toward a constant value when monitoring the last N tasks. From Fig. 2 (a), we can see that for MAML (Fig. 2 (a) top), both W N F and R σ increase with the number of tasks seen during training, whereas PROTONET (Fig. 2 (a) bottom) naturally learns the prototypes with a good coverage of the embedding space, and minimizes their norm. This behavior is rather peculiar as neither of the two methods specifically controls the theoretical quantities of interest, and still, PROTONET manages to do it implicitly. As for the transfer learning baselines (Fig. 2 (b) top and bottom), we expect them to learn features that cover the embedding space with R σ rapidly converging towards a constant value. As can be seen in Fig. 2 (b), similarly to PROTONET, BASELINE++ naturally learns linear predictors that cover the embedding space. As for BASELINE, it learns a good coverage for Omniglot dataset, but fails to do so for the more complicated tieredImageNet dataset. The observed behavior of these different methods leads to a conclusion that some meta-learning algorithms are inherently more explorative of the embedding space. Q2 -Ensuring the assumptions Armed with our regularization terms, we now aim to force the considered algorithms to verify the assumptions when it is not naturally done. the prototypes. According to our results for Q1, regularizing the singular values of the prototypes through the entropy H σ is not necessary. 3 Based on the obtained results, we can make the following conclusions. First, from Fig. 2 (a) (left, middle) and Fig. 2 (b) (left), we note that for all methods considered, our proposed methodology used to enforce the theoretical assumptions works as expected, and leads to a desired behavior during the learning process. This means that the differences in terms of results presented in Table 1 are explained fully by this particular addition to the optimized objective function. Second, from the shape of the accuracy curves provided in Fig. 2 (a) (right) and the accuracy gaps when enforcing the assumptions given in Table 1 , we can see that respecting the assumptions leads to several significant improvements related to different aspects of learning. On the one hand, we observe that the final validation accuracy improves significantly in all benchmarks for meta-learning methods and in most of experiments for BASELINE (except for Omniglot, where BASELINE already learns to regularize its linear predictors). In accordance with the theory, we attribute the improvements to the fact that we fully utilize the training data which leads to a tighter bound on the excess target risk and, consequently, to a better generalization performance. On the other hand, we also note that our regularization reduces the sample complexity of learning the target task, as indicated by the faster increase of the validation accuracy from the very beginning of the meta-training. Roughly speaking, less meta-training data is necessary to achieve a performance comparable to that obtained without the proposed regularization using more tasks. Finally, we note that BASELINE++ and PROTONET methods naturally satisfy some assumptions: both learn diverse linear predictors by design, while BASELINE++ also normalizes the weights of its linear predictors. Thus, these methods do not benefit from additional regularization as explained before.

5. CONCLUSION

In this paper, we studied the validity of the theoretical assumptions made in recent papers applied to popular meta-learning algorithms and proposed practical ways of enforcing them. On the one hand, we showed that depending on the problem and algorithm, some models can naturally fulfill the theoretical conditions during training. Some algorithms offer a better covering of the embedding space than others. On the other hand, when the conditions are not verified, learning with our proposed regularization terms allows to learn faster and improve the generalization capabilities of meta-learning methods. The theoretical framework studied in this paper explains the observed performance gain. Notice that no specific hyperparameter tuning was performed as we rather aim at showing the effect of ensuring learning bounds assumptions than comparing performance of the methods. Absolute accuracy results are detailed in the Supplementary materials. While this paper proposes an initial approach to bridging the gap between theory and practice in meta-learning, some questions remain open on the inner workings of these algorithms. In particular, being able to take better advantage of the particularities of the training tasks during meta-training could help improve the effectiveness of these approaches. Self-supervised meta-learning and multiple target tasks prediction are also important future perspectives for the application of meta-learning.



We omit other works for meta-learning via online convex optimization(Finn et al., 2019;Balcan et al., 2019;Khodak et al., 2019;Denevi et al., 2019) as they concern a different learning setup. While not stated as a separate assumption, inDu et al. (2020) assume it to derive the Assumption 1 mentioned above. See p.5 and the discussion after Assumption 4.3 in their pre-print. The effect of entropic regularization on PROTONET is detailed in the Supplementary materials.



Figure 1: Illustration of the example from Section 3.2 with ε = 0.02.

Figure 2: (a) Evolution of W N F (left), R σ (middle) and validation accuracy (right) when training of MAML (top) and PROTONET (bottom) on miniImageNet (1 shot for MAML, 5 shots for PROTONET). (b) Evolution of R σ (left) and validation accuracy (right) when training BASELINE (top) and BASELINE++ (bottom) on Omniglot (dashed lines) and tieredImageNet (solid lines). All training curves were averaged over 4 different random seeds. For MAML, W N F and R σ increase during training and violate Assumptions 1-2. PROTONET prototypes naturally cover the embedding space, while minimizing their norms. R σ converges during training on both datasets for BASE-LINE++ (similarly to PROTONET) whereas it diverges for BASELINE on tieredImageNet. With our regularization, W N F and R σ are constant during training in accordance with theory.

Accuracy gap (in p.p.) of the considered algorithms when using the regularization (or normalization in the case of PROTONET) enforcing the theoretical assumptions. All accuracy results are averaged over 2400 test episodes and 4 different seeds. Statistically significant results (out of confidence intervals) are reported with * . Exact performances are on par with those found in the literature and are reported in the Supplementary materials.

