LOCALIZED META-LEARNING: A PAC-BAYES ANAL-YSIS FOR META-LEARNING BEYOND GLOBAL PRIOR

Abstract

Meta-learning methods learn the meta-knowledge among various training tasks and aim to promote the learning of new tasks under the task similarity assumption. Such meta-knowledge is often represented as a fixed distribution; this, however, may be too restrictive to capture various specific task information because the discriminative patterns in the data may change dramatically across tasks. In this work, we aim to equip the meta learner with the ability to model and produce task-specific meta knowledge and, accordingly, present a localized meta-learning framework based on the PAC-Bayes theory. In particular, we propose a Local Coordinate Coding (LCC) based prior predictor that allows the meta learner to generate local meta-knowledge for specific tasks adaptively. We further develop a practical algorithm with deep neural network based on the bound. Empirical results on real-world datasets demonstrate the efficacy of the proposed method.



Recent years have seen a resurgence of interest in the field of meta-learning, or learning-to-learn (Thrun & Pratt, 2012) , especially for empowering deep neural networks with the capability of fast adapting to unseen tasks just as humans (Finn et al., 2017; Ravi & Larochelle, 2017) . More concretely, the neural networks are trained from a sequence of datasets, associated with different tasks sampled from a metadistribution (also called task environment (Baxter, 2000; Maurer, 2005) ). The principal aim of meta learner is to extract transferable meta-knowledge from observed tasks and facilitate the learning of new tasks sampled from the same meta-distribution. The performance is measured by the generalization ability from a finite set of observed tasks, which is evaluated by learning related unseen tasks. For this reason, there has been considerable interest in theoretical bounds on the generalization for meta-learning algorithms (Denevi et al., 2018b; a) . One typical line of work (Pentina & Lampert, 2014; Amit & Meir, 2018) uses PAC-Bayes bound to analyze the generalization behavior of the meta learner and quantify the relation between the expected loss on new tasks and the average loss on the observed tasks. In this setup, it formulates meta-learning as hierarchical Bayes. For each task, the base learner produces a posterior based on the associated task data and the prior. Each prior is a reference w.r.t. base model that is generated by the meta leaner and must be chosen before observing task data. Accordingly, meta-knowledge is formulated as a global distribution over all possible priors. Initially, it is called as hyperprior since it is chosen before observing training tasks. To learn versatile meta-knowledge across tasks, the meta learner observes a sequence of training tasks and adjusts its hyperprior into a hyperposterior distribution over the set of priors. The prior generated by the hyperposterior is then used to solve new tasks. Despite of its great success, such global hyperposterior is rather generic, typically not well tailored to various specific tasks. In contrast, in many scenarios the related tasks may require task-specific meta-knowledge. Consequently, traditional meta-knowledge may lead to sub-optimal performance for any individual prediction task. As a motivational example, suppose we have two different tasks: distinguishing motorcycle versus bicycle and distinguishing motorcycle versus car. Intuitively, each task uses distinct discriminative patterns and thus the desired meta-knowledge is required to extract these patterns simultaneously. It could be a challenging problem to represent it with a global hyperposterior since the most significant patterns in the first task could be irrelevant or even detrimental to the second task. Figure schematically illustrates this notion. Therefore, customized meta-knowledge such that the patterns are most discriminative for a given task is urgently desired. Can the meta-knowledge be adaptive to tasks? How can one achieve it? Intuitively, we could implement this idea by reformulating the meta-knowledge as a maping function. Leveraging the samples in the target task, the meta model produces tasks specific meta-knowledge. Naturally yet interestingly, one can see quantitatively how customized prior knowledge improves generalization capability, in light of the PAC-Bayes literature on the data distribution dependent-priors (Catoni, 2007; Parrado-Hernández et al., 2012; Dziugaite & Roy, 2018) . Specifically, PAC-Bayes bounds control the generalization error of Gibbs Classifiers. They usually depend on a tradeoff between the empirical error of the posterior Q and a KL-divergence term KL(Q P ), where P is the prior. Since this KL-divergence term forms part of the generalization bound and is typically large in standard PAC-Bayes approaches (Lever et al., 2013) , the choice of posterior is constrained by the need to minimize the KL-divergence between prior P and posterior Q. Thus, choosing an appropriate prior for each task which is close to the related posterior could yield improved generalization bounds. This encourages the study of data distribution-dependent priors for the PAC-Bayes analysis and gives rise to principled approaches to localized PAC-Bayes analysis. Previous related work are mainly discussed in Appendix A. Inspired by this, we propose a Localized Meta-Learning (LML) framework by formulating metaknowledge as a conditional distribution over priors. Given task data distribution, we allow a meta learner to adaptively generate an appropriate prior for a new task. The challenges of developing this model are three-fold. First, the task data distribution is not explicitly given, and our only perception for it is via the associated sample set. Second, it should be permutation invariant -the output of model should not change under any permutation of the elements in the sample set. Third, the learned model could be used for solving unseen tasks. To address these problems, we further develop a prior predictor using Local Coordinate Coding (LCC) (Yu et al., 2009) . In particular, if the classifier in each task is specialized to a parametric model, e.g. deep neural network, the proposed LCC-based prior predictor predicts base model parameters using the task sample set. The main contributions include: (1) A localized meta-learning framework which provides a means to tighten the original PAC-Bayes meta-learning bound (Pentina & Lampert, 2014; Amit & Meir, 2018 ) by minimizing the task-complexity term by choosing data-dependent prior; (2) An LCC-based prior predictor, an implementation of conditional hyperposterior, which generates local meta-knowledge for specific task; (3) A practical algorithm for probabilistic deep neural networks by minimizing the bound (though the optimization method can be applied to a large family of differentiable models); (4) Experimental results which demonstrate improved performance over meta-learning method in this field.

2. PRELIMINARIES

Our prior predictor was implemented by Local Coordinate Coding (LCC). The LML framework was inspired by PAC-Bayes theory for meta learning. In this section we briefly review the related definitions and formulations.

2.1. LOCAL COORDINATE CODING

Definition 1. (Lipschitz Smoothness Yu et al. (2009) .) A function f (x) in R d is a (α, β)-Lipschitz smooth w.r.t. a norm • if f (x) -f (x ) ≤ α x -x and f (x ) -f (x) -∇f (x) (x -x) ≤ β x -x 2 . Definition 2. (Coordinate Coding Yu et al. (2009) .) A coordinate coding is a pair (γ, C), where C ⊂ R d is a set of anchor points(bases), and γ is a map of x ∈ R d to [γ u (x)] u∈C ∈ R |C| such that u γ u (x) = 1. It induces the following physical approximation of x in R d : x = u∈C γ u (x)u. Definition 3. (Latent Manifold Yu et al. (2009) .) A subset M ⊂ R d is called a smooth manifold with an intrinsic dimension |C| := d M if there exists a constant c M such that given any x ∈ M, there exists |C| anchor points u 1 (x), . . . , u |C| (x) ∈ R d so that ∀x ∈ M: inf γ∈R |C| x -x - |C| j=1 γ j u j (x) 2 ≤ c M x -x 2 2 , where γ = [γ 1 , . . . , γ |C| ] are the local codings w.r.t. the anchor points.b Definition 2 and 3 imply that any point in R d can be expressed as a linear combination of a set of anchor points. Later, we will show that a high dimensional nonlinear prior predictor can be approximated by a simple linear function w.r.t. the coordinate coding, and the approximation quality is ensured by the locality of such coding (each data point can be well approximated by a linear combination of its nearby anchor points).

2.2. PAC-BAYES REGULAR META-LEARNING

In order to present the advances proposed in this paper, we recall some definitions in PAC-Bayes theory for single-task learning and meta-learning (Catoni, 2007; Baxter, 2000; Pentina & Lampert, 2014; Amit & Meir, 2018) . In the context of classification, we assume all tasks share the same input space X , output space Y, space of classifiers (hypotheses) H ⊂ {h : X → Y} and loss function : Y × Y → [0, 1]. The meta learner observes n tasks in the form of sample sets S 1 , . . . , S n . The number of samples in task i is denoted by m i . Each observed task i consists of a set of i.i.d. samples S i = {(x j , y j )} mi j=1 , which is drawn from a data distribution S i ∼ D mi i . Following the meta-learning setup in (Baxter, 2000) , we assume that each data distribution D i is generated i.i.d. from the same meta distribution τ . Let h(x) be the prediction of x, the goal of each task is to find a classifier h that minimizes the expected loss E x∼D (h(x), y). Since the underlying 'true' data distribution D i is unknown, the base learner receives a finite set of samples S i and produces an "optimal" classifier h = A b (S i ) with a learning algorithm A b (•) that will be used to predict the labels of unseen inputs. PAC-Bayes theory studies the properties of randomized classifier, called Gibbs classifier. Let Q be a posterior distribution over H. To make a prediction, the Gibbs classifier samples a classifier h ∈ H according to Q and then predicts a label with the chosen h. The expected error under data distribution D and empirical error on the sample set S are then given by averaging over distribution Q, namely er(Q) = E h∼Q E (x,y)∼D (h(x), y) and er(Q) = E h∼Q 1 m m j=1 (h(x j ), y j ), respectively. In the context of meta-learning, the goal of the meta learner is to extract meta-knowledge contained in the observed tasks that will be used as prior knowledge for learning new tasks. In each task, the prior knowledge P is in the form of a distribution over classifiers H. The base learner produces a posterior Q = A b (S, P ) over H based on a sample set S and a prior P . All tasks are learned through the same learning procedure. The meta learner treats the prior P itself as a random variable and assumes the meta-knowledge is in the form of a distribution over all possible priors. Let hyperprior P be an initial distribution over priors, meta learner uses the observed tasks to adjust its original hyperprior P into hyperposterior Q from the learning process. Given this, the quality of the hyperposterior Q is measured by the expected task error of learning new tasks using priors generated from it, which is formulated as: er(Q) = E P ∼Q E (D,m)∼τ,S∼D m er(Q = A b (S, P )). (1) Accordingly, the empirical counterpart of the above quantity is given by:  êr(Q) = E P ∼Q 1 n n i=1 êr(Q = A b (S i , P )). w Q i = A b (S i , P ), as shown in Figure 2 (left). Then, we could derive the following PAC-Bayes meta-learning bound. Theorem 1. Consider the regular meta-learning framework, given the hyperprior P = N (0, σ 2 w I dw ). Then for any hyperposterior Q, any c 1 , c 2 > 0 and any δ ∈ (0, 1] with probability ≥ 1 -δ we have, er(Q) ≤c 1 c 2 êr(Q) + ( n i=1 c 1 c 2 2c2nmiσ 2 w + c 1 2c1nσ 2 w ) w Q 2 + n i=1 c 1 c 2 2c2nmiσ 2 w E w P w Q i -w Q 2 + n i=1 c 1 c 2 c2nmiσ 2 w ( 1 2 + log 2n δ ) + c 1 c1nσ 2 w log 2 δ , where c 1 = c1 1-e -c 1 and c 2 = c2 1-e -c 2 . To get a better understanding, we further simplify the notation and obtain that er(Q) ≤c 1 c 2 êr(Q) + ( n i=1 c 1 c 2 2c2nmiσ 2 w + c 1 2c1nσ 2 w ) w Q 2 + n i=1 c 1 c 2 2c2nmiσ 2 w E w P w Q i -w Q 2 task-complexity + const(δ, n, mi, σw, c1, c2). See Appendix D.4 for the proof. Notice that the expected task generalization error is bounded by the empirical multi-task error plus two complexity terms which measures the environment-complexity and the task-complexity, respectively. 3 PAC-BAYES LOCALIZED META-LEARNING

3.1. MOTIVATION AND OVERALL FRAMEWORK

Our motivation stems from a core challenge in PAC-Bayes meta-learning bound in (4), wherein the task-complexity term  n i=1 c 1 c 2 2c2nmiσ 2 w Ew Q i -w Q 2 , Q is n i=1 c 1 c 2 Ew Q i 2c2nmiσ w . This solution for global hyperposterior is required to satisfy the task similarity assumption that the optimal posteriors for each task are close together and lie within a small subset of the model space. Under this circumstance, there exists a global hyperposterior from which a good prior for any individual task is reachable. However, if the optimal posteriors for each task are not related or even mutually exclusive, i.e., one optimal posterior has a negative effect on another task, the global hyperposterior may impede the learning of some tasks. Moreover, this complexity term could be inevitably large and incur large generalization error. Note that w Q is the mean of hyperposterior Q and this complexity term naturally indicates the divergence between the mean of prior w P i sampled from the hyperposterior Q and the mean of posterior w Q i in each task. Therefore, we propose to adaptively choose the mean of prior w P i according to task i. It is obvious that the complexity term vanishes if we set w P i = w Q i , but the prior P i in each task has to be chosen independently of the sample set S i . Fortunately, the PAC-Bayes theorem allows us to choose prior upon the data distribution D i . Therefore, we propose a prior predictor Φ : D m → w P which receives task data distribution D m and outputs the mean of prior w P . In this way, the generated priors could focus locally on those regions of model parameters that are of particular interest in solving specific tasks. Particularly, the prior predictor is parameterized as Φ v with v ∈ R dv . We assume v to be a random variable distributed first according to the hyperprior P, which we reformulate as N (0, σ 2 v I dv ), and later according to hyperposterior Q, which we reformulate as N (v Q , σ 2 v I dv ). Given a new task i, we first sample v from hyperposterior N (v Q , σ 2 v I dv ) and estimate the mean of prior w P i by leveraging prior predictor w P i = Φ v (D m i ). Then, the base learner utilizes the sample set S i and the prior P i = N (w P i , σ 2 w I dw ) to produce a mean posterior w Q i = A b (S i , P i ), as shown in Figure 2(right). To make w P close to w Q in each task, what properties are the prior predictor is expected to exhibit? Importantly, it is required to (i) uncover the tight relationship between the sample set and model parameters. Intuitively, features and parameters yield similar local and global structures in their respective spaces in the classification problem. Features in the same category tend to be spatially clustered together while maintaining the separation between different classes. Take linear classifiers as an example, let w k be the parameters w.r.t. category k, the separability between classes is implemented as x • w k , which also explicitly encourages intra-class compactness. A reasonable choice of w k is to maximize the inner product distance with the input features in the same category and minimize the distance with the input features of the non-belonging categories. Besides, the prior predictor should be (ii) category-agnostic since it will be used continuously as new tasks and hence new categories become available. Lastly, it should be (iii) invariant under permutations of its inputs.

3.2. LCC-BASED PRIOR PREDICTOR

There exists many implementations, such as set transformer (Lee et al., 2018) , relation network (Rusu et al., 2019) , task2vec (Achille et al., 2019) , that satisfy the above conditions. We follow the idea of nearest class mean classifier (Mensink et al., 2013) , which represents class parameter by averaging its feature embeddings. This idea has been explored in transductive few-shot learning problems (Snell et al., 2017; Qiao et al., 2018) . Snell et al. (2017) learn a metric space across tasks such that when represented in this embedding, prototype (centroid) of each class can be used for label prediction in the new task. Qiao et al. (2018) directly predict the classifier weights using the activations by exploiting the close relationship between the parameters and the activations in a neural network associated with the same category. In summary, the classification problem of each task is transformed as a generic metric learning problem which is shared across tasks. Once this mapping has been learned on observed tasks, due to the structure-preserving property, it could be easily generalized to new tasks. Formally, consider each task as a K-class classification problem, and the parameter of the classifier in task i denoted as w i = [w i [1], . . . , w i [k], . . . , w i [K]] , the prior predictor for class k can be defined as: w P i [k] = Φ v (D m ik ik ) = E S ik ∼D m ik ik 1 m ik xj ∈S ik φ v (x j ), where φ v (•) : R d → R dw is the feature embedding function, m ik is the number of samples belonging to category k, S ik and D ik are the sample set and data distribution for category k in task i. We call this function the expected prior predictor. Since data distribution D ik is considered unknown and our only insight as to D ik is through the sample set S ik , we approximate the expected prior predictor by its empirical counterpart. Note that if the prior predictor is relatively stable to perturbations of the sample set, then the generated prior could still reflect the underlying task data distribution, rather than the data, resulting in a generalization bound that still holds perhaps with smaller probability (Dziugaite & Roy, 2018). Formally, the empirical prior predictor is defined as: ŵP i [k] = Φv (S ik ) = 1 m ik xj ∈S ik φ v (x j ). Although we can implement the embedding function φ v (•) with a multilayer perceptron (MLP), both input x ∈ R d and model parameter w ∈ R dw are high-dimensional, making the empirical prior predictor Φv (•) difficult to learn. Inspired by the local coordinate coding method, if the anchor points are sufficiently localized, the embedding function φ v (x j ) can be approximated by a linear function w.r.t. a set of codings, [γ u (x j )] u∈C . Accordingly, we propose an LCC-based prior predictor, which is defined as: wP i [k] = Φv (S ik ) = 1 m ik xj ∈S ik u∈C γ u (x j )φ v (u), where φ v (u) ∈ R dw is the embedding of the corresponding anchor point u ∈ C. As such, the parameters of LCC-based prior predictor w.r.t. category k can be represented as v k = [φ v k (u 1 ), φ v k (u 2 ), . . . , φ v k (u |C| )] . Lemma 1 illustrates the approximation error between empirical prior predictor and LCC-based prior predictor. Lemma 1. (Empirical Prior Predictor Approximation) Given the definition of ŵP i [k] and wP i [k] in Eq. ( 6) and Eq. ( 7), let (γ, C) be an arbitrary coordinate coding on R d and φ v (•) be an (α, β)-Lipschitz smooth function. We have for all x ∈ R d ŵP i [k] -wP i [k] ≤ O α,β (γ, C) where O α,β (γ, C) = 1 m ik xj ∈S ik α x j -xj + β u∈C xj -u 2 and xj = u∈C γ u (x j )u. See Appendix D.1 for the proof. Lemma 1 shows that a good LCC-based prior predictor should make x close to its physical approximation x and should be localized. The complexity of LCC coding scheme depends on the number of anchor points |C|. We follow the optimization method in Yu et al. (2009) to find the coordinate coding (γ, C), which is presented in Appendix B.

3.3. PAC-BAYES LOCALIZED META-LEARNING BOUND WITH GAUSSIAN RANDOMIZATION

In order to derive a PAC-Bayes generalization bound for localized meta-learning, we first bound the approximation error between expected prior predictor and LCC-based prior predictor. Lemma 2. Given the definition of w P and wP in Eq. ( 5) and ( 7), let X be a compact set with radius R, i.e., ∀x, x ∈ X , xx ≤ R. For any δ ∈ (0, 1] with probability ≥ 1 -δ, we have w P -wP 2 ≤ K k=1 αR √ m ik (1 + 1 2 log( 1 δ )) + O α,β (γ, C) 2 . See Appendix D.2 for the proof. Lemma 2 shows that the approximation error between expected prior predictor and LCC-based prior predictor depends on (i) the concentration of prior predictor and (ii) the quality of LCC coding scheme. The first term implies the number of samples for each category should be larger for better approximation. This is consistent with the results of estimating the center of mass (Cristianini & Shawe-Taylor, 2004) . Based on Lemma 2, using the same Catoni's bound. we have the following PAC-Bayes LML bound. Theorem 2. Consider the localized meta-learning framework. Given the hyperprior P = N (0, σ 2 v I dv ), then for any hyperposterior Q, any c 1 , c 2 > 0 and any δ ∈ (0, 1] with probability ≥ 1 -δ we have, er(Q) ≤c 1 c 2 êr(Q) + ( n i=1 c 1 c 2 2c2nmiσ 2 v + c 1 2c1nσ 2 v ) v Q 2 + n i=1 c 1 c 2 c2nmiσ 2 w E v w Q i -Φv Q (Si) 2 + n i=1 c 1 c 2 c2nmiσ 2 w 1 σ 2 w K k=1 αR √ m ik (1 + 1 2 log( 4n δ )) + O α,β (γ, C) 2 + dwK( σv σw ) 2 + n i=1 c 1 c 2 c2nmiσ 2 w log 4n δ + c 1 2c1nσ 2 v log 2 δ , where c 1 = c1 1-e -c 1 and c 2 = c2 1-e -c 2 . To get a better understanding, we further simplify the notation and obtain that er(Q) ≤c 1 c 2 êr(Q) + ( n i=1 c 1 c 2 2c2nmiσ 2 v + c 1 2c1nσ 2 v ) v Q 2 + n i=1 c 1 c 2 c2nmiσ 2 w E v w Q i -Φv Q (Si) 2 task-complexity + const(α, β, R, δ, n, mi, σv, σw, c1, c2). See appendix D.3 for the proof. Similarly to the regular meta-learning bound in Theorem 1, the expected task error er(Q) is bounded by the empirical task error êr(Q) plus the task-complexity and environment-complexity terms. The main innovation here is to exploit the potential to choose the mean of prior w P adaptively, based on task data S. Intuitively, if the selection of the LCC-based prior predictor is appropriate, it will narrow the divergence between the mean of prior w P i sampled from the hyperposterior Q and the mean of posterior w Q i in each task. Therefore, the bound can be tighter than the ones in the regular meta-learning (Pentina & Lampert, 2014; Amit & Meir, 2018) . Our empirical study in Section 4 will illustrate that the algorithm derived from this bound can reduce task-complexity and thus achieve better performance than the methods derived from regular meta-learning bounds. When one is choosing the number of anchor points |C|, there is a balance between accuracy and simplicity of prior predictor. As we increase |C|, it will essentially increase the expressive power of Φv (•) and reduce the task-complexity term E v w Q -Φv Q (S) 2 . However, at the same time, it will increase the enviornment-complexity term v Q 2 and make the bound loose. If we set |C| to 1, it degenerates to the regular meta-learning framework.

3.4. LOCALIZED META-LEARNING ALGORITHM

Since the bound in (9) holds uniformly w.r.t. Q, the guarantees of Theorem 2 also hold for the resulting learned hyperposterior Q = N (v Q , σ 2 v I dv ) , so the mean of prior w P sampled from the learned hyperposterior work well for future tasks. The PAC-Bayes localized meta-learning bound in ( 9) can be compactly written as n i=1 E v êr i (Q i = A b (S i , P )) + α 1 v Q 2 + n i=1 α2 mi E v w Q i -Φv Q (S i ) 2 , where α 1 , α 2 > 0 are hyperparameters. For task i, the learning algorithm A b (•) can be formulated as w i = arg min w Q i E v êr i (Q i = N (w Q i , σ 2 w I dw )). To make fair comparison and guarantee the benefit of the proposed LML is not from using an improved optimization method, we follow the same learning algorithm in (Amit & Meir, 2018) . Specifically, we jointly optimize the parameters of LCC-based prior predictor v and the parameters of classifiers in each task w 1 , w 2 , . . . , w n , which is formulated as arg min v,w1,...,wn n i=1 E v êr i (w i ) + α 1 v Q 2 + n i=1 α 2 m i E v w Q i -Φv Q (S i ) 2 . ( ) We can optimize v and w via mini-batch SGD. The details of the localized meta-learning algorithm is given in Appendix F. The expectation over Gaussian distribution and its gradient can be efficiently Results. In Figure 3 ). This demonstrates the importance of using tight generalization bound. Moreover, our proposed LML significantly outperforms the baselines, which validates the effectiveness of the proposed LCC-based prior predictor. This confirms that LCC-based prior predictor is a more suitable representation for meta-knowledge than the traditional global hyperposterior in ML-A, ML-AM, and ML-PL. Finally, we observe that if the pre-trained feature extractor is provided, all of these methods do better than meta-training with random initialization. This is because the pre-trained feature extractor can be regarded as a data-dependent hyperpior. It is closer to the hyperposteior than the randomly initialized hyperprior. Therefore, it is able to reduce the environment complexity term and improves the generalization performance. In Figure 4 (b), we show the divergence between the mean of generated prior w P from meta model and the mean of learned posterior w Q for LML and ML-A. This further validates the effectiveness of the LCC-based prior predictor which could narrow down the divergence term and thus tighten the bound. In Figure 4 (a), we vary the number of anchor points |C| in LCC scheme from 4 to 256, the optimal value is around 64 in both datasets. This indicates that LML is sensitive to the number of anchor points |C|, which further affects the quality of LCC-based prior predictor and the performance of LML.

5. CONCLUSION

This work contributes a novel localized meta-learning framework from both the theoretical and computational perspectives. In order to tailor meta-knowledge to various individual task, we formulate meta model as a mapping function that leverages the samples in target set and produces task specific meta-knowledge as a prior. Quantitatively, this idea essentially provides a means to theoretically tighten the PAC-Bayes meta-learning generalization bound. We propose a LCC-based prior predictor to output localized meta-knowledge by using task information and further develop a practical algorithm with deep neural networks by minimizing the generalization bound. An interesting topic for future work would be to explore other principles to construct the prior predictor and apply the localized meta-learning framework to more realistic scenarios where tasks are sampled non-i.i.d. from an environment. Another challenging problem is to extend our techniques to derive localized meta-learning algorithms for regression and reinforcement learning problems. the prior term in a differentially private way. In summary, theses methods construct some quantities that reflect the underlying data distribution, rather than the sample set, and then choose the prior P based on these quantities. These works, however, are only applicable for single-task problem and could not transfer knowledge across tasks in meta-learning setting.

B OPTIMIZATION OF LCC

We minimize the inequality in (8) to obtain a set of anchor points. As with Yu et al. (2009) , we simplify the localization error term by assuming x = x, and then we optimize the following objective function: arg min γ,C n i=1 xj ∈Si α x j -xj 2 + β u∈C x j -u 2 s.t.∀x, u∈C γ u (x) = 1, where x = u∈C γ u (x)u. In practice, we update C and γ by alternately optimizing a LASSO problem and a least-square regression problem, respectively.

C NOTATIONS

Let φ v (•) : R d → R dw be the feature embedding function. m ik denotes the number of samples belonging to category k. S ik and D ik are the sample set and data distribution for category k in task i, Under review as a conference paper at ICLR 2021 respectively. Then, the expected prior predictor w.r.t. class k in task i is defined as: w P i [k] = Φ v (D m ik ik ) = E S ik ∼D m ik ik 1 m ik xj ∈S ik φ v (x j ). The empirical prior predictor w.r.t. class k in task i is defined as: ŵP i [k] = Φv (S ik ) = 1 m ik xj ∈S ik φ v (x j ). The LCC-based prior predictor w.r.t. class k in task i is defined as: wP i [k] = Φv (S ik ) = 1 m ik xj ∈S ik u∈C γ u (x j )φ v (u).

D.1 PROOF OF LEMMA 1

This lemma bounds the error between the empirical prior predictor ŵP i [k] and the LCC-based prior predictor wP i [k]. Lemma 1 Given the definition of ŵP i [k] and wP i [k] in Eq. ( 6) and Eq. ( 7), let (γ, C) be an arbitrary coordinate coding on R dx and φ be an (α, β)-Lipschitz smooth function. We have for all x ∈ R dx ŵP i [k] -wP i [k] ≤ 1 m ik xj ∈S ik α x j -xj + β u∈C xj -u 2 = O α,β (γ, C), where xj = u∈C γ u (x j )u. Proof. Let xj = u∈C γ u (x j )u. We have Φv (S ik ) -Φv (S ik ) 2 = 1 m ik xj ∈S ik φ v (x j ) - u∈C γ u (x j )φ v (u) 2 ≤ 1 m ik xj ∈S ik φ v (x j ) -φ v (x j ) 2 + u∈C γ u (x j )(φ v (u) -φ v (x j ) 2 = 1 m ik xj ∈S ik φ v (x j ) -φ v (x j ) 2 + u∈C γ u (x j )(φ v (u) -φ v ( u∈C γ u (x j )u)) -∇φ v (x j )(u -xj ) 2 ≤ 1 m ik xj ∈S ik φ v (x j ) -φ v (x j ) 2 + u∈C |γ u (x j )| (φ v (u) -φ v ( u∈C γ u (x j )u)) -∇φ v (x j )(u -xj ) 2 ≤ 1 m ik xj ∈S ik α x j -xj 2 + β u∈C xj -u 2 2 = O α,β (γ, C) In the above derivation, the first inequality holds by the triangle inequality. The second equality holds since u∈C γ u (x j ) = 1 for all x j . The last inequality uses the assumption of (α, β)-Lipschitz smoothness of φ v (•). This implies the desired bound. This lemma demonstrates that the quality of LCC approximation is bounded by two terms: the first term x j -xj 2 indicates x should be close to its physical approximation x, the second term xju implies that the coding should be localized. According to the Manifold Coding Theorem in Yu et al. (2009) , if the data points x lie on a compact smooth manifold M. Then given any > 0, there exists anchor points C ⊂ M and coding γ such that 1 m ik xj ∈S ik α x j -xj 2 + β u∈C xj -u 2 2 ≤ [αc M + (1 + 5 d M )β] 2 . ( ) It shows that the approximation error of local coordinate coding depends on the intrinsic dimension of the manifold instead of the dimension of input.

D.2 PROOF OF LEMMA 2

In order to proof Lemma 2, we first introduce a relevant theorem. Theorem 3. (Vector-valued extension of McDiarmid's inequality Rivasplata et al. ( 2018)) Let X 1 , . . . , X m ∈ X be independent random variables, and f : X m → R dw be a vector-valued mapping function. If, for all i ∈ {1, . . . , m}, and for all x 1 , . . . , x m , x i ∈ X , the function f satisfies sup xi,x i f (x 1:i-1 , x i , x i+1:m ) -f (x 1:i-1 , x i , x i+1:m ) ≤ c i (15) Then E f (X 1:m ) -E[f (X 1:m )] ≤ m i=1 c 2 i . For any δ ∈ (0, 1) with probability ≥ 1 -δ we have f (X 1:m ) -E[f (X 1:m )] ≤ m i=1 c 2 i + m i=1 c 2 i 2 log( 1 δ ). The above theorem indicates that bounded differences in norm implies the concentration of f (X 1:m ) around its mean in norm, i.e., f (X 1:m ) -E[f (X 1:m )] is small with high probability. Then, we bound the error between expected prior predictor w P i and the empirical prior predictor ŵP i . Lemma 3. Given the definition of w P i [k] and ŵP i [k] in ( 5) and ( 6), let X be a compact set with radius R, i.e., ∀x, x ∈ X , xx ≤ R. For any δ ∈ (0, 1] with probability ≥ 1 -δ, we have w P i [k] -ŵP i [k] ≤ αR √ m ik (1 + 1 2 log( 1 δ )). Proof. According to the definition of Φv (•) in ( 6), for all points x 1 , . . . , x j-1 , x j+1 , . . . , x m k , x j in the sample set S ik , we have sup xi,x i Φv (x 1:j-1 , x j , x j+1:m k ) -Φv (x 1:j-1 , x j , x j+1:m k ) = 1 m ik sup xj ,x j φ v (x j ) -φ v (x j ) ≤ 1 m ik sup xj ,x j α x j -x j ≤ αR m ik , where R denotes the domain of x, say R = sup x x . The first inequality follows from the Lipschitz smoothness condition of Φ v (•) and the second inequality follows by the definition of domain X . Utilizing Theorem 3, for any δ ∈ (0, 1] with probability ≥ 1 -δ we have w P i [k] -ŵP i [k] = Φv (S ik ) -E[ Φv (S ik )] ≤ αR √ m ik (1 + 1 2 log( 1 δ )). This implies the bound. Lemma 3 shows that the bounded difference of function Φ v (•) implies its concentration, which can be further used to bound the differences between empirical prior predictor wP i [k] and expected prior predictor w P i [k] . Now, we bound the error between expected prior predictor w P i and the LCC-based prior predictor wP i . Lemma 2 Given the definition of w P i and wP i in ( 5) and ( 7), let X be a compact set with radius R, i.e., ∀x, x ∈ X , xx ≤ R. For any δ ∈ (0, 1] with probability ≥ 1 -δ, we have w P i -wP i 2 ≤ K k=1 αR √ m ik (1 + 1 2 log( 1 δ )) + O α,β (γ, C) 2 . ( ) Proof According to the definition of w P , wP and ŵP , we have w P i -wP i 2 = K k=1 w P i [k] -wP i [k] 2 = K k=1 E[ Φv (S ik )] -Φv (S ik ) + Φv (S ik ) -Φv (S ik ) 2 = K k=1 E[ Φv (S ik )] -Φv (S ik ) 2 + Φv (S ik ) -Φv (S ik ) 2 + 2(E[ Φv (S ik )] -Φv (S ik )) ( Φv (S ik ) -Φv (S ik )) ≤ K k=1 E[ Φv (S ik )] -Φv (S ik ) 2 + Φv (S ik ) -Φv (S ik ) 2 + 2 E[ Φv (S ik )] -Φv (S ik ) Φv (S ik ) -Φv (S ik ) . Substitute Lemma 3 and Lemma 1 into the above inequality, we can derive P S ik ∼D m k k    w P -wP 2 ≤ K k=1 αR √ m ik (1 + 1 2 log( 1 δ )) + O α,β (γ, C) 2    ≥ 1 -δ. ( ) This gives the assertion. Lemma 2 shows that the approximation error between expected prior predictor and LCC-based prior predictor depends on the number of samples in each category and the quality of the LCC coding scheme.

D.3 PROOF OF THEOREM 2

Theorem 3 Let Q be the posterior of base learner Q = N (w Q , σ 2 w I dw ) and P be the prior N ( Φv (S), σ 2 w I dw ). The mean of prior is produced by the LCC-based prior predictor Φv (S) in Eq. ( 7) and its parameter v is sampled from the hyperposterior of meta learner Q = N (v Q , σ 2 v I dv ). Given the hyperprior P = N (0, σ 2 v I dv ), then for any hyperposterior Q, any c 1 , c 2 > 0 and any δ ∈ (0, 1] with probability ≥ 1 -δ we have, er(Q) ≤c 1 c 2 êr(Q) + ( n i=1 c 1 c 2 2c 2 nm i σ 2 v + c 1 2c 1 nσ 2 v ) v Q 2 + n i=1 c 1 c 2 c 2 nm i σ 2 w E v w Q i -Φv Q (S i ) 2 + n i=1 c 1 c 2 c 2 nm i σ 2 w   1 σ 2 w K k=1 αR √ m ik (1 + 1 2 log( 4n δ )) + O α,β (γ, C) 2 + d w K( σ v σ w ) 2   + n i=1 c 1 c 2 c 2 nm i σ 2 w log 4n δ + c 1 2c 1 nσ 2 v log 2 δ , where c 1 = c1 1-e -c 1 and c 2 = c2 1-e -c 2 . We can simplify the notation and obtain that er(Q) ≤c 1 c 2 êr(Q) + ( n i=1 c 1 c 2 2c 2 nm i σ 2 v + c 1 2c 1 nσ 2 v ) v Q 2 + n i=1 c 1 c 2 c 2 nm i σ 2 w E v w Q i -Φv Q (S i ) 2 + const(α, β, R, δ, n, m i ). ( ) Proof Our proof contains two steps. First, we bound the error within observed tasks due to observing a limited number of samples. Then we bound the error on the task environment level due to observing a finite number of tasks. Both of the two steps utilize Catoni's classical PAC-Bayes bound Catoni (2007) to measure the error. We give here a general statement of the Catoni's classical PAC-Bayes bound. Theorem 4. (Classical PAC-Bayes bound, general notations) Let X be a sample space and X be some distribution over X , and let F be a hypotheses space of functions over X . Define a loss function g(f, X) : F × X → [0, 1], and let X G 1 {X 1 , . . . , X G } be a sequence of G independent random variables distributed according to X. Let π be some prior distribution over F (which must not depend on the samples X 1 , . . . , X G ). For any δ ∈ (0, 1], the following bounds holds uniformly for all posterior distribution ρ over F (even sample dependent), P X G 1 ∼ i.i.d X E X∼X E f ∼ρ g(f, X) ≤ c 1 -e -c 1 G G g=1 E f ∼ρ g(f, X g ) + KL(ρ||π) + log 1 δ G × c , ∀ρ ≥ 1 -δ. ( ) First step We utilize Theorem 4 to bound the generalization error in each of the observed tasks. Let i ∈ 1, . . . , n be the index of task. For task i, we substitute the following definition into the Catoni's PAC-Bayes Bound. Specifically, X g (x ij , y ij ), K m i denote the samples and X D i denotes the data distribution. We instantiate the hypotheses with a hierarchical model f (v, w), where v ∈ R dv and w ∈ R dw are the parameters of meta learner (prior predictor) Φ v (•) and base learner h(•) respectively. The loss function only considers the base learner, which is defined as g(f, X) (h w (x), y). The prior over model parameter is represented as π (P, P ) (N (0, σ 2 v I dv ), N (w P , σ 2 w I dw )), a Gaussian distribution (hyperprior of meta learner) centered at 0 and a Gaussian distribution (prior of base learner) centered at w P , respectively. We set the posterior to ρ (Q, Q) (N (v Q , σ 2 v I dv ), N (w Q , σ 2 w I dw )) , a Gaussian distribution (hyperposterior of meta learner) centered at v Q and a Gaussian distribution (posterior of base learner) centered at w Q . According to Theorem 4, the generalization bound holds for any posterior distribution including the one generated in our localized meta-learning framework. Specifically, we first sample v from hyperposterior N (v Q , σ 2 v I dv ) and estimate w P by leveraging expected prior predictor w P = Φ v (D). The base learner algorithm A b (S, P ) utilizes the sample set S and the prior P = N (w P , σ 2 w I dw ) to produce a posterior Q = A b (S, P ) = N (w Q , σ 2 w I dw ). Then we sample base learner parameter w from posterior N (w Q , σ 2 w I dw ) and compute the incurred loss (h w (x), y). On the whole, metalearning algorithm A m (S 1 , . . . , S n , P) observes a series of tasks S 1 , . . . , S n and adjusts its hyperprior P = N (v P , σ 2 v I dv ) into hyperposterior Q = A m (S 1 , . . . , S n , P) = N (v Q , σ 2 v I dv ). The KL divergence term between prior π and posterior ρ is computed as follows: KL(ρ π) = E f ∼ρ log ρ(f ) π(f ) = E v∼N (v Q ,σ 2 v I dv ) E w∼N (w Q ,σ 2 w I dw ) log N (v Q , σ 2 v I dv )N (w Q , σ 2 w I dw ) N (0, σ 2 v I dv )N (w P , σ 2 w I dw ) = E v∼N (v Q ,σ 2 v I dv ) log N (v Q , σ 2 v I dv ) N (0, σ 2 v I dv ) + E v∼N (v Q ,σ 2 v I dv ) E w∼N (w Q ,σ 2 w I dw ) log N (w Q , σ 2 w I dw ) N (w P , σ 2 w I dw ) = 1 2σ 2 v v Q 2 + E v∼N (v Q ,σ 2 v I dv ) 1 2σ 2 w w Q -w P 2 . ( ) In our localized meta-learning framework, in order to make KL(Q||P ) small, the center of prior distribution w P is generated by the expected prior predictor w P = Φ v (D). However, the data distribution D is considered unknown and our only insight as to D ik is through the sample set S ik . In this work, we approximate the expected prior predictor Φ v (D) with the LCC-based prior predictor wP = Φv (S). Denote the term E v∼N (v Q ,σ 2 v I dv ) 1 2σ 2 w w Q -w P 2 by E v 1 2σ 2 w w Q -w P 2 for convenience, we have E v 1 2σ 2 w w Q -w P 2 =E v 1 2σ 2 w w Q -wP + wP -w P 2 =E v 1 2σ 2 w [ w Q -wP 2 + wP -w P 2 + 2(w Q -wP ) ( wP -w P )] ≤E v 1 2σ 2 w [ w Q -wP 2 + wP -w P 2 + 2 w Q -wP wP -w P ] ≤ 1 σ 2 w E v w Q -Φv (S) 2 + 1 σ 2 w E v wP -w P 2 . ( ) Since wP i = Φv (S i ) = [ Φv (S i1 ), . . . , Φv (S ik ), . . . , Φv (S iK )], we have E v w Q i -Φv (S i ) 2 = K k=1 E v w Q i [k] -Φv (S ik ) 2 = K k=1 E v w Q i [k] 2 -2(E v w Q i [k]) ( Φv Q (S ik )) + Φv Q (S ik ) 2 + V v [ Φv (S ik ) ] = K k=1 E v w Q i [k] -Φv Q (S ik ) 2 + d v |C| σ 2 v = E v w Q i -Φv Q (S i ) 2 + d w Kσ 2 v , where V v [ Φv (S ik ) ] denotes the variance of Φv (S ik ) . The last equality uses the fact that d v = |C|d w . Combining Lemma 2, for any δ ∈ (0, 1] with probability ≥ 1 -δ we have E v 1 2σ 2 w w Q i -w P i 2 ≤ 1 σ 2 w E v w Q i -Φv Q (S i ) 2 + d w K( σ v σ w ) 2 + 1 σ 2 w K k=1 αR √ m ik (1 + 1 2 log( 1 δ )) + O α,β (γ, C) 2 Then, according to Theorem 4, we obtain that for any δi 2 > 0 P Si∼D m i i E (x,y)∼Di E v∼N (v Q ,σ 2 v I dv ) E w∼N (w Q ,σ 2 w I dw ) (h w (x), y) ≤ c 2 1 -e -c2 • 1 m i mi j=1 E v∼N (v Q ,σ 2 v I dv ) E w∼N (w Q ,σ 2 w I dw ) (h w (x j ), y j ) + 1 (1 -e -c2 ) • m i 1 2σ 2 v v Q 2 + E v∼N (v Q ,σ 2 v I dv ) 1 2σ 2 w w Q i -w P i 2 + log 2 δ i , ∀Q ≥ 1 - δ i 2 , ) for all observed tasks i = 1, . . . , n. Define δ = δi 2 and combine inequality (29), we obtain P Si∼D m i i E (x,y)∼Di E v∼N (v Q ,σ 2 v I dv ) E w∼N (w Q ,σ 2 w I dw ) (h w (x), y) ≤ c 2 1 -e -c2 • 1 m i mi j=1 E v∼N (v Q ,σ 2 v I dv ) E w∼N (w Q ,σ 2 w I dw ) (h w (x j ), y j ) + 1 (1 -e -c2 )m i • 1 2σ 2 v v Q 2 + 1 σ 2 w E v w Q i -Φv Q (S i ) 2 + log 2 δ i + d w K( σ v σ w ) 2 + 1 σ 2 w K k=1 αR √ m ik (1 + 1 2 log( 2 δ i )) + O α,β (γ, C) 2 , ∀Q ≥ 1 -δ i , Using the notations in Section 3, the above bound can be simplified as P Si∼D m i i E v∼N (v Q ,σ 2 v I dv ),w P =Φv(D),Pi=N (w P ,σ 2 w I dw ) er(A b (S i , P i )) ≤ c 2 1 -e -c2 E v∼N (v Q ,σ 2 v I dv ),w P =Φv(D),Pi=N (w P ,σ 2 w I dw ) êr(A b (S i , P i )) + 1 (1 -e -c2 )m i 1 2σ 2 v v Q 2 + 1 σ 2 w E v w Q i -Φv Q (S i ) 2 + log 2 δ i + d w K( σ v σ w ) 2 + 1 σ 2 w K k=1 αR √ m ik (1 + 1 2 log( 2 δ i )) + O α,β (γ, C) 2 , ∀Q ≥ 1 -δ i . Second step Next we bound the error due to observing a limited number of tasks from the environment. We reuse Theorem 4 with the following substitutions. The samples are (D i , m i , S i ), i = 1, . . . , n, where (D i , m i ) are sampled from the same meta distribution τ and S i ∼ D mi i . The hyposthesis is parameterized as Φ v (D) with meta learner parameter v. The loss function is g(f, X) E (x,y)∼D E w∼N (w Q ,σ 2 w I dw ) (h w (x), y), where w Q = A b (S i , P i ). Let π N (0, σ 2 v I dv ) be the prior over meta learner parameter, the following holds for any δ 0 > 0, (h w (x), y) P (D m i i )∼τ,Si∼D + 1 (1 -e -c1 )n 1 2σ 2 v v Q 2 + log 1 δ 0 , ∀Q ≥ 1 -δ 0 . ( ) Using the term in Section 3, the above bound can be simplified as P (D m i i )∼τ,Si∼D m i i ,i=1,...,n er(Q) ≤ c 1 1 -e -c1 • 1 n n i=1 E v∼N (v Q ,σ 2 v I dv ),w P =Φv(D),Pi=N (w P ,σ 2 w I dw ) er(A b (S i , P i )) + 1 (1 -e -c1 )n 1 2σ 2 v v Q 2 + log 1 δ 0 , ∀Q ≥ 1 -δ 0 , Finally, by employing the union bound, we could bound the probability of the intersection of the events in ( 32) and ( 34 êr(A b (S i , P i )) + c 1 1 -e -c1 • 1 n n i=1 1 (1 -e -c2 )m i 1 2σ 2 v v Q 2 + 1 σ 2 w E v w Q i -Φv Q (S i ) 2 + log 4n δ + 1 σ 2 w K k=1 αR √ m ik (1 + 1 2 log( 4n δ )) + O α,β (γ, C) 2 + d w K( σ v σ w ) 2   + 1 (1 -e -c1 )n 1 2σ 2 v v Q 2 + log 2 δ , ∀Q ≥ 1 -δ. We can further simplify the notation and obtain that P (D m i i )∼τ,Si∼D m i i ,i=1,...,n er(Q) ≤ c 1 c 2 êr(Q) +( n i=1 c 1 c 2 2c 2 nm i σ 2 v + c 1 2c 1 nσ 2 v ) v Q 2 + n i=1 c 1 c 2 c 2 nm i σ 2 w E v w Q i -Φv Q (S i ) 2 +const(α, β, R, δ, n, m i ), ∀Q ≥ 1 -δ, where c 1 = c1 1-e -c 1 and c 2 = c2 1-e -c 2 . This completes the proof. Theorem 2 Let Q be the posterior of base learner Q = N (w Q , σ 2 w I dw ) and P be the prior N (w P , σ 2 w I dw ). The mean of prior is sampled from the hyperposterior of meta learner Q = N (w Q , σ 2 w I dw ). Given the hyperprior P = N (0, σ 2 w I dw ), then for any hyperposterior Q, any c 1 , c 2 > 0 and any δ ∈ (0, 1] with probability ≥ 1 -δ we have, er(Q) ≤c 1 c 2 êr(Q) + ( n i=1 c 1 c 2 2c 2 nm i σ 2 w + c 1 2c 1 nσ 2 w ) w Q 2 + n i=1 c 1 c 2 2c 2 nm i σ 2 w E w P w Q i -w Q 2 + n i=1 c 1 c 2 c 2 nm i σ 2 w ( 1 2 + log 2n δ ) + c 1 c 1 nσ 2 w log 2 δ , where c 1 = c1 1-e -c 1 and c 2 = c2 1-e -c 2 . Proof Instead of generating the mean of prior with a prior predictor, the vanilla meta-learning framework directly produces the mean of prior w P by sampling from hyperposterior Q = N (w Q , σ 2 w I dw ). Then the base learner algorithm A b (S, P ) utilizes the sample set S and the prior P = N (w P , σ 2 w I dw ) to produce a posterior Q = A b (S, P ) = N (w Q , σ 2 w I dw ). Similarly with the two-steps proof in Theorem 2, we first get an intra-task bound by using Theorem 4. For any δ i > 0, we have P Si∼D m i i E (x,y)∼Di E w P ∼N (w Q ,σ 2 w I dw ) E w∼N (w Q ,σ 2 w I dw ) (h w (x), y) ≤ c 2 1 -e -c2 • 1 m i mi j=1 E w P ∼N (w Q ,σ 2 w I dw ) E w∼N (w Q ,σ 2 w I dw ) (h w (x j ), y j ) + 1 (1 -e -c2 ) • m i 1 2σ 2 w w Q 2 + E w P i ∼N (w Q ,σ 2 w I dw ) 1 2σ 2 w w Q i -w P i 2 + log 1 δ i , ∀Q ≥ 1 -δ i , The term E w P i ∼N (w Q ,σ 2 w I dw ) 1 2σ 2 w w Q i -w P i 2 can be simplified as E w P i ∼N (w Q ,σ 2 w I dw ) 1 2σ 2 w w Q i -w P i 2 = 1 2σ 2 w E w P w Q i 2 -2( E w P w Q i ) w Q + w Q 2 + V w P i [ w P i ] = 1 2σ 2 w E w P w Q i -w Q 2 + σ 2 w , where V êr(A b (S i , P i )) + c 1 1 -e -c1 • 1 n n i=1 1 (1 -e -c2 ) • m i 1 2σ 2 w w Q 2 + 1 2σ 2 w E w P w Q i -w Q 2 + 1 2 + log 2n δ + 1 (1 -e -c1 )n 1 2σ 2 w w Q 2 + log 2 δ , ∀Q ≥ 1 -δ.



Figure1: Illustration of the localized metalearning framework. Instead of using global meta-knowledge for all tasks, we tailor the meta-knowledge for various specific task.

which measures the closeness between the mean of posterior and the mean of global hyperposterior for each task, is typically vital to the generalization bound. Finding the tightest possible bound generally depends on minimizing this term. It is obvious that the optimal w

estimated by using the re-parameterization trickKingma & Welling (2014);Rezende et al. (2014).For example, to sample w from the posterior Q = N (w Q , σ 2 w I dw ), we first draw ξ ∼ N (0, I dw )and then apply the deterministic function w Q + ξ σ, where is an element-wise multiplication.4 EXPERIMENTSDatasets and Setup. We use CIFAR-100 and Caltech-256 in our experiments. CIFAR-100 Krizhevsky (2009) contains 60,000 images from 100 fine-grained categories and 20 coarse-level categories. As in Zhou et al. (2018), we use 64, 16, and 20 classes for meta-training, meta-validation, and meta-testing, respectively. Caltech-256 has 30,607 color images from 256 classes Griffin et al.(2007). Similarly, we split the dataset into 150, 56 and 50 classes for meta-training, meta-validation, and meta-testing. We consider 5-way classification problem. Each task is generated by randomly sampling 5 categories and each category contains 50 samples. The base model uses the convolutional architecture in Finn et al. (2017), which consists of 4 convolutional layers, each with 32 filters and a fully-connected layer mapping to the number of classes on top. High dimensional data often lies on some low dimensional manifolds. We utilize an auto-encoder to extract the semantic information of image data and then construct the LCC scheme based on the embeddings. The parameters of prior predictor and base model are random perturbations in the form of Gaussian distribution. We design two different meta-learning environment settings to validate the efficacy of the proposed method. The first one uses the pre-trained base model as an initialization, which utilizes all the meta-training classes (64-class classification in CIFAR-100 case) to train the feature extractor. The second one uses the random initialization. We compare the proposed LML method with ML-PL method Pentina & Lampert (2014), ML-AM method Amit & Meir (2018) and ML-A which is derived from Theorem 1. In all these methods, we use their main theorems about the generalization upper bound to derive the objective of the algorithm. To ensure a fair comparison, all approaches adopt the same network architecture and pre-trained feature extractor (more details can be found in Appendix E).

Figure 3: Average test accuracy of learning a new task for varied numbers of training tasks (|C| = 64).

Figure 4: (a) The impact of the number of anchor points |C| in LCC. (b) The divergence value (normalized) between the mean generated prior w P and the mean of learned posterior w Q .

Meta-Learning: A PAC-Bayes Analysis for Meta-Learning Beyond Global Prior This supplementary document contains the discussion of previous work, the technical proofs of theoretical results and details of experiments. It is structured as follows: Appendix A gives a detailed discussion of previous work. Appendix B presents the optimization method for LCC. Appendix C presents notations for prior predictor. Appendix D gives the proofs of the main results. Appendix D.1 and D.2 show the approximation error between LCC-based prior predictor and empirical prior predictor, expected prior predictor, respectively. They are used in the proof of Theorem 2. Next, in Appendix D.3 and D.4 we show the PAC-Bayes generalization bound of localized meta-learning in Theorem 2 and also provides the PAC-Bayes generalization bound of regular meta-learning in Theorem 1. Details of experiments and more empirical results are presented in Appendix E. Finally, we summarize the localized meta-learning algorithm in Appendix F. A RELATED WORK Meta-Learning. Meta-learning literature commonly considers the empirical task error by directly optimizing a loss of meta learner across tasks in the training data. Recently, this has been successfully applied in a variety of models for few-shot learning Ravi & Larochelle (2017); Snell et al. (2017); Finn et al. (2017); Vinyals et al. (2016). Although Vuorio et al. (2018); Rusu et al. (2019); Zintgraf et al. (2019); Wang et al. (2019) consider task adaptation when using meta-knowledge for specific tasks, all of them are not based on generalization error bounds, which is the in the same spirit as our work. Meta-learning in the online setting has regained attention recentlyDenevi et al. (2018b;a;2019);Balcan et al. (2019), in which online-to-batch conversion results could imply generalization bounds.Galanti et al. (2016) analyzes transfer learning in neural networks with PAC-Bayes tools.Most related to our work arePentina & Lampert (2014);Amit & Meir (2018), which provide a PAC-Bayes generalization bound for meta-learning framework. In contrast, neither work provides a principled way to derive localized meta-knowledge for specific tasks.Localized PAC-Bayes Learning. There has been a prosperous line of research for learning priors to improve the PAC-Bayes boundsCatoni (2007);Guedj (2019).Parrado-Hernández et al. (2012) showed that priors can be learned by splitting the available training data into two parts, one for learning the prior, one for learning the posterior.Lever et al. (2013) bounded the KL divergence by a term independent of data distribution and derived an expression for the overall optimal prior, i.e. the prior distribution resulting in the smallest bound value. Recently,Rivasplata et al. (2018) bounded the KL divergence by investigating the stability of the hypothesis. Dziugaite & Roy (2018) optimized

) For any δ > 0, set δ Q ,σ 2 v I dv ),w P =Φv(D),Pi=N (w P ,σ 2 w I dw )

i ] denotes the variance of w P i . Then we get an inter-task bound. For anyδ 0 > 0, (w Q ,σ 2 w I dw ) E w∼N (w Q ,σ 2 w I dw ) Q ,σ 2 v I dv ),w P =Φv(D),Pi=N (w P ,σ 2 w I dw )

with w ∈ R dw . The prior and posterior are a distribution over the set of all possible parameters w. We choose both the prior P and posterior Q to be spherical Gaussians, i.e. P = N (w P , σ 2 w I dw )

, we demonstrate the average test error of learning a new task based on the number of training tasks, together with the standard deviation, in different settings (with or without a pre-trained feature extractor). It is obvious that the performance continually increases as we increase the number of training tasks for all the methods. This is consistent with the generalization bounds that the complexity term converges to zero if large numbers of tasks are observed. ML-A consistently outperforms ML-PL and ML-AM since the single-task bound used in Theorem 1(ML-A)

annex

Similarly, we can further simplify the notation and obtain thatwhere c 1 = c1 1-e -c 1 and c 2 = c2 1-e -c 2 . This completes the proof.

E DETAILS OF EXPERIMENTS

While the theorems consider bounded-loss, we use an unbounded loss in our experiments, we can have theoretical guarantees on a variation of the loss which is clipped to [0; 1]. Besides, in practice the loss function is almost always smaller than one.

E.1 DATA PREPARATION

We used the 5-way 50-shot classification setups, where each task instance involves classifying images from 5 different categories sampled randomly from one of the meta-sets. We did not employ any data augmentation or feature averaging during meta-training, or any other data apart from the corresponding training and validation meta-sets.

E.2 NETWORK ARCHITECHTURE

Auto-Encoder for LCC For CIFAR100, the encoder is 7 layers with 16-32-64-64-128-128-256 channels. Each convolutional layer is followed by a LeakyReLU activation and a batch normalization layer. The 1st, 3rd and 5th layer have stride 1 and kernel size (3, 3). The 2nd, 4th and 6th layer have stride 2 and kernel size (4, 4). The 7th layer has stride 1 and kernel size (4, 4). The decoder is the same as encoder except that the layers are in reverse order. The input is resized to 32 × 32. For Caltech-256, the encoder is 5 layers with 32-64-128-256-256 channels. Each convolutional layer is followed by a LeakyReLU activation and a batch normalization layer. The first 4 layers have stride 2 and kernel size (4, 4). The last layer has stride 1 and kernel size (6, 6). The decoder is the same as encoder except that the layers are in reverse order. The input is resized to 96 × 96.

Base Model

The network architecture used for the classification task is a small CNN with 4 convolutional layers, each with 32 filters, and a linear output layer, similar to Finn et al. (2017) . Each convolutional layer is followed by a Batch Normalization layer, a Leaky ReLU layer, and a maxpooling layer. For CIFAR100, the input is resized to 32 × 32. For Caltech-256, the input is resized to 96 × 96.

E.3 OPTIMIZATION

Auto-Encoder for LCC As optimizer we used AdamKingma & Ba (2015) with β 1 = 0.9 and β 2 = 0.999. The initial learning rate is 1 × 10 -4 . The number of epochs is 100. The batch size is 512.

LCC Training

We alternatively train the coefficients and bases of LCC with Adam with β 1 = 0.9 and β 2 = 0.999. In specifics, for both datasets, we alternatively update the coefficients for 60 times and then update the bases for 60 times. The number of training epochs is 3.The number of bases is 64. The batch size is 256.

Pre-Training of Feature Extractor

We use a 64-way classification in CIFAR-100 and 150-way classification in Caltech-256 to pre-train the feature embedding only on the meta-training dataset. For both CIFAR100 and Caltech-256, an L2 regularization term of 5e -4 was used. We used the Adam optimizer. The initial learning rate is 1 × 10 -3 , β 1 is 0.9 and β 2 is 0.999. The number of epochs is 50. The batch size is 512.

Meta-Training

We use the cross-entropy loss as in Amit & Meir (2018) . Although this is inconsistent with the bounded loss setting in our theoretical framework, we can still have a guarantee on a variation of the loss which is clipped to [0, 1]. In practice, the loss is almost always smaller than one. For CIFAR100 and Caltech-256, the number of epochs of meta-training phase is 12; the number of epochs of meta-testing phase is 40. The batch size is 32 for both datasets. As optimizer we used Adam with β 1 = 0.9 and β 2 = 0.999. In the setting with a pre-trained base model, the learning rate is 1 × 10 -5for convolutional layers and 5 × 10 -4 for the linear output layer. In the setting without a pre-trained base model, the learning rate is 1 × 10 -3 for convolutional layers and 5 × 10 -3 for the linear output layer. The confidence parameter is chosen to be δ = 0.1. The variance hyper-parameters for prior predictor and base model are σ w = σ v = 0.01. The hyperparameters α 1 , α 2 in LML and ML-A are set to 0.01.

E.4 MORE EXPERIMENTAL RESULTS

We also compare with two typical meta-learning few-shot learning methods: MAML (Finn et al., 2017) and MatchingNet (Vinyals et al., 2016) . Both two methods use the Adam optimizer with initial learning rate 0.0001. In the meta-training phase, we randomly split the samples of each class into support set (5 samples) and query set (45 samples). The number of epochs is 100. For MAML, the learning rate of inner update is 0.01. In Figure 5 , we demonstrate the average test error of learning a new task based on the number of training tasks, together with the standard deviation, in different settings (with or without a pre-trained feature extractor). We can find that all PAC-Bayes baselines outperform MAML and MatchingNet.Note that MAML and MatchingNet adopt the episodic training paradigm to solve the few-shot learning problem. The meta-training process requires millions of tasks and each task contains limited samples, which is not the case in our experiments. Scarce tasks in meta-training leads to severely meta-overfitting. In our method, the learned prior serves both as an initialization of base model and as a regularizer which restricts the solution space in a soft manner while allowing variation based on specific task data. It yields a model with smaller error than its unbiased counterpart when applied to a similar task. 

F PSEUDO CODE

Evaluate the gradient of J w.r.t. {v, w 1 , . . . , w n } using backpropagation. Take an optimization step. end while

