PUTTING THEORY TO WORK: FROM LEARNING BOUNDS TO META-LEARNING ALGORITHMS

Abstract

Most of existing deep learning models rely on excessive amounts of labeled training data in order to achieve state-of-the-art results, even though these data can be hard or costly to get in practice. One attractive alternative is to learn with little supervision, commonly referred to as few-shot learning (FSL), and, in particular, meta-learning that learns to learn with few data from related tasks. Despite the practical success of meta-learning, many of its algorithmic solutions proposed in the literature are based on sound intuitions, but lack a solid theoretical analysis of the expected performance on the test task. In this paper, we review the recent advances in meta-learning theory and show how they can be used in practice both to better understand the behavior of popular meta-learning algorithms and to improve their generalization capacity. This latter is achieved by integrating the theoretical assumptions ensuring efficient meta-learning in the form of regularization terms into several popular meta-learning algorithms for which we provide a large study of their behavior on classic few-shot classification benchmarks. To the best of our knowledge, this is the first contribution that puts the most recent learning bounds of meta-learning theory into practice for the task of few-shot classification.

1. INTRODUCTION

Since the very seeding of the machine learning field, its algorithmic advances were inevitably followed or preceded by the accompanying theoretical analyses establishing the conditions required for the corresponding algorithms to learn well. Such a synergy between theory and practice is reflected in numerous concepts and learning strategies that took their origins in the statistical learning theory: for instance, the famous regularized risk minimization approach is directly related to the minimization of the complexity of the hypothesis space, as suggested by the generalization bounds established for supervised learning (Vapnik, 1992) , while most of the adversarial algorithms in transfer learning (e.g., DANN from (Ganin & Lempitsky, 2015)) follow the theoretical insights provided by the seminal theory of its domain (Ben-David et al., 2010) . Even though many machine learning methods now enjoy a solid theoretical justification, some more recent advances in the field are still in their preliminary state which requires the hypotheses put forward by the theoretical studies to be implemented and verified in practice. One such notable example is the emerging field of meta-learning, also called learning to learn (LTL), where the goal is to produce a model on data coming from a set of (meta-train) source tasks to use it as a starting point for learning successfully a new previously unseen (meta-test) target task with little supervision. This kind of approach comes in particularly handy when training deep learning models as their performance crucially depends on the amount of training data that can be difficult and/or expensive to get in some applications. Several theoretical studies (Baxter, 2000; Pentina & Lampert, 2014; Maurer et al., 2016; Amit & Meir, 2018; Yin et al., 2020) 1 provided probabilistic meta-learning bounds that require the amount of data in the meta-train source task and the number of meta-train tasks to tend to infinity for efficient meta-learning. While capturing the underlying general intuition, these bounds do not suggest that all the source data is useful in such learning setup due to the additive relationship between the two terms mentioned above. To tackle this drawback, two very recent studies (Du et al., 2020; Tripuraneni et al., 2020) aimed at finding deterministic assumptions that lead to faster learning rates allowing meta-learning algorithms to benefit from all the source data. Contrary to probabilistic bounds that have been used to derive novel learning strategies for meta-learning algorithms (Amit & Meir, 2018; Yin et al., 2020) , there was no attempt to verify the validity of the assumptions leading to the fastest known learning rates in practice or to enforce them through an appropriate optimization procedure. In this paper, we bridge the meta-learning theory with practice by harvesting the theoretical results from Tripuraneni et al. (2020) and Du et al. (2020) , and by showing how they can be implemented algorithmically and integrated, when needed, to popular existing meta-learning algorithms used for few-shot classification (FSC). This latter task consists in classifying new data having seen only few training examples, and represents one of the most prominent examples where meta-learning has shown to be highly efficient. More precisely, our contributions are three-fold: 1. We identify two common assumptions from the theoretical works on meta-learning and show how they can be verified and forced via a novel regularization scheme. 2. We investigate whether these assumptions are satisfied for popular meta-learning algorithms and observe that some of them naturally satisfy them, while others do not. 3. With the proposed regularization strategy, we show that enforcing the assumptions to be valid in practice leads to better generalization of the considered algorithms. The rest of the paper is organized as follows. After presenting preliminary knowledge on the metalearning problem in Section 2, we detail the existing meta-learning theoretical results with their corresponding assumptions and show how they can be enforced via a general regularization technique in Section 3. Then, we provide an experimental evaluation of several popular few-shot learning (FSL) methods in Section 4 and highlight the different advantages brought by the proposed regularization in practice. Finally, we conclude and outline future research perspectives in Section 5.

2. PRELIMINARY KNOWLEDGE

We start by formally defining the meta-learning problem following the model described in Du et al. (2020) . To this end, we assume having access to T source tasks characterized by their respective data generating distributions {µ t } T t=1 supported over the joint input-output space X × Y with X ⊆ R d and Y ⊆ R. We further assume that these distributions are observed only through finite size samples of size n 1 grouped into matrices X t = (x t,1 , . . . , x t,n1 ) ∈ R n1×d and vectors of outputs y t = (y t,1 , . . . , y t,n1 ) ∈ R n1 , ∀t ∈ [[T ]] := {1, . . . , T }. Given this set of tasks, our goal is to learn a shared representation φ belonging to a certain class of functions Φ := {φ | φ : X → V, V ⊆ R k } and linear predictors w t ∈ R k , ∀t ∈ [[T ]] grouped in a matrix W ∈ R T ×k . More formally, this is done by solving the following optimization problem: φ, W = arg min φ∈Φ,W∈R T ×k 1 2T n 1 T t=1 n1 i=1 (y t,i , w t , φ(x t,i ) ), where : Y × Y → R + is a loss function. Once such a representation is learned, we want to apply it to a new previously unseen target task observed through a pair (X T +1 ∈ R n2×d , y T +1 ∈ R n2 ) containing n 2 samples generated by the distribution µ T +1 . We expect that a linear classifier w learned on top of the obtained representation leads to a low true risk over the whole distribution µ T +1 . More precisely, we first use φ to solve the following problem: ŵT +1 = arg min w∈R k 1 n 2 n2 i=1 (y T +1,i , w, φ(x T +1,i ) ). Then, we define the true target risk of the learned linear classifier ŵT +1 as: L( φ, ŵT +1 ) = E (x,y)∼µ T +1 [ (y, ŵT +1 , φ(x) )] and want it to be small and as close as possible to the ideal true risk L(φ * , w * T +1 ) where ∀t ∈ [[T + 1]] and (x, y) ∼ µ t , y = w * t , φ * (x) + ε, ε ∼ N (0, σ 2 ). (2)



We omit other works for meta-learning via online convex optimization(Finn et al., 2019; Balcan et al., 2019; Khodak et al., 2019; Denevi et al., 2019) as they concern a different learning setup.

