FEW-SHOT LEARNING VIA LEARNING THE REPRESENTATION, PROVABLY

Abstract

This paper studies few-shot learning via representation learning, where one uses T source tasks with n 1 data per task to learn a representation in order to reduce the sample complexity of a target task for which there is only n 2 ( n 1 ) data. Specifically, we focus on the setting where there exists a good common representation between source and target, and our goal is to understand how much a sample size reduction is possible. First, we study the setting where this common representation is low-dimensional and provide a risk bound of Õ( dk n1T + k n2 ) on the target task for the linear representation class; here d is the ambient input dimension and k( d) is the dimension of the representation. This result bypasses the Ω( 1 T ) barrier under the i.i.d. task assumption, and can capture the desired property that all n 1 T samples from source tasks can be pooled together for representation learning. We further extend this result to handle a general representation function class and obtain a similar result. Next, we consider the setting where the common representation may be high-dimensional but is capacity-constrained (say in norm); here, we again demonstrate the advantage of representation learning in both high-dimensional linear regression and neural networks, and show that representation learning can fully utilize all n 1 T samples from source tasks.

1. INTRODUCTION

A popular scheme for few-shot learning, i.e., learning in a data-scarce environment, is representation learning, where one first learns a feature extractor, or representation, e.g., the last layer of a convolutional neural network, from different but related source tasks, and then uses a simple predictor (usually a linear function) on top of this representation in the target task. The hope is that the learned representation captures the common structure across tasks, which makes a linear predictor sufficient for the target task. If the learned representation is good enough, it is possible that a few samples are sufficient for learning the target task, which can be much smaller than the number of samples required to learn the target task from scratch. While representation learning has achieved tremendous success in a variety of applications (Bengio et al., 2013) , its theoretical studies are limited. In existing theoretical work, the most natural algorithm is to explicitly look for the optimal representation given source data, which when combined with a (different) linear predictor on top for each task can achieve the smallest cumulative training error on the source tasks. Of course, it is not guaranteed that the representation found will be useful for the target task unless one makes some assumptions to characterize the connections between different tasks. Existing work often imposes a probabilistic assumption about the connection between tasks: each task is sampled i.i.d. from an underlying distribution. Under this assumption, Maurer et al. (2016) showed an Õ(foot_0 √ T + 1 √ n2 ) risk bound on the target task, where T is the number of source tasks, n 1 is the number of samples per source task, and n 2 is the number of samples from the target task. 1 Unsatisfactorily, this bound necessarily requires the number of tasks T to be large, and it



We only focus on the dependence on T , n1 and n2 in this paragraph. Note that Maurer et al. (2016) only considered n1 = n2, but their approach does not give a better result even if n1 > n2.

