THE CURSE OF LOW TASK DIVERSITY: ON THE FAILURE OF TRANSFER LEARNING TO OUTPERFORM MAML AND THEIR EMPIRICAL EQUIVALENCE Anonymous

Abstract

Recently, it has been observed that a transfer learning solution might be all we need to solve many few-shot learning benchmarks -thus raising important questions about when and how meta-learning algorithms should be deployed. In this paper, we seek to clarify these questions by 1. proposing a novel metric -the diversity coefficient -to measure the diversity of tasks in a few-shot learning benchmark and 2. by comparing Model-Agnostic Meta-Learning (MAML) and transfer learning under fair conditions (same architecture, same optimizer, and all models trained to convergence). Using the diversity coefficient, we show that the popular Mini-ImageNet and CIFAR-FS few-shot learning benchmarks have low diversity. This novel insight contextualizes claims that transfer learning solutions are better than meta-learned solutions in the regime of low diversity under a fair comparison. Specifically, we empirically find that a low diversity coefficient correlates with a high similarity between transfer learning and MAML learned solutions in terms of accuracy at meta-test time and classification layer similarity (using feature based distance metrics like SVCCA, PWCCA, CKA, and OPD). To further support our claim, we find this meta-test accuracy holds even as the model size changes. Therefore, we conclude that in the low diversity regime, MAML and transfer learning have equivalent meta-test performance when both are compared fairly. We also hope our work inspires more thoughtful constructions and quantitative evaluations of meta-learning benchmarks in the future.

1. INTRODUCTION

The success of deep learning in computer vision (Krizhevsky et al., 2012; He et al., 2015) , natural language processing (Devlin et al., 2018; Brown et al., 2020) , game playing (Silver et al., 2016; Mnih et al., 2013; Ye et al., 2021) , and more keeps motivating a growing body of applications of deep learning on an increasingly wide variety of domains. In particular, deep learning is now routinely applied to few-shot learning -a research challenge that assesses a model's ability to learn to adapt to new tasks, new distributions, or new environments. This has been the main research area where meta-learning algorithms have been applied -since such a strategy seems promising in a small data regime due to its potential to learn to learn or learn to adapt. However, it was recently shown (Tian et al., 2020) that a transfer learning model with a fixed embedding can match and outperform many modern sophisticated meta-learning algorithms on numerous few-shot learning benchmarks (Chen et al., 2019; 2020; Dhillon et al., 2019; Huang and Tao, 2019) . This growing body of evidence -coupled with these surprising results in meta-learning -raises the question if researchers are applying meta-learning with the right inductive biases (Mitchell, 1980; Shai Shalev-Shwartz, 2014) and designing appropriate benchmarks for meta-learning. Our evidence suggests this is not the case. Our work is motivated by the inductive bias that when the diversity of tasks in a benchmark is low then a meta-learning solution should provide no advantage to a minimally meta-learned algorithme.g. like only fine-tuning the final layer. Therefore in this work, we quantitatively show that when the task diversity -a novel measure of variability across tasks -is low, then MAML (Model-Agnostic Meta-Learning) (Finn et al., 2017) learned solutions have the same accuracy as transfer learning (i.e., a supervised learned model with a fine-tuned final linear layer). We want to emphasize the importance of doing such an analysis fairly: with the same architecture, same optimizer, and all models trained to convergence. We hypothesize this was lacking in previous work (Chen et al., 2019; 2020; Dhillon et al., 2019; Huang and Tao, 2019) . This empirical equivalence remained true even as the model size changed -thus further suggesting this equivalence is more a property of the data than of the model. Therefore, we suggest taking a problem-centric approach to meta-learning and suggest applying Marr's level of analysis (Hamrick and Mohamed, 2020; Marr, 1982) to few-shot learningto identify the family of problems suitable for meta-learning. Marr emphasized the importance of understanding the computational problem being solved and not only analyzing the algorithms or hardware that attempts to solve them. An example given by Marr is marveling at the rich structure of bird feathers without also understanding the problem they solve is flight. Similarly, there has been analysis of MAML solutions and transfer learning without putting the problem such solutions should solve into perspective (Raghu et al., 2020; Tian et al., 2020) . Therefore, in this work, we hope to clarify some of these results by partially placing the current state of affairs in meta-learning from a problem-centric view. In addition, an important novelty of our analysis is that we put analysis of intrinsic properties of the data as the driving force. Our contributions are summarized as follows: 1. We propose a novel metric that quantifies the intrinsic diversity of the data of a few-shot learning benchmark. We call it the diversity coefficient. It enables analysis of meta-learning algorithms through a problem-centric framework. It also goes beyond counting the number of classes or number of data points or counting the number of concatenated data setsand instead quantifies the expected diversity/variability of tasks in a few-shot learning benchmark. We also show it's strong correlation with the ground truth diversity.

2.

We analyze the two most prominent few-shot learning benchmarks -MiniImagenet and Cifar-fs -and show that their diversity is low. These results are robust across different ways to measure the diversity coefficient, suggesting that our approach is robust. In addition, we quantitatively also show that the tasks sampled from them are highly homogeneous. 3. With this context, we partially clarify the surprising results from (Tian et al., 2020) by comparing their transfer learning method against models trained with MAML (Finn et al., 2017) . In particular, when making a fair comparison, the transfer learning method with a fixed feature extractor fails to outperform MAML. We define a fair comparison when the two methods are compared using the same architecture (backbone), same optimizer, and all models trained to convergence. We also show that their final layer makes similar predictions according to neural network distance techniques like distance based Singular Value Canonical Correlation Analysis (SVCCA), Projection Weighted (PWCCA), Linear Centered Kernel Analysis (LINCKA), and Orthogonal Procrustes Distance (OPD). This equivalence holds even as the model size increases. 4. Interestingly, we also find that even in the regime where task diversity is low (in MiniImagenet and Cifar-fs), the features extracted by supervised learning and MAML are differentimplying that the mechanism by which they function is different despite the similarity of their final predictions.

5.

As an actionable conclusion, we provide a metric that can be used to analyze the intrinsic diversity of the data in a few-shot learning benchmarks and therefore build more thoughtful environments to drive research in meta-learning. In addition, our evidence suggests the following test to predict the empirical equivalence of MAML and transfer learning: if the task diversity is low, then transfer learned solutions might fail to outperform meta-learned solutions. This test is easy to run because our diversity coefficient can be done using the Task2Vec method (Achille UCLA et al., 2019) using pre-trained neural network. In addition, according to our synthetic experiments that also test the high diversity regime, this test provides preliminary evidence that the diversity coefficient might be predictive of the difference in performance between transfer learning and MAML. We hope that this line of work inspires a problem-centric first approach to meta-learning -which appears to be especially sensitive to the properties of the problem in question. Therefore, we hope future work takes a more thoughtful and quantitative approach to benchmark creation -instead of focusing only on making huge data sets.

