UNDERSTANDING TRAIN-VALIDATION SPLIT IN META-LEARNING WITH NEURAL NETWORKS

Abstract

The goal of meta-learning is to learn a good prior model from a collection of tasks such that the learned prior is able to adapt quickly to new tasks without accessing many data from the new tasks. A common practice in meta-learning is to perform a train-validation split on each task, where the training set is used for adapting the model parameter to that specific task and the validation set is used for learning a prior model that is shared across all tasks. Despite its success and popularity in multitask learning and few-shot learning, the understanding of the train-validation split is still limited, especially when the neural network models are used. In this paper, we study the benefit of train-validation split for classification problems with neural network models trained by gradient descent. For first-order model-agnostic meta-learning (FOMAML), we prove that the train-validation split is necessary to learn a good prior model when the noise in the training sample is large, while the train-train method fails. We validate our theory by conducting experiment on both synthetic and real datasets. To the best of our knowledge, this is the first work towards the theoretical understanding of train-validation split in metalearning with neural networks.

1. INTRODUCTION

In recent years, meta-learning has gained increasing popularity and been successfully applied to a wide range of problems including few-shot learning (Ren et al., 2018; Li et al., 2017; Rusu et al., 2018; Snell et al., 2017) , reinforcement learning (Gupta et al., 2018b; a) , neural machine translation (Gu et al., 2018) , and neural architecture search (NAS) (Liu et al., 2018; Real et al., 2019) . A popular meta-learning idea is to formulate it as a bi-level optimization problem, where the inner level computes the parameter adaptation to each task, while the outer level tries to minimize the meta-training loss. Such a bi-level optimization formulation is empirically proved effective to learn new tasks quickly using only a few examples with the aid of past experience. Following this idea, meta-learning algorithms such as model agnostic meta-learning (MAML) (Finn et al., 2017) have achieved remarkable success in many applications. Due to the nature of bi-level optimization, meta-learning algorithms can often take advantage of a train-validation split in the dataset, so that the inner and outer levels of the algorithm use different data points (Finn et al., 2017; Rajeswaran et al., 2019; Bai et al., 2021; Fallah et al., 2020) . It is believed that the train-validation split can help the meta-learning algorithm to achieve a better performance. There has been several attempts to understand the importance of train-validation split in meta-learning for linear models (Wang et al., 2021; Bai et al., 2021; Saunshi et al., 2021) . Specifically, Wang et al. (2021) showed that when learning linear models, the train-train method performs much worse than the train-validation method if the sample size is small and the noise is large. They also show that the train-train method is able to perform well on linear models when the sample size is large enough. Bai et al. (2021) considered the linear centroid model introduced in Denevi et al. ( 2018b) and showed that the train-validation method outperforms the train-train method in an agnostic setting, while in the realizable noiseless setting, the train-train method can asymptotically achieve a strictly smaller mean square error than the train-validation method as the sample size and dimension go to infinity at a fixed ratio. Saunshi et al. ( 2021) considered a representation learning perspective, and demonstrated that train-validation split encourages the learned representation to be low-rank while the train-train method encourages high-rank representations. However, all these works focus on the linear regression setting, while the advantage of train-validation split remains elusive for meta-learning with neural networks. Based on the above observation, we raise the following question: How does train-validation split affect the meta learning with neural networks? In this paper, we answer the above question via a case study of few-shot binary classification using a two-layer convolutional neural network. We consider a learning problem where the data model consists of large noises, and only a limited number of data are available. Under this setting, we theoretically compare the performance of first-order MAML (FOMAML (Finn et al., 2017) ), which is a simplification of MAML that ignores the hessian terms, with a train-validation split (the trainvalidation method) and without a train-validation split (the train-train method). We summarize our contributions as follows: 1. We show that under our setting, despite the complex bi-level structure of the FOMAML loss and its non-convex landscape, it is guaranteed that the train-validation method and the traintrain method can both train a two-layer CNN to a global minimum of the training loss with high probability. 2. We also demonstrate that there is a significant performance gap on new test data. Specifically, we show that the neural network trained by the train-validation method can achieve a test loss that decreases exponentially fast as the number of training tasks increase. On the other hand, we also show that the train-train method can at best achieve a constant level test loss. 3. Our study demonstrates the importance of train-validation split in learning neural networks. To the best of our knowledge, this is the first theoretical work studying train-validation split of metalearning with neural networks. Notably, the learning problem we consider is linearly realizable, for which Bai et al. (2021) showed that the train-train method can asymptotically achieve better MSE than the train-validation method under a linear model. However, our results give the opposite conclusion for learning CNNs -the train-validation method still outperforms the train-train method even for linearly realizable learning problems. Therefore, our results indicate that trainvalidation split may have a more significant advantage when learning complicated prediction models. 4. We perform experiments on both synthetic and real datasets with neural networks as backbone model to justify our theoretical results. In particular, even when the data and the neural network structure do not meet our theoretical assumptions, the experiment results still corroborate our theory to a certain extent. This demonstrates the practical value of our analysis. 



For an integer k, we denote [k] = {1, 2, . . . , k}. Given two sequences {x n } and {y n } with y n > 0, we denotex n = O(y n ) if |x n |/y n is upper bounded by a constant for all n. Similarly, we denote x n = Ω(y n ) if |x n |/y n is lower bounded by a positive constant. We denote x n = Θ(y n ) if x n = O(y n ) and x n = Ω(y n ).Finally, we use O(•), Ω(•), and Θ(•) to hide logarithmic factors. 2 ADDITIONAL RELATED WORK Optimization and generalization guarantees for meta-learning. A number of recent works studied the optimization guarantees for meta-learning algorithms. Finn & Levine (2017) proved the universality of gradient based meta-learning. Wang et al. (2020) studied the global optimality conditions for MAML with a nonconvex objective. Fallah et al. (2020) studied the convergence guarantee for MAML with nonconvex loss function and proposed Hessian-Free MAML with the same theoretical guarantee of MAML without accessing second order information. Finn et al. (2019); Balcan et al. (2019); Khodak et al. (2019); Denevi et al. (2019; 2018a) studied online meta-learning using online convex optimization. Another series of works have also studied the generalization error and sample complexities of meta-learning methods. Specifically, Amit & Meir (2018) extended the PAC-Bayes argument to the meta-learning setting and established a generalization error bound for

