META-LEARNING WITH NEGATIVE LEARNING RATES

Abstract

Deep learning models require a large amount of data to perform well. When data is scarce for a target task, we can transfer the knowledge gained by training on similar tasks to quickly learn the target. A successful approach is meta-learning, or learning to learn a distribution of tasks, where learning is represented by an outer loop, and to learn by an inner loop of gradient descent. However, a number of recent empirical studies argue that the inner loop is unnecessary and more simple models work equally well or even better. We study the performance of MAML as a function of the learning rate of the inner loop, where zero learning rate implies that there is no inner loop. Using random matrix theory and exact solutions of linear models, we calculate an algebraic expression for the test loss of MAML applied to mixed linear regression and nonlinear regression with overparameterized models. Surprisingly, while the optimal learning rate for adaptation is positive, we find that the optimal learning rate for training is always negative, a setting that has never been considered before. Therefore, not only does the performance increase by decreasing the learning rate to zero, as suggested by recent work, but it can be increased even further by decreasing the learning rate to negative values. These results help clarify under what circumstances meta-learning performs best. In this work, we follow the notation of Hospedales et al. ( 2020) and we use MAML (Finn et al. (2017)) as the meta-learning algorithm. We assume the existence of a distribution of tasks τ and, for each task, a loss function L τ and a distribution of data points D τ = {x τ , y τ } with input x τ and label y τ . We assume that the loss function is the same for all tasks, L τ = L, but each task is characterized by a different distribution of the data. The empirical meta-learning loss is evaluated on a sample of

1. INTRODUCTION

Deep Learning models represent the state-of-the-art in several machine learning benchmarks (Le-Cun et al. (2015) ), and their performance does not seem to stop improving when adding more data and computing resources (Rosenfeld et al. (2020 ), Kaplan et al. (2020) ). However, they require a large amount of data and compute to start with, which are often not available to practitioners. The approach of fine-tuning has proved very effective to address this limitation: pre-train a model on a source task, for which a large dataset is available, and use this model as the starting point for a quick additional training (fine-tuning) on the small dataset of the target task (Pan & Yang (2010 ), Donahue et al. (2014 ), Yosinski et al. (2014) ). This approach is popular because pre-trained models are often made available by institutions that have the resources to train them. In some circumstances, multiple source tasks are available, all of which have scarce data, as opposed to a single source task with abundant data. This case is addressed by meta-learning, in which a model gains experience over multiple source tasks and uses it to improve its learning of future target tasks. The idea of meta-learning is inspired by the ability of humans to generalize across tasks, without having to train on any single task for long time. A meta-learning problem is solved by a bi-level optimization procedure: an outer loop optimizes meta-parameters across tasks, while an inner loop optimizes parameters within each task (Hospedales et al. (2020) ). The idea of meta-learning has gained some popularity, but a few recent papers argue that a simple alternative to meta-learning is just good enough, in which the inner loop is removed entirely (Chen et al. ( 2020a It is hard to resolve the debate because there is little theory available to explain these findings. In this work, using random matrix theory and exact solutions of linear models, we derive an algebraic expression of the average test loss of MAML, a simple and successful meta-learning algorithm (Finn et al. (2017) ), as a function of its hyperparameters. In particular, we study its performance as a



), Tian et al. (2020), Dhillon et al. (2020), Chen et al. (2020b), Raghu et al. (2020)). Other studies find the opposite (Goldblum et al. (2020), Collins et al. (2020), Gao & Sener (2020)).

annex

Published as a conference paper at ICLR 2021 function of the inner loop learning rate during meta-training. Setting this learning rate to zero is equivalent to removing the inner loop, as advocated by recent work (Chen et al. (2020a) , Tian et al. (2020) , Dhillon et al. (2020) , Chen et al. (2020b) , Raghu et al. (2020) ). Surprisingly, we find that the optimal learning rate is negative, thus performance can be increased by reducing the learning rate below zero. In particular, we find the following:• In the problem of mixed linear regression, we prove that the optimal learning rate is always negative in overparameterized models. The same result holds in underparameterized models provided that the optimal learning rate is small in absolute value. We validate the theory by running extensive experiments.• We extend these results to the case of nonlinear regression and wide neural networks, in which the output can be approximated by a linear function of the parameters (Jacot et al. ( 2018), Lee et al. ( 2019)). While in this case we cannot prove that the optimal learning rate is always negative, preliminary experiments suggest that the result holds in this case as well.

2. RELATED WORK

The field of meta-learning includes a broad range of problems and solutions, see Hospedales et al. 3 META-LEARNING AND MAML

