METAMD: PRINCIPLED OPTIMISER META-LEARNING FOR DEEP LEARNING

Abstract

Optimiser design influences learning speed and generalisation in training machine learning models. Several studies have attempted to learn more effective gradientdescent optimisers via solving a bi-level optimisation problem where generalisation error is minimised with respect to optimiser parameters. However, most existing neural network oriented optimiser learning methods are intuitively motivated, without clear theoretical support, and focus on learning implicit biases that improve generalisation, rather than speed of convergence. We take a different perspective starting from mirror descent rather than gradient descent, and meta-learning the corresponding Bregman divergence. Within this paradigm, we formalise a novel meta-learning objective of optimising the rate of convergence. The resulting framework, termed Meta Mirror Descent (MetaMD), learns to accelerate optimisation speed. Unlike many meta-learned neural network optimisers, it also supports convergence guarantees and uniquely does so without requiring validation data. We empirically evaluate our framework on a variety of tasks and architectures in terms of convergence rate and generalisation error and demonstrate strong performance.

1. INTRODUCTION

Gradient-based optimisation algorithms, such as stochastic gradient descent (SGD), are fundamental building blocks of many machine learning algorithms-notably those focused on training linear models and deep neural networks. These methods are typically developed to solve a broad class of problems, and therefore the method developers make as few assumptions about the target problem as possible. This leads to a variety of general purpose techniques for optimisation, but such generality often comes with slower convergence. By taking advantage of more information about the target problem, one is typically able to design more efficient-but less general-optimisation algorithms. Another challenge in a non-convex deep learning context, is that that many of the empirically fastest optimisers such as Adam (Kingma & Ba, 2015) lack convergence guarantees. While one line of research hand-designs optimisers to exploit known properties of particular problems, a complementary line of research focuses on situations where optimisation problems come in families. This allows the use of meta-learning techniques to fit an optimiser to the given problem family with the goal of maximising convergence speed or generalisation performance. For example, in the many-shot regime, Andrychowicz et al. ( 2016 In this work, we revisit the optimiser learning problem from the perspective of mirror descent. Mirror descent introduces a Bregman divergence that regularises the distance between current and next iterate, introducing a strongly convex sub-problem that can be optimised exactly. In mirror descent, the choice of Bregman divergence determines optimisation dynamics. In a meta-learning context, the Bregman divergence thus provides a interesting representation of an optimisation strategy that can be fit to a given family of optimisation problems, leading to our learned optimiser termed Meta Mirror Descent (MetaMD). Many existing learned optimisers do not have a formal notion of convergence rate, and in practice typically optimise a meta-objective reflecting training or validation loss after a fixed number of iterations. In contrast, MetaMD is directly trained to optimise the convergence rate bound for mirror descent. Importantly, this means we can adapt theoretical guarantees from mirror descent to provide convergence guarantees for MetaMD, an important property not provided by most learned optimsers, and many hand-designed optimisers widely used in deep learning. An important issue in meta-learning mirror descent is specifying the family of Bregman divergences to learn. Meta-learning with general Bregman divergences leads to an intractable tri-level optimisation problem. Thus, we seek a family of divergences for which the innermost optimisation has a closed form solution. The chosen paramaterisation should be complex enough to exhibit interesting optimisation dynamics, simple enough to provide a closed form solution, and always provide a valid Bregman divergence. We provide a parameterisation that meets all these desideratum in the form of a Kronecker factorised preconditioner Zehfuss ( 1858 2019), which is a special case. Importantly, we also show how to train the divergence efficiently with implicit gradient computation Lorraine et al. ( 2020). Uniquely, this means that our framework can effectively train learning-rates (as a special case of preconditioning) with implicit gradient, an idea that was previously suggested to be impossible (Lorraine et al., 2020) . Empirically we demonstrate that we can train MetaMD given a model architecture and a suite of training tasks. We then deploy it to novel testing tasks and show that MetaMD provides fast convergence and good generalisation compared to several existing hand-designed and learned optimisers.

2. RELATED WORK

Meta-learning aims to extract some notion of 'how to learn' given a task or distribution of tasks (Hospedales et al., 2020) , such that new learning trials are better or faster. These two stages are often called meta-training, and meta-testing respectively. Key dichotomies include: meta-learning from a single task vs a task distribution (as formalised, e.g., by Baxter ( 2000)); the type of meta-knowledge to be extracted; and long-vs short-horizon meta-learning. For few-shot problems with short optimisation horizons, the seminal model-agnostic meta-learning (MAML) Finn et al. ( 2017 Several studies focus specifically on optimiser meta-learning for many-shot batch learning problems, which we address here. In this case, the extracted meta-knowledge spans learning rates for SGD (Micaelli & Storkey, 2021) , symbolic gradient-descent rules (Bello et al., 2017) neural network gradient-descent rules (Andrychowicz et al., 2016; Li & Malik, 2017) , and gradient-free optimisers (Sandler et al., 2021; Chen et al., 2017) . Differently to the gradient-descent based methods, we start from the perspective of mirror descent, where mirror descent's Bregman Divergence provides a target for meta-learning. This perspective has several benefits, notably the ability to derive a learned optimiser with convergence and generalisation guarantees. Our suggested practical instantiation is to meta-learn a Kronecker-factorised pre-conditioner, which is effective and efficient. However, the framework is general and can be used to learn particular special cases such as element-wise learning rates Li et al. (2017); Micaelli & Storkey (2021) . Beyond supporting this richer representation, we provide convergence guarantees, do not rely on a validation set, and demonstrate cross-dataset generalisation, enabling us to amortise meta-learning cost. In contrast, the single task meta-learner of Micaelli & Storkey (2021) needs to repeat meta-learning on each specific dataset to optimise per-dataset validation performance. In addition to the work on deep meta-learning of optimisers and optimiser hyperparameters, there is also a rich literature on principled development of online meta-learning algorithms that aim minimise various notions of meta-regret Finn et al. (2019); Denevi et al. (2019b; a; 2018); Balcan et al. (2019) ; Khodak et al. (2019) . Several of these build on online mirror descent, but there is an important distinction between their work and ours: we focus on the batch setting, with a view to ensuring the resulting algorithm can be tractably applied to deep neural networks. ARUBA (Khodak et al., 2019) 



) and Wichrowska et al. (2017) learn black-box neural optimisers to accelerate training of neural networks, while Bello et al. (2017) learn symbolic gradient-based optimisers to improve generalisation. MAML (Finn et al., 2017) and Meta-SGD (Li et al., 2017) learned initialisation and learning rate for SGD training of neural networks with good generalisation performance in the few-shot regime. Later generalisations focused on learning problem family-specific curvature information (Park & Oliva, 2019; Flennerhag et al., 2020). Nevertheless, most existing learned optimisers such as Andrychowicz et al. (2016); Wichrowska et al. (2017); Flennerhag et al. (2020); Bello et al. (2017) can not provide convergence guarantees.

); Martens & Grosse (2015). Preconditioners have been successfully exploited in meta-learning methods such as Meta-Curvature Park & Oliva (2019), and WarpGrad Flennerhag et al. (2020), and are richer than the parameter-wise learning rates Li et al. (2017); Khodak et al. (

) learns an initial condition from which only a few optimisation steps are required solve a new task. Meta-SGD Li et al. (2017) and Meta-Curvature Park & Oliva (2019) extend MAML by learning a parameter-wise learning rate, and a preconditioning matrix, respectively. Another group of methods focus on larger scale problems in terms of dataset size and optimisation horizon. For example, neural architecture search (NAS) Real et al. (2019); Zoph & Le (2017) discovers effective neural architectures. MetaReg Balaji et al. (2018) meta-learns regularisation parameters to improve domain generalisation.

