METAMD: PRINCIPLED OPTIMISER META-LEARNING FOR DEEP LEARNING

Abstract

Optimiser design influences learning speed and generalisation in training machine learning models. Several studies have attempted to learn more effective gradientdescent optimisers via solving a bi-level optimisation problem where generalisation error is minimised with respect to optimiser parameters. However, most existing neural network oriented optimiser learning methods are intuitively motivated, without clear theoretical support, and focus on learning implicit biases that improve generalisation, rather than speed of convergence. We take a different perspective starting from mirror descent rather than gradient descent, and meta-learning the corresponding Bregman divergence. Within this paradigm, we formalise a novel meta-learning objective of optimising the rate of convergence. The resulting framework, termed Meta Mirror Descent (MetaMD), learns to accelerate optimisation speed. Unlike many meta-learned neural network optimisers, it also supports convergence guarantees and uniquely does so without requiring validation data. We empirically evaluate our framework on a variety of tasks and architectures in terms of convergence rate and generalisation error and demonstrate strong performance.

1. INTRODUCTION

Gradient-based optimisation algorithms, such as stochastic gradient descent (SGD), are fundamental building blocks of many machine learning algorithms-notably those focused on training linear models and deep neural networks. These methods are typically developed to solve a broad class of problems, and therefore the method developers make as few assumptions about the target problem as possible. This leads to a variety of general purpose techniques for optimisation, but such generality often comes with slower convergence. By taking advantage of more information about the target problem, one is typically able to design more efficient-but less general-optimisation algorithms. Another challenge in a non-convex deep learning context, is that that many of the empirically fastest optimisers such as Adam (Kingma & Ba, 2015) lack convergence guarantees. While one line of research hand-designs optimisers to exploit known properties of particular problems, a complementary line of research focuses on situations where optimisation problems come in families. This allows the use of meta-learning techniques to fit an optimiser to the given problem family with the goal of maximising convergence speed or generalisation performance. For example, in the many-shot regime, Andrychowicz et al. ( 2016 In this work, we revisit the optimiser learning problem from the perspective of mirror descent. Mirror descent introduces a Bregman divergence that regularises the distance between current and next iterate, introducing a strongly convex sub-problem that can be optimised exactly. In mirror descent, the choice of Bregman divergence determines optimisation dynamics. In a meta-learning context, the Bregman divergence thus provides a interesting representation of an optimisation strategy that can be fit to a given family of optimisation problems, leading to our learned optimiser termed Meta Mirror Descent (MetaMD). Many existing learned optimisers do not have a formal notion of convergence



) and Wichrowska et al. (2017) learn black-box neural optimisers to accelerate training of neural networks, while Bello et al. (2017) learn symbolic gradient-based optimisers to improve generalisation. MAML (Finn et al., 2017) and Meta-SGD (Li et al., 2017) learned initialisation and learning rate for SGD training of neural networks with good generalisation performance in the few-shot regime. Later generalisations focused on learning problem family-specific curvature information (Park & Oliva, 2019; Flennerhag et al., 2020). Nevertheless, most existing learned optimisers such as Andrychowicz et al. (2016); Wichrowska et al. (2017); Flennerhag et al. (2020); Bello et al. (2017) can not provide convergence guarantees.

