BAYESIAN ONLINE META-LEARNING

Abstract

Neural networks are known to suffer from catastrophic forgetting when trained on sequential datasets. While there have been numerous attempts to solve this problem for large-scale supervised classification, little has been done to overcome catastrophic forgetting for few-shot classification problems. Few-shot metalearning algorithms often require all few-shot tasks to be readily available in a batch for training. The popular gradient-based model-agnostic meta-learning algorithm (MAML) is a typical algorithm that suffers from these limitations. This work introduces a Bayesian online meta-learning framework to tackle the catastrophic forgetting and the sequential few-shot tasks problems. Our framework incorporates MAML into a Bayesian online learning algorithm with Laplace approximation or variational inference. This framework enables few-shot classification on a range of sequentially arriving datasets with a single meta-learned model and training on sequentially arriving few-shot tasks. The experimental evaluations demonstrate that our framework can effectively prevent catastrophic forgetting and is capable of online meta-learning in various few-shot classification settings.

1. INTRODUCTION

Image classification models and algorithms often require an enormous amount of labelled examples for training to achieve state-of-the-art performance. Labelled examples can be expensive and time-consuming to acquire. Human visual systems, on the other hand, are able to recognise new classes after being shown a few labelled examples. Few-shot classification (Miller et al., 2000; Li et al., 2004; 2006; Lake et al., 2011) tackles this issue by learning to adapt to unseen classes (known as novel classes) with very few labelled examples from each class. Recent works show that metalearning provides promising approaches to few-shot classification problems (Santoro et al., 2016; Finn et al., 2017; Li et al., 2017; Ravi & Larochelle, 2017) . Meta-learning or learning-to-learn (Schmidhuber, 1987; Thrun & Pratt, 1998) takes the learning process a level deeper -instead of learning from the labelled examples in the training classes (known as base classes), meta-learning learns the example-learning process. The training process in meta-learning that utilises the base classes is called the meta-training stage, and the evaluation process that reports the few-shot performance on the novel classes is known as the meta-evaluation stage. Despite being a promising solution to few-shot classification problems, meta-learning methods suffer from several limitations: 1. Unable to continually learn from sequential few-shot tasks: It is mandatory to have all base classes readily available for meta-training. Such meta-learning algorithms often require sampling a number of few-shot tasks in every iteration for optimisation. 2. Unable to retain few-shot classification ability on sequential datasets that have evident distributional shift: A meta-learned model is restricted to perform few-shot classification on a specific dataset, in the sense that the base and novel classes have to originate from the same dataset distribution. A meta-learned model loses its few-shot classification ability on previous datasets as new ones arrive subsequently for meta-training. We emphasise that the task mentioned in this paper refers to the few-shot task for meta-learning. This paper considers meta-learning a single model for few-shot classification in the sequential datasets and sequential few-shot tasks settings respectively. We introduce a Bayesian online meta-learning framework that can train a few-shot learning model under the sequential few-shot tasks setting and train a model that is applicable to a broader scope of few-shot classification datasets by overcoming catastrophic forgetting. We extend the Bayesian online learning (BOL) framework (Opper, 1998) An important reason to implement Bayesian inference over non-Bayesian methods for an online @@@ New setting is that BOL provides a grounded framework that suggests using the previous posterior as the prior recursively. Bayesian inference inherits an advantage for robust meta-learning (Yoon et al., 2018) to overcome training instability problems addressed by Antoniou et al. (2019) . BOL implicitly keeps a memory on previous knowledge via the posterior, in contrast to recent online meta-learning methods that explicitly accumulate previous data in a task buffer (Finn et al., 2019; Zhuang et al., 2019) . Explicitly keeping a memory on previous data often triggers an important question: how should the carried-forward data be processed in future task rounds, in order to accumulate knowledge? Finn et al. ( 2019) update the meta-parameters at each iteration using previous few-shot tasks in the task buffer. This defeats the purpose of online learning, which by definition means to update the parameters each round using only the new data encountered. Having to re-train on previous data to avoid forgetting also increases the training time as the data accumulate (Finn et al., 2019; He et al., 2019) . Certainly one can clamp the amount of data at some maximal limit and sample from the buffer, but the final performance of such an algorithm would be dependent on the samples being informative and of good quality which may vary across different seed runs. In contrast to memorising the datasets, having an implicit memory via the posterior automatically deals with the question on how to process carried-forward data and allows a better carry forward in previous experiences. Below are the contributions we make in this paper: • We develop the Bayesian online meta-learning (BOML) framework for sequential few-shot classification problems. Under this framework we introduce the algorithms Bayesian online meta-learning with Laplace approximation (BOMLA) and Bayesian online meta-learning with variational inference (BOMVI). • We propose a simple approximation to the Fisher corresponding to the BOMLA algorithm that carries over the desirable block-diagonal Kronecker-factored structure from the Fisher approximation in the non-meta-learning setting. • We demonstrate that BOML can overcome catastrophic forgetting in the sequential few-shot datasets setting with apparent distributional shift in the datasets. • We demonstrate that BOML can continually learn to few-shot classify the novel classes in the sequential meta-training few-shot tasks setting.

2. META-LEARNING

Most meta-learning algorithms comprise an inner loop for example-learning and an outer loop that learns the example-learning process. Such algorithms often require sampling a meta-batch of tasks at each iteration, where a task is formed by sampling a subset of classes from the pool of base classes or novel classes during meta-training or meta-evaluation respectively. The N -way K-shot task, for instance, refers to sampling N classes and using K examples per class for few-shot quick adaptation. An offline meta-learning algorithm learns a few-shot classification model only for a specific dataset D t+1 where all base classes of D t+1 have to be readily available for meta-training. For notational convenience, we drop the t + 1 subscript in this section, as there is only one dataset involved in offline meta-learning. The dataset D t+1 is divided into the set of base classes D and novel classes D for meta-training and meta-evaluation respectively. Upon completing meta-training on the base class



to a Bayesian online meta-learning framework using the model-agnostic meta-learning (MAML) algorithm(Finn et al., 2017). MAML finds a good model parameter initialisation (called meta-parameters) that can quickly adapt to novel classes using very few labelled examples, while BOL provides a principled framework for finding the posterior of the model parameters. Our framework aims to combine both BOL and MAML to find the posterior of the meta-parameters. Our work builds onRitter et al. (2018a)  which combines the BOL framework and Laplace approximation with block-diagonal Kronecker-factored Fisher approximation, and Nguyen et al. (2018) which uses variational inference with BOL to overcome catastrophic forgetting in large-scale supervised classification.

