ENERGY-BASED MODELS FOR CONTINUAL LEARNING Anonymous authors Paper under double-blind review

Abstract

We motivate Energy-Based Models (EBMs) as a promising model class for continual learning problems. Instead of tackling continual learning via the use of external memory, growing models, or regularization, EBMs have a natural way to support a dynamically-growing number of tasks and classes and less interference with old tasks. We show that EBMs are adaptable to a more general continual learning setting where the data distribution changes without the notion of explicitly delineated tasks. We also find that EBMs outperform the baseline methods by a large margin on several continual learning benchmarks. These observations point towards EBMs as a class of models naturally inclined towards the continual learning regime.

1. INTRODUCTION

Humans are able to rapidly learn new skills and continuously integrate them with prior knowledge. The field of Continual Learning (CL) seeks to build artificial agents with the same capabilities (Parisi et al., 2019) . In recent years, CL has seen increased attention, particularly in the context of classification problems. A crucial characteristic of continual learning is the ability to learn new data without forgetting prior data. Models must also be able to incrementally learn new skills, without necessarily having a notion of an explicit task identity. However, standard neural networks (He et al., 2016; Simonyan & Zisserman, 2014; Szegedy et al., 2015) experience the catastrophic forgetting problem and perform poorly in this setting. Different approaches have been proposed to mitigate catastrophic forgetting, but many rely on the usage of external memory (Lopez-Paz & Ranzato, 2017; Li & Hoiem, 2017) , additional models (Shin et al., 2017) , or auxiliary objectives and regularization (Kirkpatrick et al., 2017; Schwarz et al., 2018; Zenke et al., 2017; Maltoni & Lomonaco, 2019) , which can constrain the wide applicability of these methods. In this work, we focus on classification tasks. These tasks are usually tackled by utilizing normalized probability distribution (i.e., softmax output layer) and trained with a cross-entropy objective. In this paper, we argue that by viewing classification from the lens of training an un-normalized probability distribution, we can significantly improve continual learning performance in classification problems. In particular, we interpret classification as learning an Energy-Based Model (EBM) across seperate classes (Grathwohl et al., 2019) . Training becomes a wake-sleep process, where the energy of an input data and its ground truth label is decreased while the energy of the input and another selected class is increased. This offers freedom to choose what classes to update in the CL process. By contrast, the cross entropy objective reduces the likelihood of all negative classes when given a new input, creating updates that lead to forgetting. The energy function, which maps a data and class pair to a scalar energy, also provides a way for the model to select and filter portions of the input that are relevant towards the classification on hand. We show that this enables EBMs training updates for new data to interfere less with previous data. In particular, our formulation of the energy function allows us to compute the energy of a data by learning a conditional gain based on the input label which serves as an attention filter to select the most relevant information. In the event of a new class, a new conditional gain can be learned. These unique benefits are applicable across a range of continual learning tasks. Most existing works (Kirkpatrick et al., 2017; Zhao et al., 2020) toward continual learning typically learn a sequence of distinct tasks with clear task boundaries (Boundary-Aware). Many of these methods depend on knowing the task boundaries that can provide proper moments to consolidate knowledge. However, this scenario is not very common in the real world, and a more natural scenario is the Boundary-Agnostic setting (Zeno et al., 2018; Rajasegaran et al., 2020) , in which data gradually changes without a clear notion of task boundaries. This setting has also been used as a standard evaluation in the continual reinforcement learning (Al-Shedivat et al., 2017; Nagabandi et al., 2018) . Many common CL methods are not applicable to the Boundary-Agnostic scenario as the task boundaries are unknown or undefined. In contrast, EBMs are readily applied to this setting without any modification and are able to support both Boundary-Aware and Boundary-Agnostic settings. There are four primary contributions of our work. First, we introduce energy-based models for classification CL problems in both boundary-aware and boundary-agnostic regimes. Secondly, we use the standard contrastive divergence training procedure and show that it significantly reduces catastrophic forgetting. Thirdly, we propose to learn new conditional gains during the training process which makes EBMs parameter updates cause less interference with old data. Lastly, we show that in practice EBMs bring a significant improvement on four standard CL benchmarks, split MNIST, permuted MNIST, CIFAR-10, and CIFAR-100. These observations towards EBMs as a class of models naturally inclined towards the CL regime.

2.1. CONTINUAL LEARNING SETTINGS

Boundary-aware versus boundary-agnostic. In most existing continual learning studies, models are trained in a "boundary-aware" setting, in which a sequence of distinct tasks with clear task boundaries is given (e.g., Kirkpatrick et al., 2017; Zenke et al., 2017; Shin et al., 2017) . Typically there are no overlapping classes between any two tasks; for example task 1 has data with ground truth class labels "1,2" and task 2 has data with ground truth class labels "3,4". In this setting, models are first trained on the entire first task and then move to the second one. Moreover, models are typically told when there is a transition from one task to the next. However, it could be argued that it is more realistic for tasks to change gradually and for models to not be explicitly informed about the task boundaries. Such a boundary-agnostic setting has been explored in (Zeno et al., 2018; Rajasegaran et al., 2020; Aljundi et al., 2019) . In this setting, models learn in a streaming fashion and the data distributions gradually change over time. For example, the percentage of "1s" presented to the model might gradually decrease while the percentage of "2s" increases. Importantly, most existing continual learning approaches are not applicable to the boundary-agnostic setting as they require the task boundaries to decide when to consolidate the knowledge (Zeno et al., 2018) . In this paper, we will show that our proposed approach can also be applied to the boundary-agnostic setting. Task-incremental versus class-incremental learning. Another important distinction in continual learning is between task-incremental learning and class-incremental learning (van de Ven & Tolias, 2019; Prabhu et al., 2020) . In task-incremental learning, also referred to as the multi-head setting (Farquhar & Gal, 2018) , models have to predict the label of an input data by choosing only from the labels in the task where the data come from. On the other hand, in class-incremental learning, also referred to as the single-head setting, models have to chose between the classes from all tasks so far when asked to predict the label of an input data. Class-incremental learning is substantially more challenging than task-incremental learning, as it requires models to select the correct labels from the mixture of new and old classes. So far, only methods that store data or use replay have been shown to perform well in the class-incremental learning scenario (Rebuffi et al., 2017; Rajasegaran et al., 2019) . In this paper, we try to tackle class-incremental learning without storing data and replay.

2.2. CONTINUAL LEARNING APPROACHES

In recent years, numerous methods have been proposed for CL. Here we broadly partition these methods into three categories: task-specific, regularization, and replay-based approaches. Task-specific methods. One way to reduce interference between tasks is by using different parts of a neural network for different problems. For a fixed-size network, such specialization could be achieved by learning a separate mask for each task (Fernando et al., 2017; Serra et al., 2018) , by a priori defining a different, random mask for every task to be learned (Masse et al., 2018) or by using a different set of parameters for each task (Zeng et al., 2019; Hu et al., 2019) . Other methods let a neural network grow or recruit new resources when it encounters new tasks, examples of which are progressive neural networks (Rusu et al., 2016) and dynamically expandable networks (Yoon et al., 2017) . Although these task-specific approaches are generally successful in reducing catastrophic forgetting, an important disadvantage is that they require knowledge of the task identity at both training and test time. These methods are therefore not suitable for class-incremental learning.

