CONTEXTUAL TRANSFORMATION NETWORKS FOR ONLINE CONTINUAL LEARNING

Abstract

Continual learning methods with fixed architectures rely on a single network to learn models that can perform well on all tasks. As a result, they often only accommodate common features of those tasks but neglect each task's specific features. On the other hand, dynamic architecture methods can have a separate network for each task, but they are too expensive to train and not scalable in practice, especially in online settings. To address this problem, we propose a novel online continual learning method named "Contextual Transformation Networks" (CTN) to efficiently model the task-specific features while enjoying neglectable complexity overhead compared to other fixed architecture methods. Moreover, inspired by the Complementary Learning Systems (CLS) theory, we propose a novel dual memory design and an objective to train CTN that can address both catastrophic forgetting and knowledge transfer simultaneously. Our extensive experiments show that CTN is competitive with a large scale dynamic architecture network and consistently outperforms other fixed architecture methods under the same standard backbone. Our implementation can be found at https://github. com/phquang/Contextual-

1. INTRODUCTION

Continual learning is a promising framework towards building AI models that can learn continuously through time, acquire new knowledge while being able to perform its already learned skills (French, 1999; 1992; Parisi et al., 2019; Ring, 1997) . On top of that, online continual learning is particularly interesting because it resembles the real world and the model has to quickly obtain new knowledge on the fly by levering its learned skills. This problem is important for deep neural networks because optimizing them in the online setting has been shown to be challenging (Sahoo et al., 2018; Aljundi et al., 2019a) . Moreover, while it is crucial to obtain new information, the model must be able to perform its acquired skills. Balancing between preventing catastrophic forgetting and facilitating knowledge transfer is imperative when learning on a stream of tasks, which is ubiquitous in realistic scenarios. Thus, in this work, we focus on the continual learning setting in an online learning fashion, where both tasks and data of each task arrive sequentially (Lopez-Paz & Ranzato, 2017) . In the literature, fixed architecture methods employ a shared feature extractor and a set of classifiers, one for each task (Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2019a; b; Aljundi et al., 2019a) . Although using a shared feature extractor has achieved promising results, the common and global features are rather generic and not well-tailored towards each specific task. This problem is even more severe when old data are limited while learning new tasks. As a result, the common feature extractor loses its ability to extract previous tasks' features, resulting in catastrophic forgetting. On the other hand, while dynamic architecture methods such as Rusu et al. (2016); Li et al. (2019); Xu & Zhu (2018) alleviate this problem by having a separate network for each task, they suffer from the unbounded growth of the parameters. Moreover, the subnetworks' design is not trivial and requires extensive resource usage (Rusu et al., 2016; Li et al., 2019) , which is not practical in many applications. These limitations motivated us to develop a novel method that can facilitate continual learning with a fixed architecture by modeling the task-specific features. To achieve this goal, we first revisit a popular result in learning multiple tasks that each task's features are centered around a common vector (Evgeniou & Pontil, 2004; Aytar & Zisserman, 2011; Pentina & Lampert, 2014; Liu et al., 2019b) . This result motivates us to develop a novel framework of Contextual Transformation Networks (CTN), which consists of a base network that learns the common features of a given input and a controller that efficiently transforms the common features to become task-specific, given a task identifier. While one can train CTN using experience replay, it does not explicitly aim at achieving a good trade-off between stability and plasticity. Therefore, we propose a novel dual memory system and a learning method that encapsulate alleviating forgetting and facilitating knowledge transfer simultaneously. Particularly, we propose two distinct memories: the episodic memory and the semantic memory associated with the base model and the controller, respectively. Then, the base model is trained by experience replay on the episodic memory while the controllers is trained to learn task-specific features that can generalize to the semantic memory. As a result, CTN achieves a good trade-off between preventing catastrophic forgetting and facilitating knowledge transfer because the task-specific features can generalize well to all past and current tasks. Figure 1 gives an overview of the proposed Contextual Transformation Network (CTN). Interestingly, the designs of our CTN and dual memory are partially related to the Complementary Learning Systems (CLS) theory in neuroscience (McClelland et al., 1995; Kumaran et al., 2016) . Particularly, the controller acts as a neocortex that learns the structured knowledge of each task. In contrast, the base model acts as a hippocampus that performs rapid learning to acquire new information from the current task's training data. Following the naming convention of memory in neuroscience, our CTN is equipped with two replay memory types. (i) the episodic memory (associated with the hippocampus) caches a small amount of past tasks' training data, which will be replayed when training the base networks. (ii) the semantic memory (associated with the neocortex) stores another distinct set of old data only used to train the controller such that the task-specific features can generalize well across tasks. Moreover, the CLS theory also suggests that the interplay between the neocortex and the hippocampus attributes to the ability to recall knowledge and generalize to novel experiences (Kumaran & McClelland, 2012) . Our proposed learning approach closely characterizes such properties: the base model focuses on acquiring new knowledge from the current task while the controller uses the base model's knowledge to generalize to novel samples. In summary, our work makes the following contributions. First, we propose CTN, a novel continual learning method that can model task-specific features while enjoying neglectable complexity overhead compared to fixed architecture methods (please refer to Table 4 ). Second, we propose a novel objective that can improve the trade-off between alleviating forgetting and facilitating knowledge transfer to train CTN. Third, we conduct extensive experiments on continual learning benchmarks to demonstrate the efficacy of CTN compared to a suite of baselines. Finally, we provide a comprehensive analysis to investigate the complementarity of each CTN's component.

2. METHOD

Notations. We denote φ as parameter of the base model that extracts global features from the input and θ as the parameter of the controller which modifies the features from φ given a task identifier t. The task identifier can be a set of semantic attributes about objects of that task (Lampert et al., 2009) or simply an index of the task, which we use in this work as a one-hot vector. A prediction is given as g ϕt (h φ,θ (x, t)), where g ϕt (•) is the task T t 's classifier with parameter ϕ t such as a fully



Figure 1: Overview of the Contextual Transformation Networks (CTN). CTN consists of a controller θ that modifies the features of the base model φ. The base model is trained using experience replay on the episodic memory while the controller is trained to generalize to the semantic memory, which addresses both alleviating forgetting and facilitating knowledge transfer. Best viewed in colors.

