CONTEXTUAL TRANSFORMATION NETWORKS FOR ONLINE CONTINUAL LEARNING

Abstract

Continual learning methods with fixed architectures rely on a single network to learn models that can perform well on all tasks. As a result, they often only accommodate common features of those tasks but neglect each task's specific features. On the other hand, dynamic architecture methods can have a separate network for each task, but they are too expensive to train and not scalable in practice, especially in online settings. To address this problem, we propose a novel online continual learning method named "Contextual Transformation Networks" (CTN) to efficiently model the task-specific features while enjoying neglectable complexity overhead compared to other fixed architecture methods. Moreover, inspired by the Complementary Learning Systems (CLS) theory, we propose a novel dual memory design and an objective to train CTN that can address both catastrophic forgetting and knowledge transfer simultaneously. Our extensive experiments show that CTN is competitive with a large scale dynamic architecture network and consistently outperforms other fixed architecture methods under the same standard backbone. Our implementation can be found at https://github. com/phquang/Contextual-

1. INTRODUCTION

Continual learning is a promising framework towards building AI models that can learn continuously through time, acquire new knowledge while being able to perform its already learned skills (French, 1999; 1992; Parisi et al., 2019; Ring, 1997) . On top of that, online continual learning is particularly interesting because it resembles the real world and the model has to quickly obtain new knowledge on the fly by levering its learned skills. This problem is important for deep neural networks because optimizing them in the online setting has been shown to be challenging (Sahoo et al., 2018; Aljundi et al., 2019a) . Moreover, while it is crucial to obtain new information, the model must be able to perform its acquired skills. Balancing between preventing catastrophic forgetting and facilitating knowledge transfer is imperative when learning on a stream of tasks, which is ubiquitous in realistic scenarios. Thus, in this work, we focus on the continual learning setting in an online learning fashion, where both tasks and data of each task arrive sequentially (Lopez-Paz & Ranzato, 2017) . In the literature, fixed architecture methods employ a shared feature extractor and a set of classifiers, one for each task (Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2019a; b; Aljundi et al., 2019a) . Although using a shared feature extractor has achieved promising results, the common and global features are rather generic and not well-tailored towards each specific task. This problem is even more severe when old data are limited while learning new tasks. As a result, the common feature extractor loses its ability to extract previous tasks' features, resulting in catastrophic forgetting. On the other hand, while dynamic architecture methods such as Rusu et al. (2016) ; Li et al. (2019) ; Xu & Zhu (2018) alleviate this problem by having a separate network for each task, they suffer from the unbounded growth of the parameters. Moreover, the subnetworks' design is not trivial and requires extensive resource usage (Rusu et al., 2016; Li et al., 2019) , which is not practical in many applications. These limitations motivated us to develop a novel method that can facilitate continual learning with a fixed architecture by modeling the task-specific features. (CTN) . CTN consists of a controller θ that modifies the features of the base model φ. The base model is trained using experience replay on the episodic memory while the controller is trained to generalize to the semantic memory, which addresses both alleviating forgetting and facilitating knowledge transfer. Best viewed in colors. To achieve this goal, we first revisit a popular result in learning multiple tasks that each task's features are centered around a common vector (Evgeniou & Pontil, 2004; Aytar & Zisserman, 2011; Pentina & Lampert, 2014; Liu et al., 2019b) . This result motivates us to develop a novel framework of Contextual Transformation Networks (CTN), which consists of a base network that learns the common features of a given input and a controller that efficiently transforms the common features to become task-specific, given a task identifier. While one can train CTN using experience replay, it does not explicitly aim at achieving a good trade-off between stability and plasticity. Therefore, we propose a novel dual memory system and a learning method that encapsulate alleviating forgetting and facilitating knowledge transfer simultaneously. Particularly, we propose two distinct memories: the episodic memory and the semantic memory associated with the base model and the controller, respectively. Then, the base model is trained by experience replay on the episodic memory while the controllers is trained to learn task-specific features that can generalize to the semantic memory. As a result, CTN achieves a good trade-off between preventing catastrophic forgetting and facilitating knowledge transfer because the task-specific features can generalize well to all past and current tasks. Figure 1 gives an overview of the proposed Contextual Transformation Network (CTN). Interestingly, the designs of our CTN and dual memory are partially related to the Complementary Learning Systems (CLS) theory in neuroscience (McClelland et al., 1995; Kumaran et al., 2016) . Particularly, the controller acts as a neocortex that learns the structured knowledge of each task. In contrast, the base model acts as a hippocampus that performs rapid learning to acquire new information from the current task's training data. Following the naming convention of memory in neuroscience, our CTN is equipped with two replay memory types. (i) the episodic memory (associated with the hippocampus) caches a small amount of past tasks' training data, which will be replayed when training the base networks. (ii) the semantic memory (associated with the neocortex) stores another distinct set of old data only used to train the controller such that the task-specific features can generalize well across tasks. Moreover, the CLS theory also suggests that the interplay between the neocortex and the hippocampus attributes to the ability to recall knowledge and generalize to novel experiences (Kumaran & McClelland, 2012) . Our proposed learning approach closely characterizes such properties: the base model focuses on acquiring new knowledge from the current task while the controller uses the base model's knowledge to generalize to novel samples. In summary, our work makes the following contributions. First, we propose CTN, a novel continual learning method that can model task-specific features while enjoying neglectable complexity overhead compared to fixed architecture methods (please refer to Table 4 ). Second, we propose a novel objective that can improve the trade-off between alleviating forgetting and facilitating knowledge transfer to train CTN. Third, we conduct extensive experiments on continual learning benchmarks to demonstrate the efficacy of CTN compared to a suite of baselines. Finally, we provide a comprehensive analysis to investigate the complementarity of each CTN's component.

2. METHOD

Notations. We denote φ as parameter of the base model that extracts global features from the input and θ as the parameter of the controller which modifies the features from φ given a task identifier t. The task identifier can be a set of semantic attributes about objects of that task (Lampert et al., 2009) or simply an index of the task, which we use in this work as a one-hot vector. A prediction is given as g ϕt (h φ,θ (x, t)), where g ϕt (•) is the task T t 's classifier with parameter ϕ t such as a fully connected layer with softmax activation. And h φ,θ (x, t) is the final feature after transformed by the controller. We denote D tr t as the training data of task T t , M em t and M sm t as the episodic memory and the semantic memory of task T t respectively. The episodic memory and semantic memory maintains two distinct sets of data obtained from task T t . The episodic memory of task T 1 , . . . , T t-1 is denoted as M em <t ; similarly, M sm <t denotes the semantic memory of the first t -1 tasks. Remark. Both the M em t and M sm t are obtained from D tr t through the learner's internal memory management strategy and contains distinct samples from each other such that their combined sizes do not exceed a pre-defined budget.

2.1. LEARNING TASK-SPECIFIC FEATURES FOR CONTINUAL LEARNING

Given a backbone network, one can implement the task-specific features by employing a set of task-specific filters and applying them to the backbone's output. However, this trivial approach is not scalable, even for small networks. In the worst case, it results in storing an additional network per task, which violates the fixed architecture constraint. Since we want to obtain task-specific features with minimal parameter overhead, we propose to use a feature-wise transformation (Perez et al., 2018) to efficiently extract the task-specific features h(x, t) from the common features ĥ(x) as follows: h(x; t) = γ t γ t 2 ⊗ ĥ(x) + β t β t 2 and {γ t , β t } = c θ (t), where ⊗ denotes the element-wise multiplication operator, and c θ (t) is the controller implemented as a linear layer with parameter θ that predicts the transformation coefficients {γ t , β t } given the task identifier t. Since the task identifiers are one-hot vectors, which are sparse and make training the controller difficult, we also introduce an embedding layer to map the task identifiers to dense, low dimensional vectors. For simplicity, we will use θ to refer to both the embedding and the linear layer parameters. In addition, instead of storing a set of coefficients {γ t , β t } for each task, we only need a fixed set of parameter θ to predict these coefficients, which results in the fixed parameters in the controller. The coefficients {γ t , β t } are 2 -normalized and then transforms the common features ĥ(x, t) to become task-specific features h(x; t). Finally, both feature types are combined by a residual connection before passing to the corresponding classifier g t (•) to make the final prediction: g ϕt (σ(h(x, t))), and h(x, t) = ĥ(x, t) + h(x, t), where σ(•) is a nonlinear activation function such as ReLU. Importantly, when the task-specific features are removed, i.e., h(x, t) = 0, Eq. 2 reduces to the traditional experience replay. Lastly, for each incoming task, CTN has to allocate a new classifier, which is the same for all continual learning methods, and a new embedding vector, which is usually low dimensional, e.g. 32 or 64. Therefore, CTN enjoys almost the same parameter growth as existing continual learning methods.

2.2. TRAINING THE CONTROLLER

While one can train CTN with experience replay (ER), it does not explicitly address the trade-off between facilitating knowledge transfer and alleviating catastrophic forgetting. This motivates us to develop a novel training method that can simultaneously address both problems by leveraging the controller's task-specific features. First, we introduce a dual memory system consisting of the semantic memory M sm t associated with the controller and the episodic memory M em t associated with the base model. We propose to train only the base model using experience replay with the episodic memory to obtain new knowledge from incoming tasks. The controller is also trained so that the task-specific features can generalize to unseen samples to the base model stored in the semantic memory. As a result, the task-specific features can generalize to both previous and current tasks, which simultaneously encapsulate both alleviating forgetting and facilitating knowledge transfer. Formally, given the current batch of data for task T t as B t , the training of CTN can be formulated as the following bilevel optimization problem (Colson et al., 2007) : Outer problem: min θ L ctrl ({φ * , θ}; M sm <t+1 ) Inner problem: s.t φ * = arg min φ L tr ({φ, θ}, B t ∪ M em <t ), where φ * denotes the optimal base model corresponding to the current controller θ. Since every CTN's prediction always involves both the controller and the base model, we use L tr ({φ, θ}, B t ∪ M em <t ) to denote the training loss of the pair {φ, θ} on the data B t ∪ M em <t . Similarly, L ctrl (•) denotes the controller's loss. For simplicity, we omitted the dependency of the the loss on the classifiers' parameters and imply that the classifiers are jointly updated with the base model. Since we do not know the optimal transformation coefficients of any task, the controller is trained to minimize the classification loss of the samples via φ. We implement both the training and controller's losses as the cross-entropy loss. Notably, Eq. 3 characterizes two nested optimization problems: the outer problem, which trains the controller to generalize, and each controller parameter θ parameterizes an inner problem that trains the base model to acquire new knowledge via experience replay. Moreover, only φ is trained in the inner problem, while only θ is updated in the outer problem. The bilevel optimization objective such as Eq. 3 has been successfully applied in other machine learning disciplines such as hyperparameter optimization, meta learning (Franceschi et al., 2018; Finn et al., 2017) , and AutoML (Liu et al., 2019a) . In this work, we extend this framework to continual learning to train the controller. However, unlike existing works (Franceschi et al., 2018; Finn et al., 2017; Liu et al., 2019a) , our Eq. 3 has to be solved incrementally when a new data sample arrives. Therefore, we consider Eq. 3 as an online learning problem and optimize it using the follow the leader principle (Hannan, 1957) . Particularly, we relax the optimal solutions of both the inner and outer problems to be solutions from a few gradient steps. When a new training data arrives, we first train the base model φ using experience replay for a few SGD steps with an inner learning rate α, each of which is implemented as: φ ← φ -α∇ φ L tr ({φ, θ}, B t ∪ M em <t ), Then, we optimize the controller θ such that it can improve φ's performance on the semantic memory: θ ← θ -β∇ θ L ctrl ({φ, θ}, M sm <t+1 ), where β is the outer learning rate. As a result, Eq. 3 is implemented as an alternative update procedure involving several outer updates to train θ, each of which includes an inner update to train φ. Moreover, performing several updates per incoming sample does not violate the online assumption since we will not revisit that sample in the future, unless it is stored in the memories.

2.3. TRAINING THE BASE NETWORK

Despite using task-specific features, the base network may still forget previous tasks because of the small episodic memory. To further alleviate catastrophic forgetting in φ, we regularize the training loss L tr (•) with a behavioral cloning (BC) strategy based on knowledge distillation (Hinton et al., 2015; van de Ven & Tolias, 2018) . Let ŷ be the logits of the model's prediction before the softmax layer π(•), we regularize the training loss on the episodic memory data in Eq. 4 as: L tr ({φ, θ}, (x, y, k)) = L(π(ŷ), y) + λD KL π ŷ τ π ŷk τ , where λ is the trade-off parameter, τ is the softmax's temperature, and ŷk is a snapshot of the model prediction on the sample (x, k) at the end of task T k . While the behavioral cloning strategy requires storing ŷk , the memory increase is minimal since ŷk is a vector with dimension bounded by the total classes, which is much smaller than the image x dimension. Importantly, the behavioural cloning strategy is used to alleviate catastrophic forgetting, which only happens in the base model, not the controller. Particularly, the controller's inputs are task identifiers such as one-hot vectors, which are fully available during learning. In summary, our episodic memory stores the input image x, its corresponding label y and the soft label ŷ, while the semantic memory stores the input-label pair x, y.

3.1. CONTINUAL LEARNING

Prior works in continual learning can be grouped into three main categories: (1) regularization methods, (2) episodic memory based methods, and (3) dynamic architecture methods. Regularization approaches (Kirkpatrick et al., 2017; Zenke et al., 2017; Aljundi et al., 2018; Ritter et al., 2018) penalize the changes of important parameters to previous tasks using a variant of knowledge distillation (Li & Hoiem, 2017) or via a quadratic constraints. However, such methods usually isolate parameters or find a common solution to all tasks, limiting the model's capacity. Episodic memory based approaches store a small amount of data from previous tasks and interleave it with data from the current task. Old data can be used as a constraint to optimize the model Lopez Dynamic architecture approaches address catastrophic forgetting by having a subnetwork for each task (Rusu et al., 2016; Serra et al., 2018; von Oswald et al., 2020) or being able to grow its structure over time (Yoon et al., 2018; Li et al., 2019; Xu & Zhu, 2018; Hung et al., 2019) 2020) employs a hypernetwork (Ha et al., 2017) to generate a whole prediction network for each task and catastrophic forgetting is avoided by performing experience replay in the hypernetwork's output space. However, this approach requires storing the hypernetwork's output for each task, which is equivalent to a prediction network's parameter. Therefore, while Serra et al. (2018) ; von Oswald et al. ( 2020) have achieved promising results, they requires larger memory and might not be suitable for the online setting.

3.2. FEATURE-WISE TRANSFORMATION

Early works in Bertinetto et al. (2016) ; Rebuffi et al. (2017a) showed that instead of using a taskspecific network on the input, one can employ a set of 1 × 1 filters to extract the task-specific from the common features. However, such approaches still require a quadratic complexity overhead in the number of channels, which can be expensive. Another compelling solution is the featurewise transformation, FiLM (Perez et al., 2018) , which only requires a linear complexity. Thanks to its efficiency, FiLM has been successfully applied in many problems, including meta learning (Requeima et al., 2019; Zintgraf et al., 2019) , visual reasoning (Perez et al., 2018) , and others fields (Dumoulin et al., 2018) with remarkable success. Notably, CNAPs (Requeima et al., 2019) proposed an adaptation network to generate the FiLM's parameters and quickly adapt to new tasks. CNAPs has showed promising results when having access to a large amount of tasks to pre-train the common features. However, this setting is different from continual learning where the learner has to obtain new knowledge on the fly. Therefore, CNAPs are principally differs from CTN in that CNAPs assume having access to a well-pretrained knowledge source and uses FiLM to quickly adapt this knowledge to a new task. On the other hand, CTNs use FiLM to accelerate the knowledge acquisition when learning progressively. Lastly, we emphasize that the CTN's design is general. If more budget is allowed, the proposed CTN is readily compatible with the aforementioned feature transformation methods such as Rebuffi et al. (2017a) by adjusting the controller's output dimension.

3.3. META LEARNING

Meta learning (Schmidhuber, 1987) , also learning to learn, refers to a learning paradigm where an algorithm learns to improve the performance of another algorithm. Our CTN design is related to such learning to learn architectures where the controller is trained to improve the base model's performance. Importantly, we note that there exist other continual learning variants that intersect with meta learning, such as meta-continual learning (Javed & White, 2019) and continual-meta learning (He et al., 2019; Caccia et al., 2020) . However, they consider different goals and problem settings, such as meta pre-training (Javed & White, 2019) or rapid recovering the performance at test time given a finetuning step before inference is allowed (He et al., 2019) , which is not the conventional online continual learning problem (Lopez-Paz & Ranzato, 2017) we focus in this study. Meta learning has been an appealing solution to learn a good initialization from a large amount of tasks (Finn et al., 2017) , even in an online manner: Online Meta Learning (OML) (Finn et al., 2019) . However, we emphasize that OML fundamentally differs from our CTN in two aspects. First, OML requires all data of previous tasks and aims to improve the performance of future tasks, which is different from continual learning. Second, OML learns an initialization and requires finetuning at test time, which is not practical, especially when testing on learned tasks. In contrast, CTN is a continual learning method that maximizes the performance of the current task as well as all previous tasks. Moreover, CTN can make a prediction at any time without requiring an additional finetuning step. Split Mini ImageNet (Split miniIMN) (Chaudhry et al., 2019a) , similarly, we split the miniIMN dataset (Vinyals et al., 2016) into 20 disjoint tasks. Finally, we consider the CORe50 benchmark by constructing a sequence of 10 tasks using the original CORe50 dataset (Lomonaco & Maltoni, 2017) .

4. EXPERIMENTS

Throughout the experiments, we compare CTN with a suite of baselines: GEM (Lopez-Paz & Ranzato, 2017) , AGEM (Chaudhry et al., 2019a) , MER (Riemer et al., 2019) , ER-Ring (Chaudhry et al., 2019b) , and MIR (Aljundi et al., 2019a) . We also consider the independent model (Lopez-Paz & Ranzato, 2017) , a dynamic architecture method that maintains a separate network for each task, and each has the same number of parameters as other baselines. While the independent model is unrealistic, it is highly competitive and was used as an upper bound of a state-of-the-art dynamic architecture method in Hung et al. (2019) . Finally, we include the Offline model, which does not follow the continual learning setting and performs multitask training on all tasks' data. Due to space constraints, we provide the results of less competitive methods in Appendix. C.2. We use a multilayer perceptron with two hidden layers of size 256 for pMNIST, a reduced ResNet18 with three times fewer filters (Lopez-Paz & Ranzato, 2017) for Split CIFAR and Split miniIMN, and a full ResNet18 on CORE50. Following (Lopez-Paz & Ranzato, 2017) , we use a Ring buffer as the memory structure for all methods and random sampling to select data from memory, including the episodic and semantic memories of CTN. The exceptions are MER (Riemer et al., 2019) , which uses reservoir sampling, and MIR (Aljundi et al., 2019a) , which use their sampling strategies as proposed by the authors. For CTN, the episodic memory and semantic memory are implemented as two Ring buffers with sizes equal to 80% and 20% of the total budget. This configuration is also cross-validated from the validation tasks. For each incoming batch of data, we randomly push 80% samples to the current task's episodic memory and the other 20% are for the current task's semantic memory. We follow the procedure proposed in Chaudhry et al. (2019a) to cross-validate all hyperparameters using the first three tasks. Then, the best configuration is selected to perform continual learning on the remaining tasks. During continual learning, the task identifier is given to all methods. We optimize all models using SGD with a mini-batch of size ten over one epoch. We run each experiment five times, each has the same task order but different initialization seed, and report the following metrics: Averaged Accuracy (Lopez-Paz & Ranzato, 2017): ACC(↑) (higher is better) , Forgetting Measure (Chaudhry et al., 2018) : FM(↓) (lower is better), and Learning Accuracy (Riemer et al., 2019) : LA(↑) (higher is better).

4.2. RESULTS OF CONTINUAL LEARNING BENCHMARKS

Table 1 reports the evaluation metrics of the models on four continual learning benchmarks considered with 50 samples per task. We observe that CTN is even comparable with the independent method and outperforms other baselines by a large margin. We remind that the independent method has T times more parameters than the remaining methods, where T is the total number of tasks. Moreover, CTN can exploit the relationship across tasks via the task identifiers to improve its performance. For example, learning to classify "man" and "woman" may be helpful to classify "boy" and "girl" because they belong to the same superclass "people". Finally, CTN significantly outperforms the baselines by achieving a better trade-off between alleviating catastrophic forgetting and facilitating knowledge transfer, as shown by lower FM(↓) and higher LA(↑) . Overall, CTN achieves state-of-the-art results, even comparable with arge scale dynamic architecture method, while enjoying neglectable model complexity overhead compared to fixed architecture methods. ACC(↑) as a function of the episodic memory size. We study the models' performances as the memory size increases. We consider the Split CIFAR 100 and Split miniIMN benchmarks and train the models of CTN, ER, MIR, and GEM with the total memory size per task increasing from 50 to 200. Fig. 2 plots the ACC(↑) curves as a function of the memory size. Generally, the performances of all methods increase with larger memory sizes. Overall, CTN consistently outperforms the competitors across all memory sizes. Notably, in both benchmarks, CTN can achieve comparable performances to the Offline model even when the memory size per task is only 175. The results show that CTN not only excels in the low memory regime but also scales remarkably well when more memory budget is allowed. Table 2 shows the results of this experiment. When the training data are scarce, the baselines performances drop significantly, even below 50% ACC(↑) in three settings. CTN, on the other hand, consistently outperforms the baselines by a large margin, from 8% to 10% across benchmarks, even in the challenging Reduced Split Cifar 10%. Moreover, the three baselines have similarly low LA, showing that they struggle in acquiring new knowledge when the training data of each task are limited. On the other hand, CTN can leverage information about the task-specific features to improve knowledge transfer and the learning outcomes. It is worth noting that even with 25% training data and 50 memory slots per task, CTN already outperforms several baselines that are trained with full data by cross-referencing the results with Table 1 .

4.4. ABLATION STUDY

We study the contribution of each component in CTN in its overall performance and consider the Split CIFAR and Split miniIMN benchmarks with an episodic memory of 50 samples per task. Particularly, we are interested in how (1) the controller, (2) the bi-level optimization, and (3) the behavioral cloning strategy contribute to the base model. We implement variants of CTN with different combinations of these components and report the results in Table 3 . Notably, CTN with only the controller (C) is equivalent to training the base network and the controller using the vanilla experience replay approach. Despite this, the controller can offer significant improvements over ER: over 5% ACC(↑) in Split miniIMN. When the controller is optimized by our proposed bilevel optimization (C + BO), the performances are further improved, showing that our proposed bilevel objective achieves a better trade-off between alleviating forgetting and facilitating knowledge transfer. Lastly, the behavioral cloning strategy can help alleviate forgetting and further strengthen the results. Overall, each of the proposed components adds positive contributions to the base model, and they work collectively as a holistic method and achieved state-of-the-art results in continual learning.

4.5. COMPLEXITY ANALYSIS

In this section, we study the CTN's complexity with the backbones used in our experiments and report the results in Table 4 . In all cases, the controller only adds minimal additional parameters, almost neglectable in complex deep architectures such as ResNets (He et al., 2016; Lopez-Paz & Ranzato, 2017) . Therefore, we can safely compare CTN with other fixed architecture methods using the same backbone because they have nearly the same number of parameters. Table 5 reports the averaged running time (in seconds) of considered methods. All methods are implemented using Pytorch (Paszke et al., 2019) version 1.5 and CUDA 10.2. Experiments are conducted using a single K80 GPU and all methods are allowed up to four gradients steps per sample. Clearly, ER-Ring has the most efficient time complexity thanks to its simplicity. On the other hand, GEM has high computational costs because of its quadratic constraints. MIR also exhibits high running time because of its virtual update, which doubles the total gradient updates. CTN, in general, is slightly faster MIR and more efficient than GEM. Overall, CTN achieves a great trade-off between model/computational complexity and performance: CTN's performances are significantly higher than considered baselines with only minimal memory and computational overhead.

5. CONCLUSION

In this work, we study the online continual learning problem and propose Contextual Transformation Networks (CTN), where a fixed architecture network can model both the common features and specific features of each task. CTN works by employing a controller that modifies features of the base network conditioning on the task identifiers. To optimize CTN, we further propose a novel dual memory system equipped with a bilevel optimization objective that can efficiently transfer knowledge and alleviate forgetting simultaneously. Moreover, we discuss the relationship of CTN to the Complementary Learning Systems theory in neuroscience and meta learning from different perspectives, showing that CTN is related to other disciplines. Through extensive experiments, our results demonstrate that CTN consistently outperforms fixed architecture methods and achieves state-of-the-art results. Moreover, CTN is even comparable with a large scale dynamic architecture network, while enjoying almost no additional model complexity. To measure the model performance, we adopt three standard metrics: Average Accuracy ACC(↑) (Lopez-Paz & Ranzato, 2017) , Forgetting Measure FM(↓) Chaudhry et al. (2019a) , and Learning Accuracy LA(↑) (Riemer et al., 2019) . Denote a i,j as the model's accuracy evaluated on the test set D te j after it has been trained on the most recent sample in dataset D i of task T i . Then, the above metrics are defined as: • Average Accuracy (higher is better): the average accuracy of all observed tasks: ACC(↑) = 1 T T i=1 a T,i . • Forgetting Measure (lower is better): the average forgetting of all previous tasks: FM(↓) = 1 T -1 T -1 j=1 max l∈{1,...T -1} a l,j -a T,j . • Learning Accuracy (higher is better): measures the performance of a model on a task right after it finishes training that task: LA(↑) = 1 T T i=1 a i,i . In the literature, there exists several different continual learning protocols. Here we categorize them based on two questions: (i) Is information about the task of a sample given during training and testing? (ii) Does training within each task is performed online? For question (i), when we do not know which task does the sample belong to, evaluation is called "single-head" and there is a shared classifier for all tasks (Aljundi et al., 2019b) . In question (ii), data of a task can either be fully available when task changes or can arrives sequentially. When all the task data is available, training within tasks can be done in an offline fashion with multiple epochs through data. Our protocol used is in this work is proposed in Lopez-Paz & Ranzato (2017) in which data of each task arrives sequentially and task identifier is also given. Moreover, hyperparameter cross-validation is also an important problem in continual learning, regardless of the protocol considered. Particularly, we must not use data of future tasks when searching for the hyperparameter. Here we follow Chaudhry et al. (2019a) and assume that we have access to a small amount of tasks prior to continual learning. Such tasks will not be encountered again during actual continual learning and only be used for cross-validation. In this section, we provide the implementation details of CTN on two feedforward network bases that we use in our experiments. We implement the context model as a single regression layer. Moreover, we share the parameter of the scale and shift models γ, β, resulting in one set of parameters that takes a task embedding as input and outputs both scale and shift values for a particular layer of the base network. Next, we will describe our implementation of CTN with the base network as MLP and ResNet (He et al., 2016) . For CTN, we will use ĥ as the original features, h as the task-specific features, and h as the combine features. CTN with Multilayer Perceptron. Consider an L-layers MLP with the form: h 0 =x h l =ReLU(W l h l-1 ), ∀l = 1, . . . , L -1, h L =g t = Softmax(W L,t h L-1 ) where the last layer is the softmax classifier h t . Since the last classification layer is already conditioned on the task information, here we are interested in conditioning the intermediate layers h l<L . The CTN with MLP is implemented as: ĥ0 =x ĥl =ReLU(W l h l-1 ), ∀l = 1, . . . , L -1, hl =ReLU(γ t ⊗ W l h l-1 + β t ), ∀l = 1, . . . , L -1, h l = ĥl + hl h L =g t = Softmax(W L,t h L-1 ) We condition each hidden layer of a MLP by using one context network for each layer. Each context network does not share parameters, however, the scale and shift models for one layer is shared. CTN with Deep Residual Network. Unlike MLP, we apply the task conditioning after the residual blocks instead of each convolution layer. Particularly, given a residual block defined as: ĥ1 =ReLU(BN(conv(x))) ĥ2 = BN(conv(h 1 )) ĥ3 =BN(conv(x)) ĥ4 = conv(x) h = ĥ3 + ĥ4 The task-conditioned residual block is computed as: h = ReLU( h) + ReLU(γ t ⊗ h + β t ) While in principle, it is possible to have a context network for each of the residual block, we empirically found that this does not offer significant improvements over using only one controller on the last residual block. Therefore, we only use one controller on the last residual block in all experiments that use a ResNet.

C EXPERIMENT DETAILS C.1 DATASET SUMMARY

We summary the datasets used in our experiments in Table 6 . (Krizhevsky & Hinton, 2009) 100 50,000 10,000 3×32×32 miniIMN (Vinyals et al., 2016) 100 50,000 10,000 3×84×84 CORe50 (Lomonaco & Maltoni, 2017) 50 119,894 44,971 3×84×84 For each benchmark, we normalize the pixel values to [0, 1] by dividing their values by 255.0 as used in Lopez-Paz & Ranzato (2017) , no other data preprocessing steps are performed.

C.2 ADDITIONAL BASELINES

In Table 7 , we provide a more comprehensive comparison with more baselines in the four benchmarks considered: Permuted MNIST, Split CIFAR100 and Split miniIMN, and CORe50. Some of these baselines are less competitive, thus, were not included in the main paper due to space constraints. We provide a brief descrption of each baselines in the following. • BCL (Pham et al., 2020) : a bilevel-optimization method using Reptile update (Nichol et al., 2018) such that the base model can generalize to a separate memory units. Unlike BCL, our CTN can model the task-specific features and does not need approximations to solve the bilevel optimization problem.  where h(x) is a feature map with dimension (C × H × W ) where C is the number of channels, H and W are the spatial dimension of this feature map. A feature-wise affine transformation γ t , β t is only required to have dimension (C × 1 × 1) for each γ t and β t . In our implementation, we predict both γ t and β t from the task embedding e(t) by a parameter θ. As a result, let e be the task embedding dimension, the embedding layer will cost (T × e) and the controller (linear regression model) will cost (2C × e), resulting in the total (T × e + 2C × e) parameters in the controller for all tasks. In practice, this term is dominated by C × e because T and e are the number of tasks and the embedding dimension, which are quite small. When a new task arrives, we only need to allocate e parameters in the embedding matrix. Overall, CTN offers significantly performance improvements with only minimal memory overhead.

C.4 EFFECT OF THE SEMANTIC MEMORY SIZE

We study how the semantic memory size effects CTN performance. For this experiment, we consider the validation tasks in the Split CIFAR-100 benchmark (the first three tasks) and vary the semantic memory size and episodic memory size such that their total sizes equals to 50 samples per task. Fig. 3 reports the results of this experiment. We can see that when the semantic memory size is 10 (20% of the total memory), CTN achieves the highest ACC, FM(↓) and lowest FM(↓) and these evaluation metrics degrades when the semantic memory sizes increases. Generally, we have to balance the amount of memory for controller and the base network. Since the controller is only a simple model, it only requires a small amount of data in the semantic memory. 

D VARIANTS OF CTN

In this section, we explored alternative strategies for alleviating catastrophic forgetting in CTN's inner optimization problem, which is experience replay (ER) to train the base model φ. Particularly, instead of the behavioural cloning strategy in Eq. 6, we consider two strategy to alleviate forgetting in ER by combining ER with EWC (Kirkpatrick et al., 2017) and GEM (Lopez-Paz & Ranzato, 2017) . Table 8 show the results of this experiment on the Split CIFAR100 and Split miniIMN benchmarks. We can see that the behavioural cloning strategy significantly outperforms its competitors, EWC and GEM. Notably, using CTN with EWC requires larger episodic memory to store the previous tasks' parameters and their importance. Moreover, using CTN with GEM results in slower running time since GEM has the slowest training time as shown in Table 5 . The results show that the behavioural cloning strategy is more suitable for alleviating forgetting in ER, while enjoying less memory overhead or faster running time compared to other alternatives.



Figure 1: Overview of the Contextual Transformation Networks (CTN). CTN consists of a controller θ that modifies the features of the base model φ. The base model is trained using experience replay on the episodic memory while the controller is trained to generalize to the semantic memory, which addresses both alleviating forgetting and facilitating knowledge transfer. Best viewed in colors.

BENCHMARK DATASETS AND BASELINESWe consider four continual learning benchmarks in our experiments. Permuted MNIST (pMNIST)(Lopez-Paz & Ranzato, 2017): each task is a random but fixed permutation of the original MNIST. We generate 23 tasks with 1,000 images for training and the testing set has the same amount of images as in the original MNIST data. Split CIFAR-100 (Split CIFAR)(Lopez-Paz & Ranzato, 2017) is constructed by splitting the CIFAR100(Krizhevsky & Hinton, 2009) dataset into 20 tasks, each of which contains 5 different classes sampled without replacement from the total of 100 classes.

Figure 2: ACC(↑) as a function of the episodic memory size on the Split CIFAR-100 and Split miniIMN benchmarks. Best viewed in colors.

Figure 3: Effect of memory size on CTN's performance. For every semantic memory size m × 50, the corresponding episodic memory size is (1 -m) × 50.

. Such methods approximate training a full, separate network per task by reducing the number of additional parameters. However, most of them require growing the backbone network during training(Rusu et al., 2016;Yoon et al., 2018;Xu & Zhu, 2018;Li et al., 2019) or extensive resource usage(Rusu et al., 2016;Li et al., 2019), which is not scalable and undesirable for many applications. Notably, the idea of conditioning on the task identifiers were explored inSerra et al. (2018);von Oswald et al. (2020). However,Serra et al. (2018) uses the task identifiers to gate the network's activations, which limits the representation capability. On the other hand, von Oswald et al. (

Evaluation metrics on continual learning benchmarks considered. All methods use the same backbone network and 50 memory slots per task, * denotes a dynamic architecture method that has a separate network per task

Evaluation metrics on the Small Split CIFAR benchmarks, M denotes the memory per task

ACC(↑) of each component in CTN on Split CIFAR and Split mini Imagenet with 50 memory slots per task. BC: behavioral cloning (Eq. 6), C: controller, BO: Bilevel optimization (Eq. 3) This setting is much more challenging because it tests the learner's ability to quickly acquire knowledge only with limited training samples by utilizing its past experiences. In this experiment, we explore how different memory-based methods perform with only limited training samples per task and memory size. We consider the Split CIFAR benchmark; however, we reduce the amount of training data per task significantly. Particularly, we only consider 25% and 10% of the original data per task while the test data remains the same. We name the new benchmarks Reduced Split CIFAR 25% and Reduced Split CIFAR 10%, respectively. Notably, the Reduced Split CIFAR 10% only has five samples per class, which is extremely challenging. We compare CTN with GEM, ER, and MIR on these benchmarks with the memory size of 50 and 25 samples per task.

Model complexity of CTN with various backbone architectures

We provide the details algorithm of our CTN and its subroutines in Alg. 1. For simplicity, we drop the dependency of the losses on the parameters and use L tr (B n ) to denote L tr (φ, ϕ, B n ; θ) andL(B n ) to denote L(φ, ϕ, B n ; θ) -∇ φ L tr (B n ) // Inner update the base model φ ϕ ← ϕ -∇ ϕ L tr (B n )// Inner update the classifier φ

Summary of datasets used in our experiments

• Independent *(Lopez-Paz & Ranzato, 2017): maintains a separate model for each task, each has the same number of parameters as other methods. While being unrealistic, this model was used as the upper bound model inHung et al. (2019) thanks to its impressive performance.• Offline: an upper bound model that performs multitask training on all data. Note that this model does not follow the continual learning setting. We implement the offline model by training the network three epochs over all data of all tasks.C.3 CTN MODEL COMPLEXITY AND COMPUTATION COSTModel complexity. Recall the interaction between the controller and the base model is described as:

Alternative strategies to reduce forgetting in CTN's inner optimization. BC: behavioural cloning strategy in Eq. 6 33±0.70 73.43±0.45 65.82±0.59 3.02±1.13 67.73±1.73 CTN-EWC 60.33±1.44 9.33±1.55 68.78±0.24 57.69±0.96 5.59±0.45 61.53±1.38 CTN-GEM 64.40±2.52 8.06±1.92 71.49±0.46 60.65±0.80 5.83±0.84 64.42±0.46 -λ (CTN): [1, 10, 25, 50, 100] -γ (GEM): [0, 0.5, 1] • Semantic memory size in percentage of total memory (CTN): [10%, 20%, 30%, 40%]

APPENDIX

This Appendix is organized as follows. In Appendix A, we provide the details of the continual learning protocol and evaluation metrics used in this work. Appendix B provides pseudo-code of CTN and its implementation on standard deep learning architectures such as MLP and Residual networks. Appendix C provides additional experiment details, including the summary of our benchmarks, results of additional baselines, model and computational complexity, and hyperparameter settings.A CONTINUAL LEARNING PROTOCOLS Continual learning, a.k.a. lifelong learning, McCloskey & Cohen (1989) ; Thrun & Mitchell (1995) ; Ring (1997) has been extensive studied over the past decades. In this work, we consider the problem of online continual learning studied by (Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2019a) . Specifically, at time step t, a learner receives an input pair (x, t) and makes a prediction y = f (x, t; w) by a predictor f (•) parameterized some parameter w. Note that here the input x belongs to an underlying task T t , which is also given to the learner. Each task T t comprises a training dataset D tr t , whose data will be sequentially presented to a learner, and a separate testing set D te t . Following (Chaudhry et al., 2019a) , we also assume having access to a small amount of tasks prior to learning for hyperparameter validation and an episodic memory M can be used. We assume that the stream of data {(x i , t i ), y i } ∞ i=1 arrives sequentially and the goal is to optimize a model that can perform well on all observed tasks so far.Published as a conference paper at ICLR 2021 • Finetune: a naive method that learns sequentially without any regularization.• LwF (Li & Hoiem, 2017) : prevents forgetting by a distillation loss of the previous model on current data. • EWC (Kirkpatrick et al., 2017) : penalizes the changes of important parameters to previous tasks to prevent forgetting. • GEM (Lopez-Paz & Ranzato, 2017) : uses an episodic memory to store some data and prevents the losses of old tasks from increasing during learning new tasks. • KDR (Hou et al., 2018) : uses knowledge distillation and task-specific experts to balance between learning new tasks and alleviating forgetting. • AGEM (Chaudhry et al., 2019a) : an efficient version of GEM by averaging the constraints in GEM. • MER: (Riemer et al., 2019) maximizes the gradient inner product between every sample pair in the memory by a variant of the Reptile algorithm. We use MERAlg6 with mini batch of size 10 for consistency with remaining methods. • ER-Ring (Chaudhry et al., 2019b) : simply mixes data of previous and current tasks during training and optimizes a multitask loss. • MIR (Aljundi et al., 2019a) : is a variant of ER which selects the samples in the episodic memory that maximizes the model's forgetting to replay.

C.5 HYPERPARAMETER SELECTION

We provide the hyper-parameters values of methods considered in our task-aware experiments. For brevity, we use MNIST to denote both the Permuted MNIST and Rotated MNIST benchmarks. The Small Split CIFAR experiments use the same hyper-parameter settings as the original Split CIFAR100. For each method, we use the same hyper-parameter notation and description as provided in the corresponding original papers. Each hyper-parameter is cross-validated using grid search on the three validation tasks, which will not be encountered during continual learning. The grid for each hyper-parameter is provided below.• Learning rate, including inner, outer (CTN) and across batch (MER) learning rates: 

