TASK-AWARE INFORMATION ROUTING FROM COMMON REPRESENTATION SPACE IN LIFELONG LEARNING

Abstract

Intelligent systems deployed in the real world suffer from catastrophic forgetting when exposed to a sequence of tasks. Humans, on the other hand, acquire, consolidate, and transfer knowledge between tasks that rarely interfere with the consolidated knowledge. Accompanied by self-regulated neurogenesis, continual learning in the brain is governed by a rich set of neurophysiological processes that harbor different types of knowledge, which are then integrated by conscious processing. Thus, inspired by the Global Workspace Theory of conscious information access in the brain, we propose TAMiL, a continual learning method that entails task-attention modules to capture task-specific information from the common representation space. We employ simple, undercomplete autoencoders to create a communication bottleneck between the common representation space and the global workspace, allowing only the task-relevant information to the global workspace, thus greatly reducing task interference. Experimental results show that our method outperforms state-of-the-art rehearsal-based and dynamic sparse approaches and bridges the gap between fixed capacity and parameter isolation approaches while being scalable. We also show that our method effectively mitigates catastrophic forgetting while being well-calibrated with reduced taskrecency bias 1 .

1. INTRODUCTION

Deep neural networks (DNNs) deployed in the real world are normally required to learn multiple tasks sequentially and are exposed to non-stationary data distributions. Throughout their lifespan, such systems must acquire new skills without compromising previously learned knowledge. However, continual learning (CL) over multiple tasks violates the i.i.d. (independent and identically distributed) assumption on the underlying data, leading to overfitting on the current task and catastrophic forgetting of previous tasks. The menace of catastrophic forgetting occurs due to the stability-plasticity dilemma: the extent to which the system must be stable to retain consolidated knowledge and be plastic to assimilate new information (Mermillod et al., 2013) . As a consequence of catastrophic forgetting, performance on previous tasks often drops significantly; in the worst case, previously learned information is completely overwritten by the new one (Parisi et al., 2019) . Humans, however, excel at CL by incrementally acquiring, consolidating, and transferring knowledge across tasks (Bremner et al., 2012) . Although there is gracious forgetting in humans, learning new information rarely causes catastrophic forgetting of consolidated knowledge (French, 1999) . CL in the brain is governed by a rich set of neurophysiological processes that harbor different types of knowledge, and conscious processing integrates them coherently (Goyal & Bengio, 2020) . Selfregulated neurogenesis in the brain increases the knowledge bases in which information related to a task is stored without catastrophic forgetting (Kudithipudi et al., 2022) . The global workspace theory (GWT) (Baars, 1994; 2005; Baars et al., 2021) posits that one of such knowledge bases is a common representation space of fixed capacity from which information is selected, maintained, and shared with the rest of the brain. When addressing the current task, the attention mechanism creates a communication bottleneck between the common representation space and the global workspace and admits only relevant information in the global workspace (Goyal & Bengio, 2020 ). Such a system enables efficient CL in humans with systematic generalization across tasks (Bengio, 2017). Several approaches have been proposed in the literature that mimic one or more neurophysiological processes in the brain to address catastrophic forgetting in DNN. Experience rehearsal (Ratcliff, 1990) is one of the most prominent approaches that mimics the association of past and present experiences in the brain. However, the performance of rehearsal-based approaches is poor under low buffer regimes, as it is commensurate with the buffer size (Bhat et al., 2022a) . On the other hand, parameter isolation methods (Rusu et al., 2016) present an extreme case of neurogenesis in which a new subnetwork is initialized for each task, thus greatly reducing task interference. Nevertheless, these approaches exhibit poor reusability of parameters and are not scalable due to the addition of a large number of parameters per task. Therefore, the right combination of the aforementioned mechanisms governed by GWT could unlock effective CL in DNNs while simultaneously encouraging reusability and mitigating catastrophic forgetting. Therefore, we propose Task-specific Attention Modules in Lifelong learning (TAMiL), a novel CL approach that encompasses both experience rehearsal and self-regulated scalable neurogenesis. Specifically, TAMiL learns by using current task samples and a memory buffer that represents data from all previously seen tasks. Additionally, each task entails a task-specific attention module (TAM) to capture task-relevant information in CL, similar to self-regulated neurogenesis in the brain. Reminiscent of the conscious information access proposed in GWT, each TAM acts as a bottleneck when transmitting information from the common representation space to the global workspace, thus reducing task interference. Unlike self-attention in Vision Transformers, we propose using a simple, undercomplete autoencoder as a TAM, thereby rendering the TAMiL scalable even under longer task sequences. Our contributions are as follows: • We propose TAMiL, a novel CL approach that entails both experience rehearsal and selfregulated scalable neurogenesis to further mitigate catastrophic forgetting in CL. • Inspired by GWT of conscious information access in the brain, we propose TAMs to capture task-specific information from the common representation space, thus greatly reducing task interference in Class-and Task-Incremental Learning scenarios. • We also show a significant effect of task attention on other rehearsal-based approaches (e.g. ER, FDR, DER++). The generalizability of the effectiveness of TAMs across algorithms reinforces the applicability of GWT in computational models in CL. • We also show that TAMiL is scalable and well-calibrated with reduced task-recency bias.

2. RELATED WORKS

Rehearsal-based Approaches: Continual learning over a sequence of tasks has been a longstanding challenge, since learning a new task causes large weight changes in the DNNs, resulting in overfitting on the current task and catastrophic forgetting of older tasks (Parisi et al., 2019) . Similar to experience rehearsal in the brain, early works attempted to address catastrophic forgetting through Experience-Replay (ER; Ratcliff (1990) (Cha et al., 2021 ), TARC (Bhat et al., 2022b) and ER-ACE (Caccia et al., 2021a) modify the learning objective to prevent representation drift when encountered with new classes. Given sufficient memory, replay-based approaches mimic the association of past and present experiences in humans and are fairly successful in challenging CL scenarios. However, in scenarios where buffer size is limited, they suffer from overfitting (Bhat et al., 2022a) , exacerbated representation drift (Caccia et al., 2021b) and prior information loss (Zhang et al., 2020) resulting in aggravated forgetting of previous tasks.



; Robins (1995)) by explicitly storing and replaying previous task samples alongside current task samples. Function Distance Regularization (FDR; Benjamin et al. (2018)), Dark Experience Replay (DER++; Buzzega et al. (2020)) and CLS-ER (Arani et al., 2022) leverage soft targets in addition to ground truth labels to enforce consistency regularization across previous and current model predictions. In addition to rehearsal, DRI (Wang et al., 2022) utilizes a generative model to augment rehearsal under low buffer regimes. On the other hand, Co 2 L

