MULTIPLE MODES FOR CONTINUAL LEARNING

Abstract

Adapting model parameters to incoming streams of data is a crucial factor to deep learning scalability. Interestingly, prior continual learning strategies in online settings inadvertently anchor their updated parameters to a local parameter subspace to remember old tasks, else drift away from the subspace and forget. From this observation, we formulate a trade-off between constructing multiple parameter modes and allocating tasks per mode. Mode-Optimized Task Allocation (MOTA), our contributed adaptation strategy, trains multiple modes in parallel, then optimizes task allocation per mode. We empirically demonstrate improvements over baseline continual learning strategies and across varying distribution shifts, namely subpopulation, domain, and task shift.

1. INTRODUCTION

As the world changes, so must our models of it. The premise of continual (or incremental or lifelong) learning is to build adaptive systems that enable a model to return accurate predictions as the test-time distribution changes, such as a change in domain or task. Training sequentially on multiple different task distributions tends to result in catastrophic forgetting (McCloskey & Cohen, 1989) , where parameter updates benefiting the inference of the new task may worsen that of prior tasks. Alleviating this is the motivation for our work. To enable flexibility in adoption, we do not assume parameter adaptation with respect to task-specific information, and we assume only access to model parameters alone (no conditioning inputs, query sets, rehearsal or replay buffers, N -shot metadata, or any historical data). This carries positive implications for adoption in online learning settings, and robustness towards different distribution shifts (e.g. sub-population, domain, task shifts). Interestingly, prior work in non-rehearsal methods (notably regularization and parameter isolation methods) tend to "anchor" the parameter updates with respect to a local parameter subspace. These methods begin with a model initialization, then update the model with respect to the first task, and henceforth all future parameter updates on new tasks are computed with respect to this local subspace (usually minimizing the number of parameter value changes). The key question we ask here: what happens when we consider the global geometry of the parameter space? Our pursuit of an adaptation method leveraging global geometry is supported by various initial observations. When learning tasks 1, ..., T , a multi-task learner tends to drift a large distance away from its previous parameters optimized for 1, ..., T -1, indicating that when given information on all prior tasks, a multi-task learner would tend to move to a completely different subspace (Figure 3; Mirzadeh et al. (2020) ). Catastrophic forgetting is intricately linked to parameter drift: unless drifting towards a multi-task-optimal subspace, if the new parameters drift further from the old parameter subspace, then accuracy is expected to drop for all prior tasks; not drifting sufficiently will retain performance on prior tasks, but fail on the new task. Coordinating parameter updates between multiple parameter modes tend to keep the average parameter drift distance low (Figure 3 ). Contributions. Grounded on these findings, we introduce a new rehearsal-free continual learning algorithm (Algorithm 1). We initialize pre-trained parameters, maximize the distance between the parameters on the first task, then on subsequent tasks we optimize each parameter based on the loss with respect to their joint probability distribution as well as each parameter's drift from its prior position (and reinforce with backtracking). Evaluating forgetting per capacity, MOTA tends to outperform baseline algorithms (Table 3 ), and adapts parameters to sub-population, domain, and task shifts (Tables 1, 2 ).

