MULTIPLE MODES FOR CONTINUAL LEARNING

Abstract

Adapting model parameters to incoming streams of data is a crucial factor to deep learning scalability. Interestingly, prior continual learning strategies in online settings inadvertently anchor their updated parameters to a local parameter subspace to remember old tasks, else drift away from the subspace and forget. From this observation, we formulate a trade-off between constructing multiple parameter modes and allocating tasks per mode. Mode-Optimized Task Allocation (MOTA), our contributed adaptation strategy, trains multiple modes in parallel, then optimizes task allocation per mode. We empirically demonstrate improvements over baseline continual learning strategies and across varying distribution shifts, namely subpopulation, domain, and task shift.

1. INTRODUCTION

As the world changes, so must our models of it. The premise of continual (or incremental or lifelong) learning is to build adaptive systems that enable a model to return accurate predictions as the test-time distribution changes, such as a change in domain or task. Training sequentially on multiple different task distributions tends to result in catastrophic forgetting (McCloskey & Cohen, 1989) , where parameter updates benefiting the inference of the new task may worsen that of prior tasks. Alleviating this is the motivation for our work. To enable flexibility in adoption, we do not assume parameter adaptation with respect to task-specific information, and we assume only access to model parameters alone (no conditioning inputs, query sets, rehearsal or replay buffers, N -shot metadata, or any historical data). This carries positive implications for adoption in online learning settings, and robustness towards different distribution shifts (e.g. sub-population, domain, task shifts). Interestingly, prior work in non-rehearsal methods (notably regularization and parameter isolation methods) tend to "anchor" the parameter updates with respect to a local parameter subspace. These methods begin with a model initialization, then update the model with respect to the first task, and henceforth all future parameter updates on new tasks are computed with respect to this local subspace (usually minimizing the number of parameter value changes). The key question we ask here: what happens when we consider the global geometry of the parameter space? Our pursuit of an adaptation method leveraging global geometry is supported by various initial observations. When learning tasks 1, ..., T , a multi-task learner tends to drift a large distance away from its previous parameters optimized for 1, ..., T -1, indicating that when given information on all prior tasks, a multi-task learner would tend to move to a completely different subspace (Figure 3; Mirzadeh et al. (2020) ). Catastrophic forgetting is intricately linked to parameter drift: unless drifting towards a multi-task-optimal subspace, if the new parameters drift further from the old parameter subspace, then accuracy is expected to drop for all prior tasks; not drifting sufficiently will retain performance on prior tasks, but fail on the new task. Coordinating parameter updates between multiple parameter modes tend to keep the average parameter drift distance low (Figure 3 ). Contributions. Grounded on these findings, we introduce a new rehearsal-free continual learning algorithm (Algorithm 1). We initialize pre-trained parameters, maximize the distance between the parameters on the first task, then on subsequent tasks we optimize each parameter based on the loss with respect to their joint probability distribution as well as each parameter's drift from its prior position (and reinforce with backtracking). Evaluating forgetting per capacity, MOTA tends to outperform baseline algorithms (Table 3 ), and adapts parameters to sub-population, domain, and task shifts (Tables 1, 2 ). Related Work. Lange et al. ( 2019) taxonomized continual learning algorithms into replay, regularization, and parameter isolation methods. Replay (or rehearsal) methods store previous task samples to supplement retraining with the new task, such as iCaRL (Rebuffi et al., 2017) , ER (Ratcliff, 1990; Robins, 1995; Riemer et al., 2018; Chaudhry et al., 2019) , and A-GEM (Chaudhry et al., 2018b) . Regularization methods add regularization terms to the loss function to consolidate prior task knowledge, such as EWC (Kirkpatrick et al., 2017b) , SI (Zenke et al., 2017), and LwF (Li & Hoiem, 2016) . These methods tend to rely on no other task-specific information or supporting data other than the model weights alone. Parameter isolation methods allocate different models or subnetworks within a model to different tasks, such as PackNet (Mallya & Lazebnik, 2017), HAT (Serrà et al., 2018) , SupSup (Wortsman et al., 2020) , BatchEnsemble (Wen et al., 2020), and WSN (Kang et al., 2022) . Task oracles may be required to activate the task-specific parameters. Ensembling strategies in this category may either require task indices to switch to a specific task model (e.g. Wen et al. ( 2020)), or update all ensemble models on all tasks but risk losing task-optimal properties of each parameter's subspace (e.g. Doan et al. ( 2022)). The loss landscape changes after each task (Figure 3 ). Prior work either anchors to the local subspace of the first task, anchors each task to its specific local subspace, or anchors the entire parameter space to the last seen task. We are the first to leverage the global geometry of a loss landscape changing with tasks without compromising the task-optimal properties of each subspace nor requiring any task-specific information.

2. TRADE-OFF BETWEEN MULTIPLE MODES AND TASK ALLOCATION

First we introduce the problem set-up of continual learning, with assumptions extendable to broader online learning settings. Then we share the observations that motivate our study into multiple modes. Finally we present a trade-off, which motivates our proposed learning algorithm. Problem Setup. A base learner receives T tasks (or batches) sequentially. D t = {x t , y t } denotes the dataset of the t-th task. In the continual learning setting, given loss function L, a neural network f(θ; x) optimizes its parameters θ such that it can perform well on the t-th task while minimizing performance drop on the previous (t -1) tasks: θ * := arg min θ T t=1 L(f(θ; x t ), y t ). We assume the only information available at test-time is the model(s) parameters and the new task's data points. The learner cannot access any prior data points from previous tasks, and capacity is not permitted to increase after each task. Additionally, we do not assume parameter adaptation at test-time can be conditioned on task boundaries or conditioning inputs (task index, replay, K-shot query data, etc). 3 , a multi-task learner has a higher average drift consistently between tasks than EWC, even when both begin from a shared starting point (init → task 1). Given visibility to prior tasks, a multi-task learner will depart the subspace of the previous parameter, and drift far. This contradicts with the notion of forcing parameters to reside in the subspace of the previous parameter. The regularization-based methods essentially anchor all future parameters to the first task observed. Results in mode connectivity (Garipov et al., 2018; Fort & Jastrzebski, 2019; Draxler et al., 2019) show that a single task can have multiple parameters ("modes") that manifest functional diversity. We explored computing multiple modes with respect to task 1 to incorporate the broader geometry of the parameter space beyond the subspace of one mode. To bring performance gains and capacity efficiency, we further obtained this trade-off between the number of modes, number of tasks allocated per mode, and capacity (Theorem 1). We denote θ init as the initialization parameter, θ MTL(1,...,T) as the multi-task parameter trained on tasks 1, ..., T , and θ i,t as the parameter of mode index i updated on task t.



Figure1: A diagram of the different parameter trajectories demonstrating that, rather than anchoring all subsequent learning on mode θ 0,1 , we can leverage the functional diversity of other modes for optimal task allocation.

