LINEAR MODE CONNECTIVITY IN MULTITASK AND CONTINUAL LEARNING

Abstract

Continual (sequential) training and multitask (simultaneous) training are often attempting to solve the same overall objective: to find a solution that performs well on all considered tasks. The main difference is in the training regimes, where continual learning can only have access to one task at a time, which for neural networks typically leads to catastrophic forgetting. That is, the solution found for a subsequent task does not perform well on the previous ones anymore. However, the relationship between the different minima that the two training regimes arrive at is not well understood. What sets them apart? Is there a local structure that could explain the difference in performance achieved by the two different schemes? Motivated by recent work showing that different minima of the same task are typically connected by very simple curves of low error, we investigate whether multitask and continual solutions are similarly connected. We empirically find that indeed such connectivity can be reliably achieved and, more interestingly, it can be done by a linear path, conditioned on having the same initialization for both. We thoroughly analyze this observation and discuss its significance for the continual learning process. Furthermore, we exploit this finding to propose an effective algorithm that constrains the sequentially learned minima to behave as the multitask solution. We show that our method outperforms several state of the art continual learning algorithms on various vision benchmarks.

1. INTRODUCTION

One major consequence of learning multiple tasks in a continual learning (CL) setting -where tasks are learned sequentially, and the model can only have access to one task at a time -is catastrophic forgetting (McCloskey & Cohen, 1989) . This is in contrast to multitask learning (MTL), where the learner has simultaneous access to all tasks, which generally learns to perform well on all tasks without suffering from catastrophic forgetting. This limitation hinders the ability of the model to learn continually and efficiently. Recently, several approaches have been proposed to tackle this problem. They have mostly tried to mitigate catastrophic forgetting by using different approximations of the multitask loss. For example, some regularization methods take a quadratic approximation of the loss of previous tasks (e.g. Kirkpatrick et al., 2017; Yin et al., 2020) . As another example, rehearsal methods attempt to directly use compressed past data either by selecting a representative subset (e.g. Chaudhry et al., 2019; Titsias et al., 2019) or relying on generative models (e.g. Shin et al., 2017; Robins, 1995) . In this work, we depart from the literature and start from the non-conventional question of understanding "What is the relationship, potentially in terms of local geometric properties, between the multitask and the continual learning minima?". Our work is inspired by recent work on mode con-

availability

://github.

