CPR: CLASSIFIER-PROJECTION REGULARIZATION FOR CONTINUAL LEARNING

Abstract

We propose a general, yet simple patch that can be applied to existing regularizationbased continual learning methods called classifier-projection regularization (CPR). Inspired by both recent results on neural networks with wide local minima and information theory, CPR adds an additional regularization term that maximizes the entropy of a classifier's output probability. We demonstrate that this additional term can be interpreted as a projection of the conditional probability given by a classifier's output to the uniform distribution. By applying the Pythagorean theorem for KL divergence, we then prove that this projection may (in theory) improve the performance of continual learning methods. In our extensive experimental results, we apply CPR to several state-of-the-art regularization-based continual learning methods and benchmark performance on popular image recognition datasets. Our results demonstrate that CPR indeed promotes a wide local minima and significantly improves both accuracy and plasticity while simultaneously mitigating the catastrophic forgetting of baseline continual learning methods. The codes and scripts for this work are available at https://github.com/csm9493/CPR_CL.

1. INTRODUCTION

Catastrophic forgetting (McCloskey & Cohen, 1989 ) is a central challenge in continual learning (CL): when training a model on a new task, there may be a loss of performance (e.g., decrease in accuracy) when applying the updated model to previous tasks. At the heart of catastrophic forgetting is the stability-plasticity dilemma (Carpenter & Grossberg, 1987; Mermillod et al., 2013) , where a model exhibits high stability on previously trained tasks, but suffers from low plasticity for the integration of new knowledge (and vice-versa) . Attempts to overcome this challenge in neural network-based CL can be grouped into three main strategies: regularization methods (Li & Hoiem, 2017; Kirkpatrick et al., 2017; Zenke et al., 2017; Nguyen et al., 2018; Ahn et al., 2019; Aljundi et al., 2019) , memory replay (Lopez-Paz & Ranzato, 2017; Shin et al., 2017; Rebuffi et al., 2017; Kemker & Kanan, 2018) , and dynamic network architecture (Rusu et al., 2016; Yoon et al., 2018; Golkar et al., 2019) . In particular, regularization methods that control model weights bear the longest history due to its simplicity and efficiency to control the trade-off for a fixed model capacity. In parallel, several recent methods seek to improve the generalization of neural network models trained on a single task by promoting wide local minima (Keskar et al., 2017; Chaudhari et al., 2019; Pereyra et al., 2017; Zhang et al., 2018) . Broadly speaking, these efforts have experimentally shown that models trained with wide local minima-promoting regularizers achieve better generalization and higher accuracy (Keskar et al., 2017; Pereyra et al., 2017; Chaudhari et al., 2019; Zhang et al., 2018) , and can be more robust to weight perturbations (Zhang et al., 2018) when compared to usual training methods. Despite the promising results, methods that promote wide local minima have yet to be applied to CL. In this paper, we make a novel connection between wide local minima in neural networks and regularization-based CL methods. The typical regularization-based CL aims to preserve important weight parameters used in past tasks by penalizing large deviations when learning new tasks. As and narrow, the space for candidate model parameters that perform well on all tasks (i.e., the intersection of the ellipsoid for each task) quickly becomes very small as learning continues, thus, an inevitable trade-off between stability and plasiticty occurs. In contrast, when the wide local minima exists for each task (bottom), it is more likely the ellipsoids will significantly overlap even when the learning continues, hence, finding a well performing model for all tasks becomes more feasible. shown in the top of Fig. 1 , a popular geometric intuition (as first given in EWC (Kirkpatrick et al., 2017)) for such CL methods is to consider the (uncertainty) ellipsoid of parameters around the local minima. When learning new tasks, parameter updates are selected in order to not significantly hinder model performance on past tasks. Our intuition is that promoting a wide local minima-which conceptually stands for local minima having a flat, rounded uncertainty ellipsoid-can be particularly beneficial for regularization-based CL methods by facilitating diverse update directions for the new tasks (i.e., improves plasticity) while not hurting the past tasks (i.e., retains stability). As shown in the bottom of Fig. 1 , when the ellipsoid containing the parameters with low-error is wider, i.e., when the wide local minima exists, there is more flexibility in finding a parameter that performs well for all tasks after learning a sequence of new tasks. We provide further details in Section 2.1. Based on the above intuition, we propose a general, yet simple patch that can be applied to existing regularization-based CL methods dubbed as Classifier-Projection Regularization (CPR). Our method implements an additional regularization term that promotes wide local minima by maximizing the entropy of the classifier's output distribution. Furthermore, from a theory standpoint, we make an observation that our CPR term can be further interpreted in terms of information projection (I-projection) formulations (Cover & Thomas, 2012; Murphy, 2012; Csiszár & Matus, 2003; Walsh & Regalia, 2010; Amari et al., 2001; Csiszár & Matus, 2003; Csiszár & Shields, 2004) found in information theory. Namely, we argue that applying CPR corresponds to projecting a classifier's output onto a Kullback-Leibler (KL) divergence ball of finite radius centered around the uniform distribution. By applying the Pythagorean theorem for KL divergence, we then prove that this projection may (in theory) improve the performance of continual learning methods. Through extensive experiments on several benchmark datasets, we demonstrate that applying CPR can significantly improve the performance of the state-of-the-art regularization-based CL: using our simple patch improves both the stability and plasticity and, hence, achieves better average accuracy almost uniformly across the tested algorithms and datasets-confirming our intuition of wide local minima in Fig. 1 . Furthermore, we use a feature map visualization that compares methods trained with and without CPR to further corroborate the effectiveness of our method.

2. CPR: CLASSIFIER-PROJECTION REGULARIZATION FOR WIDE LOCAL MINIMUM

In this section, we elaborate in detail the core motivation outlined in Fig. 1 , then formalize CPR as the combination of two regularization terms: one stemming from prior regularization-based CL methods, and the other that promotes a wide local minima. Moreover, we provide an information-geometric interpretation (Csiszár, 1984; Cover & Thomas, 2012; Murphy, 2012) for the observed gain in performance when applying CPR to CL.



Figure1: In typical regularization-based CL (top), when the low-error ellipsoid around local minima is sharp and narrow, the space for candidate model parameters that perform well on all tasks (i.e., the intersection of the ellipsoid for each task) quickly becomes very small as learning continues, thus, an inevitable trade-off between stability and plasiticty occurs. In contrast, when the wide local minima exists for each task (bottom), it is more likely the ellipsoids will significantly overlap even when the learning continues, hence, finding a well performing model for all tasks becomes more feasible.

