DISCRIMINATIVE REPRESENTATION LOSS (DRL): A MORE EFFICIENT APPROACH THAN GRADIENT RE-PROJECTION IN CONTINUAL LEARNING

Abstract

The use of episodic memories in continual learning has been shown to be effective in terms of alleviating catastrophic forgetting. In recent studies, several gradientbased approaches have been developed to make more efficient use of compact episodic memories, which constrain the gradients resulting from new samples with those from memorized samples, aiming to reduce the diversity of gradients from different tasks. In this paper, we reveal the relation between diversity of gradients and discriminativeness of representations, demonstrating connections between Deep Metric Learning and continual learning. Based on these findings, we propose a simple yet highly efficient method -Discriminative Representation Loss (DRL) -for continual learning. In comparison with several state-of-theart methods, DRL shows effectiveness with low computational cost on multiple benchmark experiments in the setting of online continual learning.

1. INTRODUCTION

In the real world, we are often faced with situations where data distributions are changing over time, and we would like to update our models by new data in time, with bounded growth in system size. These situations fall under the umbrella of "continual learning", which has many practical applications, such as recommender systems, retail supply chain optimization, and robotics (Lesort et al., 2019; Diethe et al., 2018; Tian et al., 2018) . Comparisons have also been made with the way that humans are able to learn new tasks without forgetting previously learned ones, using common knowledge shared across different skills. The fundamental problem in continual learning is catastrophic forgetting (McCloskey & Cohen, 1989; Kirkpatrick et al., 2017) , i.e. (neural network) models have a tendency to forget previously learned tasks while learning new ones. There are three main categories of methods for alleviating forgetting in continual learning: i) regularization-based methods which aim in preserving knowledge of models of previous tasks (Kirkpatrick et al., 2017; Zenke et al., 2017; Nguyen et al., 2018) ii) architecture-based methods for incrementally evolving the model by learning task-shared and task-specific components (Schwarz et al., 2018; Hung et al., 2019) ; iii) replay-based methods which focus in preserving knowledge of data distributions of previous tasks, including methods of experience replay by episodic memories or generative models (Shin et al., 2017; Rolnick et al., 2019) , methods for generating compact episodic memories (Chen et al., 2018; Aljundi et al., 2019) , and methods for more efficiently using episodic memories (Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2019a; Riemer et al., 2019; Farajtabar et al., 2020) . Gradient-based approaches using episodic memories, in particular, have been receiving increasing attention. The essential idea is to use gradients produced by samples from episodic memories to constrain the gradients produced by new samples, e.g. by ensuring the inner product of the pair of gradients is non-negative (Lopez-Paz & Ranzato, 2017) as follows: g t , g k = ∂L(x t , θ) ∂θ , ∂L(x k , θ) ∂θ ≥ 0, ∀k < t (1) where t and k are time indices, x t denotes a new sample from the current task, and x k denotes a sample from the episodic memory. Thus, the updates of parameters are forced to preserve the performance on previous tasks as much as possible. In Gradient Episodic Memory (GEM) (Lopez-Paz & Ranzato, 2017), g t is projected to a direction that is closest to it in L 2 -norm whilst also satisfying Eq. ( 1): min g 1 2 ||g t -g|| 2 2 , s.t. g, g k ≥ 0, ∀k < t. Optimization of this objective requires a high-dimensional quadratic program and thus is computationally expensive. Averaged-GEM (A-GEM) (Chaudhry et al., 2019a) alleviates the computational burden of GEM by using the averaged gradient over a batch of samples instead of individual gradients of samples in the episodic memory. This not only simplifies the computation, but also obtains comparable performance with GEM. Orthogonal Gradient Descent (OGD) (Farajtabar et al., 2020) projects g t to the direction that is perpendicular to the surface formed by {g k |k < t}. Moreover, Aljundi et al. (2019) propose Gradient-based Sample Selection (GSS), which selects samples that produce most diverse gradients with other samples into episodic memories. Here diversity is measured by the cosine similarity between gradients. Since the cosine similarity is computed using the inner product of two normalized gradients, GSS embodies the same principle as other gradient-based approaches with episodic memories. Although GSS suggests the samples with most diverse gradients are important for generalization across tasks, Chaudhry et al. (2019b) show that the average gradient over a small set of random samples may be able to obtain good generalization as well. In this paper, we answer the following questions: i) Which samples tend to produce diverse gradients that strongly conflict with other samples and why are such samples able to help with generalization? ii) Why does a small set of randomly chosen samples also help with generalization? iii) Can we reduce the diversity of gradients in a more efficient way? Our answers reveal the relation between diversity of gradients and discriminativeness of representations, and further show connections between Deep Metric Learning (DML) (Kaya & Bilge, 2019; Roth et al., 2020) and continual learning. Drawing on these findings we propose a new approach, Discriminative Representation Loss (DRL), for classification tasks in continual learning. Our methods show improved performance with relatively low computational cost in terms of time and RAM cost when compared to several state-of-theart (SOTA) methods across multiple benchmark tasks in the setting of online continual learning.

2. A NEW PERSPECTIVE OF REDUCING DIVERSITY OF GRADIENTS

According to Eq. ( 1), negative cosine similarities between gradients produced by current and previous tasks result in worse performance in continual learning. This can be interpreted from the perspective of constrained optimization as discussed by Aljundi et al. (2019) . Moreover, the diversity of gradients relates to the Gradient Signal to Noise Ratio (GSNR) (Liu et al., 2020) , which plays a crucial role in the model's generalization ability. Intuitively, when more of the gradients point in diverse directions, the variance will be larger, leading to a smaller GSNR, which indicates that reducing the diversity of gradients can improve generalization. This finding leads to the conclusion that samples with the most diverse gradients contain the most critical information for generalization which is consistent with in Aljundi et al. (2019) .

2.1. THE SOURCE OF GRADIENT DIVERSITY

We first conducted a simple experiment on classification tasks of 2-D Gaussian distributions, and tried to identify samples with most diverse gradients in the 2-D feature space. We trained a linear model on the first task to discriminate between two classes (blue and orange dots in Fig. 1a ). We then applied the algorithm Gradient-based Sample Selection with Interger Quadratic Programming (GSS-IQP) (Aljundi et al., 2019) to select 10% of the samples of training data that produce gradients with the lowest similarity (black dots in Fig. 1a ), and denote this set of samples as M = min M i,j∈M gi,gj ||gi||•||gj || . It is clear from Fig. 1a that the samples in M are mostly around the decision boundary between the two classes. Increasing the size of M results in the inclusion of samples that trace the outer edges of the data distributions from each class. Clearly the gradients can be strongly opposed when samples from different classes are very similar. Samples close to decision boundaries are most likely to exhibit this characteristic. Intuitively, storing the decision boundaries of previously learned classes should be an effective way to preserve classification performance on those classes. However, if the episodic memory only includes samples representing the learned boundaries, it may miss important information when the model is required to incrementally learn new classes. We show this by introducing a second task -training the model above on a third class (green dots). We display the decision boundaries (which split the feature space in a one vs. all manner) learned by the model after

