IMPROVING CONTINUAL LEARNING BY ACCURATE GRADIENT RECONSTRUCTIONS OF THE PAST Anonymous

Abstract

Knowledge reuse is essential for continual learning, and current methods attempt to realize it through regularization or experience replay. These two strategies have complementary strengths, e.g., regularization methods are compact, but replay methods can mimic batch training more accurately. At present, little has been done to find principled ways to combine the two methods and current heuristics can give suboptimal performance. Here, we provide a principled approach to combine and improve them by using a recently proposed principle of adaptation, where the goal is to reconstruct the "gradients of the past", i.e., to mimic batch training by estimating gradients from past data. Using this principle, we design a prior that provably gives better gradient reconstructions by utilizing two types of replay and a quadratic weight-regularizer. This improves performance on standard benchmarks such as Split CIFAR, Split TinyImageNet, and ImageNet-1000. Our work shows that a good combination of replay and regularizer-based methods can be very effective in reducing forgetting, and can sometimes even completely eliminate it.

1. INTRODUCTION

Continual learning (Parisi et al., 2019) aims for accurate incremental training over a large number of individual tasks/examples. This can potentially reduce the frequency of retraining in deep learning, making algorithms easier to use and deploy, while also reducing their environmental impact (Diethe et al., 2019; Paleyes et al., 2020) . The main challenge in continual learning is to remember past knowledge and reuse it to continue to adapt to new data. This can be difficult because the future is unknown and can interfere with past knowledge (Sutton, 1986; Mermillod et al., 2013; Kirkpatrick et al., 2017) . Performance, therefore, heavily depends on the strategies used to represent and reuse past knowledge. Two popular strategies of knowledge reuse are based on regularization and experience replay and have complementary strengths. For example, the well-known Elastic-Weight Consolidation (EWC) (Kirkpatrick et al., 2017) , which regularizes the new weight-vector to keep it close to the old one, is compact and requires storing only two vectors, one containing the weights and the other their importance (often the empirical Fisher). A variety of other such regularizers have been proposed (Schwarz et al., 2018; Zenke et al., 2017b; Li & Hoiem, 2018; Nguyen et al., 2018) . This is very different from experience replay (Robins, 1995; Shin et al., 2017) , where past examples are simply added during future training. Memory cost here can be substantial, but it can boost accuracy if the memory represents the past well. Clearly, combining the two approaches can strike a good balance between performance and memory size. At present, little has been done to find principled ways to combine the two strategies. Some works have used knowledge distillation (Rebuffi et al., 2017; Buzzega et al., 2020) or functional regularization (Titsias et al., 2020; Pan et al., 2020) , where predictions evaluated at the examples in memory are regularized. Such approaches are promising, but it is not clear why the specific choices of regularizers and memory work well and whether there are better choices that lead to further improvements. Here, we aim to fix this issue. In this paper, we provide a principled approach to combine and improve the two strategies. Our approach is based on a recently proposed principle of adaptation (Khan & Swaroop, 2021) , where a prior called Knowledge-adaptation prior (K-prior) is used to reconstruct the gradients of the past

