A UNIFIED BAYESIAN FRAMEWORK FOR DISCRIMI-NATIVE AND GENERATIVE CONTINUAL LEARNING

Abstract

Continual Learning is a learning paradigm where learning systems are trained on a sequence of tasks. The goal here is to perform well on the current task without suffering from a performance drop on the previous tasks. Two notable directions among the recent advances in continual learning with neural networks are (1) variational Bayes based regularization by learning priors from previous tasks, and, (2) learning the structure of deep networks to adapt to new tasks. So far, these two approaches have been orthogonal. We present a novel Bayesian framework for continual learning based on learning the structure of deep neural networks, addressing the shortcomings of both these approaches. The proposed framework learns the deep structure for each task by learning which weights to be used, and supports inter-task transfer through the overlapping of different sparse subsets of weights learned by different tasks. An appealing aspect of our proposed continual learning framework is that it is applicable to both discriminative (supervised) and generative (unsupervised) settings. Experimental results on supervised and unsupervised benchmarks shows that our model performs comparably or better than recent advances in continual learning.

1. INTRODUCTION

Continual learning (CL) (Ring, 1997; Parisi et al., 2019) is the learning paradigm where a single model is subjected to a sequence of tasks. At any point of time, the model is expected to (i) make predictions for the tasks it has seen so far, (ii) if subjected to training data for a new task, adapt to the new task leveraging the past knowledge if possible (forward transfer) and benefit the previous tasks if possible (backward transfer) . While the desirable aspects of more mainstream transfer learning (sharing of bias between related tasks (Pan & Yang, 2009 )) might reasonably be expected here too, the principal challenge is to retain the predictive power for the older tasks even after learning new tasks, thus avoiding the so-called catastrophic forgetting. Real world applications in, for example, robotics or time-series forecasting, are rife with this challenging learning scenario, the ability to adapt to dynamically changing environments or evolving data distributions being essential in these domains. Continual learning is also desirable in unsupervised learning problems as well (Smith et al., 2019; Rao et al., 2019b) where the goal is to learn the underlying structure or latent representation of the data. Also, as a skill innate to humans (Flesch et al., 2018) , it is naturally an interesting scientific problem to reproduce the same capability in artificial predictive modelling systems. Existing approaches to continual learning are mainly based on three foundational ideas. One of them is to constrain the parameter values to not deviate significantly from their previously learned value by using some form of regularization or trade-off between previous and new learned weights (Schwarz et al., 2018; Kirkpatrick et al., 2017; Zenke et al., 2017; Lee et al., 2017) . A natural way to accomplish this is to train a model using online Bayesian inference, whereby the posterior of the parameters learned from task t serve as the prior for task t + 1 as in Nguyen et al. (2018) and Zeno et al. (2018) . This new informed prior helps in the forward transfer, and also prevents catastrophic forgetting by penalizing large deviations from itself. In particular, VCL (Nguyen et al., 2018) achieves the state of the art results by applying this simple idea to Bayesian neural networks. The second idea is to perform an incremental model selection for every new task. For neural networks, this is done by evolving the structure as newer tasks are encountered (Golkar et al., 2019; Li et al., 2019) . Structural learning is a very sensible direction in continual learning as a new task may require a different network structure than old unrelated tasks and even if the tasks are highly related their lower layer representations can be very different. Another advantage of structural learning is that while retaining a shared set of parameters (which can be used to model task relationships) it also allow task-specific parameters that can increase the performance of the new task while avoiding catastrophic forgetting caused due to forced sharing of parameters. The third idea is to invoke a form of 'replay', whereby selected or generated samples representative of previous tasks, are used to retrain the model after new tasks are learned. In this work, we introduce a novel Bayesian nonparametric approach to continual learning that seeks to incorporate the ability of structure learning into the simple yet effective framework of online Bayes. In particular, our approach models each hidden layer of the neural network using the Indian Buffet Process (Griffiths & Ghahramani, 2011) prior, which enables us to learn the network structure as new tasks arrive continually. We can leverage the fact that any particular task t uses a sparse subset of the connections of a neural network N t , and different related tasks share different subsets (albeit possibly overlapping). Thus, in the setting of continual learning, it would be more effective if the network could accommodate changes in its connections dynamically to adapt to a newly arriving task. Moreover, in our model, we perform the automatic model selection where each task can select the number of nodes in each hidden layer. All this is done under the principled framework of variational Bayes and a nonparametric Bayesian modeling paradigm. Another appealing aspect of our approach is that in contrast to some of the recent state-of-the-art continual learning models (Yoon et al., 2018; Li et al., 2019) that are specific to supervised learning problems, our approach applies to both deep discriminative networks (supervised learning) where each task can be modeled by a Bayesian neural network (Neal, 2012; Blundell et al., 2015) , as well as deep generative networks (unsupervised learning) where each task can be modeled by a variational autoencoder (VAE) (Kingma & Welling, 2013) .

2. PRELIMINARIES

Bayesian neural networks (Neal, 2012) are discriminative models where the goal is to model the relationship between inputs and outputs via a deep neural network with parameters w. The network parameters are assumed to have a prior p(w) and the goal is to infer the posterior given the observed data D. The exact posterior inference is intractable in such models. One such approximate inference scheme is Bayes-by-Backprop (Blundell et al., 2015) that uses a mean-field variational posterior q(w) over the weights. Reparameterized samples from this posterior are then used to approximate the lower bound via Monte Carlo sampling. Our goal in the continual learning setting is to learn such Bayesian neural networks for a sequence of tasks by inferring the posterior q t (w) for each task t, without forgetting the information contained in the posteriors of previous tasks. Variational autoencoders (Kingma & Welling, 2013) are generative models where the goal is to model a set of inputs {x} N n=1 in terms of a stochastic latent variables {z} N n=1 . The mapping from each z n to x n is defined by a generator/decoder model (modeled by a deep neural network with parameters θ) and the reverse mapping is defined by a recognition/encoder model (modeled by another deep neural network with parameters φ). Inference in VAEs is done by maximizing the variational lower bound on the marginal likelihood. It is customary to do point estimation for decoder parameters θ and posterior inference for encoder parameters φ. However, in the continual learning setting, it would be more desirable to infer the full posterior q t (w) for each task's encoder and decoder parameters w = {θ, φ}, while not forgetting the information about the previous tasks as more and more tasks are observed. Our proposed continual learning framework address this aspect as well. Variational Continual Learning (VCL) Nguyen et al. ( 2018) is a recently proposed approach to continual learning that combats catastrophic forgetting in neural networks by modeling the network parameters w in a Bayesian fashion and by setting p t (w) = q t-1 (w), that is, a task reuses the previous task's posterior as its prior. VCL solves the follow KL divergence minimization problem q t (w) = arg min q∈Q KL q(w)|| 1 Z t q t-1 (w)p(D t |w) While offering a principled way that is applicable to both supervised (discriminative) and unsupervised (generative) learning settings, VCL assumes that the model structure is held fixed throughout,

