A UNIFIED BAYESIAN FRAMEWORK FOR DISCRIMI-NATIVE AND GENERATIVE CONTINUAL LEARNING

Abstract

Continual Learning is a learning paradigm where learning systems are trained on a sequence of tasks. The goal here is to perform well on the current task without suffering from a performance drop on the previous tasks. Two notable directions among the recent advances in continual learning with neural networks are (1) variational Bayes based regularization by learning priors from previous tasks, and, (2) learning the structure of deep networks to adapt to new tasks. So far, these two approaches have been orthogonal. We present a novel Bayesian framework for continual learning based on learning the structure of deep neural networks, addressing the shortcomings of both these approaches. The proposed framework learns the deep structure for each task by learning which weights to be used, and supports inter-task transfer through the overlapping of different sparse subsets of weights learned by different tasks. An appealing aspect of our proposed continual learning framework is that it is applicable to both discriminative (supervised) and generative (unsupervised) settings. Experimental results on supervised and unsupervised benchmarks shows that our model performs comparably or better than recent advances in continual learning.

1. INTRODUCTION

Continual learning (CL) (Ring, 1997; Parisi et al., 2019) is the learning paradigm where a single model is subjected to a sequence of tasks. At any point of time, the model is expected to (i) make predictions for the tasks it has seen so far, (ii) if subjected to training data for a new task, adapt to the new task leveraging the past knowledge if possible (forward transfer) and benefit the previous tasks if possible (backward transfer) . While the desirable aspects of more mainstream transfer learning (sharing of bias between related tasks (Pan & Yang, 2009 )) might reasonably be expected here too, the principal challenge is to retain the predictive power for the older tasks even after learning new tasks, thus avoiding the so-called catastrophic forgetting. Real world applications in, for example, robotics or time-series forecasting, are rife with this challenging learning scenario, the ability to adapt to dynamically changing environments or evolving data distributions being essential in these domains. Continual learning is also desirable in unsupervised learning problems as well (Smith et al., 2019; Rao et al., 2019b) where the goal is to learn the underlying structure or latent representation of the data. Also, as a skill innate to humans (Flesch et al., 2018) , it is naturally an interesting scientific problem to reproduce the same capability in artificial predictive modelling systems. Existing approaches to continual learning are mainly based on three foundational ideas. One of them is to constrain the parameter values to not deviate significantly from their previously learned value by using some form of regularization or trade-off between previous and new learned weights (Schwarz et al., 2018; Kirkpatrick et al., 2017; Zenke et al., 2017; Lee et al., 2017) . A natural way to accomplish this is to train a model using online Bayesian inference, whereby the posterior of the parameters learned from task t serve as the prior for task t + 1 as in Nguyen et al. (2018) and Zeno et al. (2018) . This new informed prior helps in the forward transfer, and also prevents catastrophic forgetting by penalizing large deviations from itself. In particular, VCL (Nguyen et al., 2018) achieves the state of the art results by applying this simple idea to Bayesian neural networks. The second idea is to perform an incremental model selection for every new task. For neural networks, this is done by evolving the structure as newer tasks are encountered (Golkar et al., 2019; Li et al., 2019) . Structural learning is a very sensible direction in continual learning as a new task may require a different network structure than old unrelated tasks and even if the tasks are highly related their lower layer representations can be very different. Another advantage of structural learning is that while retaining a shared set of parameters (which can be used to model task relationships) it also allow task-specific parameters that can increase the performance of the new task while avoiding catastrophic forgetting caused due to forced sharing of parameters. The third idea is to invoke a form of 'replay', whereby selected or generated samples representative of previous tasks, are used to retrain the model after new tasks are learned. In this work, we introduce a novel Bayesian nonparametric approach to continual learning that seeks to incorporate the ability of structure learning into the simple yet effective framework of online Bayes. In particular, our approach models each hidden layer of the neural network using the Indian Buffet Process (Griffiths & Ghahramani, 2011) prior, which enables us to learn the network structure as new tasks arrive continually. We can leverage the fact that any particular task t uses a sparse subset of the connections of a neural network N t , and different related tasks share different subsets (albeit possibly overlapping). Thus, in the setting of continual learning, it would be more effective if the network could accommodate changes in its connections dynamically to adapt to a newly arriving task. Moreover, in our model, we perform the automatic model selection where each task can select the number of nodes in each hidden layer. All this is done under the principled framework of variational Bayes and a nonparametric Bayesian modeling paradigm. Another appealing aspect of our approach is that in contrast to some of the recent state-of-the-art continual learning models (Yoon et al., 2018; Li et al., 2019) that are specific to supervised learning problems, our approach applies to both deep discriminative networks (supervised learning) where each task can be modeled by a Bayesian neural network (Neal, 2012; Blundell et al., 2015) , as well as deep generative networks (unsupervised learning) where each task can be modeled by a variational autoencoder (VAE) (Kingma & Welling, 2013) .

2. PRELIMINARIES

Bayesian neural networks (Neal, 2012) are discriminative models where the goal is to model the relationship between inputs and outputs via a deep neural network with parameters w. The network parameters are assumed to have a prior p(w) and the goal is to infer the posterior given the observed data D. The exact posterior inference is intractable in such models. One such approximate inference scheme is Bayes-by-Backprop (Blundell et al., 2015) that uses a mean-field variational posterior q(w) over the weights. Reparameterized samples from this posterior are then used to approximate the lower bound via Monte Carlo sampling. Our goal in the continual learning setting is to learn such Bayesian neural networks for a sequence of tasks by inferring the posterior q t (w) for each task t, without forgetting the information contained in the posteriors of previous tasks. Variational autoencoders (Kingma & Welling, 2013) are generative models where the goal is to model a set of inputs {x} N n=1 in terms of a stochastic latent variables {z} N n=1 . The mapping from each z n to x n is defined by a generator/decoder model (modeled by a deep neural network with parameters θ) and the reverse mapping is defined by a recognition/encoder model (modeled by another deep neural network with parameters φ). Inference in VAEs is done by maximizing the variational lower bound on the marginal likelihood. It is customary to do point estimation for decoder parameters θ and posterior inference for encoder parameters φ. However, in the continual learning setting, it would be more desirable to infer the full posterior q t (w) for each task's encoder and decoder parameters w = {θ, φ}, while not forgetting the information about the previous tasks as more and more tasks are observed. Our proposed continual learning framework address this aspect as well. Variational Continual Learning (VCL) Nguyen et al. ( 2018) is a recently proposed approach to continual learning that combats catastrophic forgetting in neural networks by modeling the network parameters w in a Bayesian fashion and by setting p t (w) = q t-1 (w), that is, a task reuses the previous task's posterior as its prior. VCL solves the follow KL divergence minimization problem q t (w) = arg min q∈Q KL q(w)|| 1 Z t q t-1 (w)p(D t |w) While offering a principled way that is applicable to both supervised (discriminative) and unsupervised (generative) learning settings, VCL assumes that the model structure is held fixed throughout, (a) Illustration on single hidden layer which can be limiting in continual learning where the number of tasks and their complexity is usually unknown beforehand. This necessitates adaptively inferring the model structure, that can potentially adapt with each incoming task. Another limitation of VCL is that the unsupervised version, based on performing CL on VAEs, only does so for the decoder model's parameters (shared by all tasks). It uses completely task-specific encoders and, consequently, is unable to transfer information across tasks in the encoder model. Our approach addresses both these limitations in a principled manner.

3. BAYESIAN STRUCTURE ADAPTATION FOR CONTINUAL LEARNING

In this section, we present a Bayesian model for continual learning that can potentially grow and adapt its structure as more and more tasks arrive. Our model extends seamlessly for unsupervised learning as well. For brevity of exposition, in this section, we mainly focus on the supervised setting where a task has labeled data with known task identities t (task-incremental). We then briefly discuss the unsupervised extension (based on VAEs) in Sec. 3.3 where task boundaries may or may not (taskagnostic) be available and provide further details in the appendix (Sec I). Our approach uses a basic primitive that models each hidden layer using a nonparametric Bayesian prior (Fig. 1a shows an illustration and Fig. 1b shows a schematic diagram). We can use these hidden layers to model feedforward connections in Bayesian neural networks or VAE models. For simplicity, we will assume a single hidden layer, the first task activates as many hidden nodes as required and learns the posterior over the subset of edge weights incident on each active node. Each subsequent task reuses some of the edges learned by the previous task and uses the posterior over the weights learned by the previous task as the prior. Additionally, it may activate some new nodes and learn the posterior over some of their incident edges. It thus learns the posterior over a subset of weights that may overlap with weights learned by previous tasks. While making predictions, a task uses only the connections it has learned. More slack for later tasks in terms of model size (allowing it to create new nodes) indirectly lets the task learn better without deviating too much from the prior (in this case, posterior of the previous tasks) and further reduces chances of catastrophic forgetting (Kirkpatrick et al., 2017) .

3.1. GENERATIVE STORY.

Omitting the task id t for brevity, consider modeling t th task using a neural network having L hidden layers. We model the weights in layer l as W l = B l V l , a point-wise multiplication of a realvalued matrix V l (with a Gaussian prior N (0, σ 2 0 ) on each entry) and a task-specific binary matrix B l . This ensures sparse connection weights between the layers. Moreover, we model B l ∼ IBP(α) using the Indian Buffet Process (IBP) Griffiths & Ghahramani (2011) prior, where the hyperparameter α controls the number of nonzero columns in B and its sparsity. The IBP prior thus enables learning the size of B l (and consequently of V l ) from data. As a result, the number of nodes in the hidden layer is learned adaptively from data. The output layer weights are denoted as W out with each weight having a Gaussian prior N (0, σ 2 0 ). The outputs are y n ∼ Lik(W out φ N N (x n )), n = 1, . . . , N (2) Here φ N N is the function computed (using parameter samples) up to the last hidden layer of the network thus formed, and Lik denotes the likelihood model for the outputs. Similar priors on the network weights have been used in other recent works to learn sparse deep neural networks (Panousis et al., 2019; Xu et al., 2019) . However, these works assume a single task to be learned. In contrast, our focus here is to leverage such priors in the continual learning setting where we need to learn a sequence of tasks while avoiding the problem of catastrophic forgetting. Henceforth, we further suppress the superscript denoting layer number from the notation for simplicity; the discussion will hold identically for all hidden layers. When adapting to a new task, the posterior of V learned from previous tasks is used as the prior. A new B is learned afresh, to ensure that a task only learns the subset of weights relevant to it. Stick Breaking Construction. As described before, to adaptively infer the number of nodes in each hidden layer, we use the IBP prior (Griffiths & Ghahramani, 2011) , whose truncated stick-breaking process (Doshi et al., 2009) construction for each entry of B is as follows ν k ∼ Beta(α, 1), π k = k i=1 ν i , B d,k ∼ Bernoulli(π k ) for d ∈ 1, ..., D, where D denotes the number of input nodes for this hidden layer, and k ∈ 1, 2, ..., K, where K is the truncation level and α controls the effective value of K, i.e., the number of active hidden nodes. Note that the prior probability π k of weights incident on hidden node k being nonzero decreases monotonically with k, until, say, K nodes, after which no further nodes have any incoming edges with nonzero weights from the previous layer, which amounts to them being turned off from the structure. Moreover, due to the cumulative product based construction of the π k 's, an implicit ordering is imposed on the nodes being used. This ordering is preserved across tasks, and allocation of nodes to a task follows this, facilitating reuse of weights. The truncated stick-breaking approximation is a practically plausible and intuitive solution for continual learning since a fundamental tenet of continual learning is that the model complexity should not increase in an unbounded manner as more tasks are encountered. Suppose we fix a budget on the maximum allowed size of the network (no. hidden nodes in a layer) after it has seen, say, T tasks. Which exactly corresponds to the truncation level for each layer. Then for each task, nodes are allocated conservatively from this total budget, in a fixed order, conveniently controlled by the α hyperparameter. In appendix (Sec. D), we also discuss a dynamic expansion scheme that avoids specifying a truncation level (and provide experimental results).

3.2. INFERENCE

Exact inference is intractable in this model due to non-conjugacy. Therefore, we resort to the variational inference (Blei et al., 2017) . We employ structured mean-field approximation (Hoffman & Blei, 2015) , which performs better than normally used mean-field approximation, as the former captures the dependencies in the approximate posterior distributions of B and ν. In particular, we use q(V , B, v) = q(V )q(B|v)q(v), where, q(V ) = D d=1 K k=1 N (V d,k |µ d,k , σ 2 d,k ) is mean field Gaussian approximation for network weights. Corresponding to the Beta-Bernoulli hierarchy of (3), we use the conditionally factorized variational posterior family, that is, q(B|v) = D d=1 K k=1 Bern(B d,k |θ d,k ), where θ d,k = σ(ρ d,k + logit(π k )) and q(v) = K k=1 Beta(v k |ν k,1 , ν k,2 ). Thus we have Θ = {ν k,1 , ν k,2 , {µ d,k , σ d,k , ρ d,k } D d=1 } K k=1 as set of learn- able variational parameters. Each column of B represents the binary mask for the weights incident to a particular node. Note that although these binary variables (in a single column of B) share a common prior, the posterior for each of these variables are different, thereby allowing a task to selectively choose a subset of the weights, with the common prior controlling the degree of sparsity. L = E q(V ,B,v) [ln p(Y |V , B, v)] -KL(q(V , B, v)||p(V , B, v)) (4) L = 1 S S i=1 [f (V i , B i , v i ) -KL[q(B|v i )||p(B|v i )]] -KL[q(V )||p(V )] -KL[q(v)||p(v)] Bayes-by-backprop (Blundell et al., 2015) is a common choice for performing variational inference in this context. Eq. 4 defines the Evidence Lower Bound (ELBO) in terms of data-dependent likelihood and data-independent KL terms which further gets decomposed using mean-field factorization. The expectation terms are optimized by unbiased gradients from the respective posteriors. All the KL divergence terms in (Eq. 5) have closed form expressions; hence using them directly rather than estimating them from Monte Carlo samples alleviates the approximation error as well as the computational overhead, to some extent. The log-likelihood term can be decomposed as f (V , B, v) = log Lik(Y |V , B, v) = log Lik(Y |W out φ N N (X; V, B, v)) where (X, Y ) is the training data. For regression, Lik can be Gaussian with some noise variance, while for classification it can be Bernoulli with a probit or logistic link. Details of sampling gradient computation for terms involving beta and Bernoulli r.v.'s is provided in the appendix. (Sec. F).

3.3. UNSUPERVISED CONTINUAL LEARNING

Our discussion thus far has primarily focused on continual learning where each task is a supervised learning problem. Our framework however readily extends to unsupervised continual learning (Nguyen et al., 2018; Smith et al., 2019; Rao et al., 2019b) where we assume that each task involves learning a deep generative model, commonly a VAE. In this case, each input observation x n has an associated latent variable z n . Collectively denoting all inputs as X and all latent variables as Z, we can define ELBO similar to Eq. 4 as L = E q(Z,V ,B,v) [ln p(X|Z, V , B, v)] -KL(q(Z, V , B, v)||p(Z, V , B, v)) Note that, unlike the supervised case, the above ELBO also involves an expectation over Z. Similar to Eq. 5 this can be approximated using Monte Carlo samples, where each z n is sampled from the amortized posterior q(z n |V , B, v, x n ). In addition to learning the model size adaptively, as shown in the schematic diagram (Fig. 1b (ii)), our model learns shared weights and task-specific masks for the encoder and decoder models. In contrast, VCL uses fixed-sized model with entirely task-specific encoders, which prevents knowledge transfer across the different encoders.

3.4. OTHER KEY CONSIDERATIONS

Task Agnostic Setting Our framework extends to task-agnostic continual learning as well where the task boundaries are unknown. Based on Lee et al. (2020) , we use a gating mechanism (Eq. 8 with t n represents the task identity of n th sample x n ) and define marginal log likelihood as p(t n = k|x n ) = p(x n |t n = k)p(t n = k) K k=1 p(x n |t n = k)p(t n = k) (8) log p(X) = E q(t=k) [p(X, t = k|θ)] + KL (q(t = k)||p(t = k|X, θ)) where, q(t = k) is the variational posterior over task identity. Similar to E-step in Expectation Maximization (Moon, 1996) , we can reduce the KL-Divergence term to zero and get the M-step as arg max θ log p(X) = arg max θ E p(t=k|X,θold) log p(X|t = k) Here, log p(X|t = k) is intractable but can be replaced with its variational lower bound (Eq. 7). We use Monte Carlo sampling for approximating p(x n |t n = k). Detecting samples from a new task is done using a threshold (Rao et al., 2019a) on the evidence lower bound (Appendix Sec. J) Masked Priors Using the previous task's posterior as the prior for current task (Nguyen et al., 2018) may introduce undesired regularization in case of partially trained parameters that do not contribute to previous tasks and may promote catastrophic forgetting. Also, the choice of the initial prior as Gaussian leads to creation of more nodes than required due to regularization. To address this, we mask the new prior for the next task t with the initial prior p t defined as p t (V d,k ) = B o d,k q t-1 (V d,k ) + (1 -B o d,k )p 0 (V d,k ) where B o is the overall combined mask from all previously learned tasks i.e., (B 1 ∪ B 2 ... ∪ B t-1 ), q t-1 , p t are the previous posterior and current prior, respectively, and p 0 is the prior used for the first task. The standard choice of initial prior p 0 can be a uniform distribution.

4. RELATED WORK

One of the key challenges in continual learning is to prevent catastrophic forgetting, typically addressed through regularization of the parameter updates, preventing them from drastically changing from the value learnt from the previous task(s). Notable methods based on this strategy include EwC (Kirkpatrick et al., 2017) , SI (Zenke et al., 2017) , LP (Smola et al., 2003) , etc. Superseding these methods is the Bayesian approach, a natural remedy of catastrophic forgetting in that, for any task, the posterior of the model learnt from the previous task serves as the prior for the current task, which is the canonical online Bayes. This approach is used in recent works like VCL (Nguyen et al., 2018) and task agnostic variational Bayes (Zeno et al., 2018) for learning Bayesian neural networks in the CL setting. Our work is most similar in spirit to and builds upon this body of work. Another key aspect in CL methods is replay, where some samples from previous tasks are used to fine-tune the model after learning a new task (thus refreshing its memory in some sense and avoiding catastrophic forgetting). Some of the works using this idea include Lopez-Paz et al. ( 2017), which solves a constrained optimization problem at each task, the constraint being that the loss should decrease monotonically on a heuristically selected replay buffer; Hu et al. (2019) , which uses a partially shared parameter space for inter-task transfer and generates the replay samples through a data-generative module; and Titsias et al. ( 2020), which learns a Gaussian process for each task, with a shared mean function in the form a feedforward neural network, the replay buffer being the set of inducing points typically used to speed up GP inference. For VCL and our work, the coreset serves as a replay buffer (Appx. C); but we emphasize that it is not the primary mechanism to overcome catastrophic forgetting in these cases, but rather an additional mechanism to preventing it. Recent work in CL has investigated allowing the structure of the model to dynamically change with newly arriving tasks. Among these, strong evidence in support of our assumptions can be found in Golkar et al. (2019) , which also learns different sparse subsets of the weights of each layer of the network for different tasks. The sparsity is enforced by a combination of weighted L 1 regularization and threshold-based pruning. There are also methods that do not learn subset of weights but rather learn the subset of hidden layer nodes to be used for each task; such a strategy is adopted by either using Evolutionary Algorithms to select the node subsets (Fernando et al., 2017) or by training the network with task embedding based attention masks (Serrà et al., 2018) . One recent approach Adel et al. ( 2020), instead of using binary masks, tries to adapt network weights at different scales for different tasks; it is also designed only for discriminative tasks. 2020) tries to amortize the network parameters directly from input samples which is a promising direction and can be adapted for future research. For non-stationary data, online variational Bayes is not directly applicable as it assumes independently and identically distributed (i.i.d.) data. As a result of which the variance in Gaussian posterior approximation will shrink with an increase in the size of training data, Kurle et al. (2020) proposed use of Bayesian forgetting, which can be naturally applied to our approach enabling it to work with non-stationary data but it requires some modifications for task-agnostic setup. In this work, we have not explored this extension keeping it as future work.

5. EXPERIMENTS

We perform experiments on both supervised and unsupervised CL and compare our method with relevant state-of-the-art methods. In addition to the quantitative (accuracy/log-likelihood compar-isons) and qualitative (generation) results, we also examine the network structures learned by our model. Some of the details (e.g., experimental settings) have been moved to the appendixfoot_0 . We first evaluate our model on standard supervised CL benchmarks. We experiment with different existing approaches such as, Pure Rehearsal (Robins, 1995) , EwC (Kirkpatrick et al., 2017) , IMM (Lee et al., 2017) , DEN (Yoon et al., 2018) , RCL (Xu & Zhu, 2018) , and "Naïve" which learns a shared model for all the tasks. We perform our evaluations on five supervised CL benchmarks: SplitMNIST, Split notMNIST(small), Permuted MNIST, Split fashionMNIST and Split Cifar100. The last layer heads (Appx. E.1) were kept separate for each task for fair baseline comparison. For Split MNIST, Split notMNIST and Split fashionMNIST each dataset is split into 5 binary classification tasks. For Split Cifar100 the dataset was split 10 multiclass classification tasks. For Permuted MNIST, each task is a multiclass classification problem with a fixed random permutation applied to the pixels of every image. We generated 5 such tasks for our experiments. Fig. 2 shows the mean test accuracies on all supervised benchmarks as new tasks are observed. As shown, the average test accuracy of our method (without as well as with coresets) is better than the compared baseline (here, we have used random point selection method for coresets). Moreover, the accuracy drops much more slowly than and other baselines showing the efficacy of our model in preventing catastrophic forgetting due to the adaptively learned structure. In Fig. 3 , we show the accuracy on first task as new tasks arrive and compare specifically with VCL. In this case too, we observe that our method yields relatively stable first task accuracies as compared to VCL. We note that for permuted MNIST the accuracy of first task increases with training of new tasks which shows the presence of backward transfer, which is another desideratum of CL. We also report the performance with our dynamically growing network variant (for more details refer Appx. Sec. D). As shown in Fig. 3 (Network Used) IBP prior concentrates weights on very few nodes, and learns sparse structures. Also most newer tasks tend to allocate fewer weights and yet perform well, implying effective forward transfer. Another important observation as shown in Fig. 3 is that the weight sharing between similar tasks like notMNIST is a higher than that of non-similar tasks like permuted MNIST. Note that new tasks show higher weight sharing irrespective of similarity, this is an artifact induced by IBP (Sec 3.1) which tends to allocate more active weights on upper side of matrix. We therefore conclude that although a new task tend to share weights learnt by old tasks, the new connections that it creates are indispensable for its performance. Intuitively, the more unrelated a task is to previously seen ones, the more new connections it will make, thus reducing negative transfer (an unrelated task adversely affecting other tasks) between tasks.

5.2. UNSUPERVISED CONTINUAL LEARNING

We next evaluate our model on generative tasks under CL setting. For that, we compare our model with existing approaches such as Naïve, EwC and VCL. We do not include other methods mentioned in supervised setup as their implementation does not incorporate generative modeling. We perform continual learning experiments for deep generative models using a VAE style network. We consider two datasets, MNIST and notMNIST. For MNIST, the tasks are sequence of single digit generation from 0 to 9. Similarily, for notMNIST each task is one character generation from A to J. Note that, unlike VCL and other baselines where all tasks have separate encoder and a shared decoder, as we discuss in Sec. 3.3, our model uses a shared encoder for all tasks, but with task-specific masks for each encoder (cf., Fig. 1b (ii) ). This enables transfer of knowledge while the task-specific mask effectively prevent catastrophic forgetting. Generation: As shown in Fig 5 , the modeling innovation we introduce for the unsupervised setting, results in much improved log-likelihood on held-out sets. In each individual figure in Fig 4 , each row represents generated samples from all previously seen tasks and the current task. We see that the quality of generated samples in does not deteriorate as compared to other baselines as more tasks are encountered. This shows that our model can efficiently perform generative modeling by reusing subset of networks and creating minimal number of nodes for each task. Task-Agnostic Learning: Fig 5 shows a particular case where nine tasks were inferred out of 10 class with high correlation among class 4 and 9 due to visual similarity between them. Since each task uses a set of network connection, this result enforces our models ability to model task relations based on network sharing. Further the log-likelihood obtained for task-agnostic setting is comparable to our model with known task boundaries, suggesting that our approach can be used effectively in task-agnostic settings as well. Representation Learning: Table 1 represents the quality of the unsupervisedly learned representation by our unsupervised continual learning approach. For this experiment, we use the learned representations to train a KNN classification model with different K values. We note that despite having task-specific encoders VCL and other baselines fail to learn good latent representation, while the proposed model learns good representations when task boundaries are known and is comparable to state-of-the-art baseline CURL (Rao et al., 2019a) under task-agnostic setting.

6. CONCLUSION

We have successfully unified structure learning in neural networks with their variational inference in the setting of continual learning, demonstrating competitive performance with state-of-the-art models on both discriminative (supervised) and generative (unsupervised) learning problems. In this work, we have experimented with task-incremental continual learning for supervised setup and sequential generation task for unsupervised setting. we believe that our task-agnostic setup can be extended to class-incremental learning scenario where sample points from a set of classes arrives sequentially and model is expected to perform classification over all observed classes. It would also be interesting to generalize this idea to more sophisticated network architectures such as recurrent or residual neural networks, possibly by also exploring improved approximate inference methods. Few more interesting extensions would be in semi-supervised continual learning and continual learning with non-stationary data. Adapting other sparse Bayesian structure learning methods, e.g. Ghosh et al. (2018) to the continual learning setting is also a promising avenue. Adapting the depth of the network is a more challenging endeavour that might also be undertaken. We leave these extensions for future work.

A DATA

The data sets used in our experiments with train test split information are listed in table given below. MNIST dataset comprises 28 × 28 monochromatic images consisting of handwritten digits from 0 to 9. notMNIST dataset comprises of glyph's of letters A to J in different fonts formats with similar configuration as MNIST. fashion MNIST is also monochromatic comprising of 10 classes (T-shirt, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, Ankle boot) with similar to MNIST. Cifar100 dataset contains RGB images with 600 images per class.

Dataset

Classes For MNIST, notMNIST, fashionMNIST datasets, our model uses single hidden layer neural network with 200 hidden units. For RCL (Xu & Zhu, 2018) and DEN (Yoon et al., 2018) , two hidden layers were used with initial network size of 256, 128 units, respectively. For the Cifar100 dataset we used an Alex-net like structure with three convolutional layers of 128, 256, 512 channels with 4 × 4, 3 × 3, 2 × 2 channels followed by two dense layers of 2048, 2048 units each. For the convolutional layer, batch-norm layers were separate for each task. We adopt Adam optimizer for our model keeping a learning rate of 0.01 for the IBP posterior parameters and 0.001 for others; this is to avoid vanishing gradient problem introduced by sigmoid function. For selective finetuning, we use a learning rate of 0.0001 for all the parameters. The temperature hyperparameter of the Gumbelsoftmax reparameterization for Bernoulli gets annealed from 10.0 to a minimum limit of 0.25. The value of α is initialized to 30.0 for the initial task and maximum of the obtained posterior shape parameters for each of subsequent tasks. Similar to VCL, we initialize our models with maximumlikelihood training for the first task. For all datasets, we train our model for 5 epochs. We selectively finetune our model after that for 5 epochs. For experiments including coresets, we use a coreset size of 50. Coreset selection is done using random and k-center methods Nguyen et al. ( 2018). For our model with dynamic expansion, we initialize our network with 50 hidden units.

B.2 UNSUPERVISED CONTINUAL LEARNING: HYPERPARAMETER SETTINGS

For all datasets, our model uses 2 hidden layers with 500, 500 units for encoder and symmetrically opposite for the decoder with a latent dimension of size 100 units. For other approaches like Naive, EwC and VCL (Kirkpatrick et al., 2017; Nguyen et al., 2018) , we use task-specific encoders with 3 hidden layers of 500, 500, 500 units respectively with latent size of 100 units, and a symmetrically reversed decoder with last two layers of decoder being shared among all the tasks and the first layer being specific to each task. we use Adam optimizer for our model keeping the learning rate configuration similar to that of supervised setting. Temperature for gumbel-softmax reparametrization gets annealed from 10 to 0.25. We initialize encoder hidden layers α values as 40, 40, respectively, and symmetrically opposite in decoder for the first task. We update α's in similar fashion to supervised setting for subsequent tasks. For latent layers, we intialize α to 20. For the unsupervised learning experiments, we did not use coresets. C CORESET METHOD EXPLANATION (t+1) . Such a predictive model is learnt after every new task, and discarded thereafter. Intuitively it makes sense as some new learnt weights for future tasks can help the older task to perform better (backward transfer) at testing time. Coreset selection can be done either through random selection or K-center greedy algorithm Gonzalez (1985) . Next, the posterior is decomposed as follows: p(θ|D 1:t ) ∝ p(θ|D 1:t \C t )p(C t |θ) ≈ qt (θ)p(C t |θ) where, q(θ) is the variational posterior obtained using the current task training data, excluding the current coreset data. Applying this trick in a recursive fashion, we can write: p(θ|D 1:t \C t ) = p(θ|D 1:t-1 \C t-1 )p(D t ∪ C t-1 \C t |θ) ≈ qt-1 (θ)p(D t ∪ C t-1 \C t |θ) We then approximate this posterior using variational approximation as qt (θ) = proj(q t-1 (θ)p(D t ∪ C t-1 \C t |θ)) Finally a projection step is performed using coreset data before prediction as follows: q t (θ) = proj(q t (θ)p(C t |θ)). This way of incorporating coresets into coreset data before prediction tries to mitigate any residual forgetting. Algorithm 1 summarizes the training procedure for our model for setting with known task boundaries.

D DYNAMIC EXPANSION METHOD

Although our inference scheme uses a truncation-based approach for the IBP posterior, it is possible to do inference in a truncation-free manner. One possibility is to greedily grow the layer width until performance saturates. However we found that this leads to a bad optima (low peaks of likelihood). We can leverage the fact that, given a sufficiently large number of columns, the last columns of the IBP matrix tends to be all zeros. So we increase the number of hidden nodes after every iteration to keep the number of such empty columns equal to a constant value T l in following manner. C l j = C l j+1 D l i I(B l ij = 0), G l = T l - K l j=1 C l j ( ) where l represents current layer index, B l is the sampled IBP mask for current task, C l j indicates if all columns from j th column onward are empty. G l is the number of hidden units to expand in the current network layer. Fix the IBP parameters and learned mask; Θ t ← arg min L t ; p new ← q t (Θ); p new ← Mask(p new ) using Eq 11; Perform prediction for given test set.. end for performance in continual learning. Therefore, in the supervised setting, we use a generalized linear model that uses the embeddings from the last hidden layer, with the parameters up to the last layer involved in transfer and adaptation. Although we do report comparision of single head models available in Sec H.2.

E.2 SPACE COMPLEXITY

The proposed scheme entails storing a binary matrix for each layer of each task which results into 1 bit per weight parameter, which is not very prohibitive and can be efficiently stored as sparse matrices. Moreover, the tasks make use of very limited number of columns of the IBP matrix, and hence does not pose any significant overhead. Space complexity grows logarithmically with number of tasks T as O(M + T log 2 (M )) where M number of parameters.

E.3 ADJUSTING BIAS TERMS

The IBP selection acts on the weight matrix only. For the hidden nodes not selected in a task, their corresponding biases need to be removed as well. In principle, the bias vector for a hidden layer should be multiplied by a binary vector u, with u i = I[∃d : B d,i = 1]. In practice, we simply scale each bias component by the maximum reparameterized Bernoulli value in that column.

E.4 SELECTIVE FINETUNING

While training with reparameterization (Gumbel-softmax), the sampled masks are close to binary but not completely binary which reduces performance a bit with complete binary mask. So we finetune the network with fixed masks to restore performance. A summarized version of Algorithm 1 summarizes our models training procedure. The method for update of coresets that we used are similar to as it was proposed in Nguyen et al. (2018) .

F ADDITIONAL INFERENCE DETAILS

Sampling Methods We obtain unbiased reparameterized gradients for all the parameters of the variational posterior distributions. For the Bernoulli distributed variables, we employ the Gumbelsoftmax trick Jang et al. (2017) , also known as CONCRETE Maddison et al. (2017) . For Beta distributed v's, the Kumaraswamy Reparameterization Gradient technique Nalisnick & Smyth (2017) is used. For the real-valued weights, the standard location-scale trick of Gaussians is used. Inference over parameters φ that involves a random or stochastic node Z (i.e Z ∼ q φ (Z)) cannot be done in a straightforward way, if the objective involves Monte Carlo expectation with respect that random variable (L = E q φ z (L(z)))). This is due to the inability to back-propagate through a random node. To overcome this issue, Kingma & Welling (2013) introduced the reparametrization trick. This involves deterministically mapping the random variable Z = f (φ, ) to rewrite the expectation in terms of new random variable , where is now randomly sampled instead of Z (i.e L = E q [L( , φ)]). In this section, we discuss some of the reparameterization tricks we used.

F.1 GAUSSIAN DISTRIBUTION REPARAMETERIZATION

The weights of our Bayesian nueral network are assumed to be distributed according to a Gaussian with diagonal variances (i.e V k ∼ N (V k |µ V k , σ 2 V k )) . We reparameterize our parameters using location-scale trick as: V k = µ V k + σ V k × , ∼ N (0, I) where k is the index of parameter that we are sampling. Now, with this reparameterization, the gradients over µ V k , σ V k can be calculated using back-propagation.

F.2 BETA DISTRIBUTION REPARAMETERIZATION

The beta distribution for parameters ν in the IBP posterior can be reparameterized using Kumaraswamy distribution Nalisnick & Smyth (2017) , since Kumaraswamy distribution and beta distribution are identical if any one of rate or shape parameters are set to 1. The Kumaraswamy distribution is defined as p(ν; α, β) = αβν α-1 (1 -ν α ) β-1 which can be reparameterized as: ν = (1 -u 1/β ) 1/α , u ∼ U (0, 1) where U represents a uniform distribution. The KL-Divergence between Kumaraswamy and beta distributions can be written as: KL(q(ν; a, b)||p(ν; α, β)) = a -α a -γ -Ψ(b) - 1 b + log ab + log(B(α, β)) - b 1 -b + (β -1)b ∞ m=1 1 m + ab B( m a , b) where γ is the Euler constant, Ψ is the digamma function and B is the beta function. As described in Nalisnick & Smyth (2017) , we can approximate the infinite sum in Eq.13 with a finite sum using first 11 terms.

F.3 BERNOULLI DISTRIBUTION REPARAMETERIZATION

For Bernoulli distribution over mask in the IBP posterior, we employ the continuous relaxation of discrete distribution as proposed in Categorical reparameterization with Gumbel-softmax Jang et al. (2017) , also known as the CONCRETE Maddison et al. ( 2017) distribution. We sample a concrete random variable from the probability simplex as follows: B k = exp((log(α k ) + g k )/λ) K i=1 exp((log(α i ) + g i )/λ) , g k ∼ G(0, 1) where, λ ∈ (0, ∞) is a temperature hyper-parameter, α k is posterior parameter representing the discrete class probability for k th class and g k is a random sample from Gumbel distribution G. For binary concrete variables, the sampling reduces to the following form: Y k = log (α k ) + log (u k /(1 -u k )) λ , u ∼ U (0, 1) then, B k = σ(Y k ) where σ is sigmoid function and u k is sample from uniform distribution U. To guarantee a lower bound on the ELBO, both prior and posterior Bernoulli distribution needs to be replaced by concrete distributions. Then the KL-Divergence can be calculated as difference of log density of both distributions. The log density of concrete distribution is given by: log q(B k ; α, λ) = log (λ) -λY k + log α k -2 log (1 + exp (-λY k + log α k )) With all reparameterization techniques discussed above, we use Monte Carlo sampling for approximating the ELBO with sample size of 10 while training and a sample size of 100 while at test time.

G IBP HYPERPARAMETER α

In this section, we discuss the approach to tune the IBP prior hyperparameter α. We found that using a sufficiently large value of α without tuning performs reasonably well in practice. However, we experimented with other alternatives as well. For example, we tried adapting α with respect to previous posterior as α = max(α, max(a ν )) for each layer, where a ν is Beta posterior shape parameter. Several other considerations can also be made regarding its choice.

G.1 SCHEDULING ACROSS TASKS

Intuitively, α should be incremented for every new task according to some schedule. Information about task relatedness can be helpful in formulating the schedule. Smaller increments of α discourages creation of new nodes and encourages more sharing of already existing connections across tasks.

G.2 LEARNING α

Although not investigated in this work, one viable alternative to choosing α by cross-validation could be to learn it. This can be accommodated into our variational framework by imposing a gamma prior on α and using a suitably parameterized gamma variational posterior. The only difference in the objective would be in the KL terms: the KL divergence of v will then also have to estimated by Monte Carlo approximation (because of dependency on α in the prior). Also, since gamma distribution does not have an analytic closed form KL divergence, the Weibull distribution can be a suitable alternative Zhang et al. (2018) .

H ADDITIONAL RESULTS: SUPERVISED CONTINUAL LEARNING

In this section, we provide some additional experimental results for supervised continual learning setup. Table 2 shows final mean accuracies over 5 tasks with deviations, obtained by all the approaches on various datasets. It also shows that our model performs comparably or better than the baselines. We have included some more models in this comparison namely, HIBNN (Kessler et al., 2020) , UCL (Ahn et al., 2019) , HAT (Serrà et al., 2018) and A-GEM (Chaudhry et al., 2019) . Note that coreset based replay is not helping much in our case, In of VCL use of coresets performs better since it forces all parameters to be shared leading to catastrophic forgetting. Our method has very less catastrophic forgetting hence the use of coresets does not improve performance significantly. Although in cases where we do not grow the model size dynamically and keep feeding tasks to it even after the model has reached its capacity (model will be forced to share more parameters), it will lead to forgetting and their use of coresets might help as it did for VCL.

H.1 LEARNED NETWORK STRUCTURES

In this section, we analyse the network structures that were learned after training our model. As runs except for some with zero deviations (1-2 runs). Deviations are rounded to 1 decimal place, very small deviations are kept as 0.1. datasets have high value and zeros elsewhere which represents that our models adapts with respect to data complexity and only uses those weights that are required for the task. Due to the use of the IBP prior, the number of active weights tends to shrink towards the first few nodes of the first hidden layer. This observation enforces that our idea of using IBP prior to learn the model structure based on data complexity is indeed working. Similar behaviour can be seen in notMNIST and fashionMNIST in Fig. 6(b and c ). On the other hand Fig 7 (left) shows the sharing of weights between subsequent tasks of different datasets. It can be observed that the tasks that are similar at input level of representation have more overlapping/sharing of parameters (e.g split MNIST) in comparison to those that are not very similar (e.g permuted MNIST). It also shows Fig 7 (right) that the amount of total network capacity used by our model differs for each task, which shows that complex tasks require more parameters as compared to easy tasks. Since the network size is fixed, the amount of network usage for all previous tasks tends to converge towards 100 percent. This promotes parameter sharing but also introduces forgetting, since the network is forced to share parameters and is not able to learn new nodes.

H.2 ADDITIONAL PERMUTED MNIST RESULT

We have done our experiments with separate heads for each task of permuted MNIST. Some approaches use a single head for permuted MNIST task and don't task labels at test-time. Here we compare some of the baselines (that supports single head) with our model (single head) on Permuted MNIST for 10 tasks. We also report number of epochs and average time to run for a rough comparision of time complexity taken by each model. Network hidden layer sizes Avg accuracy (5 tasks) [200] 98.180 ± 0.187 [100, 50] 98.188 ± 0.163 [250, 100, 50] 98.096 ± 0.152 Table 5 shows that our approach is comparable to the some strong baselines like HAT, VCL on complex tasks like cifar-10 and cifar-100 classifications. Therefore, suggesting that it can be generalized to more complex task settings. For split cifar-100 (20 tasks) each task is a 5 class classification task and, split cifar-10 has 2 class classification tasks.

H.4 OTHER METRICS

We quantified and observed the forward and backward transfer of our and VCL model, using the three metrics given in Díaz-Rodríguez et al. (2018) on Permuted MNIST dataset as follows: ACCURACY is defined as the overall model performance averaged over all the task pairs as follows: Acc = i≥j R i,j N (N -1) where, R i,j is obtained test classification accuracy of the model on task t j after observing the last sample from task t i . FORWARD TRANSFER is the ability of previously learnt task to perform on new task better and is give by: F W T = N i<j R i,j N (N -1) BACKWARD TRANSFER is the ability of newly learned task to affect the performance of previous tasks. It can be defined as: BW T = N i=2 i-1 j=1 (R i,j -R j,j ) N (N -1) We compare our model with VCL and other baselines over these three metrics in Table 6 . We can observe that backward transfer for our model is more as compared to most baselines, which shows that our approach has suffers from less forgetting as well. On the other hand forward transfer seems to give close to random accuracy (0.1) which is due to the fact that the model is not trained on the correct class labels and is asked to predict the correct label. So this metric is not very useful here; an alternative would be to train a linear classifier on the representations that are learned after each subsequent tasks for future task.

I UNSUPERVISED CONTINUAL LEARNING

Here we describe the complete generative model for our unsupervised continual learning approach. The generative story for unsupervised setting can be written as follows (for brevity we have omitted the task id t): B l ∼ IBP (α) V l d,k ∼ N (0, σ 2 0 ) W l = B l V l W out d,k ∼ N (0, σ 2 0 ) Z n ∼ N (µ z , σ 2 z ) X n ∼ Bernoulli(σ(W out φ N N (W , Z n ))) where, µ z , σ 2 z are prior parameters of latent representation; they can either be fixed or learned, and σ is the sigmoid function. The stick-breaking process for the IBP prior remains the same here as well. For doing inference here, once again we resort to structured mean-field assumption: q(Z, V , B, v) = q(Z|B, V , ν, X)q(V )q(B|v)q(v) where, q(Z|B, V , ν, X) = REPRESENTATION LEARNING In t-SNE plots, it can be observed that the latent space for MNIST dataset is more clearly seperated as compared to notMNIST dataset. This can be attributed to the abundance of data and less variation in MNIST dataset as compared to notMNIST dataset. we further analyzed the representations that were learned by our model by doing K-Nearest Neighbour classification on the latent space. Table 7 shows the KNN test error of our model and few other benchmarks on MNIST and notMNIST datasets. We performed the test with three different values for K. As shown in the table, the representations learned by other baselines are not very useful (as evidenced by the large test errors), since the latent space are not shared among the tasks, whereas our model uses a shared latent space (yet modulated for each task based on the learned task-specific mask) which results in effective latent representation learning.

J TASK AGNOSTIC SETTING

We extended our unsupervised continual learning model to a generative mixture model, where each mixture component is considered as a task distribution (i.e p(X) = K k=1 p(X|t = k)p(t = k) with t representing the task identity). Here, p(t = k) can be assumed to be a uniform distribution but it fails to consider the degree upto which each mixture is being used. Therefore, we keep a count over the number of instances belonging to each task and use that as prior ( SELECTIVE TRAINING Note that training this mixture model will require us to have all task specific variational parameters to be present at every time step unlike the case in earlier settings where we only need to store the masks and can discard the variational parameters of previously seen tasks. This will result in storage problems since the number of parameters will grow linearly with the number of tasks. To overcome this issue we fix the task specific mask parameters and prior parameters before the network is trained on new task instances. After the task specific parameters have been fixed, the arrival of data belonging to a previously seen task t prev is handled by training the network parameters with task specific masks B prev . REPRESENTATION LEARNING It makes more sense do learn representations when we don't have target class labels or task labels. As discussed, we trained our model using a gating mechanism with a threshold value of -130. 



The code for our model can be found at this link: https://github.com/npbcl/icml20



Figure 1: (a) Evolution of network structure for 3 consecutive tasks. Weights used by a task are denoted with respective colors. Note that there can be overlapping of structure between tasks. (b) Schematics representing our models, θS are parameters shared across all task, θ M are task specific mask parameters, and θ H are last layer separate head parameters. In our exposition, we collectively denote these parameters by W = B V with the masks being B and other parameters being V .

Figure 2: Mean test accuracies of tasks seen so far as newer tasks are observed on multiple benchmarks

Figure 3: Test accuracy variation for first tasks (left), network used (middle) as new tasks are observed and percentage of network connections shared (right) among different tasks in first hidden layer

Figure 4: Sequential generation for MNIST (left) and notMNIST (right) datasets. Here i th row is generated after observing i th task (Appendix contains more illustrations and zoomed-in versions)

Figure 5: Avg. log-likelihoods (left) for sequential generation (MNIST), confusion matrix (right) representing test samples mapped to each generative task learned in TA (task-agnostic) setting

Figure 6: Learned masks for input to first hidden layer weights on split MNIST(left), not MNIST(middle) and fashion MNIST(right) datasets. Darker color represent active weights.

Figure 7: Percentage weight sharing between tasks (left), percentage of network capacity already used by previous tasks(right).

Figure 8: On MNIST dataset (left) Reconstruction of images after all tasks have been observed. (right) t-SNE plot of each class after all tasks have been observed.

Figure 11: Generated Samples

Figure 12: t-SNE plot of latent space of VCL model on notMNIST (left) and MNIST (right) datasets

Figure 13: Reconstructed MNIST samples and T-SNE plots of our task agnostic setting

Fig 13 qualitatively shows the t-SNE plots and reconstruction for each class data points. Based on these results, we can conclude that the task boundaries are well understood and separated by our model.

MNIST K-NN test error rates obtained in latent space for both task-agnostic and know task setting.

Proposed inNguyen et al. (2018) as a method for cleverly sidestepping the issue of catastrophic forgetting, the coreset comprises representative training data samples from all tasks. Let M (t-1) denote the posterior state of the model before learning task t. With the t-th task's arrival having data D t , a coreset C t is created comprising choicest examples from tasks 1 . . . t. Using data D t \ C t and having prior M (t-1) , new model posterior M t is learnt. For predictive purposes at this stage (the test data comes from tasks 1 . . . t), a new posterior M t pred is learnt with M t as prior and with data C t . Note that M t pred is used only for predictions at this stage, and does not have any role in the subsequent learning of, say, M

.1 SEGREGATING THE HEAD It has been shown in prior work on supervised continual learning Zeno et al. (2018) that using separate last layers (commonly referred to as "heads") for different tasks dramatically improves Algorithm 1 Nonparametric Bayesian CL Input:Initial Prior p 0 (Θ) Initialize the network parameters and coresets Initialize : p new ← p 0 (Θ) for i = 1 to T do

Comparison of final mean accuracies on test set obtained using different methods over 10

To justify the choice of single hidden layer with MNIST like experiments, we compare our model on Permuted MNIST experiment with multiple network depths and with separate heads, From table 3 we can conclude that a single hidden layer is sufficient for obtaining good enough results. Further, to analyse the performance decrease and generality of approach with number of tasks, we perform Permuted MNIST experiment with separate heads and a single hidden layer of 200 units for different number of tasks. Table4 shows

Comparing performance on Permuted MNIST under different network configurations that model quite stable and performance does not drop alot even with large number of tasks for a fixed model size.

Comparison of model performance over different number of tasks for Permuted MNIST experiment H.3 ADDITIONAL CIFAR RESULTMNIST data experiments are relatively easier to model and an approach might not generalize to more complex datasets like image or textual data. This section includes extra results on cifar-10 and cifar-100 datasets with comparisons to some very strong baselines for observing performance under complex settings.

Avg. accuracy obtained after all tasks are obtained on cifar-10 and cifar-100 datasets

Comparison on other metrics for permuted MNIST dataset

Unsupervised learning benchmark comparison with sampled latents using multiple K-nearest neighbour errors obtained from each baseline

