CONTINUAL UNSUPERVISED DISENTANGLING OF SELF-ORGANIZING REPRESENTATIONS

Abstract

Limited progress has been made in continual unsupervised learning of representations, especially in reusing, expanding, and continually disentangling learned semantic factors across data environments. We argue that this is because existing approaches treat continually-arrived data independently, without considering how they are related based on their underlying semantic factors. We address this by a new generative model describing a topologically-connected mixture of spikeand-slab distributions in the latent space, learned end-to-end in a continual fashion via principled variational inference. The learned mixture automatically discovers the active semantic factors underlying each data environment, and to accordingly accumulate their relational structure. This distilled knowledge can further be used for generative replay and guiding continual disentangling of sequentially-arrived semantic factors. We tested the presented method on a split version of 3DShapes to provide the quantitative disentanglement evaluation of continually learned representations, and further demonstrated its ability to continually disentangle new representations and improve shared downstream tasks in benchmark datasets.

1. INTRODUCTION

The progress in continual learning has been mostly made for supervised discriminative learning, whereas continual unsupervised representation learning remains relatively under-explored (Ramapuram et al., 2020; Achille et al., 2018; Rao et al., 2019) . The few existing works have primarily focused on battling catastrophic forgetting in the generative performance of a model: for instance, a common approach known as generative-replay synthesizes past samples using a snapshot of the generative model trained from past data, and then continually trains the model to generate both new data and synthesized past samples (Achille et al., 2018; Rao et al., 2019; Ramapuram et al., 2020) . There is however another important yet under-explored question in continual unsupervised representation learning: how to reuse, expand, and continually disentangle latent semantic factors across different data environments? These are inherent in the human learning process: while learning from new data (e.g., learning cars after bicycles), we are naturally able to reuse shared semantic factors without re-learning (e.g., wheels), expand and disentangle new semantic factors (e.g., the shape of cars), while accumulating knowledge about the relationship among data environments based on these semantic factors (e.g., bicycles and cars both have wheels but are different in shapes). Disentangled representation learning, as a long-standing research topic, has demonstrated various benefits in generative modeling and downstream tasks (Higgins et al., 2017; Kumar et al., 2017; Kim & Mnih, 2018; Liu et al., 2021; Rhodes & Lee, 2021; Horan et al., 2021) . With increasing recent interests in unsupervised representation learning in a continual learning setting (Rao et al., 2019; Madaan et al., 2021) , it is important to investigate the challenges and solutions to achieve disentanglement of sequentially-arrived semantic factors in streaming data. Reusing latent dimensions for learned semantic factors has mainly been attempted by a teacher-student like approach where the student model is taught to infer and generate similarly to a snapshot of the past models (teacher) on replayed data (Achille et al., 2018; Ramapuram et al., 2020) . In Achille et al. (2018) , this is further facilitated by explicitly masking out latent dimensions that are not actively used in a data environment. Such masks however have to be heuristically defined before training on Expanding learned semantic factors, in part, results naturally from continually optimizing a generative model to newly arrived data (Achille et al., 2018; Ramapuram et al., 2020) , or even progressively increasing the model's latent capacity on the same data (Burgess et al., 2017; Li et al., 2020) . These alone however would not uncover the relational structure among data environments in the latent space. In Rao et al. (2019) , a mixture of Gaussians has been used and continually expanded such that newly arrived data are clustered to an existing or new mixture component. While this captures the expanding data distributions, it does not consider the reuse of semantic factors among data clusters. Continually disentangling semantic factors, until now, is limited to the native disentangling ability inherent in VAE, or promoting the reusing of shared semantic factors (Achille et al., 2018; Ramapuram et al., 2020) . While the common strategy of generative replay teaches a model what latent dimensions to use for shared semantic factors on the replayed data (Achille et al., 2018; Ramapuram et al., 2020) , no such guidance is available on new data. As a result, as we will show, none of the existing approaches can prevent newly-learned semantic factors to be entangled with re-used ones. In this work, we show that the above limitations boil down to a fundamental bottleneck in existing continual unsupervised learning of representations: that the learner is asked to treat continually-arrived data independently, without knowing how they are related based on the underlying semantic factors. To overcome this, we argue that the model needs to learn two critical knowledge: latent dimensions explaining active semantic factors underlying each data environment, and the relationship among the latter based on the former. We present Continual Unsupervised Disentangling of self-Organizing representations (CUDOS) that is able to accumulate the relational structure of continually-arrived data based on their underlying active semantic factors, and exploiting this knowledge to guide disentangling of sequentially-arrived semantic factors. As illustrated in Fig. 1 , to accumulate the relational structure of the data, we model the latent representations with a topologically-connected mixture of distributions via Bayesian self-organizing maps (SOM) (Kohonen, 1990; Yin & Allinson, 2001) . To automatically discover active semantic factors underlying each data environment, we model each component of the SOM mixture with a spike-and-slab distribution (Titsias & Lazaro-Gredilla, 2011; Tonolini et al., 2020) , such that the sparse spike variable identifies latent dimensions explaining active semantic factors. This results in a generative model with a self-organizing mixture of slaband-spike distributions, where the distilled knowledge -the relational structure of data environments and their associated active semantic factors -supports 1) mixture-based generative replay and 2) continual disentangling of sequentially-arrived semantic factors. We evaluated CUDOS on both benchmark datasets for continual representation learning, and a split version of 3DShapes (Burgess & Kim, 2018) designed for quantitative evaluation of disentangling sequentially-arrived semantic factors. In comparison to existing works, we showed that CUDOS not only addressed catastrophic forgetting, but also improved -both quantitatively and qualitativelycontinual disentanglement of latent semantic factors and thereby downstream discriminative tasks. Deep Learning with SOM:The use of SOM has been explored within deep learning of image classification (Liu et al., 2015) , image clustering (Manduchi et al., 2019; Forest et al., 2000) , and time-series prediction (Jin et al., 2015; Fortuin et al., 2019) . None of these works, however, considered continual learning of representations. In the context of continual learning, SOM was used to first learn the relationship among all discriminative tasks, and then used to mask the neurons of a fullyconnected layer for each task (Bashivan et al., 2018) . A dual-memory self-organizing architecture was also presented for learning object instances and categories in life-long object recognition (Parisi et al., 2018) . While both works shared our motivation to leverage the ability of SOM to maintain and expand a memory of data distributions across environments, neither considered continual unsupervised representation learning, or utilized SOM as a topologically-connected Bayesian mixture model. Inferring Active Semantic Factors in VAE: Learning meaningful representations is a vital task for VAE, where large latent space often leads to latent dimensions that carry little information (Chen et al., 2017) . Sparse coding and discrete latent space have proved to be elegant solutions (Oord et al., 2017; Tonolini et al., 2020) . Non-parametric discrete densities have been explored for stochastic latent activation to improve disentanglement (Gyawali et al., 2019) . A discrete latent space based on vector quantization was shown to solve posterior collapse (Oord et al., 2017) , whereas sparsity was directly modeled in a continuous latent space using spike-and-slab priors (Tonolini et al., 2020) . None of these concepts has been extended to continual unsupervised learning of representations. Continual Unsupervised Representation Learning: Most existing works in continual unsupervised representation learning relied on enhanced generative replay to combat catastrophic forgetting of generation using learned semantic factors (Achille et al., 2018; Ramapuram et al., 2020; Ye & Bors, 2020) . The two most related works are those presented in Achille et al. (2018) and Rao et al. (2019) . In Achille et al. (2018) , to reuse latent dimensions explaining semantic factors shared with past data, a mask for active latent dimensions specific to each data environment was used. This mask was heuristically determined before training on the new data, which will affect the learning and sharing of latent dimensions on the new data. Furthermore, while this strategy protects latent dimensions specific to past data environments (not shared and thus turned off), it does not prevent new semantic factors from being entangled with the shared dimensions. CUDOS presents fundamental solutions to these problems by 1) principled variational inference of active latent dimensions leveraging slab-and-spike priors, and 2) guiding continual disentangling of sequentially-arrived semantic factors by exploiting the relational structure of data. In Rao et al. (2019) , a mixture of Gaussians was continually expanded as new data arrive. This shares our intuition that continual learning of representations can benefit from accumulating an evolving summary of data in the latent space. How the different data clusters are related in terms of shared and distinct semantic factors, however, was not exploited in Rao et al. (2019) . In contrast, CUDOS exploits the relationship among data clusters based on their underlying active semantic factors, and uses that to facilitate continual reuse and disentangling of sequentially-arrived semantic factors.

3. METHODS

We first establish the foundation for CUDOS: a VAE with a self-organizing mixture of spike-and-slab priors to learn the relational structure of data based on their active semantic factors (Section 3.1). We then describe how to use this data summary for generative replay (Section 3.2.2), and use the relation between new and past data to improve continual disentangling of representations (Section 3.2.3).

3.1. LEARNING THE RELATIONAL STRUCTURE OF DATA VIA ACTIVE SEMANTIC FACTORS

Fig. 1 outlines the foundation model underlying CUDOS. Given data x, we are interested in learning representations of meaningful semantic factors within a latent vector z ∈ R J . While doing so we encourage sparse coding to discover latent dimensions explaining active semantic factors in VAE (Tonolini et al., 2020) , while learning the relational structure of data based on these semantic factors. Generative Model: To accumulate the relational structure of data, we model the latent space by a mixture of distributions using Bayesian-SOM (Yin & Allinson, 2001) with K nodes. Additionally, to discover the active semantic factors for each data environment, we model each mixture component as a spike-and-slab distribution that encourages sparsity in the latent dimensions (Tonolini et al., 2020) . This gives rise to the following generative process: w ∼ Cat(π), p ψ k (z|w k = 1) = J j=1 [α j k N (z j ; µ j k , (σ j k ) 2 ) + (1 -α j k )δ(z j )], p ψ (z) = K k=1 p(z|w k )p(w k ), x ∼ p θ (x|z), p θ,ψ (x, z, w) = p θ (x|z)p ψ (z|w)p(w). ( ) where * j denotes the j-th element of * , and δ( * ) Dirac delta function centered at zero. The mixing prior p(w) is parameterized by π, π i ≥ 0 and π i = 1. The latent variable z from the k-th mixture component is parametrized by ψ k = {µ k , σ 2 k , α k , π k }, where parameter α k introduces sparsity to mask out inactive dimensions. Data x are generated from z via a neural network parametrized by θ. Inference Model: We define variational approximations of the posterior density p(z, w|x) as: q(z, w|x) = q ϕ (z|x)p ψ (w|z), q ϕ (z|x) = J j=1 [α j N (z j ; μj , (σ j ) 2 ) + (1 -αj )δ(z j )], p ψ (w k = 1|z) = p(z|w k , ψ k )p(w k = 1) K k ′ =1 p(z|w k ′ , ψ k ′ )p(w k ′ = 1) . ( ) where p ψ (w k = 1|z) is the posterior probability of the k-th mixture component. For parameters μ and log σ2 of the spike-and-slab distribution, their inference is amortized as the output of an encoding network parameterized by ϕ. For parameter α, considering that similar data should share latent dimensions for common semantic factors, we infer it at a set level as inspired by the neural statistician (Edwards & Storkey, 2017; Hewitt et al., 2018) . We discuss how to determine the set in Section 3.2.1. Variational Inference: The parameters ψ, θ, and ϕ are optimized by the ELBO loss as: log p(x) ≥ L ELBO = E q ϕ (z,w|x) [log p θ,ψ (z, x, w) q ϕ (z, w|x) ] = E q ϕ (z,w|x) [log p θ (x|z)p ψ (z|w)p(w) q ϕ (z|x)p ψ (w|z) ] = E q ϕ (z|x) [log p θ (x|z)] -E p ψ (w|z) [D KL (q ϕ (z|x)||p ψ (z|w))] -E q ϕ (z|x) [D KL (p ψ (w|z)||p(w))]. The first reconstruction term is similar to that in the vanilla VAE. The second term measures the KL-divergence between q ϕ (z|x) and its conditional prior, measured over the posterior distribution of the SOM mixture p ψ (w|z). Specifically, we estimate the expectation over p ψ (w|z) as: E p ψ (w|z) [D KL (q ϕ (z|x)||p ψ (z|w))] = K k=1 p ψ (w k = 1|z)D KL [q ϕ (z|x)||p ψ (z|w k = 1)], where p ψ (w k |z) can be computed in a batch during forward propagation, and D KL [q ϕ (z|x)||p ψ (z|w k = 1) ] can be derived following (Tonolini et al., 2020) as: J j [α j (log σ j k σj + (μ j -µ j k ) 2 + (σ j ) 2 2(σ j k ) 2 - 1 2 ) + (1 -αj ) log 1 -αj 1 -α j k + αj log αj α j k ] This KL loss encourages q ϕ (z|x) to follow a mixture density with the mixing probability determined by the posterior probability of each component given z. Note that the latter automatically considers sharing of semantic factors between the inferred z and each mixture component k (via spike variable α and α k ). The third term in Eqn. (3) measures the KL-divergence between the posterior density of w and its prior (set to be uniform in this work). The expectation is estimated by Monte Carlo samples. Iterative Optimization: We maximize the ELBO loss as defined in Eqn. (3) by iterative optimization. In each iteration, we first fix ψ of the SOM mixture and maximize Eqn. (3) with respect to the VAE's parameters θ and ϕ by stochastic gradient descent with reparameterization trick (Kingma & Welling, 2014; Tonolini et al., 2020) : at the first iteration, the SOM-mixture is initialized as a uniform mixture of ψ = {0, I, 0.2} and the optimization becomes a standard ELBO with spike-and-slab priors. With the updated θ and ϕ, we then maximize Eqn. (3) with respect to the SOM-mixture parameter ψ which, as derived in Appendix-I, amounts to maximizing the expectation of the log-likelihood of p ψ (z) over the variational posterior distribution of q ϕ (z|x): ψ * = argmax ψ E q ϕ (z|x) [log p ψ (z)] We follow the theory in Gepperth & Pfülb (2021) to optimize ψ using stochastic gradient descent.

3.2. CONTINUAL LEARNING WITH CUDOS

We now consider a setting of continual unsupervised learning where the label of the underlying data environments is unknown. Following existing approaches (Achille et al., 2018; Rao et al., 2019; Ramapuram et al., 2020) , we maintain a snapshot of the model parameters [ψ old , θ old , ϕ old ] every τ training steps. These model snapshots are used to guide (1) synthesizing replayed samples to teach the model to perform consistently on past data (Section 3.2.2), and (2) continually disentangling sequentially-arrived semantic factors using the relationship between past and new data (3.2.3).

3.2.1. INFERRING α FOR STREAMING DATA

As mentioned in Section 3.1, we choose to infer α at a set level to leverage shared information underpinning a set of data. If the boundary between data environments is known, α can be shared for all data within the same data environment. Alternatively, if the boundary of data environments is not known, we assume α to be shared within each mini-batch x batch . Specifically, we maintain ϕ α = { α1 , α2 , • • • , αd } for d number of sets with distinct underlying semantic factors. Given a new batch of x batch , we first determine if it can be described by an existing α, or a new αd+1 has to be allocated if the existing ϕ α fails to describe x batch well: αnew = α, if E p(z| μ, σ2 , α) [L r ] ≤ T α , αd+1 , otherwise, where α = argmin α∈ϕα E p(z| μ, σ2 , α) [L r ], T α is a threshold for the reconstruction error L r of x batch averaging over pixels. Note that Equation ( 5) only determines how a data batch x batch is associated with a variable α. The values of these variables are optimized during variational inference.

3.2.2. GENERATIVE REPLAY WITH CUDOS

To combat forgetting, we synthesize data samples following the generation process as defined in Eqn. (1), using the snapshot of the SOM mixture with parameters ψ old . As illustrated in Fig. 2 (a), on the synthesized samples xold and their corresponding latent samples z p , we encourage the model to be consistent with its past snapshot (Ramapuram et al., 2020; Achille et al., 2018; Lezama, 2019) : L old = L c [S(p θ (x|z p )), xold ] + L c [S(q ϕ (z|x old )), z p ], where L c is mean squared error and S(•) is a sampling process. This strategy combines the two key existing concepts in replay mechanism: as in generative replay (Shin et al., 2017; Ramapuram et al., 2020) , data are synthesized quickly and readily with limited burden on storage; as in core-set methods (Nguyen et al., 2018; Borsos et al., 2020) , the SOM-mixture provides a representative summary of data ensuring that more important mixture components are more frequently re-used in future training.

3.2.3. GUIDING CONTINUAL DISENTANGLEMENT

Generative replay lacks mechanisms to teach the model how to disentangle newly-arrived semantic factors from latent dimensions already used for previously-learned semantic factors. CUDOS provides a unique opportunity to address this issue by its ability to describe the relation between new and past data. For a new data x, its shared latent dimensions with SOM summary of past data (parameterized by ψ old ) is determined by a mask s ψold computed by a scaled and displaced Sigmoid function: s j ψ old = Sigmoid(b(α j • α j vBMU,ψold -0.5)) (7) where α is the spike variable inferred from x and α vBMU,ψold the spike variable of the best-matching unit v BMU (the component with the largest posterior) on the snapshot of the past SOM. b scales and sharpens the Sigmoid function towards a gated function (Tonolini et al., 2020) . Note that neither s ψold or v BMU is fixed; they are functions of the unknown spike variables α and α vBMU,ψold . We now use this to (1) maintain consistency on latent dimensions for shared semantic factors (if any), and (2) prevent entangling new semantic factors into the shared latent dimensions. Reusing Shared Semantic Factors: To teach the model to reuse latent dimensions corresponding to previously-learned semantic factors on new data x, we ask the VAE encoder q ϕ to be consistent with its past snapshot q ϕold in the shared dimensions when inferring from the new data: L newz = L c [S(q ϕ (z|x)) ⊙ s ψ old , S(q ϕold (z|x)) ⊙ s ψ old ], where ⊙ is element-wise multiplication. Disentangling New Semantic Factors: Now since the past model does not necessarily know how to generate new data x, how does it teach the new model? Intuitively, given the relation between x and the past data as determined in Eqn. ( 7), even though the past model does not know how to generate the new x, it would know how to generate from those shared semantic dimensions in x. Therefore, as illustrated in Fig. 2 (b), if we take a sample z vBMU,ψold from the mixture component v BMU , and replace its latent values with those inferred from x at the shared latent dimensions, the model should know how to generate from this combined latent vector z com in a way consistent to the past generator: z vBMU,ψold = S(p ψold (z|v BMU )), z com = z vBMU,ψold ⊙ (1 -s ψ old ) + S(q ϕ (z|x)) ⊙ s ψ old , L newx = L c [S(p θold (x|z com )), S(p θ (x|z com ))]. Intuitively, if the model entangles new semantic factors into the shared dimensions, it will be penalized as the past decoder does not know how to generate from these new factors. As we will demonstrate, this constraint -uniquely made possible by CUDOS -is critical in continual disentanglement. Summary: In summary, the overall loss for CUDOS is: L = L ELBO + γ 1 L old + γ 2 L newz + γ 3 L newx , where γ 1 , γ 2 , and γ 3 are weighting hyperparameters. This extends the foundation objective function to continual learning settings, which promotes the sharing of shared semantic factors between new and past data environments, and the disentangling of new semantic factors.

4. EXPERIMENTS AND RESULTS

We evaluated CUDOS on (1) a split version of 3DShapes (Burgess & Kim, 2018) for quantitative evaluation of continual disentanglement, (2) MNIST(LeCun et al., 1998) , Fashion-MNIST (Xiao et al., 2017) , and their moving versions in (Achille et al., 2018) , and (3) split-CelebA (Liu et al., 2015) .

Split-3DShapes:

We quantitatively evaluated the continual disentanglement of past and new representations in a split version of 3Dshapes with two sub-sets: The first one only had red floor and wall, and the second added all floor colors except red. A successful continual learning of representations is expected to continually learn the new factor of floor color while reusing the others learned in the first set. We compared CUDOS to four groups of baselines: (1) naive VAE (Kingma & Welling, 2014), naive TC-VAE (Chen et al., 2018) , and VAE with gradually increased capacity (Burgess et al., 2017) without an explicit mechanism to combat catastrophic forgetting, (2) unsupervised continual learning reliant on generative replay (Achille et al., 2018; Ramapuram et al., 2020) and heuristically-defined masks of active latent dimensions (Achille et al., 2018) , (3) unsupervised continual learning using a mixture of Gaussian in the latent space (Rao et al., 2019) Table 1 : Quantitative metrics for disentanglement. MIG: mutual information gap (Chen et al., 2018) . MIG-sup: supplement mutual information gap (Li et al., 2020) . I(z past ; factors new ): mutual information between already active dimensions and the new factors. ∆ reconstruction loss: the relative change for reconstructing past data. Choosing suitable disentanglement metrics is vital as different metrics may measure different aspects of the disentanglement (Locatello et al., 2020; Zaidi et al., 2022) . Here we consider two types of metrics. First, to form a complete measurement of the one-to-one relationship between latent dimensions and semantic factors, we chose MIG-sup (Li et al., 2020) that penalizes learning multiple semantic factors into the same dimensions in combination with the complementary MIG (Chen et al., 2018) . Second, to focus on the disentanglement of sequentially-arrived semantic factors, we compute the mutual information I(z past ; factors new ) between active dimensions for past data and new factors in the new data. Ideally, if a model manages to disentangle new semantic factors into dimensions not used by previous data, I(z past ; factors new ) should be close to 0. Finally, we also compute the relative change of reconstruction loss for past data to measure forgetting. As shown in Fig. 3 (a), major semantic factors such as object shape, size, and color were learned from the first subset. When a new data stream is introduced (shown in Fig. 3 (b)), while all models were able to reuse most of the latent dimensions corresponding to previously-learned semantic factors, all baseline models entangled new semantic factors -the floor color -with these dimensions. In contrast, CUDOS was able to not only reuse shared latent dimensions, but also disentangle new ones into previously inactive dimensions. Note that we include results of Ramapuram et al. (2020) in Fig. 3 as an example of the baseline models that have constraints on replayed data, annotated by L old . Visual traversing results of other comparison models can be found in Appendix.D. Additional results on different sequences of split-3DShapes can be found in Appendix I. Quantitatively, as summarized in sup. This carefully designed experiment provided clear evidence that CUDOS is able to continually disentangle new semantic factors without entangling them with shared ones learned from the past. It also showed that this cannot be achieved by naively using generative replay to extend existing disentangling or prototype-based VAE (e.g., TC-VAE and VQ-VAE) into continual versions. More discussion of differences between static and continual disentanglement can be found in Appendix J. 

Moving MNIST & Fashion-MNIST:

We then tested CUDOS on sequences of Fashion-MNIST, MNIST, and moving versions of them similar to that presented in Achille et al. (2018) . Fig. 5 presented results for one sequence: Moving Fashion-MNIST → Moving MNIST → Fashion-MNIST. As shown, CUDOS was able to disentangle new semantic factors (green boxes), while re-using those learned from the past (red boxes). For instance, the positional semantic factors learned from Moving Fashion-MNIST were reused while learning Moving MNIST. CelebA: We further designed experiments for continually learning over split versions of CelebA data. Fig. 6 provide examples of results obtained by two different splits: by age (top), and by bangs (bottom). Traversing results on selected dimensions showed that CUDOS was able to reuse latent dimensions for previous semantic factors while learning new ones related to the new attributes.

Benefits on downstream tasks:

To evaluate the usefulness of the disentangled representations learnt by CUDOS, we focused on shared tasks related to the continually-learned semantic factors shared among past and new data environments, including predicting the scale and orientation of 3DShapes, and predicting the X-and Y-positions of Moving-MNIST. For each task, we identified the corresponding active latent dimensions after learning the first data environment, and continually trained linear regressors to predict the ground-truth labels using these active latent dimensions: the 2 summarizes the R 2 scores of each task obtained on new data along with their changes ∆ from the R 2 scores obtained on old data prior to continual learning. As shown, CUDOS achieved best final R 2 score for scale and orientation regression, along with minimum drop of performance over the course of continual learning. For Moving-MNIST, CUDOS and baselines achieved similar results on X-Y position regression, with in general less significant performance drop in comparison to 3DShapes. We argue that this is because X-Y position regression on a clean background is an easier task, than scale and orientation regression on a more complex data environment like 3DShapes. Additionally, Table 4 (top) shows digit-classification performance on Moving-MNIST using active and inactive dimensions, which suggested that active semantic factors for the digit data environments are learned properly. Ablation study: Table . 4 (bottom) presents a detailed ablation study on the contribution brought by the different ingredients within CUDOS. We did not include results from VAE + L old because they are represented by the work of Ramapuram et al. (2020) as reported in Table . 1. As shown, the sparsity introduced by spike-and-slab distribution plays a significant role in improving the disentanglement ability of CUDOS. While SOM alone does not appear to improve the general disentangling ability of the model, it does seem to help disentangle new semantic factors from previously learned ones; more importantly, it is a necessary component for learning the relational structure of data to guide disentanglement, i.e., for enabling L newx in Equation (10). Indeed, the combined introduction of SOM and L newx brings significant improvement in the ability of CUDOS to disentangle new semantic factors from previously learned ones, as it is evident in the improvement achieved in MIG-sup and I(z past ; factors new ). More implementation details can be found in Appendix.C.

Conclusion:

In this paper, we demonstrated that an overlooked key ingredient to continual unsupervised learning of representations is to exploit the relational structure of data based on their underlying active semantic factors. We presented CUDOS, a novel VAE with self-organizing spike-and-slab mixtures, to address this. A limitation of the present form of CUDOS is that continual disentangling partly depends on the teacher-student strategy, leaving the snapshot timing a hyper-parameter. We will investigate sequential Bayesian inference as a potential solution to this problem as a future work.

A OPTIMIZING ψ FROM ELBO

To make the partial derivative of ψ clearer, we first rewrote the last two terms of ELBO in Eqn. (3) as: -E p ψ (w|z) [D KL (q ϕ (z|x)||p ψ (z|w))] -E q ϕ (z|x) [D KL (p ψ (w|z)||p(w))] = p ψ (w|z) p ψ (w|z) q ϕ (z|x)log p ψ (z|w) q ϕ (z|x) dzdw + E q ϕ (z|x) p ψ (w|z)log p(w) p ψ (w|z) dw = q ϕ (z|x) q ϕ (z|x) p ψ (w|z)logp ψ (z|w)dzdwp ψ (w|z) p ψ (w|z) q ϕ (z|x)logq ϕ (z|x)dzdw + E q ϕ (z|x) p ψ (w|z)log p(w) p ψ (w|z) dw = E q ϕ (z|x) p ψ (w|z)log p ψ (z|w)p(w) p ψ (w|z) dw -E p ψ (w|z),q ϕ (z|x) [logq ϕ (z|x)] = E q ϕ (z|x) [logp ψ (z)] -E p ψ (w|z),q ϕ (z|x) [logq ϕ (z|x)] ) Therefore the optimized ψ * can be solved by: Regarding weights in Eqn 11, in our experiments, generally, we are trying to avoid certain parts of the loss function becoming too large or small, and we found a rule-of-thumb weight value described above across different datasets. The most tricky part is the weighting for the constraints of old data and new data. We found a slightly higher weight on new data can achieve better continual disentanglement results. We reason that the constraint on new factors is more difficult compared with reconstructing old data, and therefore emphasizing more on that (higher weight) can be helpful. ψ * = argmax ψ ELBO = argmax ψ E q ϕ (z|x) [logp ψ (z)] In general, we didn't find obvious differences within around 30% percent changes of each weight. Putting them too high (larger than the ELBO term) will affect the original continual learning of new data. Metrics in Table . 1 are calculated in a setting where the boundary of data environments is known. Network-architecture: Code is available at https://github.com/Zhiyuan1991/CUDOS_release.

D TRAVERSING RESULTS ON SPLIT-3DSHAPES

Here we presented additional traversing results for baseline methods, along with their I(z past ; factors new ) value. ) is able to re-use latent dimensions corresponding to shared factors, and 3) prevents new ones to be entangled into the shared ones. None of these challenges are addressed by state-ofthe-art (SOTA) disentangling VAEs such as FactorVAE (Kim & Mnih, 2018) and TC-VAE (Chen et al., 2018) . The continual setting sees significant challenges such as forgetting (including forgetting learned semantic factors) and, as in other continual learning problems, cannot be expected to see similar disentanglement performance to those reported on static settings. That's why the non-continual baselines in Table 1 , including naive VAE, naive TC-VAE, and Burgess et al. (2017) , all showed weaker performance than what would have been expected in a static learning setting. Furthermore, because these models are not designed for continual learning, they will also face challenges such as catastrophic forgetting. We can take a closer look at this by taking naive TC-VAE versus continual TC-VAE (with generative replay mechanism) as an example. As shown in Table 1 , naive TC-VAE achieved overall better MIG and MIG-sup scores because they can forget about previous data and focus on disentangling the new data (in split 3DShapes, the two sequentially-presented datasets share many latent factors but one). Therefore, forgetting previous semantic factors did not create a large performance drop in MIG and MIG-sup scores. This catastrophic forgetting however is reflected in the reconstruction loss, which increased from 2365 to 3207 (around 35% change) on the previous data. In addition, the I(z past ; factors new ) metric further shows that naive TC-VAE is not able to separate new semantic factors from previously-learned latent dimensions, but rather achieved relatively high disentangling by simply forgetting previous factors and learning on the new data itself. This is similar to naive VAE's performance. By extending TC-VAE to a continual learning setting with generative replay, the continual TC-VAE was able to remember how to reconstruct the previous data, where the reconstruction loss only increased from 2376 to 2395 (0.7% change). However, its disentanglement performance including MIG and MIG-sup dropped. This demonstrates the aforementioned challenges of disentangling sequentially-arrived semantic factors (while remembering the previous factors at the same time) that are fundamentally different from disentangling all seen factors at once. It also shows that such challenges cannot be addressed by either naive SOTA disentangling VAEs, or simply extending these SOTA disentangling VAEs into a continual setting (via standard techniques such as generative replay). We believe these provide further evidence for the contribution of the presented work.



Figure 1: Through a self-organizing spike-and-slab mixture, CUDOS continually distills knowledge about the relational structure of data environments with their shared and distinct semantic factors.

Figure 2: (a) Combating catastrophic forgetting by generative replays from SOM-mixture, and (b) Continually disentangling by using the relation between new and past data based on their underlying shared semantic dimensions.

Figure 3: Continually learning a split version of 3DShapes where the variations of floor colors were absent initially and appeared later. Black bounding boxes annotate in which latent dimensions the new semantic factors are learnt. (a): Traversing each latent dimension after training on data with no floor color variations. (b): Traversing each dimension after training on the data with floor color variations, where comparison of CUDOS with baseline methods shows the improved ability to disentangle new semantic factors into previously inactive dimensions.

Figure 4: SOM results on split-3DShapes. (a) SOM prototypes learnt after the first (left) and second (right) data environments. Black boxes label prototypes associated with current (black) and replayed data samples (blue), while the rest of the prototypes are mixture-interpolated. (b) Average spike parameter α k . Prototypes of the second data environment (black) shared active dimensions from the first data environment (blue) with added dimensions. Gray represents inactive dimensions.

Figure 5: Continual learning of Moving Fashion-MNIST and MNIST.

Fig. 4 (a)  shows that the SOM-mixture continually updated a summary of the relation between past (blue box) and new data environments (black box) based on their active latent dimensions (b). T-SNE plots of the learned latent representations after continual learning as an alternative way to visualize the relationships among data are provided in Appendix G.

Figure 6: CelebA split by age (top row) and bangs (bottom row). The green boxes highlight the newly learned semantic factors, and the red boxes highlight the reused ones.moving fashion moving MNIST fashion active 0.74 0.85 0.86 inactive 0.4 0.17 0.35 Method MIG ↑ MIG-sup ↑ I(zpast; factorsnew) ↓ ∆ reconstruction ↓ VAE+SOM+Lold 0.128±0.026 0.125±0.029 0.969±0.238 0.011±0.001 VAE+SS+Lold 0.302±0.121 0.293±0.119 0.782±0.765 0.005±0.103 VAE+SOM+SS+Lold 0.217±0.072 0.240±0.082 0.648±0.268 0.024±0.004 VAE+SOM+SS+Lold+Lnewz 0.216±0.036 0.320±0.086 0.072±0.050 0.024±0.004 CUDOS (above+Lnewx) 0.242±0.047 0.326±0.053 0.024±0.029 0.021±0.013 Table 4: Top: Continual digit classification accuracy (testing) based on active or inactive dimensions. Bottom: Ablation study. SS: spike-and-slab density. L old /L new : constraints on replayed/new data.

) B MOVING MNIST AND MOVING FASHION-MNIST DATASET Based on the original MNIST and fashion-MNIST dataset, we created the moving version of them by: (1) Resize the original 28*28 images to 36*36 images, (2) Place the original 36*36 images image at the top-left of a 64*64 black background. (2) Apply translation for both x and y axis with values[5,10,15,20,25].C EXPERIMENTS DETAILHyper-parameters settings and implementation strategy: We set γ 1 = 0.25, γ 2 = 1, γ 3 = 0.35, b = 10 in all experiments. Snapshot of the model is updated every τ = 1500 iteration steps.

Figure 7: Burgess et al. (2017), I(z past ; factors new ): 1.185

Figure 9: Ramapuram et al. (2020), I(z past ; factors new ): 0.877

Figure 11: Prototypes learnt for continual learning on Moving-Fanshion-MNIST to Moving-MNIST to MNIST.

Figure 12: Mapping density on SOM of the first data environment of split-3DShapes during training.

Figure 13: Mapping density on SOM of the second data environment of split-3DShapes during training. Dots without black boundaries are replay data.

Figure 14: T-SNE plots for 3DShapes factors.

Figure 15: Burgess et al. (2017).

Figure 16: Achille et al. (2018).

Figure 17: Ramapuram et al. (2020).

Figure 18: Continual VQ-VAE.

Figure 19: Continually learning a split version of 3DShapes where the variations of floor and wall colors were absent initially and appeared later. Each row of images is traversing each latent dimension after training on data environments titled above. Black bounding boxes annotate where the new semantic factors are learnt.

Figure 20: Continually learning a split version of 3DShapes where the variations of scale and wall colors were absent initially and appeared later. Each row of images is traversing each latent dimension after training on data environments titled above. Black bounding boxes annotate where the new semantic factors are learnt.

Figure 21: Continually learning a reversed sequence of split-3DShape where all semantic factors were presented in the beginning but later the variation of floor color was removed (only red remained). Each row of images is traversing each latent dimension after training on data environments titled above.

Table. 5 shows the network architecture for all our experiments, where BN stands for batch normalization, Conv stands for convolution layer, T-Conv stands for transposed convolution layer, FC stands for fully-connected layer, ELU stands for Exponential linear unit activation.

Network architecture

ACKNOWLEDGEMENT

This work was supported by National Heart, Lung, and Blood Institute (NHLBI) grant R01HL145590, and National Institute of Nursing Research (NINR) grant R01NR018301.

