IMPROVING VAES' ROBUSTNESS TO ADVERSARIAL ATTACK

Abstract

Variational autoencoders (VAEs) have recently been shown to be vulnerable to adversarial attacks, wherein they are fooled into reconstructing a chosen target image. However, how to defend against such attacks remains an open problem. We make significant advances in addressing this issue by introducing methods for producing adversarially robust VAEs. Namely, we first demonstrate that methods proposed to obtain disentangled latent representations produce VAEs that are more robust to these attacks. However, this robustness comes at the cost of reducing the quality of the reconstructions. We ameliorate this by applying disentangling methods to hierarchical VAEs. The resulting models produce high-fidelity autoencoders that are also adversarially robust. We confirm their capabilities on several different datasets and with current state-of-the-art VAE adversarial attacks, and also show that they increase the robustness of downstream tasks to attack.

1. INTRODUCTION

Variational autoencoders (VAEs) are a powerful approach to learning deep generative models and probabilistic autoencoders (Kingma & Welling, 2014; Rezende et al., 2014) . However, previous work has shown that they are vulnerable to adversarial attacks (Tabacof et al., 2016; Gondim-Ribeiro et al., 2018; Kos et al., 2018) : an adversary attempts to fool the VAE to produce reconstructions similar to a chosen target by adding distortions to the original input, as shown in Fig 1 . This kind of attack can be harmful when the encoder's output is used downstream, as in Xu et al. (2017) ; Kusner et al. (2017) ; Theis et al. (2017) ; Townsend et al. (2019) ; Ha & Schmidhuber (2018) ; Higgins et al. (2017b) . As VAEs are often themselves used to protect classifiers from adversarial attack (Schott et al., 2019; Ghosh et al., 2019) , ensuring VAEs are robust to adversarial attack is an important endeavour. Despite these vulnerabilities, little progress has been made in the literature on how to defend VAEs from such attacks. The aim of this paper is to investigate and introduce possible strategies for defence. We seek to defend VAEs in a manner that maintains reconstruction performance. Further, we are also interested in whether methods for defence increase the robustness of downstream tasks using VAEs. Our first contribution is to show that regularising the variational objective during training can lead to more robust VAEs. Specifically, we leverage ideas from the disentanglement literature (Mathieu et al., 2019) to improve VAEs' robustness by learning smoother, more stochastic representations that are less vulnerable to attack. In particular, we show that the total correlation (TC) term used to encourage independence between latents of the learned representations (Kim & Mnih, 2018; Chen et al., 2018; Esmaeili et al., 2019 ) also serves as an effective regulariser for learning robust VAEs. Though a clear improvement over the standard VAE, a severe drawback of this approach is that the gains in robustness are coupled with drops in the reconstruction performance, due to the increased regularisation. Furthermore, we find that the achievable robustness with this approach can be limited (see Fig 1) and thus potentially insufficient for particularly sensitive tasks. To address this, we apply TC-regularisation to hierarchical VAEs. By using a richer latent space representation than a standard VAE, the resulting models are not only more robust still to adversarial attacks than single-layer models with TC regularisation, but can also provide reconstructions which are comparable to, and often even better than, the standard (unregularised, single-layer) VAE. n vanilla VAEs, less effective on e to ineffective on our proposed are: regularised VAEs, trained with orrelation, are significantly more ttacks than vanilla VAEs. rchical VAE, the Seatbelt-VAE, obustness to adversarial attack. ween robustness, disentangling , linked through regularisation. z) where (x|z) is an appropriate distrif the data, the parameters of deep nets with parameters ✓. tractable for this model, in a ed stochastic variational inferariational posterior distribution q (z|x) = N (µ (x), ⌃ (x)), ascent on the evidence lower D KL (q (z|x)||p ✓ (x, z)) = L (q (z|x)||p(z))  log p(x) , using the reparameterisation ugh Monte Carlo samples from agent is trying to manipulate the e learning model towards a goal ly in deep learning this would be lassify an image through adding tar & Mian, 2018; Gilmer et al., in input, of little importance to ce large changes in the model's Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour. output. Attacks on VAEs have been proposed in Tabacof et al. (2016) ; Gondim-Ribeiro et al. (2018) ; Kos et al. (2018) . The adversary wants draws from the model to be close to a target image when given a distorted image as input. See Figure 1 .a) for an example of a successful attack on a vanilla VAE. Here we are trying to turn Hugh Jackman (Original, top left) into Anna Wintour (Target, bottom left). We can see that, by adding a well-chosen distortion (Distortion, bottom right), the reconstruction of Jackman goes from looking like a somewhat blurry version of the input (Original rec., top middle) to a somewhat blurry version of Wintour (Adversarial rec., bottom middle). The adversary has achieved their goal. The current most effective mode of attack on VAEs, the latent space attack (Tabacof et al., 2016; Gondim-Ribeiro Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour. output. Tabacof et al. (2016) ; Gondim-Ribeiro et al. (2018) ; Kos et al. (2018) . The adversary wants draws from the model to be close to a target image when given a distorted image as input.

Attacks on VAEs have been proposed in

See Figure 1 .a) for an example of a successful attack on a vanilla VAE. Here we are trying to turn Hugh Jackman (Original, top left) into Anna Wintour (Target, bottom left). We can see that, by adding a well-chosen distortion (Distortion, bottom right), the reconstruction of Jackman goes from looking like a somewhat blurry version of the input (Original rec., top middle) to a somewhat blurry version of Wintour (Adversarial rec., bottom middle). The adversary has achieved their goal. The current most effective mode of attack on VAEs, the latent space attack (Tabacof et al., 2016 Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour. output. Tabacof et al. (2016); Gondim-Ribeiro et al. (2018) ; Kos et al. (2018) . The adversary wants draws from the model to be close to a target image when given a distorted image as input. See Figure 1 .a) for an example of a successful attack on a vanilla VAE. Here we are trying to turn Hugh Jackman (Original, top left) into Anna Wintour (Target, bottom left). We can see that, by adding a well-chosen distortion (Distortion, bottom right), the reconstruction of Jackman goes from looking like a somewhat blurry version of the input (Original rec., top middle) to a somewhat blurry version of Wintour (Adversarial rec., bottom middle). The adversary has achieved their goal. The current most effective mode of attack on VAEs, the latent space attack (Tabacof et al., 2016 Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour. output. Tabacof et al. (2016); Gondim-Ribeiro et al. (2018) ; Kos et al. (2018) . The adversary wants draws from the model to be close to a target image when given a distorted image as input. See Figure 1 .a) for an example of a successful attack on a vanilla VAE. Here we are trying to turn Hugh Jackman (Original, top left) into Anna Wintour (Target, bottom left). We can see that, by adding a well-chosen distortion (Distortion, bottom right), the reconstruction of Jackman goes from looking like a somewhat blurry version of the input (Original rec., top middle) to a somewhat blurry version of Wintour (Adversarial rec., bottom middle). The adversary has achieved their goal. The current most effective mode of attack on VAEs, the latent space attack (Tabacof et al., 2016; Gondim-Ribeiro Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour. output. Tabacof et al. (2016) ; Gondim-Ribeiro et al. (2018) ; Kos et al. (2018) . The adversary wants draws from the model to be close to a target image when given a distorted image as input. See Figure 1 .a) for an example of a successful attack on a vanilla VAE. Here we are trying to turn Hugh Jackman (Original, top left) into Anna Wintour (Target, bottom left). We can see that, by adding a well-chosen distortion (Distortion, bottom right), the reconstruction of Jackman goes from looking like a somewhat blurry version of the input (Original rec., top middle) to a somewhat blurry version of Wintour (Adversarial rec., bottom middle). The adversary has achieved their goal. The current most effective mode of attack on VAEs, the latent space attack (Tabacof et al., 2016; Gondim-Ribeiro Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour. output. Tabacof et al. (2016); Gondim-Ribeiro et al. (2018) ; Kos et al. (2018) . The adversary wants draws from the model to be close to a target image when given a distorted image as input. See Figure 1 .a) for an example of a successful attack on a vanilla VAE. Here we are trying to turn Hugh Jackman (Original, top left) into Anna Wintour (Target, bottom left). We can see that, by adding a well-chosen distortion (Distortion, bottom right), the reconstruction of Jackman goes from looking like a somewhat blurry version of the input (Original rec., top middle) to a somewhat blurry version of Wintour (Adversarial rec., bottom middle). The adversary has achieved their goal. The current most effective mode of attack on VAEs, the latent space attack (Tabacof et al., 2016; Gondim-Ribeiro Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour. output. Tabacof et al. (2016); Gondim-Ribeiro et al. (2018) ; Kos et al. (2018) . The adversary wants draws from the model to be close to a target image when given a distorted image as input. See Figure 1 .a) for an example of a successful attack on a vanilla VAE. Here we are trying to turn Hugh Jackman (Original, top left) into Anna Wintour (Target, bottom left). We can see that, by adding a well-chosen distortion (Distortion, bottom right), the reconstruction of Jackman goes from looking like a somewhat blurry version of the input (Original rec., top middle) to a somewhat blurry version of Wintour (Adversarial rec., bottom middle). The adversary has achieved their goal. The current most effective mode of attack on VAEs, the latent space attack (Tabacof et al., 2016 Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour. output. Tabacof et al. (2016) ; Gondim-Ribeiro et al. (2018) ; Kos et al. (2018) . The adversary wants draws from the model to be close to a target image when given a distorted image as input. See Figure 1 .a) for an example of a successful attack on a vanilla VAE. Here we are trying to turn Hugh Jackman (Original, top left) into Anna Wintour (Target, bottom left). We can see that, by adding a well-chosen distortion (Distortion, bottom right), the reconstruction of Jackman goes from looking like a somewhat blurry version of the input (Original rec., top middle) to a somewhat blurry version of Wintour (Adversarial rec., bottom middle). The adversary has achieved their goal. The current most effective mode of attack on VAEs, the latent space attack (Tabacof et al., 2016; Gondim-Ribeiro Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour. output. Tabacof et al. (2016) ; Gondim-Ribeiro et al. (2018) ; Kos et al. (2018) . The adversary wants draws from the model to be close to a target image when given a distorted image as input. = N (µ (x), ⌃ (x)), n the evidence lower (q (z|x)||p ✓ (x, z)) = x)||p(z))  log p(x) See Figure 1 .a) for an example of a successful attack on a vanilla VAE. Here we are trying to turn Hugh Jackman (Original, top left) into Anna Wintour (Target, bottom left). We can see that, by adding a well-chosen distortion (Distortion, bottom right), the reconstruction of Jackman goes from looking like a somewhat blurry version of the input (Original rec., top middle) to a somewhat blurry version of Wintour (Adversarial rec., bottom middle). The adversary has achieved their goal. The current most effective mode of attack on VAEs, the latent space attack (Tabacof et al., 2016; Gondim-Ribeiro Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour. output. Here we start with the image of Hugh Jackman and introduce an adversary that tries to produce reconstructions that look like Anna Wintour. This is done by applying a distortion (third column) to the original image to produce an adversarial input (second column). We can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. Adding a regularisation term using the β-TCVAE produces an adversarial reconstruction that does not look like Wintour, but it is also far from a successful reconstruction. The hierarchical version of a β-TCVAE (which we call Seatbelt-VAE) is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour. To summarise: We provide insights into what makes VAEs vulnerable to attack and how we might go about defending them. We unearth novel connections between disentanglement and adversarial robustness. We demonstrate that regularised VAEs, trained with an up-weighted total correlation, are much more robust to attacks than vanilla VAEs. Building on this we develop regularised hierarchical VAEs that are more robustness still and offer improved reconstructions. Finally, we show that robustness to adversarial attack also confers increased robustness to downstream tasks.

2. BACKGROUND: ATTACKING VAES

In adversarial attacks an agent is trying to manipulate the behaviour of some model towards a goal of their choosing (Akhtar & Mian, 2018; Gilmer et al., 2018) . For many deep learning models, very small changes in the input can produce large changes in output. Attacks on VAEs have been proposed where the adversary looks to apply small input distortions that produce reconstructions close to a target adversarial image (Tabacof et al., 2016; Gondim-Ribeiro et al., 2018; Kos et al., 2018 ). An example is shown in Unlike more established adversarial settings, only a small number of such VAE attacks have been suggested in the literature. The current known most effective mode of attack is a latent space attack (Tabacof et al., 2016; Gondim-Ribeiro et al., 2018; Kos et al., 2018) . This aims to find a distorted image x * = x + d such that its posterior q φ (z|x * ) is close to that of the agent's chosen target image q φ (z|x t ) under some metric. This then implies that the likelihood p θ (x t |z) is high when given draws from the posterior of the adversarial example. It is particularly important to be robust to this attack if one is concerned with using the encoder network of a VAE as part of a downstream task. For a VAE with a single stochastic layer, the latent-space adversarial objective is ∆ r (x, d, x t ; λ) = r(q φ (z|x + d), q φ (z|x t )) + λ||d|| 2 , where r(•, •) is some divergence or distance, commonly a D KL (Tabacof et al., 2016; Gondim-Ribeiro et al., 2018) . We are penalising the L 2 norm of d too, so as to aim for attacks that change the image less. We can then simply optimise to find a good distortion d. Alternatively, we can aim to directly increase the ELBO for the target datapoint (Kos et al., 2018) : ∆ output (x, d, x t ; λ) = E q φ (z|x+d) log(x t |z) -D KL (q φ (z|x + d)||p(z)) + λ||d|| 2 . (2)

3. DEFENDING VAES

This problem was not considered by prior worksfoot_0 . To address it, we first need to consider what makes VAEs vulnerable to adversarial attacks. We argue that two key factors dictate whether we can perform a successful attack on a VAE: a) whether we can induce significant changes in the encoding distribution q φ (z|x) through only small changes in the data x, and b) whether we can induce significant changes in the reconstructed images through only small changes to the latents z. The first of these relates to the smoothness of the encoder mapping, the latter to the smoothness of the decoder mapping. Consider, for the sake of argument, the case where the encoder-decoder process is almost completely noiseless. Here successful reconstruction places no direct pressure for similar encodings to correspond to similar images: given sufficiently powerful networks, very small changes to embeddings z can imply very large changes to the reconstructed image; there is no ambiguity in the "correct" encoding of a particular datapoint. In essence, we can have lookup-table style behaviour -nearby realisations of z do not necessarily relate to each other and very different images can have very similar encodings. This will now be very vulnerable to adversarial attacks: small input changes can lead to large changes in the encoding, and small encoding changes can lead to large changes in the reconstruction. It will also tend to overfit and have gaps in the aggregate posterior, q φ (z) = 1 N N n=1 q φ (z|x n ), as each q φ (z|x n ) will be sharply peaked. These gaps can then be exploited by an adversary. There are two mechanisms by which we can reduce this lookup-table behaviour, thereby reducing gaps in the aggregate posterior. First, we can try to regulate the level of noise in the per-datapoint posterior covariance, to then obtain smoothness in the overall embeddings. Having a stochastic encoding creates uncertainty in the latent that gives rise to a particular image, forcing similar latents to correspond to similar images. Adding noise forces the VAE to smooth the encode-decode process in that similar images will lead to similar embeddings in the latent space, ensuring that small changes in the input result in small changes in the latent space and result in small changes in the decoded outputs. This proportional input-output change is what we refer to as a 'simple' encode-decode process, which is the second mechanism that can reduce look-up table behaviour. The fact that the VAE is vulnerable to adversarial attack suggests that its standard setup does not obtain sufficiently smooth and simple representations to provide an adequate defence. Introducing additional regularisation to enforce simplicity or increased posterior covariance thus provides a prospect for defending VAEs. We could attempt to obtain this by direct regularisation of the networks (e.g. weight decay). Here, however, we focus on macro-level regularisation approaches as discussed in the next section. The reason for this is that controlling the macroscopic behaviour of the networks through low-level regularisations can be difficult to control and, in particular, difficult to calibrate. Further, as the most effective attack on VAEs currently attack the latent space, it is reasonable that regularisation methods that directly act on the properties of the latent space form a good place to start.

3.1. DISENTANGLING METHODS AND ROBUSTNESS

Recent research into disentangling VAEs (Higgins et al., 2017a; Siddharth et al., 2017; Kim & Mnih, 2018; Chen et al., 2018; Esmaeili et al., 2019; Mathieu et al., 2019) and the information bottleneck (Alemi et al., 2017; 2018) has looked to regularise the ELBO with the hope of providing more interpretable embeddings. These regularisers also have influences on the smoothness and stochasticity of the embeddings learned. Of particular relevance, Mathieu et al. (2019) introduce the notion of overlap in the embedding of a VAE: the level of overlap between per-datapoint posteriors as they combine to form the aggregate posterior. Controlling this is critical to achieving a smoothly varying latent embedding. Overlap encapsulates both the level of uncertainty in the encoding process and also a locality of this uncertainty. To learn a smooth representation we not only need our encoder distribution to have an appropriate entropy, we also want the different possible encodings to be similar to each other. Critically, Mathieu et al. (2019) show that many methods proposed for disentangling, and in particular the β-VAE (Higgins et al., 2017a; Alemi et al., 2017) , provide a mechanism for directly controlling this overlap. Going back to our previous arguments, we see that controlling this overlap may also provide a mechanism for improving VAEs' robustness. This observation now hints at an interesting question: can we use methods initially proposed to encourage disentanglement to encourage robustness? It is important to note here that disentangling can be difficult to achieve in practice, typically requiring precise choices in the hyperparameters of the model and the weighting of the added regularisation term, and often also a fair degree of luck (Locatello et al., 2019; Mathieu et al., 2019; Rolinek et al., 2019) . As such, we are not suggesting to induce disentangled representations to induces robustness, or indeed that disentangled representations should be any more robust. Rather, as highlighted above, we are interested in whether the regularisers traditionally used to encourage disentanglement reliably lead to adversarially robust VAEs. Indeed, we will find that though our approaches-based on these regularisers-provide reliable and significant improvements in robustness, these improvements are not generally due to any noticeable improvements in disentanglement itself (see Appendix E.1). Regularising for Robustness There are a number of different disentanglement methods that one might consider using to train robust VAEs. Perhaps the simplest would be to use a β-VAE (Higgins et al., 2017a) , wherein we up-weight the D KL term in the VAE's ELBO by a factor β ≥ 1. However, as mentioned previously the β-VAE only increases overlap at the expense of substantial reductions in reconstruction quality as the data likelihood term has, in effect, been down-weighted (Kim & Mnih, 2018; Chen et al., 2018; Mathieu et al., 2019) . Because of these shortfalls, we instead propose to regularise through penalisation of a total correlation (TC) term (Kim & Mnih, 2018; Chen et al., 2018) . As discussed in Section A.1, this looks to directly force independence across the different latent dimensions in aggregate posterior q φ (z), such that the aggregate posterior factorises across dimensions. This approach has been shown to have a smaller deleterious effect on reconstruction quality than found in β-VAEs (Chen et al., 2018) . As seen in Fig 2 this method also gives greater overlap by increasing posterior variance. To summarise, the greater overlap and the lesser degradation of reconstruction quality induced by β-TCVAE make them highly suitable for our purposes.

3.2. ADVERSARIAL ATTACKS ON TC-PENALISED VAES

We now consider attacking these TC-penalised VAEs and demonstrate one of the key contributions of the paper: that empirically this form of regularisation makes adversarial attacks on VAEs harder to carry out. To do this, we first train them under the β-TCVAE objective (i.e. Eq (15)), jointly optimising θ, φ for a given β. Once trained, we then attack the models using the latent-space attack method outlined in Section 2, finding an input distortion d that minimises the latent attack loss ∆ as per Eq (1) with r(•, •) = D KL (•||•). One possible metric for how successful such attacks have been is the achieved value reached of the attack loss ∆ KL . If the latent space distributions for the original input and for the distorted input match closely for a small distortion, then ∆ KL is small and the model has been successfully fooledreconstructions from samples from the attacked posterior would be indistinguishable from those from the target posterior. Meanwhile, the larger the converged value of the attack loss the less similar these distributions are and the more different the reconstructed image is to the adversarial target image. We carry our these attacks for dSprites (Matthey et al., 2017) , Chairs (Aubry et al., 2014) and 3D faces (Paysan et al., 2009) , for a range of β and λ values. We pick values of λ following standard methodology (Tabacof et al., 2016; Gondim-Ribeiro et al., 2018) , and use L-BFGS-B for gradient descent (Byrd et al., 1995) . We also varied the dimensionality of the latent space of the model, d z , but found it had little effect on the effectiveness of the attack. In Fig 3 we show the effect on the attack loss ∆ KL for varying β, averaged over different original input-target pairs and values of d z . Note that the plot is logarithmic in the loss. We see a clear pattern for each dataset that the loss values reached by the adversary increases as we increase β from the standard VAE (i.e. β = 1). This analysis is also borne out by visual inspection of the effectiveness of these attacks, for example as shown in Fig 1 . We will return to give further experimental results in Section 5. An interesting aspect of Fig 3 is that in many cases the adversarial loss starts to decrease if β is too large: as β increases there is less pressure in the objective to produce good reconstructions. 

4. HIERARCHICAL T C-PENALISED VAES

We are now armed with the fact that penalising the TC in the ELBO induces robustness in VAEs. However, TC-penalisation in single layer VAEs comes at the expense of model reconstruction quality (Chen et al., 2018) , albeit less than that in β-VAEs. Our aim is to develop a model that is robust to adversarial attack while mitigating this trade-off between robustness and sample quality. To achieve this, we now consider instead using hierarchical VAEs (Rezende et al., 2014; Sønderby et al., 2016; Kingma et al., 2016; Zhao et al., 2017; Maaløe et al., 2019; Vahdat & Kautz, 2020; Child, 2021) . These are known for their superior modelling capabilities and more accurate reconstructions. As these gains stem from using more complex hierarchical latent spaces, rather than less noisy encoders, this suggests they may be able to produce better reconstructions and generative capabilities, while also remaining robust to adversarial attacks when appropriately regularised. The simplest hierarchical extension of conditional stochastic variables in the generative model is the Deep Latent Gaussian Model (DLGM) of Rezende et al. (2014) . Here the forward model factorises as a chain, p θ (x, z) = p θ (x|z 1 ) L-1 i=1 p θ (z i |z i+1 )p(z L ), where each p θ (z i |z i+1 ) is a Gaussian distribution with mean and variance parameterised by deep nets, while p(z L ) is an isotropic Gaussian. Unfortunately, we found that naively applying TC-correlation penalisation to DLGM-style VAEs did not confer the improved robustness we observed in single layer VAEs. We postulate that this observed weakness is inherent to the structure of chain factorisation in the generative model. This means that the data-likelihood depends solely on z 1 , the bottom-most latent variable, and attackers only need to manipulate z 1 to produce a successful attack. To account for this, we instead use a generative model in which the likelihood p θ (x| z) depends on all the latent variables in the chain z, rather than just the bottom layer z 1 , as has been done in Kingma et al. ( 2016); Maaløe et al. (2019) . This leads to the following factorisation of the generative structure: p θ (x, z) = p θ (x| z) L-1 i=1 p θ (z i |z i+1 )p(z L ). (3) To construct the ELBO, we must further introduce an inference network q φ ( z|x). On the basis of simplicity and that it produces effective empirical performance, we use the factorisation: q φ ( z|x) = q φ (z 1 |x) L-1 i=1 q φ (z i+1 |z i , x), where each conditional distribution q φ (z i+1 |z i , x) takes the form of a Gaussian. Again, marginalising out intermediate z i layers, q φ (z L |x) is a non-Gaussian, highly flexible distribution. To defend this model against adversarial attack, we apply TC regularisation term as per the last section. We refer to the resulting models as Seatbelt-VAEs. We obtain a decomposition of the ELBO for this model, revealing the existence of a TC term for the top-most layer (see Appendix B for proof). Published as a conference paper at ICLR 2021 Theorem 1. The Evidence Lower Bound, for a hierarchical VAE with forward model as in Eq (3) and amortised variational posterior as in Eq (4), can be decomposed to reveal the total correlation (see Definition A.1), of the aggregate posterior of the top-most layer of latent variables: L(θ, φ; D) = E q( z,x) log p θ (x| z) + R + S a + S b -D KL q(z L )|| j q(z L j ) , where the last term is the required TC term, and, using j to index over the coordinates in z L , R = dx L i=1 (dz i )q φ ( z|x)q(x) log L-1 k=1 p θ (z k |z k+1 ) q φ (z 1 |x) L-2 m=1 q φ (z m+1 |z m , x) (6) S a = -E q φ (z L-1 ) )D KL (q φ (z L , x|z L-1 )||q φ (z L )q(x)) (7) S b = - j D KL (q φ (z L j )||p(z L j )). In other words, following the Factor and β-TCVAEs, we up-weight the TC term for z L . We can upweight this term then recombine the decomposed parts of the ELBO, to give us the following compact form of this objective. Definition 1. A Seatbelt-VAE is a hierarchical VAE with forward model as in Eq (3) and amortised variational posterior as in Eq (4), trained wrt its parameters θ, φ to maximise the objective: L Seatbelt (θ, φ; β, D) := E q φ ( z,x) log p θ (x, z) q φ ( z|x) -(β -1)D KL q(z L )|| j q(z L j ) . We see that, when L = 1, a Seatbelt-VAE reduces to a β-TCVAE. We use the β = 1 case as a baseline in our experiments as it corresponds to a Vanilla VAE for L = 1 and for L > 1, β = 1 it produces a hierarchical model with a likelihood function conditioned on all latents. As with the β-TCVAE, training L Seatbelt θ,φ;β,D using stochastic gradient ascent with minibatches of the data is complicated by the presence of aggregate posteriors q φ (z) which depend on the entire dataset. To deal with this, Appendix C we derive a minibatch estimator for TC-penalised hierarchical VAEs, building off that used for β-TCVAEs (Chen et al., 2018) . We note that, as in Chen et al. (2018) , large batch sizes are generally required to provide accurate TC estimates. Attacking Hierarchical T C-Penalised VAEs In the above hierarchical model the likelihood over data is conditioned on all layers, so manipulations to any layer have the potential to be significant. We focus on simultaneously attacking all layers, noting that, as shown in Appendix D, this is more effective that just targeting the top or base layers individually. Hence our adversarial objective for latent-space attacks on Seatbelt-VAEs is the following generalisation of that introduced in Tabacof et al. (2016); Gondim-Ribeiro et al. (2018) ; Kos et al. (2018) , to attack all the layers at the same time: ∆ Seatbelt r (x, d, x t ; λ) = λ||d|| 2 + L i=1 r(q φ (z i |x + d), q φ (z i |x t )). (10)

5. EXPERIMENTS

Expanding on the brief experiments in Section 3.2, we perform a battery of adversarial attacks on each of the introduced models. We do this for three different adversarial attacks: first (as in Section 3.2) a latent attack, Eqs (1,10) using the D KL divergence between attacked and target posteriors; secondly, we attack via the model's output, aiming to make the target maximally likely under the attacked model as in Eq (2); finally, a new latent attack method as per Eqs (1,10) where we use r(•, •) = W 2 (•, •), the 2-Wasserstein distance between attacked and target posteriors. We then evaluate the effectiveness of these attacks in three ways. First, like Fig 1, we can plot the attacks themselves, to see how effective these attacks are in fooling us. Secondly, we can measure the adversary's loss under the attack objective. Thirdly, we give the negative adversarial likelihood of the target image x t given an attacked latent representation z * . Larger, more positive, values of log p θ (x t |z * ) correspond to less successful attacks as they correspond to large distances between the target and the adversarial reconstruction. Lower values correspond to successful attacks as they correspond to a small distance between the adversarial target and the reconstruction. We also measure reconstruction quality of these models, as a function of degree of regularisation. Finally, we also measure how downstream tasks that use output of these models perform under attack. We train classifiers, on the reconstructions and on the latent representations, and see how robust performance is when the upstream VAE is attacked. We demonstrate that hierarchical T C-Penalised VAEs (Seatbelt-VAEs) confer superior robustness to β-TCVAEs and standard VAEs, while preserving the ability to reconstruct inputs effectively. Through this, we demonstrate that they are a powerful tool for learning robust deep generative models. Following previous work (Tabacof et al., 2016; Gondim-Ribeiro et al., 2018) we randomly sample 10 input-target pairs for each dataset and for each image pair we consider 50 different values of λ geometrically-distributed from 2 -20 to 2 20 . Thus each individual trained model undergoes 500 attacks for each attack mode. As before, we used L-BFGS-B for gradient descent (Byrd et al., 1995) . We perform these experiments on Chairs (Aubry et al., 2014), 3D faces (Paysan et al., 2009) , and CelebA (Liu et al., 2015) . Details of neural architectures and training are given in Appendix G.

5.1. VISUAL APPRAISAL OF ATTACKS

We first visually appraise the effectiveness of attacks that use the D KL divergence on vanilla VAEs, β-TCVAEs, and Seatbelt-VAEs. As mentioned in Section 1, Fig 1 shows the results of latent space attacks on three models trained on CelebA. It is apparent that the β-TCVAE provides additional resilience to the attacks compared with the standard VAE. Furthermore, this figure shows that Seatbelt-VAEs are sufficiently robust to almost completely thwart the adversary: its adversarial construction still resembles the original input. Moreover, this was achieved while also producing a clearer nonadversarial reconstruction. One might expect attacks targeting a single generative factor underpinning the data to be easier. However, we find that these models protect effectively against this as well. For example, see Fig 4 for plots showing an attacker attempting to rotate a dSprites heart. In both figures we follow the method of Gondim-Ribeiro et al. ( 2018) to plot attacks. Those shown are representative of the adversarial inputs the attacker was able to find over the 50 different values of λ. The Seatbelt-VAE input only undergoes a small perturbation because it is sufficiently robust that the attacker is not able to make the reconstruction look more like the target image in any meaningful way, such that the optimiser never drifts far from the initial input. Note that the β-TCVAE is also robust here. The attacker is unable to induce the desired adversarial reconstruction, even though the attack may be of large magnitude. In contrast, attacks on vanilla-VAEs are able to move through the latent space and find a perturbation that reconstructs to the adversary's target image.

5.2. QUANTITATIVE ANALYSIS OF ROBUSTNESS

Having ascertained perceptually that Seatbelt-VAEs offer the strongest protection to adversarial attack, we now demonstrate this quantitatively. adversarial attacks and that the hierarchical extension confers much greater protection to adversarial attack than a single layer β-TCVAE. As we go to the largest values of β for both Chairs and 3D Faces, adversarial loss ∆ KL grows by a factor of ≈ 10 7 andlog p θ (x t |z * ) for those attacks doubles for Seatbelt-VAE. For all attacks, TC-penalised models outperformed standard VAEs (β=1) and Seatbelt-VAEs outperform single-layer VAEs. β-TCVAEs do not experience such a large uptick in adversarial loss and negative adversarial likelihood. These results show that the hierarchical approach can offer very strong protection from the adversarial attacks studied. In Appendix D we provide plots detailing these metrics for a range of L values. In Appendix E we also calculate the L 2 distance between target images and adversarial outputs and show that the loss of effectiveness of adversarial attacks is not due to the degradation of reconstruction quality from increasing β. We also test VAE robustness to random noise. We noise the inputs and evaluate the model's ability to reconstruct the original input. Through this we are evaluating their ability to denoise. See Appendix F for an illustration of this for TC-penalised models. It is plausible that the ability of these models' to denoise is linked to their robustness to attacks. ELBO and Reconstructions Though Seatbelt-VAEs offer better protection to adversarial attack than β-TCVAEs, we also motivate their utility by way of their reconstruction quality. In Fig 6 we plot the ELBO of the two TC-penalised models, calculated without the β penalisation that was applied during training. We further show the effect of depth and TC-penalisation on CelebA reconstructions. These plots show that Seatbelt-VAEs' reconstructions are more resilient to increasing β than β-TCVAEs'. p MLP (y|x) 0.17 (-0.32) 0.25 (-0.21) 0.38 (-0.09) p Conv (y|x) 0.07 (-0.37) 0.32 (-0.10) 0.34 (-0.07) p MLP (y|z) 0.16 (-0.41) 0.26 (-0.23) 0.39 (-0.09) 5.3 PROTECTION TO DOWNSTREAM TASKS Finally, we consider the protection that Seatbelt-VAEs might provide to downstream tasks, noting that VAEs are often used as subcomponents in larger ML systems (Higgins et al., 2017b) , or as a mechanism to protect another model from attack (Schott et al., 2019; Ghosh et al., 2019) . Table 1 shows results for classification tasks using 2-layer MLPs and fully-convolutional nets trained on the reconstructions or on the embeddings. It shows the drop in accuracy caused by an adversary that picks a target with a different label and attacks the VAEs' embedding using the attack objective with λ = 1. We see that Seatbelt-VAEs produced significantly better accuracies under these attacks.

6. CONCLUSION

We have shown that VAEs can be rendered more robust to adversarial attacks by regularising the evidence lower bound. This increase in robustness can be strengthened by extending these regularisation methods to hierarchical VAEs, forming Seatbelt-VAEs, which uses a generative structure where the likelihood makes use of all the latent variables. Designing robust VAEs is becoming pressing as they are increasingly deployed as subcomponents in larger pipelines. As we have shown, methods typically used for disentangling, motivated by their ability to provide interpretable representations, also confer robustness. Studying the beneficial effects of these methods is starting to come to the fore of VAE research.

A VARIATIONAL AUTOENCODERS

Variational autoencoders (VAEs) are a variety of generative model suitable for high-dimensional data like images (Kingma & Welling, 2014; Rezende et al., 2014) . They introduce a joint distribution over data x and latent variables z: p θ (x, z) = p θ (x|z)p(z), where p θ (x|z) is an appropriate distribution given the form of the data, the parameters of which are represented by deep nets with parameters θ, and p(z) = N (0, I) is a common choice for the prior. As exact inference is intractable, one performs amortised stochastic variational inference by introducing an inference network for the latent variables, q φ (z|x), which often also takes the form of a Gaussian, N (z|µ φ (x), Σ φ (x)). We can then perform gradient ascent, with respect to both θ and φ, on the evidence lower bound (ELBO) L(x) = E q φ (z|x) [log p θ (x|z)] -D KL (q φ (z|x)||p(z)), using the reparameterisation trick to take gradients through Monte Carlo samples from q φ (z|x). A.1 DISENTANGLING VAES When learning disentangled representations (Bengio et al., 2013) in a VAE, one attempts to establish a one-to-one correspondence between dimensions of the learnt latent space and some interpretable aspect of the data (Higgins et al., 2017a; Burgess et al., 2017; Chen et al., 2018; Mathieu et al., 2019) . One dimension of the latent space could encode the rotation of a face for instance. Mathieu et al. (2019) offers a broader perspective, where disentangling can be interpreted as a particular case of decomposition. In decomposition, models have the right degree of overlap between their latent posteriors such that the aggregate posterior matches the prior well throughout the latent space Z. Disentangling is often enforced by an added penalisation to the VAE ELBO that acts akin to a regularisation method. Because of this, disentangling can be difficult to achieve in practice, and often requires precisely choosing the hyperparameters of the model and of the weighting of the added regularisation term (Locatello et al., 2019; Mathieu et al., 2019; Rolinek et al., 2019) . That disentangling relies on forms of soft supervision renders the task of learning disentangled representations potentially problematic (Khemakhem et al., 2020) . When viewed as a purely unsupervised task it can be hard to establish a direct correspondence between a disentangling-VAE's training objective and the learning of a disentangled latent space. Nevertheless, models trained under disentangling objectives have other beneficial properties. For example, the encoders of some disentangled VAEs have been used as the perceptual part of deep reinforcement learning models to create agents more robust to variation in their environment (Higgins et al., 2017b) . Thus, regardless of the presence of disentangled generative factors, these regularisation methods can be useful for downstream tasks. In this paper we show that methods developed to obtain disentangled representations have the benefit of conferring robustness to adversarial attack. A commonly used disentangling method is that of the β-VAE. In a β-VAE (Higgins et al., 2017a) , a free parameter β multiplies the D KL term in the evidence lower bound L(x). This objective L β (x) remains a lower bound on the evidence: L β (x) := E q φ (z|x) [ log p θ (x|z)] -βD KL (q φ (z|x)||p(z))] The β-VAE though it offers a simple method for obtaining potentially disentangled representations does so at the expense of model quality. Models trained with large β penalisation suffer from poor quality reconstructions and lower ELBO. For more discussion of their theoretical aspects, see Kumar & Poole (2020) . Other methods seek to offset this degradation in model quality by decomposing the ELBO and more precisely targeting the regularisation when obtaining disentangled representations. We can more insight into VAEs by defining the evidence lower bound not per data-point, but instead over the dataset D of size N , D = {x n }, so we have L(θ, φ, D) (Hoffman & Johnson, 2016; Makhzani et al., 2016; Kim & Mnih, 2018; Chen et al., 2018; Esmaeili et al., 2019) . From this, Esmaeili et al. (2019) Published as a conference paper at ICLR 2021 gives a decomposition of the dataset-level evidence lower bound: L(θ, φ, D) = E q φ (z,x) log p θ (x, z) q φ (z, x) (12) = E q φ (z,x) log p θ (x|z) p θ (x) 1 -log q φ (z|x) q φ (z) 2 -D KL (q(x)||p θ (x)) 3 -D KL (q φ (z)||p(z)) 4 ( ) where under the assumption that p(z) factorises we can further decompose 4 : D KL (q φ (z)||p(z)) = E q φ (z) log q φ (z) j q φ (z j ) A + j D KL (q φ (z j )||p(z j )) B ( ) where j indexes over coordinates in z. q φ (z, x) = q φ (z|x)q(x) and q(x) := 1 N N n=1 δ(x -x n ) is the empirical data distribution. q φ (z) := 1 N N n=1 q φ (z|x n ) is called the aggregate posterior. A is the total correlation (TC) for q φ (z). Definition A.1. The total correlation (TC) is a generalisation of mutual information to multiple variables (Watanabe, 1960) and is often used as the objective Independent Component Analysis (Bell & Sejnowski, 1995) . The TC is defined as is defined as the KL divergence from the joint distribution p(s), s ∈ R d to the independent distribution over the dimensions of the variable s: p(s 1 )p(s 2 ) . . . p(s n ). Formally: TC(s) = D KL (p(s)|| d j=1 p(s j )) With this mean-field p(z), Factor and β-TCVAEs upweight the TC of the aggregate posterior, so we have an objective: L βTC (θ, φ, D) = 1 + 2 + 3 + B + β A ( ) Upweighting the penalisation associated with the TC term promotes the learning of independent latent factors, one of the key objectives of disentangling. Chen et al. (2018) show empirically that the learnt representations are disentangled when the hyperparameters of the model are well-chosen. They also give a differentiable, stochastic approximation to E q φ (z) log q φ (z), rendering this decomposition simple to use as a training objective using stochastic gradient descent. However this is a biased estimator: it is a nested expectation, for which unbiased, finite-variance, estimators do not generally exist (Rainforth et al., 2018) . Consequently, it has the unfortunate consequence of needing large batch sizes to have the desired behaviour; for small batch sizes its practical behaviour mimics that of the β-VAE (Mathieu et al., 2019) . Published as a conference paper at ICLR 2021 So, why do these regularisation methods increase overlap? Why can upweighting penalisation of the Total Correlation -demanding that the aggregate posterior is well-approximated by the product of its marginals -be expected to increase overlap? And why it does so in a superior way to a β-VAE's upweighting of D KL (q φ (z|x)||p(z))? Recall that in Fig 2 we showed that the L 2 norm of the standard deviation of the encoder concentrates at a particular value for β-VAEs, but for β-TCVAEs it takes a broader ranger of values, values above the saturation point of β-VAEs. In a β-VAE with large β we are asking that the amortised posterior is close to the prior for all inputs. So for p(z) = N (0, 1) we are forcing µ φ (x) to 0 and σ φ (x) to 1. Naturally this will lead our aggregate posterior to have a high degree of overlap between its constituent mixture components, because all of them are being driven to be the same. And with all per-datapoint posteriors being driven to be the same, information about the initial input data is necessarily lost in these representations. For a β-TCVAE, however, the demand for the aggregate posterior to be well-approximated by the product of its marginals does not in itself entail a fixed scale, nor does it push all the per-datapoint posteriors towards the prior. Rather we are directly asking for statistical independence between coordinate directions. Holes in the aggregate posterior are (as long as they are off-axis) a form of dependency between the latent variables. By demanding that the aggregate posterior factorises, we are thus asking the model to 'smooth out' any holes (or peaks) that do not lie along the axes of the latent space. Intuitively, and as shown in Figure 2 , can be done achieved without causing as strong degradation to model quality, as measured by the fidelity of reconstructions and the values of the (β = 1) ELBO. To give us a more direct understanding here we perform some toy experiments on 'Swiss Roll' data, Fig A .7. We train 2D-latent-space VAEs: vanilla, β-VAEs, and β-TCVAEs. We plot the aggregate posterior and the reconstructions (the means of the likelihood conditioned on a sample of each per-datapoint posterior). Clearly the amount of overlap increases with β for both kinds of model, but the β-TCVAEs seem to do this in a more structured way and, unlike the β-VAE, does not suffer from (eventually catastrophic) degradation in model quality for large β.

A.3 HIERARCHICAL VAES

In a hierarchical VAE we have a set of L layers of z variables: z = {z i }. However, training DLGMs is challenging: the latent variables furthest from the data can fail to learn anything informative (Sønderby et al., 2016; Zhao et al., 2017) . Due to the factorisation of q φ ( z|x) and p θ (x, z) in a DLGM, it is possible for a single-layer VAE to train in isolation within a hierarchical model: each p θ (z i |z i+1 ) distribution can become a fixed distribution not depending on z i+1 such that each D KL divergence present in the objective between corresponding z i layers can still be driven to a local minima. (Zhao et al., 2017) gives a proof of this separation for the case where the model is perfectly trained, i.e. D KL (q φ (z, x)||p θ (x, z)) = 0. This is the hierarchical version of the collapse of z units in a single-layer VAE (Burda et al., 2016) Here we prove that the ELBO for a hierarchical VAE with forward model as in Eq (3) and amortised variational posterior as in Eq (4) can be decomposed to reveal a total-correlation in the top-most latent variable. Specifically, now considering the ELBO for the whole dataset and using q(x) to indicate the empirical data distribution, we will obtain, denoting z 0 = x: L (θ, φ; D) = E q φ ( z,x) [log p θ (x| z)] -E q φ ( z|x)q(x) L-1 i=1 D KL (q φ (z i |z i-1 , x)||p θ (z i |z i+1 )) -E q φ (z L-1 ) D KL (q φ (z L , x|z L-1 )||q φ (z L )q(x)) - j D KL (q φ (z L j )||p(z L j )) -βD KL q φ (z L )|| j q φ (z L j ) We start with the forms of p and q given in Theorem 1. The likelihood is conditioned on all z layers: p θ (x| z). L(θ, φ; D) = E q φ ( z,x) log p θ (x, z) q φ ( z, x) (17) = E q φ ( z,x) [ log p θ z)] -E q(x) [D KL (q φ ( z, x)||p θ ( z))] (18) = E q( z,x) log p θ (x| z) -E q(x) log q(x) + E q( z,x) log p θ ( z) q( z|x) (19) = E q( z,x) log p θ (x| z) + H(q(x)) + dx dz 1 L i=2 (dz i q φ (z i |z i-1 , x))q φ (z 1 |x)q(x) log p(z L ) L-1 k=1 p θ (z k |z k+1 ) q φ (z 1 |x) L-1 m=1 q φ (z m+1 |z m , x) W So here we have three terms: an expectation over the data likelihood, the entropy of the empirical data distribution (a constant) and W . We now can expand W to a term involving the prior for the latent z L and a term involving the conditional distributions from the generative model for the remaining components of z: W = dx L i=1 (dz i )q φ ( z|x)q(x) log L-1 k=1 p θ (z k |z k+1 ) q φ (z 1 |x) L-2 m=1 q φ (z m+1 |z m , x) R + dx L i=1 (dz i )q φ ( z|x)q(x) log p(z L ) q φ (z L |z L-1 , x) S (21) The first part R , it that part of W not involving the prior for 'top-most' latent variable z L , is the first subject of our attention. We split out the part of R involving the generative and posterior terms for the latent variable closest to the data, z 1 and the rest: R = dx L i=1 (dz i )q φ ( z|x)q(x) log p θ (z 1 |z 2 ) q φ (z 1 |x) Ra + L-1 m=2 dx L i=1 (dz i )q φ ( z|x)q(x) log p θ (z m |z m+1 ) q φ (z m |z m-1 , x) R b . The first of these terms R a is an expectation over a D KL : R a = -E q φ (z 2 ,x) D KL (q φ (z 1 |x)||p θ (z 1 |z 2 )). Published as a conference paper at ICLR 2021 And the rest, R b , provides the D KL divergences in the ELBO for all latent variables other than z L and z 1 . It reduces to a sum of expectations over D KL divergences, one per latent variable. R b = L-1 m=2 dx L i=1 (dz i )q φ (z 1 |x)q(x) L-1 k=1, =m (q φ (z k+1 |z k , x))q φ (z m |z m-1 , x) log p θ (z m |z m+1 ) q φ (z m |z m-1 , x) (23) = - L-1 m=2 dx L i=1 (dz i )q φ (z 1 |x)q(x) L-1 k=1, =m (q φ (z k+1 |z k , x))D KL φ (z m |z m-1 , x)||p θ (z m |z m+1 )) (24) = - L-1 m=2 E q φ (z m+1 ,z m-1 ,x) D KL (q φ (z m |z m-1 , x)||p θ (z m |z m+1 )). Now we have: L(θ, φ; D) = E q( z,x) log p θ (x| z) + H(q(x)) + R a + R b + S We wish to apply TC decomposition to the top-most latent variable z L . S is an expectation over the D KL divergence between q φ (z L |z L-1 , x) and p(z L ) S = -E q φ (z L-1 ,x) D KL (q φ (z L |z L-1 , x)||p(z L )) Applying the decomposition, with j indexes over units in z L . S = -E q φ (z L ,z L-1 ,x) log q φ (z L |z L-1 , x) -log p(z L ) + log q φ (z L ) -log q φ (z L ) + log j q φ (z L j ) -log j q φ (z L j ) = -E q φ (z L ,z L-1 ,x) log q φ (z L |z L-1 , x) q φ (z L ) -E q φ (z L ) log q φ (z L ) j q φ (z L j ) -E q φ (z L ) log j q φ (z L j ) p(z L ) = -E q φ (z L ,z L-1 ,x) log q φ (z L |z L-1 , x)q(x) q φ (z L )q(x) -E q φ (z L ) log q φ (z L ) j q φ (z L j ) - j E q φ (z L ) log q φ (z L j ) p(z L j ) = -E q φ (z L-1 ) )D KL (q φ (z L , x|z L-1 )||q φ (z L )q(x)) Sa - j D KL (q φ (z L j )||p(z L j )) S b -D KL (q φ (z L )|| j q φ (z L j ))

Sc

Where we have used p(z L ) = j p(z L j ) for our chosen generative model, a product of independent unit-variance Gaussian distributions. L(θ, φ; D) = E q( z,x) log p θ (x| z) + H(q(x)) + R a + R b + S a + S b + S c (28) Giving us a decomposition of the evidence lower bound that reveals the TC-term in z L , as required. Multiplying this with a chosen pre-factor β gives us the required form. (c,d) , and 3D Faces (d,e) on: the L 2 distance from the adversarial target x t to its reconstruction when given as input (target-recon distance) and the L 2 distance between the adversarial input x * and x t (adversarial-target distance); and the adversarial objectives ∆. We also include these metrics for "output" attacks Gondim-Ribeiro et al. ( 2018), which we find to be generally less effective. In such attacks the attacker directly tries to reduce the L2 distance between the reconstructed output and the target image. For latent attacks the adversarial-target L 2 distance grows more rapidly than the target-recon distance (i.e the degradation of reconstruction quality) as we increase β. This effect is much less clear for output attacks. This makes it apparent that the robustness we see in β-TCVAE to latent space adversarial attacks is not due the degradation in reconstruction quality we see as β increases. It is also apparent that increasing β increases the adversarial loss for latent attacks and output attacks. Published as a conference paper at ICLR 2021 E.1 DISENTANGLING AND ROBUSTNESS? Although we are using regularisation methods that were initially proposed to encourage disentangled representations, we are interested here in their effect on robustness not whether the representations we learn are in fact disentangled. This is not least due to the questions that have arisen about the hyperparameter tuning required for disentangled representations Locatello et al. (2019) ; Rolinek et al. (2019) . For us the β pre-factor is just the degree of regularisation imposed. However, it may be of interest to see what relationship, if any, exists between the ease of attacking of a model and how disentangled it is. Here we show the MIG score (Chen et al., 2018) against the achieved adversarial loss on the Faces data for β-TCVAEs. MIG measures the degree to which representations are disentangled and larger adversarial losses correspond to a less successful attack. Shading is over the range of β and d z values. There does not seem to be any simple correspondence between increased MIG and increases in adversarial loss, indicative of a less successful attack. 



We note that the earliest version of this work appeared in June (Willetts et al., 2019), here extended. Since then other works, egCamuto et al. (2020);Cemgil et al. (2020);Barrett et al. (2021), have built of our own to consider this problem of VAE robustness, including investigating it from a more theoretical standpoint.



VAEs) are a deep extension le for high-dimensional data Welling, 2013; Rezende et al., oint distribution over data x ✓ (x, z) = p ✓ (x|z)p(

Figure 1. Latent-space adversarial attacks on CelebA for different models: a) Vanilla VAE b) -TCVAE c) our proposed Seatbelt-VAE.Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour.

Figure 1. Latent-space adversarial attacks on CelebA for different models: a) Vanilla VAE b) -TCVAE c) our proposed Seatbelt-VAE.Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour.

Figure 1. Latent-space adversarial attacks on CelebA for different models: a) Vanilla VAE b) -TCVAE c) our proposed Seatbelt-VAE.Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour.

Figure 1. Latent-space adversarial attacks on CelebA for different models: a) Vanilla VAE b) -TCVAE c) our proposed Seatbelt-VAE.Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour.

Figure 1. Latent-space adversarial attacks on CelebA for different models: a) Vanilla VAE b) -TCVAE c) our proposed Seatbelt-VAE.Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour.

Figure 1. Latent-space adversarial attacks on CelebA for different models: a) Vanilla VAE b) -TCVAE c) our proposed Seatbelt-VAE.Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour.

Figure 1. Latent-space adversarial attacks on CelebA for different models: a) Vanilla VAE b) -TCVAE c) our proposed Seatbelt-VAE.Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour.

Figure 1. Latent-space adversarial attacks on CelebA for different models: a) Vanilla VAE b) -TCVAE c) our proposed Seatbelt-VAE.Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour.

Figure 1. Latent-space adversarial attacks on CelebA for different models: a) Vanilla VAE b) -TCVAE c) our proposed Seatbelt-VAE.Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour.

Figure 1. Latent-space adversarial attacks on CelebA for different models: a) Vanilla VAE b) -TCVAE c) our proposed Seatbelt-VAE.Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour.

Figure 1. Latent-space adversarial attacks on CelebA for different models: a) Vanilla VAE b) -TCVAE c) our proposed Seatbelt-VAE.Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour.

Figure 1: Adversarial attacks on CelebA for different models.Here we start with the image of Hugh Jackman and introduce an adversary that tries to produce reconstructions that look like Anna Wintour. This is done by applying a distortion (third column) to the original image to produce an adversarial input (second column). We can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. Adding a regularisation term using the β-TCVAE produces an adversarial reconstruction that does not look like Wintour, but it is also far from a successful reconstruction. The hierarchical version of a β-TCVAE (which we call Seatbelt-VAE) is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour.

Fig 1: a standard VAE is successfully attacked, turning Jackman into Wintour.

Figure 2: [Left] density plot of ||σ φ (x)|| 2 (the norm of the encoder standard deviation) for a VAE, a β-VAE and a β-TCVAE each trained on CelebA, β = 10. The β-VAE's posterior variance saturates, while the β-TCVAE's does not and as such is able to induce more overlap. [Right] the likelihood (log p θ (x|z)) and ELBO for both as a function of β. Clearly the model quality degrades to a lesser degree for the TC-penalised models under increasing β.

Figure 3: Attacker's achieved loss ∆ KL (i.e. Eq (1) with r = D KL ) for β-TCVAE for different β values and datasets. Higher loss indicates more robustness. Shading corresponds to the 95% CI produced by attacking 20 images for each combination of d z = {4, 8, 16, 32, 64, 128} and taking 50 geometrically distributed values of λ between 2-20 and 2 20 (giving 1000 total trials). Note that the loss axis is logarithmic. β > 1 clearly induces a much larger loss for the adversary relative to β = 1 for all datasets.

Figure 4: D KL Latent space attacks only on rotation of a heart-shaped dSprite for β-TCVAEs (d z = 64) and Seatbelt-VAEs (L = 2) for β = {1, 2}. The attacks are conducted by applying a distortion (third column of each image) to the original image (top first column) to produce an adversarial input (bottom second column of each image) to try to cause the output of the target image (bottom first column). Here we show the most successful adversarial distortion in terms of adversarial loss for each model. It is apparent that Seatbelt-VAEs are the most resilient to attack. Note that the distortions plots (bottom right) are scaled to [0,1] for ease of viewing.

Figure 5: Plots showing the robustness of Seatbelt-VAEs (L=4) and β-TCVAEs models for different values of β for three different attack methods: a) Latent space attack via D KL in Eqs (1,10), b)Attack via the model output as in Eq 2, and c) Latent space attack via the 2-Wasserstein (W 2 ) distance in Eqs (1,10). Note that the β-TCVAE with β = 1 corresponds to a vanilla VAE and that L > 1 β = 1 models correspond to hierarchical baselines. We show the negative adversarial likelihood of a target image x t given an attacked latent representation z * for Faces (1 st col) and Chairs (3 rd col) respectively. Larger values oflog p θ (x t |z * ) mean less successful adversarial attacks. We also show the adversarial loss ∆ in 2 nd and 4 th cols, which have a logarithmic axis. Shading in results corresponds to the 95% CI over variation for 10 images for each combination of d z = {4, 8, 16, 32, 64, 128} and λ taking 50 geometrically distributed values between 2 -20 and 2 20 .

Figure 6: Effect of varying β on the reconstructions of TC-penalised models. In sub-figures (a) and (b) we plot the final ELBO of TC-penalised models trained on the Chairs and 3D faces, calculated without the β penalisation applied during training. Shading gives the 95% CI over variation due to variation of d z = {32, 64, 128} for β-TCVAE and also L = {2, 3, 4, 5} for Seatbelt. As β increases L degrades more slowly for Seatbelt-VAE, relative to β-TCVAE, (c) serves as a visual confirmation of these results. The top row shows CelebA input data. The bottom row, the reconstructions from a Seatbelt-VAE with L = 4 and β = 20, clearly maintains facial identity better than those from a β-TCVAE, the middle row: many of the individuals' finer facial features lost by the β-TCVAE are maintained by the Seatbelt-VAE.

Figure D.8:log p θ (x t |z), z ∼ q(z|x + d) where d is some adversarial distortion, for Seatbelt-VAEs trained on (a) 3D Faces and (b) Chairs; over β and L values for latent attacks. We attack the bottom layer (z 1 ), the top layer (z L ), and finally show the effect when attacking all layers (z). Larger values oflog p θ (x t |z) correspond to less successful adversarial attacks. Generally attacking all layers seems to give the attacker a slight advantage (as seen by the slightly lowerlog p θ (x t |z) values for Faces and Chairs).

Figure E.11: Adversarial attack loss reached vs MIG score for β-TCVAEs trained on Faces and Chairs presented for a range of β = {1, 2, 4, 6, 8, 10} and d z = {8, 32} values.

). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour.

Improving VAEs' Robustness to Adversarial Attack ised VAEs.

Robustness of downstream classification tasks under adversarial attack. We consider classifiers trained either on the reconstructed image (denoted p(y|x)) or on the latent representations (p(y|z)). We show accuracy when the model is attacked, resulting in perturbed embeddings z * and reconstructions (x * ). Parentheses show the drop in accuracy resulting from the attack -the smaller the drop in magnitude the better

, but now the collapse is over entire layers z i . It was part of the motivation for the Ladder VAE(Sønderby et al., 2016) and BIVA(Maaløe et al., 2019).

ACKNOWLEDGEMENTS

This research was directly funded by the Alan Turing Institute under Engineering and Physical Sciences Research Council (EPSRC) grant EP/N510129/1. MW was supported by EPSRC grant EP/G03706X/1. AC was supported by an EPSRC Studentship. SR gratefully acknowledges support from the UK Royal Academy of Engineering and the Oxford-Man Institute. CH was supported by the Medical Research Council, the Engineering and Physical Sciences Research Council, Health Data Research UK, and the Li Ka Shing Foundation We thank Tomas Lazauskas, Jim Madge and Oscar Giles from the Alan Turing Institute's Research Engineering team for their help and support.

annex

Recall from the discussion in § 3 that it is gaps, holes, in the aggregate posterior that adversaries can exploit. We also want to close up these holes without degrading the model too much. Rezende & Viola (2018) observed that in regions of Z when the aggregate posterior places no density the decoder is unconstrained by the ELBO. It is these regions, with associated unconstrained decoder behaviour, that enable adversaries to have an easy time attacking the model. Thus our aim in making robust VAEs is to have an aggregate posterior that is smooth in the sense of having relatively flat density across Z, so therefore having no holes. This is equivalent to overlap, as introduced in Mathieu et al. (2019) .Published as a conference paper at ICLR 2021

C MINIBATCH WEIGHTED SAMPLING

As in Chen et al. (2018) , applying β-TC decomposition requires us to calculate terms of the form:The i = 1 case is covered in the appendix of Chen et al. (2018) . First we will repeat the argument for i = 1 as made in Chen et al. (2018) , but in our notation, and then we cover the case i > 1 for models with factorisation of q φ ( z|x) of Seatbelt VAEs.

C.1 MWS FOR β-TCVAES

We denote B M = {x 1 , x 2 , ..., x M }, a minibatch of datapoints drawn uniformly iid from qFor any minibatch we have p(B M ) = 1 N M . Chen et al. ( 2018) introduce r(B M |x), the probability of a sampled minibatch given that one member is x and the remaining M -1 points are sampled iid from q(x), so r(So then during training, one samples a minibatch {x 1 , x 2 , ..., x M } and can estimate E q φ (z 1 ) log q φ (z 1 ) as:and z 1 i is a sample from q φ (z 1 |x i ).

C.2 MINIBATCH WEIGHTED SAMPLING FOR SEATBELT-VAES

Here we have that q( z, x). Now instead of having a minibatch of datapoints, we have a minibatch of draws of z i-1 :Each member of which is the result of sequentially sampling along a chain, starting with some particular datapoint x m ∼ q(x).For i > 2, members of B i-1 M are drawn:and for i = 2:Thus each member of this batch B i-1 M is the descendant of a particular datapoint that was sampled in an iid minibatch B M as defined above. We similarly define r(B i-1 M |z i-1 , x) as the probability of selecting a particular minibatch B i-1M of these values out from our set {(x n , z i-1 n )} (of cardinality N ) given that we have selected into our minibatch one particular pair of values (x, z i-1 ) from these N values. Like above, rPublished as a conference paper at ICLR 2021 Now we can consider E q φ (z i ) log q φ (z i ) for i > 1:Where we have followed the same steps as in the previous subsection.During training, one samples a minibatch {z i-1 1 , z i-1 2 , ..., z i-1 M }, where each is constructed by sampling ancestrally. Then one can estimate E q φ (z i ) log q φ (z i ) as:and z i k is a sample from q φ (z i |z i-1 k , x k ). In our approach we only need terms of this form for i = L, so we have:and z L k is a sample from q φ (zPublished Figure F.12: Here we measure the robustness of both β-TCVAE and Seatbelt-VAE when Gaussian noise is added to Chairs. Within each plot a range of β values are shown. We evaluate each model's ability to decode a noisy embedding to the original non-noised data x by measuring the distribution of log p θ (x|z) when z ∼ q φ (z|x + a ) (a some scaling factor taking values in {0.1, 0.5, 1} and ∼ N (0, 1)) for which higher values indicate better denoising. We show these likelihood values as density plots for the β-TCVAE in (a) and for the Seatbelt-VAE with L = 4 in (b), taking β ∈ {1, 2, 4, 6, 8, 10}. Note the axis scalings are different for each subplot. We see that for both models using β > 1 produces autoencoders that are better at denoising their inputs. Namely, the mean of the density, i.e. E q φ (z|x+ ) [ log p θ (x|z)], shifts dramatically to higher values for β > 1 relative to β = 1. In other words, for both these models, the likelihood of the dataset in the noisy setting is much closer to the non-noisy dataset when β > 1 across all noise scales (0.1 , 0.5 , ).Published as a conference paper at ICLR 2021 G IMPLEMENTATION DETAILS All runs were done on the Azure cloud system on NC6 GPU machines.

G.1 ENCODER AND DECODER ARCHITECTURES

We used the same convolutional network architectures as Chen et al. (2018) . For the encoders of all our models (q(•|x)) we used purely convolutional networks with 5 convolutional layers. When training on single-channel (binary/greyscale) datasets such as dSprites, 3D Faces, or Chairs the 5 layers took the following number of filters in order: {32, 32, 64, 64, 512}. For more complex RGB datasets, such as CelebA, the layers had the following number of filters in order: {64, 64, 128, 128, 512}. The mean and variance of the amortised posteriors are the output of dense layers acting on the output of the purely convolutional network, where the number of neurons in these layers is equal to the dimensionality of the latent space Z.Similarly, for the decoders (p(x|z)) of all our models we also used purely convolutional networks with 6 deconvolutional layers. When training on single-channel (binary/greyscale) datasets, dSprites, 3D Faces, or Chairs, the 6 layers took the following number of filters in order: {512, 64, 64, 32, 32, 1}. For CelebA the layers had the following number of filters in order: {512, 128, 128, 64, 64, 3}. The mean of the likelihood p(x|•) was directly encoded by the final de-convolutional layer. The variance of the decoder, σ, was fixed to 0.1.For β-TCVAE the range of d z values used was {4, 6, 8, 16, 32, 64, 128}. For Seatbelt-VAEs the number of units in each layer z i decreases sequentially. There is a list z sizes for each dataset, and for a model of L layers that the last L entries to give d z,i , i ∈ {1, ..., L}.{d z } Chairs ={96, 48, 24, 12, 6}{d z } 3DFaces ={96, 48, 24, 12, 6}{d z } CelebA ={256, 128, 64, 32}For Seatbelt-VAEs we also have the mappings q φ (z i+1 |z i , x) and p θ (z i |z i+1 ). These are amortised as MLPs with 2 hidden layers with batchnorm and Leaky-ReLU activation. The dimensionality of the hidden layers also decreases as a function of layer index i: d h (q φ (z i+1 |z i , x)) = h sizes [i] (48) [1024, 512, 256, 128, 64] To train the model we used ADAM Kingma & Lei Ba (2015) with default parameters, a cosine decaying learning rate of 0.001, and a batch size of 1024. All data was pre-processed to fall on the interval -1 to 1. CelebA and Chairs were both downsampled and cropped as in Chen et al. (2018) and Kulkarni et al. (2015) respectively. We find that using free-bits regularisation (Kingma et al., 2016) greatly ameliorates the optimisation challenges associated with DLGMs.

