IMPROVING VAES' ROBUSTNESS TO ADVERSARIAL ATTACK

Abstract

Variational autoencoders (VAEs) have recently been shown to be vulnerable to adversarial attacks, wherein they are fooled into reconstructing a chosen target image. However, how to defend against such attacks remains an open problem. We make significant advances in addressing this issue by introducing methods for producing adversarially robust VAEs. Namely, we first demonstrate that methods proposed to obtain disentangled latent representations produce VAEs that are more robust to these attacks. However, this robustness comes at the cost of reducing the quality of the reconstructions. We ameliorate this by applying disentangling methods to hierarchical VAEs. The resulting models produce high-fidelity autoencoders that are also adversarially robust. We confirm their capabilities on several different datasets and with current state-of-the-art VAE adversarial attacks, and also show that they increase the robustness of downstream tasks to attack.

1. INTRODUCTION

Variational autoencoders (VAEs) are a powerful approach to learning deep generative models and probabilistic autoencoders (Kingma & Welling, 2014; Rezende et al., 2014) . However, previous work has shown that they are vulnerable to adversarial attacks (Tabacof et al., 2016; Gondim-Ribeiro et al., 2018; Kos et al., 2018) : an adversary attempts to fool the VAE to produce reconstructions similar to a chosen target by adding distortions to the original input, as shown in Fig 1 . This kind of attack can be harmful when the encoder's output is used downstream, as in Xu et al. (2017) ; Kusner et al. (2017) ; Theis et al. (2017) ; Townsend et al. (2019) ; Ha & Schmidhuber (2018) ; Higgins et al. (2017b) . As VAEs are often themselves used to protect classifiers from adversarial attack (Schott et al., 2019; Ghosh et al., 2019) , ensuring VAEs are robust to adversarial attack is an important endeavour. Despite these vulnerabilities, little progress has been made in the literature on how to defend VAEs from such attacks. The aim of this paper is to investigate and introduce possible strategies for defence. We seek to defend VAEs in a manner that maintains reconstruction performance. Further, we are also interested in whether methods for defence increase the robustness of downstream tasks using VAEs. Our first contribution is to show that regularising the variational objective during training can lead to more robust VAEs. Specifically, we leverage ideas from the disentanglement literature (Mathieu et al., 2019) to improve VAEs' robustness by learning smoother, more stochastic representations that are less vulnerable to attack. In particular, we show that the total correlation (TC) term used to encourage independence between latents of the learned representations (Kim & Mnih, 2018; Chen et al., 2018; Esmaeili et al., 2019 ) also serves as an effective regulariser for learning robust VAEs. Though a clear improvement over the standard VAE, a severe drawback of this approach is that the gains in robustness are coupled with drops in the reconstruction performance, due to the increased regularisation. Furthermore, we find that the achievable robustness with this approach can be limited (see Fig 1) and thus potentially insufficient for particularly sensitive tasks. To address this, we apply TC-regularisation to hierarchical VAEs. By using a richer latent space representation than a standard VAE, the resulting models are not only more robust still to adversarial attacks than single-layer models with TC regularisation, but can also provide reconstructions which are comparable to, and often even better than, the standard (unregularised, single-layer) VAE. Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour. Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour. output. Tabacof et al. (2016); Gondim-Ribeiro et al. (2018); Kos et al. (2018) . The adversary wants draws from the model to be close to a target image when given a distorted image as input.

Attacks on VAEs have been proposed in

See Figure 1 .a) for an example of a successful attack on a vanilla VAE. Here we are trying to turn Hugh Jackman (Original, top left) into Anna Wintour (Target, bottom left). We can see that, by adding a well-chosen distortion (Distortion, bottom right), the reconstruction of Jackman goes from looking like a somewhat blurry version of the input (Original rec., top middle) to a somewhat blurry version of Wintour (Adversarial rec., bottom middle). The adversary has achieved their goal. Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour. output. Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour. Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour. output. The adversary wants draws from the model to be close to a target image when given a distorted image as input. See Figure 1 Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour. output. Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour. output. Tabacof et al. (2016); Gondim-Ribeiro et al. (2018); Kos et al. (2018) . The adversary wants draws from the model to be close to a target image when given a distorted image as input. See Figure 1 .a) for an example of a successful attack on a vanilla VAE. Here we are trying to turn Hugh Jackman (Original, top left) into Anna Wintour (Target, bottom left). We can see that, by adding a well-chosen distortion (Distortion, bottom right), the reconstruction of Jackman goes from looking like a somewhat blurry version of the input (Original rec., top middle) to a somewhat blurry version of Wintour (Adversarial rec., bottom middle). The adversary has achieved their goal. Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour. output. for this model, in a astic variational inferl posterior distribution = N (µ (x), ⌃ (x)), n the evidence lower Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour. output. The adversary wants draws from the model to be close to a target image when given a distorted image as input. (q (z|x)||p ✓ (x, z)) = x)||p(z))  log p(x)

Attacks on

See Figure 1 .a) for an example of a successful attack on a vanilla VAE. Here we are trying to turn Hugh Jackman (Original, top left) into Anna Wintour (Target, bottom left). We can see that, by adding a well-chosen distortion (Distortion, bottom right), the reconstruction of Jackman goes from looking like a somewhat blurry version of the input (Original rec., top middle) to a somewhat blurry version of Wintour (Adversarial rec., bottom middle). The adversary has achieved their goal. The current most effective mode of attack on VAEs, the latent space attack (Tabacof et al., 2016; Gondim-Ribeiro Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour Here we start with the image of Jackman and introduce an adversary that tries to produce reconstructions that look like Anna Wintour. This is done by applying a distortion (third column) to the original image to produce an adversarial input (second column). We can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. Adding a regularisation term using the β-TCVAE produces an adversarial reconstruction that does not look like Wintour, but it is also far from a successful reconstruction. The hierarchical version of a β-TCVAE (which we call Seatbelt-VAE) is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour. To summarise: We provide insights into what makes VAEs vulnerable to attack and how we might go about defending them. We unearth novel connections between disentanglement and adversarial robustness. We demonstrate that regularised VAEs, trained with an up-weighted total correlation, are much more robust to attacks than vanilla VAEs. Building on this we develop regularised hierarchical VAEs that are more robustness still and offer improved reconstructions. Finally, we show that robustness to adversarial attack also confers increased robustness to downstream tasks.

2. BACKGROUND: ATTACKING VAES

In adversarial attacks an agent is trying to manipulate the behaviour of some model towards a goal of their choosing (Akhtar & Mian, 2018; Gilmer et al., 2018) . For many deep learning models, very small changes in the input can produce large changes in output. Attacks on VAEs have been proposed where the adversary looks to apply small input distortions that produce reconstructions close to a target adversarial image (Tabacof et al., 2016; Gondim-Ribeiro et al., 2018; Kos et al., 2018) . An example is shown in Unlike more established adversarial settings, only a small number of such VAE attacks have been suggested in the literature. The current known most effective mode of attack is a latent space attack (Tabacof et al., 2016; Gondim-Ribeiro et al., 2018; Kos et al., 2018) . This aims to find a distorted image x * = x + d such that its posterior q φ (z|x * ) is close to that of the agent's chosen target image q φ (z|x t ) under some metric. This then implies that the likelihood p θ (x t |z) is high when given draws from the posterior of the adversarial example. It is particularly important to be robust to this attack if one is concerned with using the encoder network of a VAE as part of a downstream task. For a VAE with a single stochastic layer, the latent-space adversarial objective is ∆ r (x, d, x t ; λ) = r(q φ (z|x + d), q φ (z|x t )) + λ||d|| 2 , where r(•, •) is some divergence or distance, commonly a D KL (Tabacof et al., 2016; Gondim-Ribeiro et al., 2018) . We are penalising the L 2 norm of d too, so as to aim for attacks that change the image less. We can then simply optimise to find a good distortion d. Alternatively, we can aim to directly increase the ELBO for the target datapoint (Kos et al., 2018): ∆ output (x, d, x t ; λ) = E q φ (z|x+d) log(x t |z) -D KL (q φ (z|x + d)||p(z)) + λ||d|| 2 . (2)

3. DEFENDING VAES

This problem was not considered by prior worksfoot_0 . To address it, we first need to consider what makes VAEs vulnerable to adversarial attacks. We argue that two key factors dictate whether we can perform a successful attack on a VAE: a) whether we can induce significant changes in the encoding distribution q φ (z|x) through only small changes in the data x, and b) whether we can induce significant changes in the reconstructed images through only small changes to the latents z. The first of these relates to the smoothness of the encoder mapping, the latter to the smoothness of the decoder mapping.



We note that the earliest version of this work appeared in June (Willetts et al., 2019), here extended. Since then other works, eg Camuto et al. (2020); Cemgil et al. (2020); Barrett et al. (2021), have built of our own to consider this problem of VAE robustness, including investigating it from a more theoretical standpoint.



Figure 1. Latent-space adversarial attacks on CelebA for different models: a) Vanilla VAE b) -TCVAE c) our proposed Seatbelt-VAE.Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour.

Figure 1. Latent-space adversarial attacks on CelebA for different models: a) Vanilla VAE b) -TCVAE c) our proposed Seatbelt-VAE.Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour.

Figure 1. Latent-space adversarial attacks on CelebA for different models: a) Vanilla VAE b) -TCVAE c) our proposed Seatbelt-VAE.Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour.

Figure 1. Latent-space adversarial attacks on CelebA for different models: a) Vanilla VAE b) -TCVAE c) our proposed Seatbelt-VAE.Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour.

Figure 1. Latent-space adversarial attacks on CelebA for different models: a) Vanilla VAE b) -TCVAE c) our proposed Seatbelt-VAE.Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour.

Tabacof et al.  (2016);Gondim-Ribeiro et al. (2018);Kos et al. (2018). The adversary wants draws from the model to be close to a target image when given a distorted image as input.See Figure1.a) for an example of a successful attack on a vanilla VAE. Here we are trying to turn Hugh Jackman (Original, top left) into Anna Wintour (Target, bottom left). We can see that, by adding a well-chosen distortion (Distortion, bottom right), the reconstruction of Jackman goes from looking like a somewhat blurry version of the input (Original rec., top middle) to a somewhat blurry version of Wintour (Adversarial rec., bottom middle). The adversary has achieved their goal.

Figure 1. Latent-space adversarial attacks on CelebA for different models: a) Vanilla VAE b) -TCVAE c) our proposed Seatbelt-VAE.Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour.

Figure 1. Latent-space adversarial attacks on CelebA for different models: a) Vanilla VAE b) -TCVAE c) our proposed Seatbelt-VAE.Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour.

Figure 1. Latent-space adversarial attacks on CelebA for different models: a) Vanilla VAE b) -TCVAE c) our proposed Seatbelt-VAE.Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour.

Figure 1. Latent-space adversarial attacks on CelebA for different models: a) Vanilla VAE b) -TCVAE c) our proposed Seatbelt-VAE.Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour.

Tabacof et al.  (2016);Gondim-Ribeiro et al. (2018);Kos et al. (2018). The adversary wants draws from the model to be close to a target image when given a distorted image as input.See Figure1.a) for an example of a successful attack on a vanilla VAE. Here we are trying to turn Hugh Jackman (Original, top left) into Anna Wintour (Target, bottom left). We can see that, by adding a well-chosen distortion (Distortion, bottom right), the reconstruction of Jackman goes from looking like a somewhat blurry version of the input (Original rec., top middle) to a somewhat blurry version of Wintour (Adversarial rec., bottom middle). The adversary has achieved their goal. The current most effective mode of attack on VAEs, the latent space attack (Tabacof et al., 2016; Gondim-Ribeiro Improving VAEs' Robustness to Adversarial Attack ised VAEs. rised VAEs are robust a class of hierarchical ur model, Seatbelt-VAE, arial attack, and higher single layer regularised rarchical VAEs, without ust to adversarial attack. of how adversarial at-VAEs, less effective on ective on our proposed sed VAEs, trained with , are significantly more an vanilla VAEs. AE, the Seatbelt-VAE, s to adversarial attack. ustness, disentangling hrough regularisation. are a deep extension high-dimensional data 2013; Rezende et al., tribution over data x = p ✓ (x|z)p(z) where an appropriate distrita, the parameters of ets with parameters ✓.

Figure 1. Latent-space adversarial attacks on CelebA for different models: a) Vanilla VAE b) -TCVAE c) our proposed Seatbelt-VAE.Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour.

VAEs have been proposed in Tabacof et al. (2016); Gondim-Ribeiro et al. (2018); Kos et al. (2018).

Figure 1. Latent-space adversarial attacks on CelebA for different models: a) Vanilla VAE b) -TCVAE c) our proposed Seatbelt-VAE.Clockwise within each plot we show the initial input, its reconstruction, the best adversarial input the adversary could produce, the adversarial distortion that was added to make the adversarial input, the adversarial input's reconstruction, and the target image. We are trying to make the initial input (Hugh Jackman) look like the target (Anna Wintour). You can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. The -TCVAE adv. reconstruction does not look like Wintour, so the attack has not been successful, but it is not Jackman either. Our proposed model, Seatbelt-VAE, is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour.

Figure 1: Adversarial attacks on CelebA for different models.Here we start with the image of Jackman and introduce an adversary that tries to produce reconstructions that look like Anna Wintour. This is done by applying a distortion (third column) to the original image to produce an adversarial input (second column). We can see that the adversarial reconstruction for the Vanilla VAE looks substantially like Wintour, indicating a successful attack. Adding a regularisation term using the β-TCVAE produces an adversarial reconstruction that does not look like Wintour, but it is also far from a successful reconstruction. The hierarchical version of a β-TCVAE (which we call Seatbelt-VAE) is sufficiently hard to attack that the output under attack still looks like Jackman, not Wintour.

Fig 1: a standard VAE is successfully attacked, turning Jackman into Wintour.

The current most effective mode of attack on VAEs, the latent space attack(Tabacof et al., 2016; Gondim-Ribeiro

.a) for an example of a successful attack on a vanilla VAE. Here we are trying to turn Hugh Jackman (Original, top left) into Anna Wintour (Target, bottom left).

