IMPROVING MODEL ROBUSTNESS WITH LATENT DIS-TRIBUTION LOCALLY AND GLOBALLY

Abstract

We propose a novel adversarial training method which leverages both the local and global information to defend adversarial attacks. Existing adversarial training methods usually generate adversarial perturbations locally in a supervised manner and fail to consider the data manifold information in a global way. Consequently, the resulting adversarial examples may corrupt the underlying data structure and are typically biased towards the decision boundary. In this work, we exploit both the local and global information of data manifold to generate adversarial examples in an unsupervised manner. Specifically, we design our novel framework via an adversarial game between a discriminator and a classifier: the discriminator is learned to differentiate the latent distributions of the natural data and the perturbed counterpart, while the classifier is trained to recognize accurately the perturbed examples as well as enforcing the invariance between the two latent distributions. We conduct a series of analysis on the model robustness and also verify the effectiveness of our proposed method empirically. Experimental results show that our method substantially outperforms the recent state-of-the-art (i.e. Feature Scattering) in defending adversarial attacks by a large accuracy margin (e.g. 17.0% and 18.1% on SVHN dataset, 9.3% and 17.4% on CIFAR-10 dataset, 6.0% and 16.2% on CIFAR-100 dataset for defending PGD20 and CW20 attacks respectively).

1. INTRODUCTION

Deep Neural Networks (DNNs) have achieved impressive performance on a broad range of datasets, yet can be easily fooled by adversarial examples or perturbations (LeCun et al., 2015; He et al., 2016; Gers et al., 1999) . Adversarial examples have been shown to be ubiquitous beyond different tasks such as image classification (Goodfellow et al., 2014) , segmentation (Fischer et al., 2017) , and speech recognition (Carlini & Wagner, 2018) . Overall, adversarial examples raise great concerns about the robustness of learning models, and have drawn enormous attention over recent years. To defend adversarial examples, great efforts have been made to improve the model robustness (Kannan et al., 2018; You et al., 2019; Wang & Zhang, 2019; Zhang & Wang, 2019) . Most of them are based on the adversarial training, i.e. training the model with adversarially-perturbed samples rather than clean data (Goodfellow et al., 2014; Madry et al., 2017; Lyu et al., 2015) . In principle, adversarial training is a min-max game between the adversarial perturbations and classifier. Namely, the indistinguishable adversarial perturbations are designed to mislead the output of the classifier, while the classifier is trained to produce the accurate predictions for these perturbed input data. Currently, the adversarial perturbations are mainly computed by enforcing the output invariance in a supervised manner (Madry et al., 2017) . Despite its effectiveness in some scenarios, it is observed recently that these approaches may still be limited in defending adversarial examples. In particular, we argue that these current adversarial training approaches are typically conducted in a local and supervised way and fail to consider globally the overall data manifold information; such information however proves crucially important for attaining better generalization. As a result, the generated adversarial examples may corrupt the underlying data structure and would be typically biased towards the decision boundary. Therefore, the well-generalizing features inherent to the data distribution might be lost, which limits the performance of the DNNs to defend adversarial examples even if adversarial training is applied (Ilyas et al., 2019a; Schmidt et al., 2018) . For illustration, we have shown a toy example in Figure 1 erated by PGD, one of the most successful adversarial training method, corrupt the data manifold, which would inevitably lead to poor performance if the training is conducted based on these perturbed examples. On the other hand, the current state-of-the-art method Feature Scattering (Zhang & Wang, 2019) can partially alleviate this problem but still leads to corruptions on the data manifold. To address this limitation, we propose a novel method called Adversarial Training with Latent Distribution (ATLD) which additionally considers the data distribution globally in an unsupervised fashion. In this way, the data manifold could be well preserved, which is beneficial to attain better model generalization. Moreover, since the label information is not required when computing the adversarial perturbations, the resulting adversarial examples would not be biased towards the decision boundary. This can be clearly observed in Figure 1(d) . Our method can be divided into two steps: first, we train the deep model with the adversarial examples which maximize the variance between latent distributions of clean data and adversarial counterpart rather than maximizing the loss function. We reformulate it as a minimax game between a discriminator and a classifier. The adversarial examples are crafted by the discriminator to make different implicitly the latent distributions of clean and perturbed data, while the classifier is trained to decrease the discrepancy between these two latent distributions as well as promoting accurate classification on the adversarial examples as Figure 2 shows. Then, during the inference procedure, we generate the specific perturbations through the discriminator network to diminish the impact of the adversarial attack as shown in Figure 6 in Appendix. On the empirical front, with the toy examples, we show that our proposed method can preserve more information of the original distribution and learn a better decision boundary than the existing adversarial training method. We also test our method on three different datasets: CIFAR-10, CIFAR-100 and SVHN with the famous PGD, CW and FGSM attacks. Our ATLD method outperforms the state-of-the-art methods by a large margin. e.g. ATLD improves over Feature Scattering (Zhang & Wang, 2019) by 17.0% and 18.1% on SVHN for PGD20 and CW20 attacks. Our method also shows a large superiority to the conventional adversarial training method (Madry et al., 2017) , boosting the performance by 32.0% and 30.7% on SVHN for PGD20 and CW20 attacks.

2. RELATED WORK

Adversarial Training. Adversarial training is a family of techniques to improve the model robustness (Madry et al., 2017; Lyu et al., 2015) . It trains the DNNs with adversarially-perturbed samples instead of clean data. Some approaches extend the conventional adversarial training by injecting the adversarial noise to hidden layers to boost the robustness of latent space (Ilyas et al., 2019b; You et al., 2019; Santurkar et al., 2019; Liu et al., 2019) . All of these approaches generate the adversarial examples by maximizing the loss function with the label information. However, the structure of the data distribution is destroyed since the perturbed samples could be highly biased towards the non-optimal decision boundary (Zhang & Wang, 2019) . Our proposed method has a similar training scheme with adversarial training by replacing clean data with the perturbed one. Nevertheless, our method generates the adversarial perturbations without the label information which weakens the impact of non-optimal decision boundary and can retain more information of the underlying data distribution. et al., 2018; Miyato et al., 2017) . Some methods are designed to enforce the local smoothness around the natural examples by penalizing the difference between the outputs of adversarial examples and clean counterparts (Kannan et al., 2018; Chan et al., 2020; Jakubovitz & Giryes, 2018) . All of these methods just leverage the local information of the distribution or manifold. Differently, our method generates the perturbations additionally considering the structure of distribution globally.

Manifold

Unsupervised Domain Adversarial Training. Domain Adversarial Training shares a training scheme similar to our method where the classifier and discriminator compete with each other (Odena et al., 2017; Long et al., 2018; Ganin et al., 2016) . However, its objective is to reduce the gap between the source and target distributions in the latent space. The discriminator is used to measure the divergence between these two distributions in the latent space. The training scheme of our method is also based on competition between the classifier and discriminator. Different from the previous framework, the discriminator of our method is used to capture the information of distributions of adversarial examples and clean counterparts in the latent space which helps generate the adversarial perturbations.

GAN-based Adversarial

Training Methods. Several GAN-based methods leverage GANs to learn the clean data distribution and purify the adversarial examples by projecting them on clean data manifold before classification (Meng & Chen, 2017; Metzen et al., 2017) . The framework of GAN can also be used to generate the adversarial examples (Baluja & Fischer, 2018) . The generator produces the adversarial examples to deceive both the discriminator and classifier; the discriminator and classifier attempt to differentiate the adversaries from clean data and produce the correct labels respectively. Some adversary detector networks are proposed to detect the adversarial examples which can be well aligned with our method (Gong et al., 2017; Grosse et al., 2017) . In these works, a pretrained network is augmented with a binary detector network. 

3.1. ADVERSARIAL TRAINING

Let us first introduce the widely-adopted adversarial training method for defending against adversarial attacks. Specifically, it solves the following minimax optimization problem through training. min θ {E (x,y)∼D [ max x ∈Sx L(x , y; θ)]}, where x ∈ R n and y ∈ R are respectively the clean data samples and the corresponding labels drawn from the dataset D, and L(•) is the loss function of the DNN with the model parameter θ ∈ R m . Furthermore, we denote the clean data distribution as Q 0 , i.e. x ∼ Q 0 . , and denote x ∈ R n as perturbed samples in a feasible region S x {z : z ∈ B(x, ) ∩ [-1.0, 1.0] n } with B(z, ) {z : x -z ∞ ≤ } being the ∞ -ball at center x with radius . By defining f θ (•) as the mapping function from the input layer to the last latent layer, we can also rewrite the loss function of the DNN as l(f θ (x), y) where l(•) denotes the loss function calculated from the last hidden layer of the DNN, e.g. the cross entropy loss as typically used in DNN. Whilst the outer minimization can be conducted by training to find the optimal model parameters θ, the inner maximization essentially generates the strongest adversarial attacks on a given set of model parameters θ. In general, the solution to the minimax problem can be found by training a network minimizing the loss for worst-case adversarial examples, so as to attain adversarial robustness. Given a set of model parameters θ, the commonly adopted solution to the inner maximization problem can lead to either one-step (e.g., FGSM) or multi-step (e.g., PGD) approach (Madry et al., 2017) . In particular, for a given single point x, the strongest adversarial example x at the t-th iteration can be iteratively obtained by the following updating rule: x t+1 = Π Sx (x t + α • sgn(∇ x L(x t , y; θ))), where Π Sx (•) is a projection operator to project the inputs onto the region S x , sgn(•) is the sign function, and α is the updating step size. For the initialization, x 0 can be generated by randomly sampling in B(x, ). It appears in (1) that each perturbed sample x is obtained individually by leveraging its loss function L(x , y; θ) with its label y. However, without considering the inter-relationship between samples, we may lose the global knowledge of the data manifold structure which proves highly useful for attaining better generalization. This issue has been studied in a recent work (Zhang & Wang, 2019) where a new method named feature scattering made a first step to consider the inter-sample relationship within the batch; unfortunately this approach did not take the full advantages of the global knowledge of the entire data distribution. In addition, relying on the maximization of the loss function, the adversarially-perturbed data samples may be highly biased towards the decision boundary, which potentially corrupts the structure of the original data distribution, especially when the decision boundary is non-optimal (see Figure 1 again for the illustration).

3.2. DIVERGENCE ESTIMATION

To measure the discrepancy of two distributions, statistical divergence measures (e.g., Kullback-Leibler and Jensen-Shannon divergence) have been proposed. In general, given two distributions P and Q with a continuous density function p(x) and q(x) respectively, f -divergence is defined as D f (P||Q) X q(x)f p(x) q(x) dx. The exact computation of f -divergence is challenging, and the estimation from samples has attracted much interest. For instance, leveraging the variational methods, Nguyen et al. (2010) propose a method for estimating f -divergence from only samples; Nowozin et al. (2016) extend this method by estimating the divergence with learning the parameters of discriminator. Specifically, the f -divergence between two distributions P and Q can be lowerbounded using Fenchel conjugate and Jensen's inequality (Nowozin et al., 2016) . D f (P||Q) = X q(x) sup t∈domf * {t p(x) q(x) -f * (t)}dx ≥ sup T ∈τ ( X p(x)T (x)dx - X q(x)f * (T (x))dx) = sup W (E x∼P [g f (V W (x))] + E x∼Q [-f * (g f (V W (x)))]), where V W : X → R is a discriminator network with parameter W and g f : R → domf * is an output activation function which is determined by the type of discriminator. τ is an arbitrary class of functions T : X → R. f is a convex lower-semicontinous function and f * is its conjugate defined by f * (t) = sup u∈domf [ut -f (u)]. The objective of discriminator for GANs is a special case of (3) with the activation function g f (t) = -log(1 + e -t ) and f * (g) = -log(2 -e g ). It approximates the Jense-Shannon divergence between real and fake distributions. Arjovsky et al. (2017) also develop a method to estimate the Wasserstein-distance by neural network. In this paper, these methods will be used to estimate the Jensen-Shannon divergence between latent distributions induced by adversarial and clean examples.

4. ADVERSARIAL TRAINING WITH LATENT DISTRIBUTION

As discussed in Section 3.1, the conventional adversarial training methods rely on the knowledge of data labels. As a result, the local information to generate adversarial examples may be biased toward the decision boundary; such individual adversarial example generation does not capture the global knowledge of the data manifold. To alleviate these limitations, we propose a novel method to compute the perturbed samples by leveraging the global knowledge of the whole data distribution and then disentangling them from the data labels and the loss function. Generally speaking, the perturbations are generated to enlarge the variance between latent distributions induced by clean and adversarial data. Formally, we try to identify the set of adversarial examples X adv that yield in the latent space a distribution P * θ through f θ (•) that is the most different from the latent distribution Q θ induced by the clean samples X org = {x : x ∼ Q 0 }, without resorting to the corresponding labels Y . In other words, the resulting adversarial examples can be deemed as manifold adversarial examples, which 'deceive' the manifold rather than fool the classifier as defined in the traditional adversarial examples. It is noted that the latent space to be perturbed could be any hidden layer though it is defined in the last hidden layer before the softmax of a DNN in this paper. The optimization problem of the proposed adversarial training can then be reformulated as follows: min θ {E f θ (x adv )∼P * θ [l(f θ (x adv ), y)] + D f (P * θ ||Q θ )} (4) s.t. P * θ = arg max P θ ∈P [D f (P θ ||Q θ )] where l(•) and y are similarly defined as before, and D f (•) is the f -divergence measure of two distributions. P = {P : f θ (x ) ∼ P subject to ∀x ∼ Q 0 , x ∈ B(x, )} is the feasible region for the latent distribution P θ which is induced by the set of perturbed examples X p through f θ (•). f θ (x ) and f θ (x adv ) represents the latent features of the perturbed example x and adversarial example x adv respectively. Intuitively, we try to obtain the worst latent distribution P * θ which is induced by X adv through f θ (•) within the region P, while the model parameter θ is learned to minimize the classification loss on the latent feature f θ (x adv ) ∼ P * θ (or equivalently adversarial example x adv ∈ X adv ) and the f -divergence between the latent distributions P * θ and Q θ induced by adversarial examples X adv and clean data X org . It is still challenging to solve the above optimization problem, since both the objective function and the constraint are entangled with the adversarial examples X adv and the model parameters θ. To make the problem more tractable, we propose a novel Adversarial Training with Latent Distribution (ATLD) method. In the next subsection, by taking into account the entire data distribution globally, we first focus on the constraint and identify the adversarial samples X adv through the maximization problem. We then solve the minimization of the objective function with the adversarial training procedure. To further enhance the performance, we add specific perturbations named Inference with Manifold Transformation (IMT) in Section 4.2 to input samples for enforcing them towards the more separable natural data manifold. Finally, we classify the transformed data points with the adversarially-trained model.

4.1. GENERATING ADVERSARIAL EXAMPLES FOR TRAINING

First, we optimize the constraint (5) to generate the adversarial examples or its induced distribution P * θ for training. Intuitively, the adversarial examples X adv are crafted to maximize the divergence between the latent distributions induced by natural examples X org and adversarial counterpart X adv in an unsupervised fashion since no knowledge of labels Y is required. Together with the objective function in (4), our proposed adversarial training method is to minimize such divergence as well as the classification error for adversarial examples X adv . However, it is a challenging task to evaluate the divergence between two latent distributions. To make it more tractable, we leverage a discriminator network for estimating the Jensen-Shannon divergence between two distributions P * θ /P θ and Q θ according to Section 3.2. It is noted again that the class label information is not used for generating adversarial examples. Hence the adversarial examples are still generated in an unsupervised way. Then, by using (3), the optimization problem in (4) and ( 5) can be approximated as follows in a tractable way. min θ N i=1 L(x adv i , y i ; θ) L f + sup W N i=1 [log D W (f θ (x adv i )) + (1 -log D W (f θ (x i ))) L d ] s.t. x adv i = arg max x i ∈B(xi, ) [log D W (f θ (x i )) + (1 -log D W (f θ (x i ))) L d ] where N denotes the number of training samples and D W denotes the discriminator network with the sigmoid activation function and parameter W . f θ (x i ) is the latent feature of the clean sample x i . D W is used to determine whether the latent feature is from adversary manifold (output the manifold label of the latent feature). For ease of description, we represent the components in Eq. ( 6) as two parts: L f and L d . L d is the manifold loss and L f represents the loss function of the classifier network. We now interpret the above optimization problem. By comparing Eq. ( 6) and Eq. ( 4), it is observed that the Jensen-Shannon divergence between P * θ and Q θ is approximated by sup W However, when the latent distributions are multi-modal, which is a real scenario due to the nature of multi-class classification, it is challenging for the discriminator to measure the divergence between such distributions. Several work reveals that there is a high risk of failure in using the discriminator networks to measure only a fraction of components underlying different distributions (Arjovsky & Bottou, 2017; Che et al., 2016) . Ma ( 2018) also shows that two different distributions are not guaranteed to be identical even if the discriminator is fully confused. To alleviate such the problem, we additionally train the discriminator D W to predict the class labels for latent features as (Odena et al., 2017; Long et al., 2018) . As a result, the problem (6) can then be reformulated as: min θ N i=1 L(x adv i , y i ; θ) L f + sup W N i=1 [log D 0 W (f θ (x adv i )) + (1 -log D 0 W (f θ (x i ))) L 0 d ] + min W [l(D 1:C W (f θ (x i )), y i ) + l(D 1:C W (f θ (x adv i )), y i )] L 1:C d s.t. x adv i = arg max x i ∈B(xi, ) [log D 0 W (f θ (x i )) + (1 -log D 0 W (f θ (x i )) L 0 d ] Here D 0 W is the first dimension of the output of the discriminator, which indicates the manifold label of the latent features; D 1:C W are the remaining C dimensions of the output of D W , used to output the class label of the latent feature; C denotes the number of classes, and L 0 d and L 1:C d are the manifold loss and the classification loss for the discriminator network respectively. (The detailed derivation for Eq. ( 6) and Eq. ( 7) can be seen in Appendix.) The detailed training procedure of our framework is depicted in Figure 2 . Remarks. It is worth noting that the labeled information is not required for generating adversarial examples. Therefore, our method prevents the perturbed examples from highly biasing towards the decision boundary and more information of the original distribution structure is preserved. In addition, since the discriminator is trained with the whole dataset (both clean and adversarial examples), it captures the global information of data manifold. Consequently, by training with adversarial examples generated according to the manifold loss of the discriminator, our method can improve the model robustness against adversarial examples with the global structure of data distribution.

4.2. INFERENCE WITH MANIFOLD TRANSFORMATION

To enhance the generalization of ATLD, we further develop a new inference method with manifold transformation. Although adversarially-trained models can well recognize adversarial examples, there are still potential examples which are easily misclassified especially for unseen data. In other words, the generalization to adversarial examples is hard to achieve due to the more complex distribution of adversarial examples (Schmidt et al., 2018; Zhai et al., 2019) . To alleviate this problem, Specifically, the input sample is fed into our adversarially-trained model and the discriminator outputs the probability of such a sample lying on the adversarial manifold. If this probability is higher than a certain threshold, we compute the transformed example x t by adding the specific perturbation r * to the input sample x to reduce such a probability. This perturbation can be computed as: r * = arg min r ∞≤ log D 0 W (f θ (x + r)). Intuitively, the reduction of probability of this data point lying on adversarial manifold indicates that this point moves towards the benign example manifold after adding perturbation r * . In other words, it becomes more separable since the benign example manifold is further away from the decision boundary. When the probability of the image lying on adversary manifold is lower than threshold, we still add such a perturbation to input image to make it more separable but with smaller magnitude. In the experiment part, we show this perturbation can move the adversarial examples away from the decision boundary. The whole inference procedure can be seen in Figure 5 in Appendix.

5. EXPERIMENTS

We conduct experiments on the widely-used datasets, e.g., CIFAR-10, SVHN, and CIFAR-100. Following the Feature Scattering method (Zhang & Wang, 2019), we leverage the wideresnet (Zagoruyko & Komodakis, 2016) as our basic classifier and discriminator model structure. During the training phase, the initial learning rate is empirically set to 0.1 for all three datasets. We train our model 400 epochs with the transition epoch 60, 90 and the decay rate 0.1. The input perturbation budget is set to = 8 with the label smoothing rate as 0.5. We use L ∞ perturbation in this paper including all the training and evaluation. We evaluate the various models on white-box attacks and black-box attacks. 

5.1. DEFENDING WHITE-BOX ATTACKS

We show the classification accuracy under several white-box attacks on CIFAR-10, CIFAR-100, SVHN in this section. We first report the accuracy on CIFAR-10 in Table 1 with the attack iterations T = 20, 40, 100 for PGD (Madry et al., 2017) and CW (Carlini & Wagner, 2017) . We also conduct more experiments to further evaluate the robustness of our proposed method against more recent attacks, e.g. AutoAttack (Croce & Hein, 2020) and RayS (Chen & Gu, 2020)) as shown in Appendix B.2. As observed, overall, our proposed method achieves a clear superiority over all the defence approaches on both the clean data and adversarial examples (except that it is slightly inferior to Feature Scattering in FGSM). We also observe one exception that the standard model performs the best on clean data. Our approach performs much better than the other baseline models on PGD and CW attack. Particularly, we improve the performance of the recent state-of-the-art method Feature Scattering almost 3.1% and 5.2% under PGD20 and CW20 attack respectively. With the implementation of Inference with Manifold Transformation (IMT), our approach (ATLD-IMT) is 8.9% and 17.4% higher than the Feature Scattering under PGD20 and CW20 attack respectively. However, the performance on clean data is declined from 93.3% to 86.4% since IMT appears to have a negative effect for classifying clean data. In order to reduce the impact of IMT on the natural data, a threshold is used to limit the perturbation of IMT based on the output of discriminator. The perturbation is halved if the output of discriminator is within the range of [0.3, 0.7] (ATLD-IMT+). Under such setting, our approach could achieve high performance on adversarial attacks without sacrificing its accuracy on clean data. Similarly, the accuracy on CIFAR-100 and SVHN are shown in It deserves our attention that ATLD-IMT appears to have a negative impact on the black-box attacks though it stills performs much better than PGD. This may be explained in several aspects. On one hand, the distributions of adversarial examples produced by different models may differ significantly in the latent space; on the other hand, our discriminator lacks the ability to deal with the unseen distributions since the discriminator only distinguishes one type of adversarial examples from the natural data during training. We will leave the investigation of this topic as future work. 

6. CONCLUSION

We have developed a novel adversarial training method which leverages both the local and global information to defend adversarial attacks in this paper. In contrast, existing adversarial training methods mainly generate adversarial perturbations in a local and supervised fashion, which could however limit the model's generalization. We have established our novel framework via an adversarial game between a discriminator and a classifier: the discriminator is learned to differentiate globally the latent distributions of the natural data and the perturbed counterpart, while the classifier is trained to recognize accurately the perturbed examples as well as enforcing the invariance between the two latent distributions. Extensive empirical evaluations have shown the effectiveness of our proposed model when compared with the recent state-of-the-art in defending adversarial attacks in both the white-box and black-box settings. 

B.2 MODEL ROBUSTNESS AGAINST AUTOATTACK AND RAYS

As shown in (Croce & Hein, 2020; Chen & Gu, 2020) , several models (such as Feature Scattering) could achieve high enough robustness against PGD and CW attack, but they may fail to defend more stronger attacks. To further evaluate the model robustness against stronger attacks, we evaluate the robustness of our proposed method IMT+ against AutoAttack (Croce & Hein, 2020) and RayS (Chen & Gu, 2020) attacks with L ∞ budget = 8 on CIFAR-10 and CIFAR-100. We first compare the accuracy of the proposed ATLD-IMT+ with several competitive methods on CIFAR-10 in Table 4 to We also report the accuracy of ATLD-IMT+ with the state-of-the-arts methods on CIFAR-100 in Table 5 against the AutoAttack (AA). Our proposed method again achieves on both the clean data and AA attacked examples significant better performance than all the other defense approaches (without data augmentation). Furthermore, it is noted that, while our ATLD-IMT+ method is just slightly inferior to Gowal et al. (2020) (which is trained with additional data), it is substantially ahead of the normal version of Gowal et al. (2020) . 

B.3 BLACK-BOX RESULTS ON SVHN AND CIFAR-100

We conduct more evaluations on the transfer-based black-box attacks on SVHN and CIFAR-100. We report the results in Table 6 . It can be observed that our proposed method overall outperforms Feature Scattering in most of the cases on SVHN. Surprisingly, the Adversarial Training method, i.e. the PGD, performs better than our method and Feature Scattering method in three cases. This also partially reveals the more challenging nature of defending black-box attacks than white-box attacks. On CIFAR-100, it can be observed that our method and Feature Scattering are comparable. The performance of these two methods differs little though our method outperforms Feature Scattering significantly under PGD20 and CW20 against adversarial attacks generated from the Feature Scattering model. Overall, though the proposed ATLD method may not lead to remarkably higher performance than the current state-of-the-art algorithms in defending black-box attacks (as we observed in the case of white-box attacks), it still generates overall better or comparable performance. We will again leave the further exploration of defending black-box attacks as our future work. We elaborate the training procedure of our IMT in this section. The overall architecture of ATLD-IMT is plotted in Figure 5 . A test sample x is fed into the classifier, and the discriminator outputs the prediction. A special perturbation in IMT is then computed from the loss D W and added back to x; in this way, the sample would be pushed towards the manifold of natural samples, which is supposed to be further away from the decision boundary. The prediction of the transformed x t by the adversarially-trained classifier will then be output as the label of x. To illustrate clearly the effect of our ATLD-IMT, we conduct additional toy experiments as shown in Figure 6 where we respectively plot the clean or natural data, perturbed data attacked by PGD, and adjusted data by ATLD-IMT in (a), (b), and (c). Moreover, the decision boundary is given by ATLD in all the three sub-figures. In (a), it deserves our attention that the boundary learned by ATLD could classify natural data well compared to the PGD and Feature Scattering as shown in Section A.3. As observed in (b), the perturbations generated by PGD will push the natural samples toward or even cross the decision boundary. Our proposed IMT can push the samples towards the manifold of natural examples as observed in (c). Since the manifold of natural examples would be more separable, this may further increase the classification performance as observed in the experiments. We start with minimizing the largest f -divergence between latent distributions P θ and Q θ induced by perturbed example x and natural example x. And we denote their corresponding probability density functions as p(z) and q(z). According to Eq. (3), we have min θ max Q θ D f (P θ ||Q θ ) = min θ max q(z) Z q(z) sup t∈domf * {t p(z) q(z) -f * (t)}dx ≥ min θ max q(z) sup T ∈τ ( Z p(z)T (z)dz - Z q(z)f * (T (z))dz) = min θ max Q θ sup W E z∼P θ [g f (V W (z))] + E z∼Q θ [-f * (g f (V W (z)))] = min θ sup W E x∼D max x ∈B(x, ) [g f (V W (f θ (x )))] + [-f * (g f (V W (f θ (x))))] To compute the Jensen-Shannon divergence between P θ and Q θ , we set g f (t) = -log(1 + e -t ) and f * (g) = -log(2 -e g ). Then, we have [log D 0 W (f θ (x i )) + (1 -log D 0 W (f θ (x i )) L 0 d ] C.2 COMPUTATION FOR ADVERSARIAL EXAMPLE AND TRANSFORMED EXAMPLE To compute the adversarial example, we need to solve the following problem: x adv i = arg max x i ∈B(xi, ) [log D 0 W (f θ (x i )) + (1 -log D 0 W (f θ (x i )) L 0 d ] It can be reformulated as computing the adversarial perturbation as follows: r adv i = arg max r ∞≤ [L 0 d (x i + r i , θ)] We first consider the more general case r p ≤ and expand (13) with the first order Taylor expansion as follows: r adv i = arg max r p ≤ [L 0 d (x i , θ)] + ∇ x F T r i where F = L(x i , θ). The problem ( 14) can be reduced to: max ri p = ∇ x F T r i We solve it with the Lagrangian multiplier method and we have ∇ x Fr i = λ( r i p -) Then we make the first derivative with respect to r i : ∇ x F = λ r p-1 i p( j (r j i ) p ) 1-1 p (17) ∇ x F = λ p ( r i ) p-1 (∇ x F) p p-1 = ( λ p ) p p-1 ( r i ) p If we sum over two sides, we have (∇ x F) p p-1 = ( λ p ) p p-1 ( r i ) p (19) ∇ x F p * p * = ( λ p ) p * * 1 (20) where p * is the dual of p, i.e. 1 p + 1 p * = 1. We have ( λ p ) = ∇ x F p * By combining ( 18) and ( 21), we have r * i = sgn(∇ x F)( |∇ x F| ∇ x F p * ) 1 p-1 = sgn(∇ x L 0 d )( |∇ x L 0 d | ∇ x L 0 d p * ) 1 p-1 (22) In this paper, we set p to ∞. Then we have r * i = lim p→∞ sgn(∇ x L 0 d )( |∇ x L 0 d | ∇ x L 0 d p * ) 1 p-1 = sgn(∇ x L 0 d )( |∇ x L 0 d | ∇ x L 0 d 1 ) 0 = sgn(∇ x L 0 d ) Then we can obtain the adversarial example: x * i = x i + sgn(∇ x L 0 d ) To compute the transformed example, we need to solve the following problem: r * = arg min r ∞≤ log D 0 W (f θ (x + r)). ( ) With the similar method, we can easily get the transformed example x t x t = x -sgn(∇ x log D 0 W ). (26)



Figure 1: Illustrative example of different perturbation schemes. (a) Original data, Perturbed data using (b) PGD: a supervised adversarial generation method (c) Feature Scattering, and (d) the proposed ATLD method. The overlaid boundary is from the model trained on clean data.

N i=1 L d , and the minimization of the classification loss on adversarial examples is given by min θ N i=1 L f . The problem (6) is optimized by updating parameters θ and W and crafting adversarial examples {x adv i } N i=0 iteratively. The whole training procedure can be viewed as the game among three players: the classifier, discriminator, and adversarial examples. The discriminator D W is learned to differentiate the latent distributions of the perturbed examples and clean data via maximizing the loss L d while the classifier f θ is trained to (1) enforce the invariance between these two distributions to confuse the discriminator D W by minimizing the loss L d , and (2) classify the adversarial examples as accurately as possible by minimizing L f . For each training iteration, the adversarial examples are crafted to make different the adversarial latent distribution and natural one by maximizing L d . Although D W cannot measure the divergence between the two latent distributions exactly at the first several training steps, it can help evaluate the divergence between distributions induced by perturbed examples and clean ones when the parameters W converges.

Figure 2: Overall architecture of ATLD and its training procedure. 1) The natural example is fed into the network, and the discriminator outputs its prediction. The manifold loss L 0 d is computed with the prediction and true label and generates the adversarial example x adv (blue arrow). 2) Both the clean and adversarial sample are fed into the network to train the classifier (green arrow) and the discriminator (yellow arrow) iteratively.

Figure 3: Model performance under PGD and CW attacks with different attack budgets.

defend the AutoAttack (AA) and Rays attacks, including: (1) Traditional Adversarial Training with PGD (AT)(Madry et al., 2017); (2) TRADES: trading adversarial robustness off against accuracy (Zhang et al., 2019); (3) Feature Scattering: generating adversarial examples with considering inter-relationship of samples (Zhang & Wang, 2019); (4) Robustoverfitting: improving models adversarial robustness by simply using early stop (Rice et al., 2020); (5) Pretraining: improving models adversarial robustness with pre-training (Hendrycks et al., 2019); (6)WAR: mitigating the perturbation stability deterioration on wider models (Wu et al., 2020); (7) RTS: achieving high robust accuracy with semisupervised learning procedure (self-training) (Carmon et al., 2019); (8) Gowal et al. (2020): achieving state-of-the-art results by combining larger models, Swish/SiLU activations and model weight averaging.These comparison algorithms attain the most competitive performance on defending AA attack. As observed, overall, our proposed method achieves a clear superiority over all the defence approaches on both the clean data and adversarial examples (except on clean data, ours is slightly inferior toGowal et al. (2020) which is however trained with additional data). Note that Pretraining, WAR andGowal et al. (2020) with footnote require additional data for training (e.g. unlabeled data, pretraining).

ILLUSTRATION OF THE OVERLAID BOUNDARY CHANGE OF DIFFERENT METHODSWe conduct a toy example in Figure4to illustrate the effect on how the various methods would affect the decision boundary after the adversarial training is applied. In Figure4, (a) shows the decision boundary trained with clean data; (b) shows the decision boundary adversarially trained with the perturbed samples by PGD; (c) presents the decision boundary given by the adversarial training of Feature Scattering; and (d) illustrates the decision boundary trained from our proposed ATLD. Clearly, both the PGD (Figure 4(b)) and the FS (Figure4(c)) vary the original decision boundary significantly. Moreover, it can be observed that the adversarial training with PGD corrupts the data manifold completely. On the other hand, FS appears able to retain partially the data manifold information since it considers the inter-sample relationship locally. Nonetheless, its decision boundary appears non-smooth, which may hence degrade the performance. In contrast, as shown in Figure4(d), our proposed method considers to retain the data manifold globally, which varies the

Figure 4: The overlaid decision boundary after the various adversarial training is applied

Figure 5: Detailed Procedure of IMT. 1) The natural example or adversarial example x is fed into the network, and the discriminator outputs its prediction. The loss log D W is computed and the transformed example x t (red arrow) is then generated. 2) The transformed sample is fed into the network and classified by the adversarially-trained network.

JS (P θ ||Q θ ) ≥ min θ sup W E x∼D max x ∈B(x, ) [log D W (f θ (x )))] + [1 -log D W (f θ (x))))](10)whereD W (x) = 1/(1 + e -V W (x) ). (10) is equivalent to optimize the lower bound of Jensen-Shannon divergence between P θ and Q θ . With disentangling the computation of adversarial examples from Eq. (10) and further considering the classification loss for the classifier L f and the discriminator L 1:C d , we can obtain the final objective: θ (x i )), y i ) + l(D 1:C W (f θ (x adv i )), y i )]

-based Adversarial Training. Song et al. (2017) propose to generate the adversarial examples by projecting on a proper manifold. Zhang & Wang (2019) leverage the manifold information in the forms of inter-sample relationship within the batch to generate adversarial adversarial perturbations. Virtual Adversarial Training and Manifold Adversarial Training are proposed improve model generalization and robustness against adversarial examples by ensuring the local smoothness of the data distribution (Zhang

Accuracy under different White-box Attack attack on CIFAR-10

Accuracy under different White-box Attack attack on CIFAR-100 and SVHN Four different models are used for generating test time attacks including the Vanilla Training model, the Adversarial Training with PGD model, the Feature Scattering Training model and our model. As demonstrated by the results in Table3, our proposed approach can achieve competitive performance almost in all the cases. Specifically, ATLD outperforms Feature Scattering significantly in 8 cases while it demonstrates comparable or slightly worse accuracy in the other 3 cases.

Accuracy under black-box attack on CIFAR-10

Accuracy under AutoAttack (AA) and RayS on CIFAR-10

Accuracy under AutoAttack (AA) on CIFAR-100

Accuracy under black-box attack on SVHN and CIFAR-100

APPENDIX A LIST OF MAJOR NOTATION

For clarity, we list the major notations that are used in our model.• X org = {x : x ∼ Q 0 }: the set of clean data samples, where Q 0 is its underlying distribution;• X p = {x : x ∈ B(x, ), ∀x ∼ Q 0 }: the set of perturbed samples, the element x ∈ X p is in the -neighborhood of the clean example x ∼ Q 0 ;• f θ : the mapping function from input to the latent features of the last hidden layer (i.e., the layer before the softmax layer);• Q θ : the underlying distribution of the latent feature f θ (x) for all x ∈ X org ;• P θ : the underlying distribution of the latent feature f θ (x ) for all x ∈ X p ;• P: the feasible region of the latent distribution P θ , which is defined as P {P : f θ (x ) ∼ P subject to ∀x ∼ Q 0 , x ∈ B(x, )}.• X adv : the set of the worst perturbed samples or manifold adversarial examples, the element x adv ∈ X adv are in the -neighborhood of clean example x ∼ Q 0 ;• P * θ : the worst latent distribution within the feasible region P which leads to the largest divergence or the underlying distribution of the latent feature f θ (x adv ) for all x adv ∈ X adv ;

BUDGETS

We further evaluate the model robustness against PGD and CW attacks under different attack budgets with a fixed attack step of 20. These results are shown in Figure 3 . It is observed that the performance of Adversarial Training with the PGD method (AT) drops quickly as the attack budget increases. The Feature Scattering method (FS) can improve the model robustness across a wide range of attack budgets. The proposed approach ADLT-IMT further boosts the performance over Feature Scattering by a large margin under different attack budgets especially under CW attack, except that our ADLT-IMT is slightly inferior to Feature Scattering under PGD attack with budget = 20 on CIFAR-10.

