ADVERSARIAL FEATURE DESENSITIZATION

Abstract

Deep neural networks can now perform many tasks that were once thought to be only feasible for humans. While reaching this impressive performance under standard settings, such networks are known to be vulnerable to adversarial attacks -slight but carefully constructed perturbations of the inputs which drastically decrease the network performance. Here we propose a new way to improve the network robustness against adversarial attacks by focusing on robust representation learning based on the adversarial learning paradigm, called here Adversarial Feature Desensitization (AFD). AFD desensitizes the representation via an adversarial game between the embedding network and an additional adversarial discriminator, which is trained to distinguish between the clean and perturbed inputs from their high-level representations. Our method substantially improves the state-of-the-art in robust classification on MNIST, CIFAR10, and CIFAR100 datasets. More importantly, we demonstrate that AFD has better generalization ability than previous methods, as the learned features maintain their robustness across a wide range of perturbations, including perturbations not seen during training. These results indicate that reducing feature sensitivity is a promising approach for ameliorating the problem of adversarial attacks in deep neural networks.

1. INTRODUCTION

Despite remarkable recent progress in deep learning that allowed neural networks to achieve a near human-level performance across a range of complex tasks (He et al., 2016; Mnih et al., 2015; Silver et al., 2017; Vinyals et al., 2019) , a number of important open challenges remain. For example, deep networks are know to be highly vulnerability to adversarial attacks (Szegedy et al., 2013) , i.e. small but precise perturbations of the inputs that result in high-confidence predictions which are critically divergent from human judgement. Many prior works on adversarial robustness have tackled the robust classification problem by forcing the classifier to output the correct label for the perturbed inputs (Madry et al., 2017; Kannan et al., 2018; Zhang et al., 2019b) . These approaches essentially push the representations of samples from different categories away from the decision boundary. For example, the Adversarial Training procedure (Madry et al., 2017) , trains a network to minimize the classification loss on the distribution of perturbed input samples. Another recent approach (Zhang et al., 2019b) augments the regular classification loss with an auxiliary term that encourages the network to match the assigned labels to clean and perturbed inputs (Figure 1a ). More recently, several other works have tried to improve the classification robustness by enhancing the smoothness of the classification loss (Wu et al., 2019; Qin et al., 2020) , or the saliency of the Jacobian matrix (Chan et al., 2020b) . These methods has been shown to further improve the robust performance compared to prior approaches that do not consider the gradient landscape of the network. However, despite all these efforts, most of these defenses remain vulnerable against other forms of attacks that were not used during training or even slightly stronger perturbations of the same kind (Schott et al., 2018; Sitawarin et al., 2020) . One reason for the above could be an insufficient focus on the robustness of representations learned by the model. It has been shown that many adversarial perturbations that are often small in magnitude lead to large deviations in the high-level features of deep neural networks (Liao et al., 2018; Yoon et al., 2019) . In addition, previous work (Ilyas et al., 2019) demonstrated that adversarial patterns often rely on specific learned features which generalize even on large datasets such as ImageNet (Deng et al., 2009) . However, these features are highly sensitive to input changes, yielding a potential vulnerability that can be exploited by adversarial attacks. While humans can also experience altered (Madry et al., 2017) , TRADES (Zhang et al., 2019b) , and AFD). The dotted black line corresponds to the decision boundary of the adversarial discriminator; (b) schematic of the proposed AFD paradigm. perception in response to particular visual patterns (e.g., visual illusionsfoot_0 ), they are seemingly insensitive to this particular class of perturbations, and often unaware of the subtle image changes resulting from adversarial attacks. This in turn suggests that current deep neural networks may rely on features that are still considerably different from those giving rise to perception in primates (and, particularly, in humans) -even despite many recent studies highlighting their remarkable similarities (Yamins et al., 2014; Khaligh-Razavi & Kriegeskorte, 2014; Bashivan et al., 2019) . It is therefore reasonable to hypothesize that a deep network may become more robust to such adversarial attacks if the corresponding higher-level representations are more robust to input perturbations, similar to those used by our brains. One way to approach the issue of robust classification is to consider the classifier as a relatively simple mapping (e.g. a linear transformation) that produces predictions based on a learned representation. In this case, if the learned representation is robust then the predictions from the simple classifier would consequently be robust too (Garg et al., 2018; Zhu et al., 2020) . Here, instead of focusing on robust classification, we turned our attention to robustness of learned features from which the categories are inferred (e.g. using a simple linear classifier). Our goal is to learn representations that remain stable in the presence of adversarial attacks. We propose to learn robust representations via an adversarial game between two agents: i) an attacker that searches for performance-degrading perturbations given the embedding function and ii) a discriminator function that distinguishes between the clean and perturbed inputs from their high-level representations. The parameters of the embedding and the adversarial discriminator functions are then tuned via an adversarial game between the two (Figure 1b ). This setup is similar to the adversarial learning paradigm widely used in image generation and transformation (Goodfellow et al., 2014a; Karras et al., 2019; Zhu et al., 2017) , unsupervised and semi-supervised learning (Miyato et al., 2018b) , video prediction (Mathieu et al., 2015; Lee et al., 2018) , domain adaptation (Ganin & Lempitsky, 2015; Tzeng et al., 2017) , active learning (Sinha et al., 2019) , and continual learning (Ebrahimi et al., 2020) . While some prior work have also considered adversarial learning to tackle the problem of adversarial examples, they have often been used to learn the distribution of the adversarial images (Wang & Yu, 2019; Matyasko & Chau, 2018) , or the input gradients (Chan et al., 2020b; a) .

The main contributions of this work are:

• We propose a novel method to learn adversarially robust representations through an adversarial game between the embedding function and an adversarial discriminator that distinguishes between the natural and perturbed representations. • We theoretically show that our proposed adversarial approach leads to a flat loss function in the vicinity of the training samples, thereby making the overall representation more stable against adversarial attacks. • We perform extensive empirical evaluations against many prior art methods, on three datasets, eight types of attacks, with a wide range of attack strength, and show that our proposed approach performs similar or better (often, significantly better) than most previous defense methods under most tested circumstances.

2. METHODS

Let E θ (x) : X → H, where X ⊆ R Ni , H ⊆ R Ne , be an embedding function (e.g. a neural network with parameters θ) of the input x ∈ X into representation h ∈ H, and let Dc φ : H → Y, where Y ⊆ R Nc , be a linear decoding function, with parameters φ (e.g., the last linear layer of a neural network before applying softmax). The likelihood of each class i from a set of N c classes, C = {1, ..., N c }, given the input x, is computed as follows: l i (x) = sof tmax Dc φ (E θ (x)) i , i ∈ C. Let π(x, ) denote a perturbation function (an adversarial attack) which computes the perturbed input x within the -neighborhood of input x: ∀x ∈ X : π(x, ) = x ∈ B(x, ); B(x, ) = {x ∈ X : x -x < }, such that argmax i∈C l i (x) = argmax i∈C l i (x ), i.e. the attack changes the class label of a sample x. It has been shown that adversarial examples are attributed to the presence of non-robust features which are predictive of the categories but are not shared with the human perception (Ilyas et al., 2019) . Naturally, reducing the sensitivity of the learned features could potentially enhance the network classification robustness against adversarial attacks. Given the perturbation vector δ ∈ R Ni , δ ≤ , we could simply define the sensitivity of a representation as an empirical average (over n input samples) of the maximum norm change in the representation due to input perturbation (attack): S e = 1 n x 1 max δ E(x) -E(x + δ) , and formulate the robust representation learning problem as an optimization problem which aims at minimizing the representation sensitivity S e . However, such an approach may negatively affect the empirical risk objective, i.e. the classification accuracy (as we will later see in the empirical section). Thus, we desire a more precise formulation which would be less disruptive to the classification objective of the network.

2.1. ADVERSARIAL FEATURE DESENSITIZATION

Instead of minimizing the empirical average of representation sensitivity across all samples in the dataset (as formulated in the previous section), we focus on minimizing the representation sensitivity at the level of distributions which we expect to be less disruptive to the classification objective. For this, we propose an adversarial learning procedure similar to Generative Adversarial Networks (GAN) (Goodfellow et al., 2014a) , in which the generator network is replaced by an embedding network E θ that learns to map the clean and perturbed inputs into representations that are indistinguishable from each other. Similar to the original GAN setup, a discriminator network Da ψ is trained to distinguish between representations of clean and perturbed inputs. The training procedure involves three loss functions that are optimized sequentially. First, parameters of the embedding function E θ and decoder Dc φ are tuned to minimize the classification softmax entropy loss (on clean inputs). Second, parameters ψ of the adversarial discriminator Da ψ are tuned to minimize the cross-entropy loss associated with discriminating natural and perturbed inputs conditioned on the natural labels. Lastly, parameters of the embedding function E θ are adversarially tuned to maximize the cross-entropy from the second step. Algorithm 1 summarizes the proposed approach (also, see Figure 1b ). The adversarial training framework involves a two-player minimax game (Chrysos et al., 2019 ) between E θ and Da ψ , with value function V (E θ , Da ψ ): V (E θ , Da ψ ) = E p(y) E p(x|y) [S(-Da ψ (E θ (x), y))] + E q(y) E q(x|y) [S(Da ψ (E θ (x), y))] , where p and q correspond to natural and perturbed distributions, and S denotes the softplus function. Chrysos et al. (2019) proves that the global minimum of the adversarial training criterion V (E θ , Da ψ ) is achieved if and only if p = q; in our setting, p = P (E θ (x), y) and q = P (E θ (x ), y), i.e. achieving the global minimum in eq. 2 would imply that the representations of natural and perturbed images conditioned on the class label would belong to the same probability distribution. In that case, a Bayes optimal classifier would achieve the same error rate on the perturbed inputs as it would on the natural inputs. We use this fact below to prove that, when V (E θ , Da ψ ) is at its global minimum, the gradient of the likelihood function becomes equal to zero; i.e. the adversarial attack will fail to change the class likelihoods. Let π(x, ) be a policy which computes the perturbed input x within the neighborhood of input x: π(x, ) = x -∂lt ∂x = x ∈ B(x, ) where t denotes the target (ground truth) class index; and S(Da ψ ) be a discriminator function H → {0, 1} that distinguishes between natural and perturbed representations; where S is the softplus function. The following theorem clarifies this property of our approach, which was not previously taken care of, at least not explicitly, by alternative methods. Algorithm 1: AFD training procedure Input: Attack policy π, mini-batch B of size m, encoding network E θ , adversarial discriminator network Da ψ , decoder network Dc φ , softplus function S, and learning rates α, β, and γ. Read mini-batch B = {(x 1 , y 1 ), ..., (x m , y m )} repeat x ← π(x, ) L EDc = -1 m m i=1 log sof tmax(-Dc φ (E θ (x i ))) yi L Da = 1 m m i=1 S(-Da ψ (E θ (x i ), y i )) + S(Da ψ (E θ (x i ), y i )) L E = 1 m m i=1 S(-Da ψ (E θ (x i ), y i )) (θ, φ) ← (θ, φ) -α∇ θ, φ L EDc ψ ← ψ -β∇ ψ L Da θ ← θ -γ∇ θ L E until training converged; Theorem 1. If the adversarial optimization of embedding and discriminator functions, E θ and Da ψ , converges to the global minimum (θ * , ψ * ) of the training objective in equation 2, then the gradient of the true class (t) likelihood with respect to the input x is zero at any x ∈ X , i.e. ∂lt ∂x = 0. See appendix (6.2) for proof. While the assumption of convergence to global optimum is a strong assumption, in practice, it is possible to derive a bound on the classifier's robust error in terms of its error on clean inputs and a divergence measure between the clean and perturbed representations (see 6.4 in the appendix).

3.1. ADVERSARIAL ATTACKS

We used a range of adversarial attacks in our experiments, using existing implementations in the Foolbox (Rauber et al., 2017) and Advertorch (Ding et al., 2019) packages. We validated the models against different variations of the Projected Gradient Descent (PGD) (Madry et al., 2017) (L ∞ , L 2 , L 1 ), Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2014b) , Momentum Iterative Method (MIM) (Dong et al., 2018) , Decoupled Direction and Norm (DDN) (Rony et al., 2019 ), Deepfool (Moosavi-Dezfooli et al., 2016) , and C&W (Carlini & Wagner, 2017) attacks. For each attack, we swept the value across a wide range and validated different models on each. Specific settings used for each perturbation are listed in Table-A2.

3.2. ADVERSARIAL ROBUSTNESS

We validated our proposed approach on learning robust representations on the MNIST (LeCun et al., 1998) , CIFAR10, and CIFAR100 (Krizhevsky et al., 2009) datasets. We used the PGD-L ∞ attack to perturb the inputs during training. was set to 0.3 and 0.031 for MNIST and CIFAR datasets respectively. We used the activations before the last linear layer as the high-level representations (H) of the network. In all experiments, the adversarial discriminator network (Da ψ ) consisted of three fully connected layers with Leaky ReLU nonlinearity followed by a projection discriminator layer that incorporated the labels into the adversarial discriminator through a dot product operation (Miyato & Koyama, 2018) . We compared several variations of the adversarial discriminator architecture and evaluated its effect on robust classification on MNIST dataset (Table A6 ). Increasing the depth of the adversarial discriminator and adding the projection discriminator layer drastically improved the robust classification accuracy. We verified that the adversarial discriminator could successfully discriminate between the clean and perturbed embeddings initially and that this performance was reduced during training (Figure A5 ). The number of hidden units in all layers of Da ψ were equal (64 for MNIST and 512 for CIFAR). We used spectral normalization (Miyato et al., 2018a ) on all layers of Da ψ . Further details of training for each experiment are listed in Table-A1. We used three separate learning rates for tuning the embedding E θ , adversarial discriminator Da ψ , and decoder Dc φ parameters. To find the best learning rates, we randomly split the CIFAR10 train set into a train and validation sets (45000 and 5000 images in train and validation sets respecively). We then carried out a grid-search using the train-validation sets and picked the learning rates with highest validation performance. As baseline, we used a re-implementation of adversarial training (AT) method (Madry et al., 2017) and the official code for TRADESfoot_1 (Zhang et al., 2019b) and denoted these results with † in the tables.

MNIST CIFAR10 CIFAR100

Figure 2 : Robust accuracy for different strengths of PGD-L ∞ attack on different datasets. Adversarial robustness against the observed attack We first evaluated our approach against the same class and strength of attack that was used during training (PGD-L inf with = 0.3 and 0.031 for MNIST and CIFAR datasets respectively). Table 1 compares the robust classification performance of our proposed approach against PGD-L ∞ (with similar setting as was used during training) and FGSM attacks. Training LeNet with AFD was unstable leading to frequent crashing of adversarial discriminator accuracy despite our extensive hyperparameter search. For this reason, we conducted our MNIST experiments also using the ResNet18 architecture (He et al., 2016) . On all datasets, AFD-trained network performed much better than alternative methods against both white-box and black-box attacks. The relative improvement was largest on CIFAR10 and CIFAR100 datasets. We also observed a relatively high variance in robust accuracy of AFD-trained networks on CIFAR datasets when trained from different random initializations (standard deviation of 10.78 and 8.30 for CIFAR10 and CIFAR100). We suspect this large variance to be due to the additional randomness in AFD training due to the adversarial game between the embedding and the adversarial discriminator networks. Across the three runs, the best trained models performed 83.72% and 54.95% against the white-box PGD-L inf attack ( = 0.03) on CIFAR10 and CIFAR100 respectively. Furthermore, AFD retained most of its robustness against a large set of attacks while improving robustness against C&W and DeepFool attacks when using particular weaker attacks (e.g. PGD-L ∞ with = 4 255 and 5 iterations) during training (Figure -A6 ). In addition, we also evaluated the AFD model on transfer black-box attacks from AT and TRADES models which further showed higher robustness to those attacks too (Table-A4). Robust classification against stronger and unseen attacks We also validated the robustness classification against higher degrees of the same attack used during training as well as to a suite of other attacks that were not observed during training. We found that, compared to alternative defense methods, the AFD-trained networks continued to perform well against white-box attacks even for very large perturbations -while performance of other methods went down to zero relatively quickly (Figures 2, 3, A1, A2 ). The AFD-trained network also performed remarkably well against most other attacks that were not observed during training (8/8 on MNIST and 6/8 on CIFAR datasets). To compare different models considering both attack types and perturbation strength, we computed the area-under-the-curve (AUC) for a range of epsilons for each attack and each approach. Table-2 summarizes these values for our approach and two alternative approaches (adversarial training and TRADES). Our results showed that compared to other baseline methods, AFD-trained networks are robust to a wide range of attacks and strengths. As discussed in the Methods section, unlike most previous defense methods that focus on minimizing the robust classification error, AFD minimizes the representation sensitivity and consequently, the learned representation remains stable for a large range of attack strengths compared to other methods (Figure 4 -left). Despite the large gain in robustness against most of the attacks, AFD-trained networks slightly underperformed against two of the attacks (Deepfool and C&W) when tested on CIFAR10 and CIFAR100 datasets. Our posthoc analyses showed that the direction of perturbations in the representational space in response to Deepfool and C&W attacks were more misaligned with the PGD-L inf attack compared to other attacks such as DDN which was comparatively less successful (Table A7 ). Moreover, it has been shown that most adversarial defenses are not guaranteed to transfer to unseen attacks (Maini et al., 2020; Pinot et al., 2020) and that different adversarial training methods might even overfit to ). For each matrix, rows correspond to ground truth (target) labels and columns correspond to non-target labels. the training set (Rice et al., 2020) . While we did not observe any sign of overfitting for the PGD-L inf during training, the robustness against Deepfool and C&W attacks decreased during the later stages of training and in a way, the network might have overfit to the PGD-L inf attack during training (Figure A4 ). Representation sensitivity We compared the robustness of the learned representation derived from training the same architecture using different methods. For that we measured the normalized sensitivity of the representations in each network as E(x)-E(x ) 2 E(x) 2 . For all three datasets we found that the AFD-trained networks learn high-level representations that were more robust against input perturbations as measured by the normalized L2 distance between clean and perturbed representations (Figures A8, A9, A10 ). Gradient landscape To empirically validate the prediction from Theorem-1, we computed the average gradient of class likelihoods with respect to the input across samples within the test set of each dataset ( ∇ x l i , i ∈ 1, ..., N c ). We found that, on all datasets, the magnitude of gradients in the direction of most non-target classes were much smaller for AFD-trained network compared to other tested methods . This empirically confirms that AFD stabilizes the representation in a way that significantly reduces the gradients towards most non-target classes. Moreover, the output gradients of the AFD-trained network were highly salient and interpretable (Figure A7 ). Learning a sparse representation As we discussed in the Methods section, we expected the AFD method to find and remove the non-robust features from the learned representation. Thus, we expected the learned representational space to potentially be of lower dimensionality. To test this, we compared the dimensionality of the learned representation using two measures. i) number of non-zero features over the test set within each dataset and ii) number of PCA dimensions that explains more than 99% of the variance in the representation computed over the test-set of each dataset. We found that the same network architecture when trained with AFD method gave rise to a much sparser and lower dimensional representational space (Table A5 ). The representational spaces learned with AFD on MNIST, CIFAR10, and CIFAR100 datasets had only 6, 9, and 76 principal components respectively. Adversarial vs. L2 optimization We also ran an additional experiment on the MNIST dataset in which we added a regularization term to the classification loss to directly minimize the representation sensitivity S e = 1 n x E(x) -E(x ) , during training. We observed that although this augmented loss led to learning robustness representations, it only achieved modest levels of robustness (∼ 80%) and showed only weak generalization to stronger and other unseen attacks (Figure -A3 ). This result suggests that enforcing a distributional form of feature desensitization (e.g. AFD) may lead to robust behavior over a larger range of perturbations compared to the case where feature stability is directly enforced through an L p norm measure. Non-obfuscated gradients Recent literature have pointed out that many defense methods against adversarial perturbations could drive the network into a regime called obfuscated gradients in which the network appears to be robust against common iterative adversarial attacks but could easily be broken using black-box or alternative attacks that do not rely on exact gradients (Papernot et al., 2017; Athalye et al., 2018; Carlini et al., 2019) . We believe that our results are not due to obfuscated gradients for several reasons. i) For most perturbations, the model performance continues to decrease with increased epsilon (Figures-3,A1,A2); ii) The iterative perturbations were consistently more successful than single-step ones (Table -1) ; iii) Black-box attacks were significantly less successful than white-box attacks (Table -1) ; iv) The AFD-trained model performed similar or better than alternate methods against the Boundary attack (Brendel et al., 2018) -an attack which does not rely on the network gradients (Table -A3 ). In addition to these tests, we also evaluated the AFD performance on B&B (Brendel et al., 2018) and AutoAttack (Croce & Hein, 2020) . On these attacks, AFD was consistently better than or equal to the baseline models on MNIST and CIFAR10 datasets but was less robust on the CIFAR100 dataset (Table -A3 ).

4. RELATED WORK

There is an extensive literature on mitigating susceptibility to adversarial perturbations. Adversarial training (Madry et al., 2017) is one of the earliest successful attempts to improve robustness of the learned representations to potential perturbations to the input pattern by solving a "saddle point" problem composed of an inner and outer adversarial optimization. A number of other works suggest additional losses instead of direct training on the perturbed inputs. TRADES (Zhang et al., 2019b) adds a regularization term to the cross-entropy loss which penalizes the network for assigning different labels to natural images and their corresponding perturbed images. (Qin et al., 2020) proposed an additional regularization term (local linearity regularizer) that encourages the classification loss to behave linearly around the training examples. (Wu et al., 2019) proposed to regularize the flatness of the loss to improve adversarial robustness. Our work is closely related to the domain adaptation literature in which adversarial optimization has recently gained much attention (Ganin & Lempitsky, 2015; Liu et al., 2019; Tzeng et al., 2017) . From this viewpoint one could consider the clean and perturbed inputs as two distinct domains for which a network aims to learn an invariant feature set. Although in our setting, i) the perturbed domain continuously evolves while the parameters of the embedding network are tuned; ii) unlike the usual setting in domain-adaptation problems, here we have access to the labels associated with samples from the perturbed (target) domain. Despite this, (Song et al., 2019) regularized the network to have similar logit values in response to clean and perturbed inputs and showed that this additional term leads to better robust generalization to unseen perturbations. Related to this, Adversarial Logit Pairing (Kannan et al., 2018) increases robustness by directly matching the logits for clean and adversarial inputs. Another line of work is on developing certified defenses which consist of methods with provable bounds over which the network is certified to operate robustly (Zhang et al., 2019c; Zhai et al., 2020; Cohen et al., 2019) . While these approaches provide a sense of guarantee about the proposed defenses, they are usually prohibitively expensive to train, drastically reduce the performance of the network on natural images, and the empirical robustness gained against standard attacks are low.

5. DISCUSSION

We proposed a method to decrease the sensitivity of learned representations in neural networks using adversarial optimization. Decreasing the input-sensitivity of features has long been desired in training neural networks (Drucker & Le Cun, 1992) and has been suggested as a way to improve adversarial robustness (Ros & Doshi-Velez, 2018; Zhu et al., 2020) . Our results show that AFD can be used to effectively reduce the input-sensitivity of network features with minimal interference with the classification objective and to improve robustness against a family of adversarial attacks. Successful feature desensitization was dependent on having a strong adversarial discriminator and maintaining a balance between the embedding and discriminator networks throughout training. With regards to the computational cost, while AFD requires three SGD updates per batch, the additional computational cost is not significantly higher than many prior methods when considering that most of the computational cost is associated with generating the adversarial examples during training.

6.1. NETWORK ARCHITECTURES

For all experiments, we trained the ResNet18 architecture (He et al., 2016) using SGD optimizer with 0.9 momentum and learning rates as indicated in Table-A1, weight decay of 10 -4 , batch size of 128. All learning rates were reduced by a factor of 10 after scheduled epochs. Proof. Assume Dc i , i ∈ C is a set of differentiable functions that implement the Bayes optimal classifier from the E(x) representation (note that we will drop the subscripts in E θ and Da φ notation for simplicity), i.e. ŷ = argmax i l i and l i = P (y i |x) = sof tmax Dc(E(x)) i , y i ∈ C. Assuming that for the perturbed inputs x = π(x ) the adversarial training of E and Da converges to its global minimum, from Proposition 2 of (Chrysos et al., 2019) we have: ∀x ∈ X , y ∈ Y : P (E(x), y) = P (E(π(x, )), y), Following from Bayes rule we have: P (y i = t|E(x))P (E(x)) = P (y i = t|E(x -δ))P (E(x -δ)), δ = ∂l t ∂x , From equation 4, the marginal distributions P (E(x)) and P (E(x -δ)) should be equal which leads to: P (y i = t|E(x)) = P (y i = t|E(x -δ)), which can only be true if ∂lt ∂x = 0.

6.3. ADVERSARIAL ATTACKS

We used a range of adversarial attacks in our experiments. Hyperparameters associated with each attack are listed in the table below. Implementation of these attacks were adopted from Foolbox (Rauber et al., 2017 ), AdverTorch (Ding et al., 2019) packages.

6.4. BOUND ON CLASSIFIER'S ROBUST ERROR

Considering the representation distributions in response to clean and perturbed inputs (of a particular class) as two distinct domains of inputs, it is straight forward to use the math from domain adaptation literature to derive a bound on the classifier's robust error (i.e. under the perturbed scenario). In this case, we can directly adapt Theorem 2 in (Ben-David et al., 2010) to derive this bound. If D c and D p are distributions of representations in response to clean and perturbed inputs of a particular class y i respectively. Let U c and U p be samples of size m each, drawn from D c and D p . Let H be a hypothesis space of VC dimension d, then for any δ ∈ (0, 1), with probability at least 1-δ (over the choice of the samples), for every h ∈ H:  p (h) ≤ ξ c (h) + 1 2 dH∆H (U c , U p ) + 4 2dlog(2m) + log( 2 δ ) m + λ where ξ c and ξ p are the errors on clean and perturbed inputs, dH∆H is the empirical H-divergence (Ben-David et al., 2010), and λ is the is the combined error of the ideal hypothesis h * : λ = ξ c (h * ) + ξ p (h * ). Table A3 : Comparison of robust accuracy against AutoAttack (Croce & Hein, 2020) , Boundary attack (Brendel et al., 2018) with 5000 steps and = 2, and B&B attack (Brendel et al., 2019) . We tested the robust performance of each model on 100 random samples from each dataset's test-set. 



https://michaelbach.de/ot/ https://github.com/yaodongyu/TRADES.git



Overview of the proposed AFD approach: (a) visual comparison of several adversarial robustness methods (Adversarial training

Figure 3: Comparison of robust accuracy of different methods against white-box attacks on CIFAR10 dataset with ResNet18 architecture.

Figure 4: (left) Comparison of normalized representation sensitivity on test-set of MNIST (top), CIFAR10 (middle), CIFAR100 (bottom) datasets under PGD-L ∞ attack. Plots show the median (±std) sensitivity over test-set for each dataset. * denotes statistically significant difference between sensitivity distributions for AFD and TRADES. (right) Logarithm of the average gradient magnitudes of class likelihoods with respect to input, evaluated at samples within the test-set of each dataset (log E x∼X ∂lt ∂x

Figure A1: Comparison of robust accuracy of different methods against white-box attacks on MNIST dataset with ResNet18 architecture.

Figure A6: Robust accuracy of AFD-trained models on CIFAR10 dataset against various attacks when using different levels of attack strength during training.

Figure A7: Feature visualization. Pixel values were changed in the direction of the gradients that would maximize either the ground truth class (left column) or a randomly selected class (right column). For each image, the original image (left), transformed image (middle) and gradient map (right) are shown.

Figure A8: Scatter plot of 2-dimensional t-SNE projection (Maaten & Hinton, 2008) of the representations derived from training the ResNet18 architecture on MNIST dataset. (top row) t-SNE projection of representations of clean images for networks trained with different methods. Each point corresponds to the representation of one of the images from the MNIST test-set. (rows 2 to 5) t-SNE projection of the representation of the clean and perturbed MNIST test-set images. Columns are sorted from left to right with the strength of the perturbation (left-most column corresponds to clean images and right-most column with highest tested perturbation). Perturbations are generated using PGD-L ∞ attack. NT: naturally trained; AT: adversarially trained(Madry et al., 2017); TRADES: (Zhang et al., 2019b); AFD: adversarial feature desensitization.

Figure A9: Scatter plot of 2-dimensional t-SNE projection (Maaten & Hinton, 2008) of the representations derived from training the ResNet5 architecture on CIFAR10 dataset. (top row) t-SNE projection of representations of clean images for networks trained with different methods. Each point corresponds to the embedding of one of the images from the CIFAR10 test-set. (rows 2 to 5) t-SNE projection of the embedding of the clean and perturbed CIFAR10 test-set images. Columns are sorted from left to right with the strength of the perturbation (left-most column corresponds to clean images and right-most column with highest tested perturbation). NT: naturally trained; AT: adversarially trained(Madry et al., 2017); TRADES: (Zhang et al., 2019b);AFD: adversarial feature desensitization.

Figure A10: Scatter plot of 2-dimensional t-SNE projection (Maaten & Hinton, 2008) of the representation derived from training the ResNet5 architecture on CIFAR100 dataset. (top row) t-SNE projection of representations of clean images for networks trained with different methods. Each point corresponds to the representation of one of the images from the CIFAR100 test-set. (rows 2 to 5) t-SNE projection of the representation of the clean and perturbed CIFAR100 test-set images. Columns are sorted from left to right with the strength of the perturbation (left-most column corresponds to clean images and right-most column with highest tested perturbation). NT: naturally trained; AT: adversarially trained (Madry et al., 2017); TRADES (Zhang et al., 2019b); AFD: adversarial feature desensitization.

Comparison of robust accuracy against various attacks on different datasets. For all attacks we used = 0.3 and 8 255 for MNIST and CIFAR10/CIFAR100 datasets respectively. † indicates replicated results. NT: natural training; AT: adversarial training; AFD: adversarial feature desensitization; WB: white-box attack; BB: black-box attack where the adversarial examples were produced by running the attack on the NT ResNet18 model. Numbers reported with µ ± σ denote mean and std values over three independent runs with different random initialization. * RST(Carmon et al., 2019) additionally uses 500K unlabeled images during training.

AUC measures for different perturbations and methods on MNIST, CIFAR10, and CIFAR100 datasets. AUC values are normalized to have a maximum allowable value of 1. Evaluations on AT and TRADES were made on networks trained using reimplemented or official code.

Training hyperparameters for each dataset and network.Theorem 1. If the adversarial optimization of embedding and discriminator functions, E θ and Da ψ , converges to the global minimum (θ * , ψ * ) of the training objective in equation 2, then the gradient of the true class (t) likelihood with respect to the input x is zero at any x ∈ X , i.e. ∂lt ∂x = 0.

Attack hyperparameters for each dataset and attack.

Transfer black-box attack from ResNet18 network trained with adversarially training (AT) and TRADES on different datasets.

Dimensionality of the learned representation space on various datasets using different methods and measures. Units: number of non-zero feature dimensions over the test-set within each dataset. Dims: number of PCA dimensions that account for 99% of the variance across all images within the test-set of each dataset.

Comparison of robust accuracy against PGD-L ∞ with = 0.3 using different architectures for the adversarial discriminator, tested on MNIST dataset.Dataset Model Da Architecture Robust Acc.

Comparison of representation perturbations in response to different attacks. We computed the cosine angle between representation perturbations due to each attack to those from PGD-L inf . Values are reported in radians.

