WITH FALSE FRIENDS LIKE THESE, WHO CAN HAVE SELF-KNOWLEDGE?

Abstract

Adversarial examples arise from excessive sensitivity of a model. Commonly studied adversarial examples are malicious inputs, crafted by an adversary from correctly classified examples, to induce misclassification. This paper studies an intriguing, yet far overlooked consequence of the excessive sensitivity, that is, a misclassified example can be easily perturbed to help the model to produce correct output. Such perturbed examples look harmless, but actually can be maliciously utilized by a false friend to make the model self-satisfied. Thus we name them hypocritical examples. With false friends like these, a poorly performed model could behave like a state-of-the-art one. Once a deployer trusts the hypocritical performance and uses the "well-performed" model in real-world applications, potential security concerns appear even in benign environments. In this paper, we formalize the hypocritical risk for the first time and propose a defense method specialized for hypocritical examples by minimizing the tradeoff between natural risk and an upper bound of hypocritical risk. Moreover, our theoretical analysis reveals connections between adversarial risk and hypocritical risk. Extensive experiments verify the theoretical results and the effectiveness of our proposed methods. 1. This is of scientific interest. Hypocritical examples are the opposite of adversarial examples. While adversarial examples are hard test data to a model, hypocritical examples aim to make it easy to do correct classification. Hypocritical examples warn ML researchers to

1. INTRODUCTION

Deep neural networks (DNNs) have achieved breakthroughs in a variety of challenging problems such as image understanding (Krizhevsky et al., 2012) , speech recognition (Graves et al., 2013) , and automatic game playing (Mnih et al., 2015) . Despite these remarkable successes, their pervasive failures in adversarial settings, the phenomenon of adversarial examples (Biggio et al., 2013; Szegedy et al., 2014) , have attracted significant attention in recent years (Athalye et al., 2018; Carlini et al., 2019; Tramer et al., 2020) . Such small perturbations on inputs crafted by adversaries are capable of causing well-trained models to make big mistakes, which indicates that there is still a large gap between machine and human perception, thus posing potential security concerns for practical machine learning (ML) applications (Kurakin et al., 2016; Qin et al., 2019; Wu et al., 2020b) . An adversarial example is "an input to a ML model that is intentionally designed by an attacker to fool the model into producing an incorrect output" (Goodfellow & Papernot, 2017) . Following the definition of adversarial examples on classification problems (Goodfellow et al., 2015; Papernot et al., 2016; Elsayed et al., 2018; Carlini et al., 2019; Zhang et al., 2019; Wang et al., 2020b; Zhang et al., 2020; Tramèr et al., 2020) , given a DNN classifier f and a correctly classified example x with class label y (i.e., f (x) = y), an adversarial example x adv is generated by perturbing x such that f (x adv ) = y and x adv ∈ B (x). The neighborhood B (x) denotes the set of points within a fixed distance > 0 of x, as measured by some metric (e.g., the l p distance), so that x adv is visually the "same" for human observers. Then, an imperfection of the classifier is highlighted by G adv = Acc(D)-Acc(A), the performance gap between the accuracy (denoted by Acc(•)) evaluated on clean set sampled from data distribution D and adversarially perturbed set A. An adversary could construct such a perturbed set A that looks no different from D but can severely degrade the performance of even state-of-the-art DNN models. From direct attacks in the digital space (Goodfellow et al., 2015; Carlini & Wagner, 2017) to robust attacks in the physical world (Kurakin et al., 2016; Xu et al., 2020) , from toy classification problems (Chen et al., 2020; Dobriban et al., 2020) to complicated perception tasks (Zhang & Wang, 2019; Wang et al., 2020a) , from the high dimensional nature of the input space (Goodfellow et al., 2015; Gilmer et al., 2018) to the Left: Conceptual diagrams for the generation of an adversarial example x adv and a hypocritical example x hyp . The input space is (ground-truth) classified into the orange lined region (e.g., class "not panda"), and the blue dotted region (e.g., class "panda"). The black solid line is the decision boundary of a nonrobust model, which classifies the region above the boundary as "panda" and the region below the boundary as "not panda". Red shadow and black shadow in the ball B (x) denote that the points in there are misclassified and correctly classified, respectively. As we can see, x adv or x hyp can be easily found by perturbing a correctly classified x or a misclassified x across the model's decision boundary. Right: A demonstration of adversarial examples and hypocritical examples on real data. Here we choose ResNet50 (He et al., 2016a) trained on ImageNet (Russakovsky et al., 2015) as the victim model. In (a) the correctly classified "panda" can be stealthily perturbed to be misclassified as "tennis ball". In (b) the "panda" (misclassified as "tripod") can be stealthily perturbed to be correctly classified. Perturbations are rescaled for display. framework of (non)-robust features (Jetley et al., 2018; Ilyas et al., 2019) , many efforts have been devoted to understanding and mitigating the risk raised by adversarial examples, thus closing the gap G adv . Previous works mainly concern the adversarial risk on correctly classified examples. However, they typically neglect a risk on misclassified examples themselves which will be formalized in this work. In this paper, we first investigate an intriguing, yet far overlooked phenomenon, where given a DNN classifier f and a misclassified example x with class label y (i.e., f (x) = y), we can easily perturb x to x hyp such that f (x hyp ) = y and x hyp ∈ B (x). Such an example x hyp looks harmless, but actually can be maliciously utilized by a false friend to fool a model to be self-satisfied. et al., 2016; Briefs, 2015; Lei, 2018) . An attacker may add imperceptible perturbations on the test examples (e.g., the "stop sign" on the road) stealthily without human notice, to hypocritically help an ML-based autonomous vehicle to pass the tests that might otherwise fail. However, the high performance can not be maintained on public roads without the help of the attacker. Thus, the potential risk is underestimated and traffic accidents might happen unexpectedly when the vehicle driving on public roads. We propose a defense method by improving model robustness against hypocritical perturbations. Specifically, we formalize the hypocritical risk and minimize it via a differentiable surrogate loss (Section 3). Experimentally, we verify the effectiveness of our proposed attack (Section 2.1) and defense (Section 4.1). Further, we study the transferability of hypocritical examples across models trained with various methods (Section 4.2). Finally, we conclude our paper by discussing and summarizing our results (Section 5 and Section 6). Our main contributions are: • We give a formal definition of hypocritical examples. We demonstrate the unreliability of standard evaluation process in the existence of false friends and show the potential security risk on the deployment of a model with high hypocritical performance. • We formalize the hypocritical risk and analyze its relation with natural risk and adversarial risk. We propose the first defense method specialized for hypocritical examples by minimizing the tradeoff between the natural risk and an upper bound of hypocritical risk. • Extensive experiments verify the effectiveness of our proposed methods. We also examine the transferability of hypocritical examples. We show that the transferability is not always desired by the attackers, which depends on their purpose.

2. FALSE FRIENDS AND ADVERSARIES

Better an open enemy than a false friend! Only by being aware of the potential risk of the false friend can we prevent it. In this section, we expose a kind of false friends, who are capable of manipulating model performance stealthily during the evaluation process, thus making the evaluation results unreliable. We consider a classification task with data (x, y) ∈ R d × {1, . . . , C} from a distribution D. Denote by f : R d → {1, ..., C} the classifier which predicts the class of an input example x: f (x) = arg max k p k (x), where p k (x) is the kth component of p(x) : R d → ∆ C (e.g., the output after softmax activation), in which ∆ C = {u ∈ R C | 1 T u = 1, u ≥ 0} is the probabilistic simplex. Adversarial examples are malicious inputs crafted by an adversary to induce misclassification. We first give the commonly accepted definition of adversarial examples as follows: Definition 1 (Adversarial Examples). Given a classifier f and a correctly classified input (x, y) ∼ D (i.e., f (x) = y), an -bounded adversarial example is an input x * ∈ R d such that: f (x * ) = y and x * ∈ B (x). The assumption underlying this definition is that inputs satisfying x * ∈ B (x) preserve the label y of the original input x. As a false friend, a hypocritical example can be generated from a misclassified example by maximizing max x ∈B (x) 1(f (x ) = y), which is equivalent to minimizing min x ∈B (x) 1(f (x ) = y), where 1(•) is the indicator function. Similar to Madry et al. (2018) ; Wang et al. (2020b) , in practice, we leverage the commonly used cross entropy (CE) loss as the surrogate loss of 1(f (x ) = y) and minimize it by projected gradient descent (PGD). Note that Equation 2 looks similar to but conceptually differs from the known targeted adversarial attack (Carlini & Wagner, 2017) , which generates a kind of adversarial examples defined on correctly classified clean inputs and targeted to wrong classes. The hypocritical examples here are defined on misclassified inputs and are targeted to their right classes.

2.1. ATTACK RESULTS

In this subsection, we demonstrate the power of our proposed hypocritical attack on three benchmark datasets: MNIST (LeCun et al., 1998) , CIFAR-10 (Krizhevsky et al., 2009) and ImageNet (Russakovsky et al., 2015) . But that is just a trivial defense, it simply predicts most of the points in input region as a certain class because of the poor scaling of network weights at initialization (He et al., 2016b; Elsayed et al., 2019) . More discussions are in Appendix A.2. Therefore, it is not enough to blindly pursue robustness against hypocritical perturbations but ignore the performance on clean examples.

3. HYPOCRITICAL RISK

In this section, we formalize the hypocritical risk and analyze the relation between natural risk, adversarial risk, and hypocritical risk. We propose a defense method specialized for hypocritical examples by minimizing the tradeoff between natural risk and an upper bound of the hypocritical risk. Moreover, by decomposing a existing method designed for adversarial defense (TRADES (Zhang et al., 2019 )), we find that, surprisingly, TRADES minimizes not only the adversarial risk on correctly classified examples, but also a looser upper bound of the hypocritical risk. Our theoretical analysis suggests that TRADES can be another candidate defense method for hypocritical examples. To characterize the adversarial robustness of a classifier f , Madry et al. ( 2018  R adv (f ) = E (x,y)∼D max x ∈B (x) 1(f (x ) = y) . The standard measure of classifier performance, known as natural risk, is denoted as R nat (f ) = E (x,y)∼D [1(f (x) = y)]. Let q(x, y) be the probability density function of data distribution D. We denote by S + f the conditional data distribution on correctly classified examples w.r.t. f , with a conditional density function q(x, y | E) = q(x, y)/ Z(E) if E is true (otherwise q(x, y | E) = 0), where the event E is f (x) = y and Z(E) = x,y 1(f (x) = y)dq(x, y) is a normalizing constant. We denote by S - f the conditional data distribution on misclassified examples with the conditional density function q(x, y | E) and f (x) = y as the event E. Then we have the following relation between the natural risk and the adversarial risk: Proposition 1. Denote the adversarial risk on correctly classified examples by Radv (f ) = E (x,y)∼S + f max x ∈B (x) 1(f (x ) = y) , then we have R adv (f ) = R nat (f ) + (1 -R nat (f )) Radv (f ). Proposition 1 shows that we can view the adversarial risk R adv (f ) as the tradeoff between R nat (f ) and Radv (f ) with the scaling parameter λ = 1-R nat (f ). The adversarial risk on correctly classified examples Radv (f ) is in sharp contrast to the hypocritical risk defined on misclassified examples formalized as follows: Definition 3 (Hypocritical Risk). The hypocritical risk on misclassified examples of a classifier f under the threat model of bounded ball is defined as Rhyp (f ) = E (x,y)∼S - f max x ∈B (x) 1(f (x ) = y) . The hypocritical risk Rhyp (f ) is the proportion of perturbed examples (originally misclassified) that can be successfully correctly classified by the classifier after a false friend's attack. When considering the existence of false friends, a good model should have not only low natural risk but also low hypocritical risk, to be robust against hypocritical perturbations.

3.1. TRADEOFF BETWEEN NATURAL AND HYPOCRITICAL RISKS

Figure 2 : Counterexample given by Equation 4. Motivated by the tradeoff between natural and adversarial risks Tsipras et al. (2019) ; Zhang et al. (2019) , we notice that there may also exist an inherent tension between the goal of natural risk minimization and hypocritical risk minimization. To illustrate the phenomenon, we provide a toy example here, which is modified from the example in Zhang et al. (2019) , and its risk minimization solutions can be analytically found. Consider the case (x, y) ∈ R × {-1, +1} from a distribution D, where the marginal distribution over the instance space is a uniform distribution over [0, 1], and for k = 0, 1, • • • , 1 2 -1 , η(x) := Pr(y = +1 | x) = 1/4, x ∈ [2k , (2k + 1) ), 1, x ∈ ((2k + 1) , (2k + 2) ]. See Figure 2 for visualization of η(x). In this problem, we consider two classifiers: a) the Bayes optimal classifier sign(2η(x) -1); b) the all-one classifier which always outputs "positive". Table 3 displays the trade-off between natural and hypocritical risks: the minimal natural risk 1/8 is achieved by the Bayes optimal classifier with large hypocritical risk, while the optimal hypocritical risk 0 is achieved by the all-one classifier with large natural risk. It is natural then to optimize our models to minimize natural and hypocritical risks at the same time. However, it's hard to do optimization over Rhyp (f ). To ease the optimization obstacles in there, we derive the following upper bounds. Theorem 1. For any data distribution D and its corresponding conditional distribution on misclassified examples S - f w.r.t. a classifier f , we have E (x,y)∼S - f 1(f (x hyp ) = y) Rhyp (f ) ≤ E (x,y)∼S - f 1(f (x hyp ) = f (x)) R hyp (f ) ≤ E (x,y)∼S - f 1(f (x rev ) = f (x)) R hyp (f ) , where x hyp = arg max x ∈B (x) 1(f (x ) = y) and x rev = arg max x ∈B (x) 1(f (x ) = f (x)). Here x rev means that it pursues to reverse a clean example to a different class, from the point of view of the model. The upper bounds found in Theorem 1 allow us to optimize the hypocritical risk using proper surrogate loss functions which are both physical meaningful and computaionally tractable. Before moving forward to algorithmic design, we state a useful proposition below, which reveals the internal mechanism behind in TRADES. Proposition 2. R rev (f ) = (1-R nat (f )) Radv (f )+R nat (f )R hyp (f ) = E (x,y)∼D 1(f (x rev ) = f (x)). Proposition  (f ) = R nat (f ) + λR hyp (f ), where λ > 0 is a tunable scaling parameter balancing the importance of natural risk and hypocritical risk. We name our method THRM (Tradeoff for Hypocritical Risk Minimization). Optimization over 0-1 loss in THRM is still intractable. In practice, for the indicator function 1(f (x) = y) in R nat (f ), we adopt the commonly used CE loss as surrogate loss. Observed that R hyp (f ) = 1 Rnat(f ) E (x,y)∼D 1(f (x hyp ) = f (x)) , we absorb the R nat (f ) term into λ and use KL divergence as the surrogate loss of the indicator function 1(f (x hyp ) = f (x)) (Zheng et al., 2016; Zhang et al., 2019; Wang et al., 2020b)  Intuition behind the objective L THRM : the first term in Equation 6 encourages the natural risk to be optimized, while the second regularization term encourages the output to be stable against hypocritical perturbations, that is, the classifier should not be overly confident in its predictions especially when a false friend wants it to be. To derive the objective function for TRADES, we can minimize the tradeoff between the natural risk and the reversible risk: R TRADES (f ) = R nat (f ) + λR rev (f ). Similar to THRM, we use CE loss and KL divergence as the surrogate loss of 1(f (x) = y) and 1(f (x rev ) = f (x)), respectively. The final objective function becomes L TRADES = E (x,y)∼D [L CE (p(x), y) + λL KL (p(x), p(x rev ))] , which is exactly the multi-class classification objective function first proposed in Zhang et al. (2019) for adversarial defense. From the perspective of the hypocritical risk, our Proposition 2 reveals an advantage behind it, that is, TRADES is capable of minimizing the upper bound of hypocritical risk R hyp (f ), thus can be considered as a candidate defense method for hypocritical examples. Proposition 2 also implies that there may be a deeper connection between adversarial robustness and hypocritical robustness. We will discuss it more and compare our proposed THRM with TRADES in next section.

4. EXPERIMENTS

In this section, to verify the effectiveness of the methods (THRM and TRADES) suggested in Section 3.3, we conduct experiments on real-world datasets including MNIST and CIFAR-10. 

4.1. WHITE-BOX ANALYSIS

For the wide range of the scaling parameter λ, we conduct experiments in parallel over multiple NVIDIA Tesla V100 GPUs. On MNIST, perturbations are bounded by l ∞ norm with = 0.2. On CIFAR-10, models are trained against 3 different hypocritical attackers bounded by l ∞ norm with = 1/255, = 2/255 and = 8/255, respectively. Each experiment is conducted 3 times with different random seeds. The hypocritical risk reported here is actually an approximation of the real value, since the optimization problem in it is NP-hard and we approximately solve it using surrogate loss and PGD on test set. Further details about model architecture and training procedure are in Appendix A.3. Note that these experiments are extensive. It takes over 230 GPU days to completely train the models considered in this section. We believe that these experiments are beneficial to the ML community to further understand the tradeoffs and relative merits in THRM and TRADES. Results on MNIST ( = 0.2) and CIFAR10 ( = 2/255) are shown in Figure 3 . Each data point represents a model trained with different λ. More results including comparison with Madry's defense (Madry et al., 2018) are provided in Appendix A.3 due to the limited space. First, we observe that, on both datasets, as the regularization parameter λ increases, the natural risk R nat increases while the hypocritical risk Rhyp decreases, which verifies the effectiveness of our proposed method and the theoretical analysis in Proposition 2, where we reveal that TRADES is capable of minimizing a looser upper bound of hypocritical risk. Second, we show that THRM achieves better tradeoff on MNIST since it optimizes a tighter upper bound than TRADES. However, the situation becomes nuanced on CIFAR-10. As we can see in Figure 3 (b), THRM seems to behave better in the beginning when λ is small but is surpassed by TRADES when λ increases. Overall, optimizing only a tighter upper bound of hypocritical risk achieves better tradeoff on test set when the task is relatively simple (e.g., on MNIST with = 0.2), while simultaneously optimizing hypocritical risk and adversarial risk achieves better tradeoff on test set when the task tends to be hard (e.g., on CIFAR-10 with = 2/255 and = 8/255). Above phenomenon shows that, when dealing with finite sample size and finite-time gradientdescent trained classifiers, better adversarial robustness may help the generalization of hypocritical robustness, which conforms our intuition that they are two sides of the same coin. Interestingly, a contemporary work claims that, on CIFAR-10, TRADES achieves better adversarial robustness than Madry's defense in fair hyperparameter settings (Anonymous, 2021) . Thus there may be potential mutual benefits between adversarial robustness and hypocritical robustness. After all, robust training objectives force DNNs to be invariant to signals that humans are invariant to, which may lead to feature representations that are more similar to what humans use (Salman et al., 2020) . A rigorous treatment of the synergism is beyond the scope of the current paper but is an important future direction.

4.2. TRANSFERABILITY ANALYSIS

Transferability of adversarial examples across models is well known (Tramèr et al., 2017; Papernot et al., 2017b; Ilyas et al., 2019 ) and here we examine the transferability of hypocritical examples on MNIST and CIFAR-10. We observe that hypocritical examples, i) can transfer easily between naturally trained models, ii) are hard to transfer from randomly initialized models to other models (and vise versa), iii) are hard to transfer from standard models to defended models, iv) generated from THRM models usually have high transferability. Experimental details are in Appendix A.4. Better transferability is beneficial for black-box attacks but is not always desired by hypocritical attackers. A hypocritical attacker only expects high transferability on the targeted model the attacker chose to help. If there are other competing models available to the deployer, the attacker actually does not want the hypocritical examples to be successfully transferred to those competing models. Thus fine-grained attack methods are required. We leave this to future work.

5. DISCUSSION

The false friends considered in this paper are as powerful as typical adversaries. They all know the ground truth labels of clean examples. Such powerful friends actually can help a model to not only correctly classify a misclassified clean example but also correctly classify an adversarial example crafted by an adversary. One may expect to rely on true friends against adversaries. Unfortunately, an omniscient and faithful friend is unachievable in practical tasks, so far at least. Once it is achieved, the problem of robustness disappears immediately. What we can do at present is using a relatively more robust model as a surrogate of the true friend to improve the robustness of a weak model. This induces a promising general method in practice, that is, high-performance models can be employed as true friends to help a weak model without exposing training data and model weights for the purpose of privacy protection and knowledge transfer (Abadi et al., 2016; Papernot et al., 2017a) . Additional discussions are in Appendix C.

A EXPERIMENTAL DETAILS A.1 DETAILS IN FIGURE 1

Attack procedure. In adversarial attacks, we perturb clean inputs to maximize the surrogate loss using PGD. In hypocritical attacks, we perturb clean inputs to minimize the surrogate loss using PGD. In both attacks, for the purpose of imperceptibility, we execute PGD attack 100 steps (step size is /50) with early stopping on ImageNet and the budget here is 2/255. For all models, we use the default PyTorch initialization, except that we initialize the convolutional weights in Wide ResNet with He initialization (He et al., 2015) . We conduct all the experiments using a single NVIDIA Tesla V100 GPU. Each experiment is conducted 3 times with different random seeds, except the standard models trained on ImageNet, in which we use the pretrained standard models available within PyTorch. Attack procedure. In adversarial attacks, we perturb clean inputs to maximize the surrogate loss using PGD. In hypocritical attacks, we perturb clean inputs to minimize the surrogate loss using PGD. In both attacks, we execute 50 steps PGD attacks (step size is /10) with 20 times of random restart on MNIST and CIFAR-10, and we use 50 steps PGD attacks (step size is /8) on ImageNet. Other hyperparameter choices didn't offer a significant change in accuracy. On MNIST, the hypocritical perturbed set F and the adversarially perturbed set A are constructed by attacking every example in the clean test set sampled from D. Both attacks are bounded by a l ∞ ball with radius = 0.2. On CIFAR-10, both attacks are bounded by a l ∞ ball with radius = 8/255. On ImageNet, F and A are constructed based on its validation set sampled from D. Both attacks are bounded by a l ∞ ball with radius = 16/255. Numerical results. The attack results on CIFAR-10 are shown in Table 4 . Full results of Table 1 , Table 2 and Table 4 are shown in Table 5, Table 6 and Table 7 , respectively. Moreover, we show the attack results of 9 Naive models evaluated on ImageNet in Table 6 . We find that all the Naive models in VGG family achieve high accuracy on F and all the Naive models in ResNet family have relatively poor performance on F. Especially, the Naive (ResNet152) model in Trial 1 is invariant to hypocritical perturbations. Even in the existence of a strong false friend, the hypocritical performance is still as low as the clean performance (only 0.1%). We carefully examined the Naive (ResNet152) model and find that it's actually a trivial classifier, which purely classifies almost all the points in input region [0, 1] d as a certain class for some simple reasons, such as poor scaling of network weights at initialization. Therefore, it is not enough to blindly pursue robustness against hypocritical perturbations but ignore the performance on clean examples. Once we train a Naive model with clean examples, the model becomes vulnerable immediately (see Standard (ResNet50)), whereas the trained weights are better conditioned (Elsayed et al., 2019) . Architecture. For MNIST, a variant of LeNet model (2 convolutional layers of sizes 32 and 64, and a fully connected layer of size 1024) is adopted. For CIFAR-10, a Wide ResNet (with depth 28 and width factor 10) is adopted.

Model

Training procedure. For the wide range of the scaling parameter λ, we conduct experiments in parallel over multiple NVIDIA Tesla V100 GPUs. Each experiment is conducted 3 times with different random seeds. For MNIST, all models (including Standard, Madry, TRADES, THRM) are trained for 80 epochs with Adam optimizer with batch size 128 and a learning rate of 0.001. Early stopping is done with holding out 1000 examples from the MNIST training set as suggested in Rice et al. (2020) . For CIFAR-10, all models are trained for 150 epochs with SGD optimizer with batch size 128 and the learning rate starts with 0.1, and is divided it by 10 at 90 and 125 epochs. We apply weight decay of 2e-4 and momentum of 0.9. Early stopping is done with holding out 1000 examples from the CIFAR-10 training set as suggested in Rice et al. (2020) . Attack procedure. For the inner maximization in the objective function of THRM, we perturb clean inputs to minimize the CE loss as the surrogate loss. For the inner maximization in TRADES, we maximize the KL divergence as the surrogate loss. For the inner maximization in Madry, we maximize the CE loss as the surrogate loss. On MNIST, the training attack is PGD with random start and 10 iterations (step size /4). On CIFAR-10, the training attack is PGD with random start and 10 iterations (step size /4) when = 8/255, and the training attack is PGD with random start and 7 iterations (step size /3) when = 1/255 and = 2/255. On all experiments, the test attack is 50 steps PGD (step size is /10) with 20 times of random restart. Other hyperparameter choices didn't offer a significant change in accuracy.

Numerical results.

The natural risk reported here is estimated on test set. The hypocritical risk reported here is estimated on test set and is actually an approximation of the real value since we approximately solve the optimization problem by PGD on examples from test set. Results on MNIST ( = 0.2) and CIFAR-10 ( = 1/255, = 2/255 and = 8/255) are shown in Figure 100.0 93.9 0.0 100.0 94.3 0.0 100.0 94.0 0.0 Standard (Wide ResNet) 100.0 95.0 0.0 100.0 95.1 0.0 100.0 95.3 0.0 6. Each point in Figure 6 represents one model trained with a certain λ. Full numerical results on MNIST ( = 0.2) and CIFAR-10 ( = 1/255, = 2/255 and = 8/255) can be found in Table 8 , Table 9 , Table 10 and Table 11 , respectively. On MNIST ( = 0.2), THRM has better tradeoff than TRADES. However, when the task becomes hard, TRADES performs as well as or better than THRM. On CIFAR-10, as the task becomes harder (the larger the radius the harder the task), the gap between TRADES and THRM becomes larger. This phenomenon shows that better adversarial robustness may help the generalization of hypocritical robustness, especially when the task is hard. Moreover, we compare our methods with Madry et al. (2018) 's defense designed for adversarial robustness (denoted as Madry)foot_1 and standard training method (denoted as Standard). We summarize results in Table 12 . For direct comparison, we pick a certain λ for each model trained by TRADES and THRM in each task. We observed that, in all tasks, Madry's defense has nonnegligible robustness on hypocritical examples, although there is no hypocritical risk or its upper bound in the objective function. This phenomenon indicates that optimizing only adversarial risk could bring a certain degree of robustness against hypocritical examples. While this experimental results partly support our hypothesis (i.e., the potential mutual benefits between robustness against adversarial perturbations and hypocritical perturbations), we do not take the evidence as an ultimate one and further exploration is needed. We note that the standard deviation becomes larger when λ is bigger in TRADES and THRM, which is attributed to optimization difficulty and result in more significant difference among different trials. Reducing the initial learning rate may mitigate this phenomenon. For completeness, we further evaluate the adversarial risk on correctly classified examples of the models trained by THRM and TRADES. Results on MNIST ( = 0.2) and CIFAR-10 ( = 2/255) are summarized in Table 13 and Table 14 , respectively. One interesting finding is that models trained with THRM manifest noteworthy adversarial robustness, especially on CIFAR-10, although there is no adversarial risk in the objective function of THRM. These facts also support the hypothesis (i.e., the potential mutual benefits between robustness against adversarial perturbations and hypocritical perturbations). 

B PROOFS OF MAIN RESULTS

In this section, we provide the proofs of our main results.  Rhyp (f ) ≤ E (x,y)∼S - f 1(f (x hyp ) = f (x)) R hyp (f ) ≤ E (x,y)∼S - f 1(f (x rev ) = f (x)) R hyp (f ) , where x hyp = arg max x ∈B (x) 1(f (x ) = y) and x rev = arg max x ∈B (x) 1(f (x ) = f (x)). Proof. To prove the first inequality, we have Rhyp (f ) = E (x,y)∼S - f max x ∈B (x) 1(f (x ) = y) = E (x,y)∼S - f 1(f (x hyp ) = y) ≤ E (x,y)∼S - f 1(f (x hyp ) = f (x)), where the above inequality involves two conditions: 1(f (x hyp ) = y) = 1 = 1(f (x hyp ) = f (x)), if f (x hyp ) = y, 0 ≤ 1(f (x hyp ) = f (x)), if f (x hyp ) = y. To prove the second inequality, we have R hyp (f ) = E (x,y)∼S - f 1(f (x hyp ) = f (x)) ≤ E (x,y)∼S - f 1(f (x rev ) = f (x)). Since (x, y) ∼ S - f , we have f (x) = y. If there exists a x hyp such that f (x hyp ) = y, then f (x hyp ) = f (x). Now let x rev = x hyp , then f (x rev ) = f (x) is true. Otherwise, if we couldn't find a x hyp such that f (x hyp ) = y, there still exists a posibility to find a x rev such that f (x rev ) = y but f (x rev ) = f (x) is true. Therefore, the above inequalities holds.  B.3 A PROOF OF PROPOSITION 2 Proposition 2. R rev (f ) = (1-R nat (f )) Radv (f )+R nat (f )R hyp (f ) = E (x,y)∼D 1(f (x rev ) = f (x)). Proof. R rev (f ) =(1 -R nat (f )) Radv (f ) + R nat (f )R hyp (f ) =(1 -R nat (f )) E (x,y)∼S + f [1(f (x adv ) = y)] + R nat (f ) E (x,y)∼S - f [1(f (x rev ) = f (x))] = E (x,y)∼D

C ADDITIONAL DISCUSSIONS

We showed that correctly classified examples (hypocritical examples) could be easily found in the vicinity of misclassified clean examples. As a result, a hypocritically perturbed set could be constructed with these hypocritical examples. The victim model's standard accuracy evaluated on the hypocritically perturbed set becomes higher than that on the clean set. It is natural then to wonder: How about adversarially robust accuracy (i.e., accuracy under adversarial perturbations) of the victim model on hypocritical examples? It's easy to see that, if the adversary is bounded by the same ball as the false friend, the model's adversarial accuracy evaluated on hypocritically perturbed set is zero, since a misclassified example exists in the ball of a hypocritical example (by definition). However, if the adversary's power is restricted by another δ ball such that δ < , then a robust hypocritical example may exist in the vicinity of a clean example so that a δ-bounded adversary can not change the model's prediction on the robust hypocritical example. In such a case, the model's adversarial accuracy evaluated on the robustly hypocritically perturbed set could be higher than that on the clean set. New attack and defense methods are required to further explore this phenomenon.



CONCLUSIONIn this work, we expose a new risk arising from excessive sensitivity. Model performance becomes hypocritical in the existence of false friends. By formalizing the hypocritical risk and analyzing its relation with natural risk and adversarial risk, we propose to use THRM and TRADES as defense methods against hypocritical perturbations. Extensive experiments verify the effectiveness of methods. These findings open new avenues for mitigating and exploiting model sensitivity. They actually optimize the adversarial risk in Equation 3 via surrogate loss.



Figure 1: Comparison between adversarial examples and hypocritical examples. Left: Conceptual diagrams for the generation of an adversarial example x adv and a hypocritical example x hyp . The input space is (ground-truth) classified into the orange lined region (e.g., class "not panda"), and the blue dotted region (e.g., class "panda"). The black solid line is the decision boundary of a nonrobust model, which classifies the region above the boundary as "panda" and the region below the boundary as "not panda". Red shadow and black shadow in the ball B (x) denote that the points in there are misclassified and correctly classified, respectively. As we can see, x adv or x hyp can be easily found by perturbing a correctly classified x or a misclassified x across the model's decision boundary. Right: A demonstration of adversarial examples and hypocritical examples on real data.Here we choose ResNet50(He et al., 2016a)  trained on ImageNet(Russakovsky et al., 2015) as the victim model. In (a) the correctly classified "panda" can be stealthily perturbed to be misclassified as "tennis ball". In (b) the "panda" (misclassified as "tripod") can be stealthily perturbed to be correctly classified. Perturbations are rescaled for display.

); Uesato et al. (2018); Cullina et al. (2018) defined the adversarial risk under the threat model of bounded ball:

, since f (x hyp ) = f (x) implies that the perturbed examples have different output distributions to that of clean examples. Our final objective function for THRM becomes L THRM = E (x,y)∼D [L CE (p(x), y) + λL KL (p(x), p(x hyp ))] .

On CIFAR-10. Perturbations are bounded by l∞ norm with = 2/255.

Figure 3: Tradeoff between natural risk and hypocritical risk on real-world datasets.

More examples. More adversarial examples and hypocritical examples generated on ImageNet using our methods are shown in Figure 5. More hypocritical examples generated on MNIST and CIFAR-10 are shown in Figure 4(a) and Figure 4(b). The victim models are LeNet (Standard) and Wide ResNet (Standard) for MNIST and CIFAR-10, respectively. They are trained with the same procedures described in Appendix A.2. In both attacks, for the purpose of imperceptibility, we execute 100 steps PGD attacks (step size is /50) with early stopping on MNIST and CIFAR-10. The budget for MNIST here is 0.2. The budget for CIFAR-10 here is 8/255.

Figure 4: Hypocritical examples. In each subfigure, the first cloumn represents the clean examples sampled from original data distribution, the second cloumn represents the generated perturbations, the third cloumn represents the perturbed examples. Perturbations are rescaled for display. Red labels and black labels below images denote misclassification and correct classification, respectively.

photocopierwashing machine

Full results of accuracy (%) evaluated on ImageNet. Attacks are bounded with = 16/255.

On CIFAR-10. Perturbations are bounded by l∞ norm with = 1/255.

On CIFAR-10. Perturbations are bounded by l∞ norm with = 8/255.

Figure 6: Tradeoff between natural risk and hypocritical risk on real-world datasets.

R adv (f ) = R nat (f ) + (1 -R nat (f )) Radv (f ). Proof. R adv (f ) = E (x,y)∼D max x ∈B (x) 1(f (x ) = y) = E (x,y)∼D 1(f (x) = y) • max x ∈B (x) 1(f (x ) = y) + E (x,y)∼D 1(f (x) = y) • max x ∈B (x) 1(f (x ) = y) (x ) = y) =R nat (f ) + (1 -R nat (f )) Radv (f ).B.2 A PROOF OF THEOREM 1 Theorem 1. For any data distribution D and its corresponding conditional distribution on misclassified examples S - f w.r.t. a classifier f , we have E (x,y)∼S - f 1(f (x hyp ) = y)

(f (x) = y) • 1(f (x adv ) = y)] + E (x,y)∼D [1(f (x) = y) • 1(f (x rev ) = f (xf (x rev ) = f (x))] .

Thus we name them hypocritical examples (see Figure 1 for a comparison with adversarial examples).Adversarial examples and hypocritical examples are two sides of the same coin. On the one side, a well-performed but sensitive model becomes unreliable in the existence of adversaries. On the other side, a poorly performed but sensitive model behaves well with the help of friends. With false friends like these, a naturally trained suboptimal model could have state-of-the-art performance, and even worse, a randomly initialized model could behave like a well-trained one (see Section 2.1). Does our model truly achieve human-like intelligence, or is it just simply because the test data prefers the model?2. There are practical threats. A variety of nefarious ends may be achievable if the mistakes of ML systems can be covered up by hypocritical attackers. For instance, before allowing autonomous vehicles to drive on public roads, manufacturers must first pass tests in specific environments (closed or open roads) to obtain a license (Administration

In such a case, if the examples used to evaluate a model are falsified by a false friend, the model will manifest like a perfect one (on hypocritical examples), but it actually may not be well performed even on clean examples, not to mention adversarial examples. Thus a new imperfection of the classifier can be found in G

The reason for the existence of adversarial examples is that a model is overly sensitive to non-semantic changes. Next, we formalize a complementary phenomenon to adversarial examples, called hypocritical examples. Hypocritical examples are malicious inputs crafted by a false friend to stealthily correct the prediction of a model:



Comparison of Bayes optimal classifier and all-one classifier.

2 shows a connection between adversarial risk and hypocritical risk: the adversarial risk on correctly classified examples Radv (f ) and the looser upper bound of the hypocritical risk on misclassified examples R hyp (f ) can be seamlessly united to a new risk on all examples R rev (f ). We name it reversible risk since minimizing it pursues the model whose predictions can't be reversed by small perturbations.

Training procedure. i) Models trained with standard approach using clean examples (Standard). For MNIST, models are trained for 80 epochs with Adam optimizer with batch size 128 and a learning rate of 0.001. Early stopping is done with holding out 1000 examples from the MNIST training set. For CIFAR-10, models are trained for 150 epochs with SGD optimizer with batch size 128 and the learning rate starts with 0.1, and is divided it by 10 at 90 and 125 epochs. We apply weight decay of 2e-4 and momentum of 0.9. Early stopping is done with holding out 1000 examples from the CIFAR-10 training set. For ImageNet, we use the pretrained standard models available within PyTorch (torchvision.models). ii) Models that randomly initialized without training (Naive).

Accuracy (%) of models evaluated on CIFAR-10. Attacks are bounded with = 8/255.

Full results of accuracy (%) evaluated on MNIST. Attacks are bounded with = 0.2.

Full results of accuracy (%) evaluated on CIFAR-10. Attacks are bounded with = 8/255.

Full results of natural risk (%) and hypocritical risk (%) on MNIST. Attacks are bounded by l ∞ norm with = 0.2.

Full results of natural risk (%) and hypocritical risk (%) on CIFAR-10. Attacks are bounded by l ∞ norm with = 1/255.

Full results of natural risk (%) and hypocritical risk (%) on CIFAR-10. Attacks are bounded by l ∞ norm with = 2/255.

Full results of natural risk (%) and hypocritical risk (%) on CIFAR-10. Attacks are bounded by l ∞ norm with = 8/255.

Comparison of natural risk (%±std over 3 random trials) and hypocritical risk (%±std over 3 random trials) between methods on real-world datasets. Attacks are bounded by l ∞ norm.

Evaluated results of natural risk (%) and adversarial risk (%) on MNIST. Attacks are bounded by l ∞ norm with = 0.2.

Evaluated results of natural risk (%) and adversarial risk (%) on CIFAR-10. Attacks are bounded by l ∞ norm with = 2/255.

A.4 DETAILS IN SECTION 5

Hypocritical attacks here are executed by 50 steps PGD (step size is /10) on source models. Note that the optimization method we used here is not to pursue state-of-the-art transferability, but to examine the transferability of hypocritical examples. There are many methods designed to improve the transferability of adversarial examples may be extended to hypocritical examples (Liu et al., 2017; Dong et al., 2018; Wu et al., 2020a) . Figure 7 shows the transferability heatmap of hypocritical attack over 9 models trained on MNIST. Figure 8 9. Model decision boundary is the line given by Equation 10with the threshold b = 0.5. Red shadow and green shadow region denote that the points in there are misclassified and correctly classified by the model, respectively. The gray lined region denotes that the points in there can be perturbed with little perturbations to reverse the prediction of the model.

D TRADEOFF BETWEEN ADVERSARIAL AND HYPOCRITICAL RISKS

Despite the experiments in Section 4.1 and Appendix A.3 showed that, when dealing with finite sample size and finite-time gradient-descent trained classifiers, there may be mutual benefits between adversarial robustness and hypocritical robustness in real-world datasets, we note that in general, this synergism does not necessarily exist. We illustrate the phenomenon by providing another toy example here, which is inspired by the precision-recall tradeoff Buckland & Gey (1994) ; Alvarez (2002) .Consider the case (x, y) ∈ R 2 × {-1, +1} from a distribution D, where the marginal distribution over the instance space is a uniform distribution over [0, 1] 2 . We assume that the decision boundary of the oracle (ground truth) is a circle:where the centre c = (0.5, 0.5) and the radius r = 0.4. The points inside the circle are labeled as belonging to the positive class, otherwise they are labeled as belonging to the negative class. We consider the linear classifier f with fixed w = (0, 1) and a tunable threshold b: formulas of these values are visualized in Figure 10 . Here we choose the bounded l 2 ball B (x) = {x ∈ R 2 : xx 2 ≤ } with = 0.1 as the threat model.Figure 11 plots the curve of precision and recall versus threshold b. We can see that there is a obvious precision-recall tradeoff between the two gray dotted lines. Similarly, Figure 12 plots the curve of Radv (f ) and Rhyp (f ) versus threshold b. We can see that the tradeoff exists almost everywhere: as the adversarial risk increases, the hypocritical risk decreases, and vise versa.

