WITH FALSE FRIENDS LIKE THESE, WHO CAN HAVE SELF-KNOWLEDGE?

Abstract

Adversarial examples arise from excessive sensitivity of a model. Commonly studied adversarial examples are malicious inputs, crafted by an adversary from correctly classified examples, to induce misclassification. This paper studies an intriguing, yet far overlooked consequence of the excessive sensitivity, that is, a misclassified example can be easily perturbed to help the model to produce correct output. Such perturbed examples look harmless, but actually can be maliciously utilized by a false friend to make the model self-satisfied. Thus we name them hypocritical examples. With false friends like these, a poorly performed model could behave like a state-of-the-art one. Once a deployer trusts the hypocritical performance and uses the "well-performed" model in real-world applications, potential security concerns appear even in benign environments. In this paper, we formalize the hypocritical risk for the first time and propose a defense method specialized for hypocritical examples by minimizing the tradeoff between natural risk and an upper bound of hypocritical risk. Moreover, our theoretical analysis reveals connections between adversarial risk and hypocritical risk. Extensive experiments verify the theoretical results and the effectiveness of our proposed methods.

1. INTRODUCTION

Deep neural networks (DNNs) have achieved breakthroughs in a variety of challenging problems such as image understanding (Krizhevsky et al., 2012) , speech recognition (Graves et al., 2013) , and automatic game playing (Mnih et al., 2015) . Despite these remarkable successes, their pervasive failures in adversarial settings, the phenomenon of adversarial examples (Biggio et al., 2013; Szegedy et al., 2014) , have attracted significant attention in recent years (Athalye et al., 2018; Carlini et al., 2019; Tramer et al., 2020) . Such small perturbations on inputs crafted by adversaries are capable of causing well-trained models to make big mistakes, which indicates that there is still a large gap between machine and human perception, thus posing potential security concerns for practical machine learning (ML) applications (Kurakin et al., 2016; Qin et al., 2019; Wu et al., 2020b) . An adversarial example is "an input to a ML model that is intentionally designed by an attacker to fool the model into producing an incorrect output" (Goodfellow & Papernot, 2017) . Following the definition of adversarial examples on classification problems (Goodfellow et al., 2015; Papernot et al., 2016; Elsayed et al., 2018; Carlini et al., 2019; Zhang et al., 2019; Wang et al., 2020b; Zhang et al., 2020; Tramèr et al., 2020) , given a DNN classifier f and a correctly classified example x with class label y (i.e., f (x) = y), an adversarial example x adv is generated by perturbing x such that f (x adv ) = y and x adv ∈ B (x). The neighborhood B (x) denotes the set of points within a fixed distance > 0 of x, as measured by some metric (e.g., the l p distance), so that x adv is visually the "same" for human observers. Then, an imperfection of the classifier is highlighted by G adv = Acc(D)-Acc(A), the performance gap between the accuracy (denoted by Acc(•)) evaluated on clean set sampled from data distribution D and adversarially perturbed set A. An adversary could construct such a perturbed set A that looks no different from D but can severely degrade the performance of even state-of-the-art DNN models. From direct attacks in the digital space (Goodfellow et al., 2015; Carlini & Wagner, 2017) to robust attacks in the physical world (Kurakin et al., 2016; Xu et al., 2020) , from toy classification problems (Chen et al., 2020; Dobriban et al., 2020) to complicated perception tasks (Zhang & Wang, 2019; Wang et al., 2020a) , from the high dimensional nature of the input space (Goodfellow et al., 2015; Gilmer et al., 2018) to the et al., 2015) as the victim model. In (a) the correctly classified "panda" can be stealthily perturbed to be misclassified as "tennis ball". In (b) the "panda" (misclassified as "tripod") can be stealthily perturbed to be correctly classified. Perturbations are rescaled for display. framework of (non)-robust features (Jetley et al., 2018; Ilyas et al., 2019) , many efforts have been devoted to understanding and mitigating the risk raised by adversarial examples, thus closing the gap G adv . Previous works mainly concern the adversarial risk on correctly classified examples. However, they typically neglect a risk on misclassified examples themselves which will be formalized in this work. In this paper, we first investigate an intriguing, yet far overlooked phenomenon, where given a DNN classifier f and a misclassified example x with class label y (i.e., f (x) = y), we can easily perturb x to x hyp such that f (x hyp ) = y and x hyp ∈ B (x). Such an example x hyp looks harmless, but actually can be maliciously utilized by a false friend to fool a model to be self-satisfied. 



Figure 1: Comparison between adversarial examples and hypocritical examples. Left: Conceptual diagrams for the generation of an adversarial example x adv and a hypocritical example x hyp . The input space is (ground-truth) classified into the orange lined region (e.g., class "not panda"), and the blue dotted region (e.g., class "panda"). The black solid line is the decision boundary of a nonrobust model, which classifies the region above the boundary as "panda" the region below the boundary as "not panda". Red shadow and black shadow in the ball B (x) denote that the points in there are misclassified and correctly classified, respectively. As we can see, x adv or x hyp can be easily found by perturbing a correctly classified x or a misclassified x across the model's decision boundary. Right: A demonstration of adversarial examples and hypocritical examples on real data. Here we choose ResNet50 (He et al., 2016a) trained on ImageNet (Russakovskyet al., 2015)  as the victim model. In (a) the correctly classified "panda" can be stealthily perturbed to be misclassified as "tennis ball". In (b) the "panda" (misclassified as "tripod") can be stealthily perturbed to be correctly classified. Perturbations are rescaled for display.

Thus we name them hypocritical examples (see Figure 1 for a comparison with adversarial examples). Adversarial examples and hypocritical examples are two sides of the same coin. On the one side, a well-performed but sensitive model becomes unreliable in the existence of adversaries. On the other side, a poorly performed but sensitive model behaves well with the help of friends. With false friends like these, a naturally trained suboptimal model could have state-of-the-art performance, and even worse, a randomly initialized model could behave like a well-trained one (see Section 2.1). It is natural then to wonder: Why should we care about hypocritical examples? Here we give two main reasons: 1. This is of scientific interest. Hypocritical examples are the opposite of adversarial examples. While adversarial examples are hard test data to a model, hypocritical examples aim to make it easy to do correct classification. Hypocritical examples warn ML researchers to

