WITH FALSE FRIENDS LIKE THESE, WHO CAN HAVE SELF-KNOWLEDGE?

Abstract

Adversarial examples arise from excessive sensitivity of a model. Commonly studied adversarial examples are malicious inputs, crafted by an adversary from correctly classified examples, to induce misclassification. This paper studies an intriguing, yet far overlooked consequence of the excessive sensitivity, that is, a misclassified example can be easily perturbed to help the model to produce correct output. Such perturbed examples look harmless, but actually can be maliciously utilized by a false friend to make the model self-satisfied. Thus we name them hypocritical examples. With false friends like these, a poorly performed model could behave like a state-of-the-art one. Once a deployer trusts the hypocritical performance and uses the "well-performed" model in real-world applications, potential security concerns appear even in benign environments. In this paper, we formalize the hypocritical risk for the first time and propose a defense method specialized for hypocritical examples by minimizing the tradeoff between natural risk and an upper bound of hypocritical risk. Moreover, our theoretical analysis reveals connections between adversarial risk and hypocritical risk. Extensive experiments verify the theoretical results and the effectiveness of our proposed methods.

1. INTRODUCTION

Deep neural networks (DNNs) have achieved breakthroughs in a variety of challenging problems such as image understanding (Krizhevsky et al., 2012) , speech recognition (Graves et al., 2013) , and automatic game playing (Mnih et al., 2015) . Despite these remarkable successes, their pervasive failures in adversarial settings, the phenomenon of adversarial examples (Biggio et al., 2013; Szegedy et al., 2014) , have attracted significant attention in recent years (Athalye et al., 2018; Carlini et al., 2019; Tramer et al., 2020) . Such small perturbations on inputs crafted by adversaries are capable of causing well-trained models to make big mistakes, which indicates that there is still a large gap between machine and human perception, thus posing potential security concerns for practical machine learning (ML) applications (Kurakin et al., 2016; Qin et al., 2019; Wu et al., 2020b) . An adversarial example is "an input to a ML model that is intentionally designed by an attacker to fool the model into producing an incorrect output" (Goodfellow & Papernot, 2017) . Following the definition of adversarial examples on classification problems (Goodfellow et al., 2015; Papernot et al., 2016; Elsayed et al., 2018; Carlini et al., 2019; Zhang et al., 2019; Wang et al., 2020b; Zhang et al., 2020; Tramèr et al., 2020) , given a DNN classifier f and a correctly classified example x with class label y (i.e., f (x) = y), an adversarial example x adv is generated by perturbing x such that f (x adv ) = y and x adv ∈ B (x). The neighborhood B (x) denotes the set of points within a fixed distance > 0 of x, as measured by some metric (e.g., the l p distance), so that x adv is visually the "same" for human observers. Then, an imperfection of the classifier is highlighted by G adv = Acc(D)-Acc(A), the performance gap between the accuracy (denoted by Acc(•)) evaluated on clean set sampled from data distribution D and adversarially perturbed set A. An adversary could construct such a perturbed set A that looks no different from D but can severely degrade the performance of even state-of-the-art DNN models. From direct attacks in the digital space (Goodfellow et al., 2015; Carlini & Wagner, 2017) to robust attacks in the physical world (Kurakin et al., 2016; Xu et al., 2020) , from toy classification problems (Chen et al., 2020; Dobriban et al., 2020) to complicated perception tasks (Zhang & Wang, 2019; Wang et al., 2020a) , from the high dimensional nature of the input space (Goodfellow et al., 2015; Gilmer et al., 2018) to the

