SMALL INPUT NOISE IS ENOUGH TO DEFEND AGAINST QUERY-BASED BLACK-BOX ATTACKS

Abstract

While deep neural networks show unprecedented performance in various tasks, the vulnerability to adversarial examples hinders their deployment in safety-critical systems. Many studies have shown that attacks are also possible even in a blackbox setting where an adversary cannot access the target model's internal information. Most black-box attacks are based on queries, each of which obtains the target model's output for an input, and many recent studies focus on reducing the number of required queries. In this paper, we pay attention to an implicit assumption of these attacks that the target model's output exactly corresponds to the query input. If some randomness is introduced into the model to break this assumption, query-based attacks may have tremendous difficulty in both gradient estimation and local search, which are the core of their attack process. From this motivation, we observe even a small additive input noise can neutralize most query-based attacks and name this simple yet effective approach Small Noise Defense (SND). We analyze how SND can defend against query-based black-box attacks and demonstrate its effectiveness against eight different state-of-the-art attacks with CIFAR-10 and ImageNet datasets. Even with strong defense ability, SND almost maintains the original clean accuracy and computational speed. SND is readily applicable to pre-trained models by adding only one line of code at the inference stage, so we hope that it will be used as a baseline of defense against query-based black-box attacks in the future.

1. INTRODUCTION

Although deep neural networks perform well in various areas, it is now well-known that small and malicious input perturbation can cause them to malfunction (Biggio et al., 2013; Szegedy et al., 2013) . This vulnerability of AI models to adversarial examples hinders their deployment, especially in safety-critical areas. In a white-box setting, where the target model's parameters can be accessed, strong adversarial attacks such as Projected Gradient Descent (PGD) (Madry et al., 2018) 2017) train a substitute model that mimics the behavior of the target model and show that the adversarial example created from it can successfully disturb different models. However, due to differences in training methods and model architectures, the transferability of adversarial examples can be significantly weakened, and thus, transfer-based attacks usually result in lower success rates (Chen et al., 2017) . For this reason, most black-box attacks are based on queries, each of which measures the target model's output for an input. Query-based attacks create adversarial examples through an iterative process based on either local search with repetitive small input modifications or optimization with estimated gradients of an adversary's loss with respect to input. However, requesting many queries in their process takes a lot of time and financial loss. Moreover, many similar query images can be suspicious to system administrators. For this reason, researchers have focused on reducing the number of queries required to make a successful adversarial example. Compared to the increasing number of studies on query-based attacks, the number of defenses against them is still very small (Bhambri et al., 2019) . Also, existing defenses developed for whitebox attacks may not be effective against query-based black-box attacks. Dong et al. ( 2020) find that existing defenses such as ensemble adversarial training (Tramèr et al., 2018) do not effectively defend against decision-based attacks. Therefore, it is necessary to develop new defense strategies that respond appropriately to the query-based attacks. To defend against query-based black-box attacks, we pay attention to an implicit but important assumption of these attacks that the target model's output exactly corresponds to the query input. If some randomness is introduced into the model to break this assumption, they can have tremendous difficulty in both gradient estimation and local search, which are the core of their attack process. This intuition is illustrated in Fig. 1a . In this paper, however, we highlight that simply adding small Gaussian noise into an image is enough to defeat various query-based attacks by breaking the above core assumption while almost maintaining clean accuracy. One may think that additive Gaussian noise cannot defend against most adversarial attacks unless we introduce large randomness. This idea is valid for white-box attacks (Gu & Rigazio, 2014), but our experimental results show that small noise is surprisingly effective against query-based black-box attacks. Our second intuition regarding the minimization of clean accuracy loss can be seen in Fig. 1b . Dodge & Karam (2017) find that the classification accuracy decreases in proportion to the variance of Gaussian noise, but for a sufficiently small variance, the accuracy drop is negligible. Considering that the robustness against additive Gaussian noise is positively correlated to the distance to the decision boundary (Gilmer et al., 2019) , the above observation implies that clean images have a relatively long distance to the decision boundary. We think an adversarial defense should have the following goals: (1) preventing malfunction of a model against various attacks, (2) minimizing the computational overhead, (3) maintaining the accuracy on clean images, and (4) easily applicable to existing models. The proposed defense against query-based attacks meets all of the above objectives, and we name this simple yet effective defense Small Noise Defense (SND). Our contributions can be listed as follows: • We highlight the effectiveness of adding a small additive noise to input in defending against query-based black-box attacks. The proposed defense, SND, can be readily applied to pretrained models by adding only one line of code in the Pytorch framework (Paszke et al., 2019) at the inference stage (x = x + sigma * torch.randn_like(x)) and almost maintains the performance of the model.



can generate adversarial examples using the internal information. However, recent studies have shown that adversarial examples can be generated even in a black-box setting where the model's interior is hidden to adversaries. These black-box attacks can be largely divided into transfer-based attacks and query-based attacks. Transfer-based attacks take advantage of transferability that adversarial examples generated from a network can deceive other networks. Papernot et al. (

Figure 1: Illustrations of our intuitions. (a) Small noise can effectively disturb gradient estimation of query-based attacks which use finite difference (b) Compared to large noise, small noise hardly affects predictions on clean images.

