SMALL INPUT NOISE IS ENOUGH TO DEFEND AGAINST QUERY-BASED BLACK-BOX ATTACKS

Abstract

While deep neural networks show unprecedented performance in various tasks, the vulnerability to adversarial examples hinders their deployment in safety-critical systems. Many studies have shown that attacks are also possible even in a blackbox setting where an adversary cannot access the target model's internal information. Most black-box attacks are based on queries, each of which obtains the target model's output for an input, and many recent studies focus on reducing the number of required queries. In this paper, we pay attention to an implicit assumption of these attacks that the target model's output exactly corresponds to the query input. If some randomness is introduced into the model to break this assumption, query-based attacks may have tremendous difficulty in both gradient estimation and local search, which are the core of their attack process. From this motivation, we observe even a small additive input noise can neutralize most query-based attacks and name this simple yet effective approach Small Noise Defense (SND). We analyze how SND can defend against query-based black-box attacks and demonstrate its effectiveness against eight different state-of-the-art attacks with CIFAR-10 and ImageNet datasets. Even with strong defense ability, SND almost maintains the original clean accuracy and computational speed. SND is readily applicable to pre-trained models by adding only one line of code at the inference stage, so we hope that it will be used as a baseline of defense against query-based black-box attacks in the future.

1. INTRODUCTION

Although deep neural networks perform well in various areas, it is now well-known that small and malicious input perturbation can cause them to malfunction (Biggio et al., 2013; Szegedy et al., 2013) . This vulnerability of AI models to adversarial examples hinders their deployment, especially in safety-critical areas. In a white-box setting, where the target model's parameters can be accessed, strong adversarial attacks such as Projected Gradient Descent (PGD) (Madry et al., 2018) 2017) train a substitute model that mimics the behavior of the target model and show that the adversarial example created from it can successfully disturb different models. However, due to differences in training methods and model architectures, the transferability of adversarial examples can be significantly weakened, and thus, transfer-based attacks usually result in lower success rates (Chen et al., 2017) . For this reason, most black-box attacks are based on queries, each of which measures the target model's output for an input. Query-based attacks create adversarial examples through an iterative process based on either local search with repetitive small input modifications or optimization with estimated gradients of an adversary's loss with respect to input. However, requesting many queries in their process takes a lot of time and financial loss. Moreover, many similar query images can be suspicious to system administrators. For this reason, researchers have focused on reducing the number of queries required to make a successful adversarial example.



can generate adversarial examples using the internal information. However, recent studies have shown that adversarial examples can be generated even in a black-box setting where the model's interior is hidden to adversaries. These black-box attacks can be largely divided into transfer-based attacks and query-based attacks. Transfer-based attacks take advantage of transferability that adversarial examples generated from a network can deceive other networks. Papernot et al. (

