POLICY-DRIVEN ATTACK: LEARNING TO QUERY FOR HARD-LABEL BLACK-BOX ADVERSARIAL EXAMPLES

Abstract

To craft black-box adversarial examples, adversaries need to query the victim model and take proper advantage of its feedback. Existing black-box attacks generally suffer from high query complexity, especially when only the top-1 decision (i.e., the hard-label prediction) of the victim model is available. In this paper, we propose a novel hard-label black-box attack named Policy-Driven Attack, to reduce the query complexity. Our core idea is to learn promising search directions of the adversarial examples using a well-designed policy network in a novel reinforcement learning formulation, in which the queries become more sensible. Experimental results demonstrate that our method can significantly reduce the query complexity in comparison with existing state-of-the-art hard-label black-box attacks on various image classification benchmark datasets. Code and models for reproducing our results are available at https://github.com/ZiangYan/ pda.pytorch.

1. INTRODUCTION

It is widely known that deep neural networks (DNNs) are vulnerable to adversarial examples, which are crafted via perturbing clean examples to cause the victim model to make incorrect predictions. In a white-box setting where the adversaries have full access to the architecture and parameters of the victim model, gradients w.r.t. network inputs can be easily calculated via back-propagation, and thus first-order optimization techniques can be directly applied to craft adversarial examples in this setting (Szegedy et al., 2014; Goodfellow et al., 2015; Carlini & Wagner, 2017; Madry et al., 2018; Rony et al., 2019) . However, in black-box settings, input gradients are no longer readily available since all model internals are kept secret. Over the past few years, the community has made massive efforts in developing black-box attacks. In order to gain high attack success rates, delicate queries to the victim model are normally required. Recent methods can be roughly categorized into score-based attacks (Chen et al., 2017; Ilyas et al., 2018; Nitin Bhagoji et al., 2018; Ilyas et al., 2019; Yan et al., 2019; Li et al., 2020b; Tu et al., 2019; Du et al., 2019; Li et al., 2019; Bai et al., 2020) and hard-label attacks (a.k.a, decision-based attacks) (Brendel et al., 2018; Cheng et al., 2019; Dong et al., 2019; Shi et al., 2019; Brunner et al., 2019; Chen et al., 2020; Rahmati et al., 2020; Li et al., 2020a; Shi et al., 2020; Chen & Gu, 2020) , based on the amount of information exposed to the adversaries from the output of victim model. When the prediction probabilities of the victim model are accessible, an intelligent adversary would generally prefer score-based attacks, while in a more practical scenario where only the top-1 class prediction is available, the adversaries will have to resort to hard-label attacks. Since less information is exposed from such feedback of the victim model, hard-label attacks often bare higher query complexity than that of score-based attacks, making their attack process costly and time intensive. In this paper, we aim at reducing the query complexity of hard-label black-box attacks. We cast the problem of progressively refining the candidate adversarial example (by skillfully querying the victim model and analyzing its feedback) into a reinforcement learning formulation. At each iteration, we search along a set of chosen directions to see whether there exists any new candidate adversarial example that is perceptually more similar to its benign counterpart, i.e., in the sense of requiring less distortion. A reward is assigned to each of such search directions (treated as actions), based on the amount of distortion reduction yielded after updating the adversarial example along that direction. Such a reinforcement learning formulation enables us to learn the non-differentiable mapping from search directions to their potential of refining the current adversarial example, directly and precisely. The policy network is expected to be capable of providing the most promising search direction for updating candidate adversarial examples to reduce the required distortion of the adversarial examples from their benign counterparts. As we will show, the proposed policy network can learn from not only the queries that had been performed following the evolving policy but also peer experience from other black-box attacks. As such, it is possible to pre-train the policy network on a small number of query-reward pairs obtained from the performance log of prior attacks (with or without policy) to the same victim model. Experiments show that our policy-driven attack (PDA) can achieve significantly lower distortions than existing state-of-the-arts under the same query budgets.

2. RELATED WORK

In this paper, we focus on the hard-label black-box setting where only the top-1 decision of the victim model is available. Since less information (of the victim model) is exposed after each query, attacks in this category are generally required to query the victim model more times than those in the white-box or score-based settings. For example, an initial attempt named boundary attack (Brendel et al., 2018) could require ∼million queries before convergence. It proposed to start from an image that is already adversarial, and tried to reduce the distortion by walking towards the benign image along the decision boundary. Recent methods in this category focused more on gradient estimation which could provide more promising search directions, while relying only on top-1 class predictions. Ilyas et al. ( 2018) advocated to use NES (Wierstra et al., 2014; Salimans et al., 2017) to estimate the gradients over proxy scores, and then mounted a variant of PGD attack (Madry et al., 2018) with the estimated gradients. Towards improving the efficiency of gradient estimation, Cheng et al. ( 2019) and Chen et al. ( 2020) further introduced a continuous optimization formulation and an unbiased gradient estimation with careful error control, respectively. The gradients were estimated via issuing probe queries from a standard Gaussian distribution. To generate probes from some more powerful distributions, Dong et al. (2019) proposed to use the covariance matrix adaptation evolution strategy, while Shi et al. (2020) suggested to use customized distribution to model the sensitivity of each pixel. In contrast to these methods, our PDA proposes to use a policy network which is learned from prior intentions to advocate promising search directions to reduce the query complexity. We note that some works also proposed to exploit DNN models to generate black-box attacks. For example, Naseer et al. (2019) used DNNs to promote the transferability of black-box attacks, while several score-based black-box attacks proposed to train DNN models for assisting the generation of queries (Li et al., 2019; Du et al., 2019; Bai et al., 2020) . Our method is naturally different from them in problem settings (score-based vs hard-label) and problem formulations. In the autonomous field, Hamdi et al. (2020) proposed to formulate the generation of semantic attacks as a reinforcement learning problem to find parameters of environment (e.g., camera viewpoint) that can fool the recognition system. To the best of our knowledge, our work is the first to incorporate reinforcement learning into the black-box attacking scenario for estimating perturbation directions, and we advocate the community to consider more about this principled formulation in the future. In addition to the novel reinforcement learning formulation, we also introduce a specific architecture for the policy network which enjoys superior generalization performance, while these methods adopted off-the-shelf auto-encoding architectures.

3. OUR POLICY-DRIVEN ATTACK

We study the problem of attacking an image classifier in the hard-label setting. The goal of the adversaries is to perturb an benign image x ∈ R n to fool a k-way victim classifier f : R n → R k into making an incorrect decision: arg max i f (x ) i = y, where x is the adversarial example generated by perturbing the benign image and y is the true label of x. The adversaries would generally prefer

