POLICY-DRIVEN ATTACK: LEARNING TO QUERY FOR HARD-LABEL BLACK-BOX ADVERSARIAL EXAMPLES

Abstract

To craft black-box adversarial examples, adversaries need to query the victim model and take proper advantage of its feedback. Existing black-box attacks generally suffer from high query complexity, especially when only the top-1 decision (i.e., the hard-label prediction) of the victim model is available. In this paper, we propose a novel hard-label black-box attack named Policy-Driven Attack, to reduce the query complexity. Our core idea is to learn promising search directions of the adversarial examples using a well-designed policy network in a novel reinforcement learning formulation, in which the queries become more sensible. Experimental results demonstrate that our method can significantly reduce the query complexity in comparison with existing state-of-the-art hard-label black-box attacks on various image classification benchmark datasets. Code and models for reproducing our results are available at https://github.com/ZiangYan/ pda.pytorch.

1. INTRODUCTION

It is widely known that deep neural networks (DNNs) are vulnerable to adversarial examples, which are crafted via perturbing clean examples to cause the victim model to make incorrect predictions. In a white-box setting where the adversaries have full access to the architecture and parameters of the victim model, gradients w.r.t. network inputs can be easily calculated via back-propagation, and thus first-order optimization techniques can be directly applied to craft adversarial examples in this setting (Szegedy et al., 2014; Goodfellow et al., 2015; Carlini & Wagner, 2017; Madry et al., 2018; Rony et al., 2019) . However, in black-box settings, input gradients are no longer readily available since all model internals are kept secret. Over the past few years, the community has made massive efforts in developing black-box attacks. In order to gain high attack success rates, delicate queries to the victim model are normally required. Recent methods can be roughly categorized into score-based attacks (Chen et al., 2017; Ilyas et al., 2018; Nitin Bhagoji et al., 2018; Ilyas et al., 2019; Yan et al., 2019; Li et al., 2020b; Tu et al., 2019; Du et al., 2019; Li et al., 2019; Bai et al., 2020) and hard-label attacks (a.k.a, decision-based attacks) (Brendel et al., 2018; Cheng et al., 2019; Dong et al., 2019; Shi et al., 2019; Brunner et al., 2019; Chen et al., 2020; Rahmati et al., 2020; Li et al., 2020a; Shi et al., 2020; Chen & Gu, 2020) , based on the amount of information exposed to the adversaries from the output of victim model. When the prediction probabilities of the victim model are accessible, an intelligent adversary would generally prefer score-based attacks, while in a more practical scenario where only the top-1 class prediction is available, the adversaries will have to resort to hard-label attacks. Since less information is exposed from such feedback of the victim model, hard-label attacks often bare higher query complexity than that of score-based attacks, making their attack process costly and time intensive. In this paper, we aim at reducing the query complexity of hard-label black-box attacks. We cast the problem of progressively refining the candidate adversarial example (by skillfully querying the victim model and analyzing its feedback) into a reinforcement learning formulation. At each iteration, we search along a set of chosen directions to see whether there exists any new candidate adversarial example that is perceptually more similar to its benign counterpart, i.e., in the sense of requiring less distortion. A reward is assigned to each of such search directions (treated as actions), based on the amount of distortion reduction yielded after updating the adversarial example along that direction. Such a reinforcement learning formulation enables us to learn the non-differentiable mapping from search directions to their potential of refining the current adversarial example, directly and precisely. The policy network is expected to be capable of providing the most promising search direction for updating candidate adversarial examples to reduce the required distortion of the adversarial examples from their benign counterparts. As we will show, the proposed policy network can learn from not only the queries that had been performed following the evolving policy but also peer experience from other black-box attacks. As such, it is possible to pre-train the policy network on a small number of query-reward pairs obtained from the performance log of prior attacks (with or without policy) to the same victim model. Experiments show that our policy-driven attack (PDA) can achieve significantly lower distortions than existing state-of-the-arts under the same query budgets.

2. RELATED WORK

In this paper, we focus on the hard-label black-box setting where only the top-1 decision of the victim model is available. Since less information (of the victim model) is exposed after each query, attacks in this category are generally required to query the victim model more times than those in the white-box or score-based settings. For example, an initial attempt named boundary attack (Brendel et al., 2018) could require ∼million queries before convergence. It proposed to start from an image that is already adversarial, and tried to reduce the distortion by walking towards the benign image along the decision boundary. Recent methods in this category focused more on gradient estimation which could provide more promising search directions, while relying only on top-1 class predictions. Ilyas et al. (2018) advocated to use NES (Wierstra et al., 2014; Salimans et al., 2017) to estimate the gradients over proxy scores, and then mounted a variant of PGD attack (Madry et al., 2018) with the estimated gradients. Towards improving the efficiency of gradient estimation, Cheng et al. (2019) and Chen et al. (2020) further introduced a continuous optimization formulation and an unbiased gradient estimation with careful error control, respectively. The gradients were estimated via issuing probe queries from a standard Gaussian distribution. To generate probes from some more powerful distributions, Dong et al. (2019) proposed to use the covariance matrix adaptation evolution strategy, while Shi et al. (2020) suggested to use customized distribution to model the sensitivity of each pixel. In contrast to these methods, our PDA proposes to use a policy network which is learned from prior intentions to advocate promising search directions to reduce the query complexity. We note that some works also proposed to exploit DNN models to generate black-box attacks. For example, Naseer et al. (2019) used DNNs to promote the transferability of black-box attacks, while several score-based black-box attacks proposed to train DNN models for assisting the generation of queries (Li et al., 2019; Du et al., 2019; Bai et al., 2020) . Our method is naturally different from them in problem settings (score-based vs hard-label) and problem formulations. In the autonomous field, Hamdi et al. (2020) proposed to formulate the generation of semantic attacks as a reinforcement learning problem to find parameters of environment (e.g., camera viewpoint) that can fool the recognition system. To the best of our knowledge, our work is the first to incorporate reinforcement learning into the black-box attacking scenario for estimating perturbation directions, and we advocate the community to consider more about this principled formulation in the future. In addition to the novel reinforcement learning formulation, we also introduce a specific architecture for the policy network which enjoys superior generalization performance, while these methods adopted off-the-shelf auto-encoding architectures.

3. OUR POLICY-DRIVEN ATTACK

We study the problem of attacking an image classifier in the hard-label setting. The goal of the adversaries is to perturb an benign image x ∈ R n to fool a k-way victim classifier f : R n → R k into making an incorrect decision: arg max i f (x ) i = y, where x is the adversarial example generated by perturbing the benign image and y is the true label of x. The adversaries would generally prefer adversarial examples x with smaller distortions xx 2 achieved using less queries, since these properties make the attack less suspicious and also save the cost. In this section, we first briefly review some background information that motivate our method (in Section 3.1), and then detail our reinforcement learning formulation (in Section 3.2 and Section 3.3) and the architecture of our policy network (in Section 3.4).

3.1. MOTIVATIONS

Most recent hard-label attacks followed a common pipeline of searching from a starting point which was already an adversarial imagefoot_0 yet not close enough to the benign one. Unlike the white-box and score-based black-box setting in which the input gradients can be calculated and used as the most effective perturbation direction, in the concerned hard-label setting, outputs of the victim model only flip on the decision boundary while keeping constant away from the boundary, making it difficult to evaluate different directions almost everywhere. In this context, the search of promising perturbation directions was restricted into the regions near the decision boundary, since these regions are arguably more informative, and binary search was used to reach the decision boundary efficiently. Let us take a very recent attack named HopSkipJumpAttack (Chen et al., 2020) as an example. Given the current estimation x s of the adversarial example at each iteration, HopSkipJumpAttack first performed binary search to project it onto the decision boundary of the victim model. Denote x as the updated example that was on the decision boundary already, HopSkipJumpAttack then sampled many probes around x from an isotropic Gaussian distribution, and issued these probes to the victim model as queries. The feedback of the victim model was utilized to estimate the gradient direction at x , and it was updated along this direction to obtain a new estimation of the adversarial example. This process was repeated many times until the query budget was exhausted. In comparison with boundary attack (Brendel et al., 2018) , HopSkipJumpAttack was in general far more query-efficient, though a large number of queries had to be consumed for probing the local geometrics of the decision boundary of the victim model. Its superiority came from using the estimated gradient directions as the search directions, which motivated us to explore even better search directions at each iteration of the attack. As will be shown in the appendices, drawing on some geometric insights, we found that the gradient directions are in fact not the optimal search directions in the framework of HopSkipJumpAttack. We also found that the task of performing hardlabel black-box attack could be naturally cast into a reinforcement learning task, thus we attempt to explore the possibility of developing a model-based method for predicting the most promising search directions for attacks. Feedbacks from the victim can provide supervision and thus the policy models in our reinforcement learning framework can be trained/fine-tuned on the fly during each attack process, such that little query is required once the model has been well-trained.

3.2. ATTACK AS REINFORCEMENT LEARNING PROBLEM

In this paper, we consider both targeted attacks and untargeted attacks. Given a benign example x, its label y, and the victim model f , an environment E(x, y, f ) is naturally formed. The adversaries shall play the role of agent, trying to interact with the environment by issuing queries and collecting feedbacks, under a certain policy. The current example x t on the decision boundary of the victim model (or called the candidate adversarial example) represents the state at each timestamp t. The agent uses a learnable policy network g which will be carefully introduced in Section 3.4 to guide its actions, and the action is to update the candidate adversarial example such that less distortions are required to fool the victim model. The action here incorporates searching along a promising direction a t / a t where a t ∈ R n is sampled from an isotropic Gaussian distribution whose mean vector is given by the policy network µ t = g(x t , y, y ) ∈ R n where y is the target label, and its covariance matrix is given by Σ = σI ∈ R n×n , in which the value of σ ∈ R is set to be gradually increased as the attack on each sample progresses, and I ∈ R n×n indicates the n × n identity matrix. With a t , the agent searches along its direction a t / a t to see whether any better candidate adversarial example can be found. For targeted attacks, the target label y is chosen by the agent from the beginning and kept unchanged during the attack process. For untargeted attacks, x t should be on the decision boundary where one side is the ground-truth label y and the other side could be regarded as the "target label" y . As will be carefully introduced in Section 3.3, a reward r t ∈ R Algorithm 1 Policy-Driven Attack Algorithm 1: Input: the environment E(x, y, f ); the target label y , initial adversarial image x 1 ∈ R n which lies on the decision boundary; the policy network g. 2: Output: an adversarial example. 3: Initialize the step index t ← 1. 4: while the query count limit not reached do 5: // Determine the baseline l t to evaluate the potential of different actions 6: based on the performance of each action and the corresponding a t is given to the agent for updating the parameter of the policy network. All details of our PDA are summarized in Algorithm 1. µ t ← g(x t , y, y ), z ← BS(x t + δ • µt µt 2 , x, f ), Powered by the reinforcement learning framework, we can use policy gradient algorithms to train the policy network g to generate promising search directions in a direct way. For simplicity, we use the one-step REINFORCE (Williams, 1992) in the sequel of this paper and leave the exploration of more advanced policy gradient algorithms to future work.

3.3. REWARD AND ACTION

< l a t e x i t s h a 1 _ b a s e 6 4 = " C J t C h z p t m l 4 U J 5 f w O h y E f l 9 q B + A = " > A A A B / X i c b V B N S 8 N A E J 3 U r 1 q / o h 6 9 L B a h p 5 J I Q Y 8 F L x 4 r G F t o Q 9 l s N + 3 S 3 U 3 Y 3 Y g l 1 L / g V e 8 e x a u / x a u / x G 2 b g 2 1 9 M P B 4 b 4 a Z e V H K m T a e 9 + 2 U N j a 3 t n f K u 5 W 9 / Y P D I / f 4 5 E E n m S I 0 I A l P V C f C m n I m a W C Y 4 b S T K o p F x G k 7 G t / M / P Y j V Z o l 8 t 5 M U h o K P J Q s Z g Q b K w W 9 S K C n v l v 1 6 t 4 c a J 3 4 B a l C g V b f / e k N E p I J K g 3 h W O u u 7 6 U m z L E y j H A 6 r f Q y T V N M x n h I u 5 Z K L K g O 8 / m x U 3 R h l Q G K E 2 V L G j R X / 0 7 k W G g 9 E Z H t F N i M 9 K o 3 E / / z u p m J r 8 O c y T Q z V J L F o j j j y C R o 9 j k a M E W J 4 R N L M F H M 3 o r I C C t M j M 1 n a U s k p h U b i r 8 a w T p p X 9 b 9 R t 3 3 7 x r V Z q 3 I p w x n c A 4 1 8 O E K m n A L L Q i A A I M X e I U 3 5 9 l 5 d z 6 c z 0 V r y S l m T m E J z t c v T G 2 V N A = = < / l a t e x i t > x < l a t e x i t s h a 1 _ b a s e 6 4 = " B z E f K L p B J j k J v N 6 6 G R M n M z n H C i A = " > A A A B + X i c b V A 9 T w J B E J 3 D L 8 Q v 1 N J m I z G h I n e G R E s S G k t I R E j g Q v a W O d i w e 3 f Z 3 T P B C 7 / A V n t L Y + u v s f W X u M A V C r 5 k k p f 3 Z j I z L 0 g E 1 8 Z 1 v 5 z C 1 v b O 7 l 5 x v 3 R w e H R 8 U j 4 9 e 9 B x q h h 2 W C x i 1 Q u o R s E j 7 B h u B P Y S h V Q G A r v B t L n w u 4 + o N I + j e z N L 0 J d 0 H P G Q M 2 q s 1 G 4 O y x W 3 5 i 5 B N o m X k w r k a A 3 L 3 4 N R z F K J k W G C a t 3 3 3 M T 4 G V W G M 4 H z 0 i D V m F A 2 p W P s W x p R i d r P l o f O y Z V V R i S M l a 3 I k K X 6 e y K j U u u Z D G y n p G a i 1 7 2 F + J / X T 0 1 4 6 2 c 8 S l K D E V s t C l N B T E w W X 5 M R V 8 i M m F l C m e L 2 V s I m V F F m b D Z / t g R y X r K h e O s R b J L u d c 2 r 1 z y v X a 8 0 q n k + R b i A S 6 i C B z f Q g D t o Q Q c Y I D z D C 7 w 6 T 8 6 b 8 + 5 8 r F o L T j 5 z D n / g f P 4 A V V C T j A = = < / l a t e x i t > C < l a t e x i t s h a 1 _ b a s e 6 4 = " q z X 1 a S 9 J g v P r h o R Y l a v 8 s s G O V + Y = " > A A A B + X i c b V A 9 T w J B E J 3 D L 8 Q v 1 N L m I j G h I n f G R E u i j S U k I i R w I X v L H G z Y 3 b v s 7 p n g h V 9 g q 7 2 l s f X X 2 P p L X O A K A V 8 y y c t 7 M 5 m Z F y a c a e N 5 3 0 5 h Y 3 N r e 6 e 4 W 9 r b P z g 8 K h + f P O o 4 V R R b N O a x 6 o R E I 2 c S W 4 Y Z j p 1 E I R E h x 3 Y 4 v p v 5 7 S d U m s X y w U w S D A Q Z S h Y x S o y V m r f 9 c s W r e X O 4 6 8 T P S Q V y N P r l n 9 4 g p q l A a S g n W n d 9 L z F B R p R h l O O 0 1 E s 1 J o S O y R C 7 l k o i U A f Z / N C p e 2 G V g R v F y p Y 0 7 l z 9 O 5 E R o f V E h L Z T E D P S q 9 5 M / M / r p i a 6 C T I m k 9 S g p I t F U c p d E 7 u z r 9 0 B U 0 g N n 1 h C q G L 2 V p e O i C L U 2 G y W t o R i W r K h + K s R r J P 2 Z c 2 / q v l + 8 6 p S r + b 5 F O E M z q E K P l x D H e 6 h A S 2 g g P A C r / D m P D v v z o f z u W g t O P n M K S z B + f o F U 7 u T i w = = < / l a t e x i t > B < l a t e x i t s h a 1 _ b a s e 6 4 = " S a O J o Q S i Y n p W x h D H V D o F s h t 3 j t k = " > A A A B + 3 i c b V A 9 T w J B E J 3 D L 8 Q v 1 N J m I z G h I n e E R E s S G 0 u M H p D A h e w t e 7 B h d + + y u 2 d C T n 6 C r f a W x t Y f Y + s v c Y E r B H z J J C / v z W R m X p h w p o 3 r f j u F r e 2 d 3 b 3 i f u n g 8 O j 4 p H x 6 1 t Z x q g j 1 S c x j 1 Q 2 x p p x J 6 h t m O O 0 m i m I R c t o J J 7 d z v / N E l W a x f D T T h A Y C j y S L G M H G S g / t Q X 1 Q r r g 1 d w G 0 S b y c V C B H a 1 D + 6 Q 9 j k g o q D e F Y 6 5 7 n J i b I s D K M c D o r 9 V N N E 0 w m e E R 7 l k o s q A 6 y x a k z d G W V I Y p i Z U s a t F D / T m R Y a D 0 V o e 0 U 2 I z 1 u j c X / / N 6 q Y l u g o z J J D V U k u W i K O X I x G j + N x o y R Y n h U 0 s w U c z e i s g Y K 0 y M T W d l S y h m J R u K t x 7 B J u n U a 1 6 j 5 n n 3 j U q z m u d T h A u 4 h C p 4 c A 1 N u I M W + E B g B C / w C m / O s / P u f D i f y 9 a C k 8 + c w w q c r 1 + j y p R E < / l a t e x i t > V 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " M C U 1 l e 0 7 9 / k J w v k N O T V j G 2 i R z a Q = " > A A A B + 3 i c b V B N S w M x E J 2 t X 7 V + V T 1 6 C R a h p 7 I R Q Y 8 F L x 4 r 2 g 9 o l 5 J N s 2 1 o k l 2 S r F D W / g S v e v c o X v 0 x X v 0 l p u 0 e b O u D g c d 7 M 8 z M C x P B j f X 9 b 6 + w s b m 1 v V P c L e 3 t H x w e l Y 9 P W i Z O N W V N G o t Y d 0 J i m O C K N S 2 3 g n U S z Y g M B W u H 4 9 u Z 3 3 5 i 2 v B Y P d p J w g J J h o p H n B L r p I d W H / f L F b / m z 4 H W C c 5 J B X I 0 + u W f 3 i C m q W T K U k G M 6 W I / s U F G t O V U s G m p l x q W E D o m Q 9 Z 1 V B H J T J D N T 5 2 i C 6 c M U B R r V 8 q i u f p 3 I i P S m I k M X a c k d m R W v Z n 4 n 9 d N b X Q T Z F w l q W W K L h Z F q U A 2 R r O / 0 Y B r R q 2 Y O E K o 5 u 5 W R E d E E 2 p d O k t b Q j k t u V D w a g T r p H 1 Z w 1 c 1 j O + v K v V q n k 8 R z u A c q o D h G u p w B w 1 o A o U h v M A r v H n P 3 r v 3 4 X 0 u W g t e P n M K S / C + f g G i N Z R D < / l a t e x i t > V 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 0 X e 9 Q 2 t d 3 L U B n J N i A / A 7 K h u 8 Y Y = " > A A A C A H i c b V A 9 S w N B E J 3 z M 8 a v q K X N Y h B T h T s J a B m w s Y x g T C A 5 w t 5 m L 1 m y e 3 f s z o n h S O F f s N X e U m z 9 J 7 b + E j f J F S b x w c D j v R l m 5 g W J F A Z d 9 9 t Z W 9 / Y 3 N o u 7 B R 3 9 / Y P D k t H x w 8 m T j X j T R b L W L c D a r g U E W + i Q M n b i e Z U B Z K 3 g t H N 1 G 8 9 c m 1 E H N 3 j O O G + o o N I h I J R t F K 7 G y j y 1 M O L X q n s V t 0 Z y C r x c l K G H I 1 e 6 a f b j 1 m q e I R M U m M 6 n p u g n 1 G N g k k + K X Z T w x P K R n T A O 5 Z G V H H j Z 7 N 7 J + T c K n 0 S x t p W h G S m / p 3 I q D J m r A L b q S g O z b I 3 F f / z O i m G 1 3 4 m o i R F H r H 5 o j C V B G M y f Z 7 0 h e Y M 5 d g S y r S w t x I 2 p J o y t B E t b A n U p G h D 8 Z Y j W C W t y 6 p X q 3 r e X a 1 c r + T 5 F O A U z q A C H l x B H W 6 h A U 1 g I O E F X u H N e X b e n Q / n c 9 6 6 5 u Q z J 7 A A 5 + s X T Y W W T A = = < / l a t e x i t > x 0 t < l a t e x i t s h a 1 _ b a s e 6 4 = " s M k d z n u m i k + I x B d J 2 m u 7 G + / 3 W L M = " > A A A C B X i c b V C 7 S g N B F L 0 b X z G + o p Y 2 g 0 F I F X Z E 0 D J g Y x n B P C B Z l 9 n J b D J k Z n e d m R X D k t Z f s N X e U m z 9 D l u / x E m y h U k 8 c O F w z r 2 c y w k S w b V x 3 W + n s L a + s b l V 3 C 7 t 7 O 7 t H 5 Q P j 1 o 6 T h V l T R q L W H U C o p n g E W s a b g T r J I o R G Q j W D k b X U 7 / 9 y J T m c X R n x g n z J B l E P O S U G C t 5 v U C i J 1 / f Z y 0 f T / x y x a 2 5 M 6 B V g n N S g R w N v / z T 6 8 c 0 l S w y V B C t u 9 h N j J c R Z T g V b F L q p Z o l h I 7 I g H U t j Y h k 2 s t m T 0 / Q m V X 6 K I y V n c i g m f r 3 I i N S 6 7 E M 7 K Y k Z q i X v a n 4 n 9 d N T X j l Z T x K U s M i O g 8 K U 4 F M j K Y N o D 5 X j B o x t o R Q x e 2 v i A 6 J I t T Y n h Z S A j k p 2 V L w c g W r p H 1 e w x c 1 j G 8 v K v V q 3 k 8 R T u A U q o D h E u p w A w 1 o A o U H e I F X e H O e n X f n w / m c r x a c / O Y Y F u B 8 / Q J j I 5 i S < / l a t e x i t > x V1 s < l a t e x i t s h a 1 _ b a s e 6 4 = " o B i b Z U e C t Z q R Y r b r 7 S D S H k l s I 5 I = " > A A A C B X i c b V D L S s N A F L 2 p r 1 p f V Z d u B o v Q V U l K Q Z c F N y 4 r 2 A e 0 M U y m k 3 b o T B J n J m I J 2 f o L b n X v U t z 6 H W 7 9 E q d t F r b 1 w I X D O f d y L s e P O V P a t r + t w s b m 1 v Z O c b e 0 t 3 9 w e F Q + P u m o K J G E t k n E I 9 n z s a K c h b S t m e a 0 F 0 u K h c 9 p 1 5 9 c z / z u I 5 W K R e G d n s b U F X g U s o A R r I 3 k D n y B n j x 1 n 3 a 8 e u a V K 3 b N n g O t E y c n F c j R 8 s o / g 2 F E E k F D T T h W q u / Y s X Z T L D U j n G a l Q a J o j M k E j 2 j f 0 B A L q t x 0 / n S G L o w y R E E k z Y Q a z d W / F y k W S k 2 F b z Y F 1 m O 1 6 s 3 E / 7 x + o o M r N 2 V h n G g a k k V Q k H C k I z R r A A 2 Z p E T z q S G Y S G Z + R W S M J S b a 9 L S U 4 o u s Z E p x V i t Y J 9 1 6 z W n U H O e 2 U W l W 8 3 6 K c A b n U A U H L q E J N 9 C C N h B 4 g B d 4 h T f r 2 X q 3 P q z P x W r B y m 9 O Y Q n W 1 y 9 k u Z i T < / l a t e x i t > x V2 s < l a t e x i t s h a 1 _ b a s e 6 4 = " g L 4 v w X G G A o 7 X p S L K X e I v M X X Q i J 4 = " > A A A C J 3 i c b Z D L S s N A F I Y n X m u 9 R V 2 6 G S y i I N a k F H R Z c O O y g m 2 F p o T J 5 F Q H Z 5 I w c y K W 0 q f w J X w F t 7 p 3 K b r U J 3 H S d u H t h 4 G P / 5 z D O f N H m R Q G P e / d m Z m d m 1 9 Y L C 2 V l 1 d W 1 9 b d j c 2 2 S X P N o c V T m e r L i B m Q I o E W C p R w m W l g K p L Q i W 5 O i 3 r n F r Q R a X K B g w x 6 i l 0 l o i 8 4 Q 2 u F 7 m E Q K X o X 4 t 5 B E I N E F v A 4 x c J j R 4 F s g 5 5 w o A s M a 6 F b 8 a r e W P Q v + F O o k K m a o f s Z x C n P F S T I J T O m 6 3 s Z 9 o Z M o + A S R u U g N 5 A x f s O u o G s x Y Q p M b z j + 1 o j u W i e m / V T b l y A d u 9 8 n h k w Z M 1 C R 7 V Q M r 8 3 v W m H + V + v m 2 D / p D U W S 5 Q g J n y z q 5 5 J i S o u M a C w 0 c J Q D C 4 x r Y W + l / J p p x t E m + W N L p E Z l G 4 r / O 4 K / 0 K l V / X r V 9 8 / r l c b + N J 8 S 2 S Y 7 Z J / 4 5 J g 0 y B l p k h b h 5 J 4 8 k i f y 7 D w 4 L 8 6 r 8 z Z p n X G m M 1 v k h 5 y P L + r e p X Y = < / l a t e x i t > x 0 t + • a/kak 2 Figure 1: The reward assignment mechanism of our method. Arcs with color magenta, yellow, and green corresponded to actions with reward 0, 1, and 2, respectively. Figure 1 illustrates how we assign the scalar reward r t given current candidate adversarial example x t and an action a t . The decision boundary is illustrated by a horizontal straight line (denoted by B) in the figure, the benign counterpart x is assumed to be below B, and the circle C centered at x t with a small radius δ shows all possible locations after jumping along the directions of some actions by δ from x t . As described earlier, the reward r t should be assigned based on the amount of potential distortion reduction brought by a t . A direct evaluation can be achieved by jumping along the direction of a t first and then projecting the updated example back onto the decision boundary via binary search, to see how much improvement is obtained. However, since we evaluate M actions a t,i simultaneously at an iteration (see Algorithm 1) and binary search needs to be performed for each of them, and the overall process would be prohibitively (query-)expensive. On this point, to efficiently assess the performance of an action, we instead evaluate whether the reduction of distortion by taking a particular action can exceed particular baselines. Concretely, we first evaluate µ t = g(x t , y, y ) as an action directly by using binary search in a way as just described. Suppose that it can reduce the required adversarial distortionfoot_1 by l t , then we setup two levels of baselines xx t 2β 1 • l t and xx t 2β 2 • l t to see whether other actions can lead to adversarial examples with closer distance (than these baselines) from the benign example x, in which β 1 = 0 and β 2 = 0.25. As shown in Figure 1 , for an action a ∈ {a t,i }, we first obtain x t + δ • a/ a 2 and then move it towards x to see how much reward it can obtain. The two arcs V 1 and V 2 indicates where the same progress as the two baselines can be achieved, thus we can further project x t + δ • a/ a 2 onto the arcs to see if the projections (i.e, x V1 s and x V2 s ) are still adversarial. It can be seen that x V1 s is still adversarial yet x V2 s is not. We assign a reward 1 to such an action a. If both the projections are still adversarial we shall assign a reward of 2, and if neither of them is adversarial, zero reward is assigned. Since x V1 s is not adversarial could imply that x V2 s is also not adversarial, such a way reduces the number of queries for assessing each action to at most 2 (x V1 s and x V2 s ) and makes our PDA more query-efficient.

3.4. ARCHITECTURE OF THE POLICY NETWORK

As described in Section 3.2, the goal of our policy network is to predict a direction based on which the optimal adversarial example/candidate can be easily found. Its input is the current example on the decision boundary of the victim model (together with other useful information that is available to the agent if needed) and its output is expected to be a promising search direction that shares the same dimension with its input. Naïve architecture designs of the policy network include the conventional auto-encoders and U-Net (Ronneberger et al., 2015) . However, our experimental results suggest that such off-the-shelf auto-encoding architectures often offer degraded performance (see Section 4.3 for more details). We note that predicting a promising search direction is discrepant from the computer vision tasks for which these architectures are widely applied (e.g., predicting a segmentation map). Specifically, a segmentation map often aligns with the visual contents of the input image, while the promising search directions might be less correlated with the semantics of the input examples. We reckon it can be more beneficial to incorporate domain knowledge about adversarial attacks into the architecture of the policy network g. On this point, we propose a new architecture for the policy network for our PDA. First, we know from HopSkipJumpAttack that the gradient direction at x t , although not theoretically optimal, can provide strong empirical performance when serving as the search direction. Therefore, designing an architecture which could output the gradient vector at the point of its input seems to be an appropriate option for the policy network g. Formally, it takes the candidate adversarial example x t , the groundtruth label y of the benign example, and the target label y as input, and mapped them to a search direction in R n . In this spirit, the policy network is designed to own an internal classifier h : R n → R k , which performs a k-way classification. The number k can be the same as the number of prediction classes of the black-box victim model if the adversaries has such information. We hope the internal classifier h can learn to distill knowledge from the victim model if possible, as such we can use the input gradient of h as a descent search direction. Following the logit-diff loss developed for the white-box setting in Carlini & Wagner (2017)'s work, we propose to use: g(x t , y, y ) = ∇ x t h(x t ) y -∇ x t h(x t ) y + b, as the output of the policy network, where a learnable bias vector b ∈ R n is introduced to improved the capacity and flexibility of the network. The forward process of such a policy network is basically a back-propagation process of the internal classifier h, and if the decision boundary of the internal classifier h is aligned with that of the victim model, the output of the policy network should be the gradient direction of the victim model. Since the model is parameterized and has sufficient capacity, it can also learn to explore even better search directions other than the gradient directions.

3.5. (OPTIONAL) PRE-TRAINING OF THE POLICY NETWORK.

Note that the learning of the policy network can bare from large sampling complexity and may even fail to converge if its initial outputs are completely unable to cut the distortion. Just like for playing the game of go (Silver et al., 2016) , we can optionally (pre-)train the policy network in a supervised manner to make the initial actions more reasonable in the reinforcement learning process, such that the samples collected in the follow-up steps are more informative and the issue can be relieved. In this section, we introduce how such pre-training can be performed for the concerned task. Recall that the goal of the policy network is to predict promising search directions for the candidate adversarial examples, thus we can construct the pre-training set S by collecting the intermediate results of any prior attacks to the same victim model, or we can also use a simplified policy and collect its suggested actions for constructing S for pre-training. Given a small dataset D = {(x, y)} which consists of benign examples and their ground-truth labels, it is easier to first run our PDA without learning a policy network, i.e., using an input-independent policy for it. Concretely, we can instead sample each direction a t from the distribution N (b, σ t I) at the timestamp t, where b here is still learnable, just like in Eq. ( 1). We found that such a simplified policy tends to learn a search direction that is very similar to the gradient direction. This simplified reinforcement learning problem where the policy network g is now absent shows more stable training performance and each of its attack trajectories can be effectively used to pre-train the policy network, which is formulated as: T x {(x t , x, y, y t , a t ) | t ∈ {1, 2, . . . , m x }} , where m x is the total number of iterations, and x t , y t , a t are the candidate adversarial example, the target label, and the suggested search direction by the simplified policy at each iteration, with a common timestamp t, respectively. To improve the compactness of the sample set for pre-training, we suggest a simple post-processing strategy to discard the tuples in T x with less informative candidate adversarial examples: for i ≥ 2, we discard the i-th tuple in T x if x i -x 2 > 0.99 x j -x 2 , where j indicates the index of any previous tuple that is decided not to be discarded. The set that contains the remaining tuples is denoted by T r x , and the final pre-training set S is constructed using this sort of set gathered from all attack trajectories using the simplified policy, i.e., S = (x,y)∈D T r x . ( ) Then it is how to pre-train the policy network given S for better initialization. First, according to the design of g as introduced in Section 3.4, it is natural to encourage the internal classifier h to perform as a classifier, and then, probably more importantly, the outputs of the policy network are encouraged to somehow align with the considered effective search directions (i.e., a t ) found by the aforementioned simplified policy. On this point, we introduce the cosine similarity S together with a regularizer Ψ which incorporates the classification loss to achieve these two goals, making the pre-training loss L as: L = 1 |S| (x ,x,y,y ,a )∼S -S (g(x , y, y ), a ) + λ • Ψ(h(x), h(x ), y, y ), where S(•, •) calculates the cosine similarity between its two input vectors, λ is the coefficient for regularization, and Ψ serves as a regularizer which is given by: Ψ(h(x), h(x ), y, y ) = CE(h(x), y) + 1 2 CE(h(x ), y) + 1 2 CE(h(x ), y ), where CE(•, •) calculates the cross-entropy loss given logits and labels. Since x is on the decision boundary of the victim model f , f must assign 0.5 probability to both the benign class y and the adversarial class y at x , which interprets the terms 1 2 CE(h(x ), y) and 1 2 CE(h(x ), y ).

4. EXPERIMENTS

In this section, we evaluate the effectiveness of our method on three datasets: MNIST (LeCun et al., 2010) , CIFAR-10 (Krizhevsky & Hinton, 2009) , and ImageNet (Russakovsky et al., 2015) . We compare our method with Boundary Attack (Brendel et al., 2018) and HopSkipJumpAttack (Chen et al., 2020) in both untargeted and targeted settings, and the required 2 distortions given a specific query budget are evaluated, as in recent related work (Brendel et al., 2018; Chen et al., 2020; Li et al., 2020a) . For a comprehensive comparison, we report the distortions at {100, 500, 1K, 5K, 10K, 25K} query budgets in all experiments in the paper. The required mean distortions for untargeted and targeted attacks are reported in Table 1 and Table 2 , respectively, and the median distortions are reported in the appendices. For targeted attack, we set the target label to y = y + 1 mod 10 where y is the true label for the clean image, and we only show results on MNIST and CIFAR-10 which are faster to be evaluated. All experiments are conducted on NVIDIA GTX 2080 Ti GPUs with PyTorch (Paszke et al., 2017) . (Engstrom et al., 2019) , which was adversarially trained under 2 PGD attacks ( = 1.0). On ImageNet, we adopt a ResNet-18 (He et al., 2016) from the PyTorch official model zoo as the victim model, which shows a top-1 error rate of 30.24% on the ImageNet official validation set. As for the policy network g, we adopt a VGG-13 (Simonyan & Zisserman, 2015) architecture as its internal classifier h for attacking all these victim models, i.e., the architecture of g and h is by no means similar to that of any of the victim models. Implementation details. We mostly test our PDA with pre-training of the policy network g, and its performance without pre-training will be presented in the appendices. When pre-training is to be performed, we should construct a training dataset S for it. On MNIST and CIFAR-10, we randomly sample 5,000 images which are confirmed to be correctly classified by the victim model from their official test set to construct S. On ImageNet, we similarly gather 50,000 images from an auxiliary dataset called ImageNetV2 and the official ImageNet validation set to construct S. For each gathered image x, we sample m x = 500 actions at each iteration t (see Eq. ( 2)). When constructing S with the simplified policy for pre-training as described in Section 3.5, we choose the SGD optimizer without momentum, and we use a learning rate 0.003 in our PDA. Another 500 images are also collected to form the validation set for tuning all hyper-parameters for each of the three datasets (i.e., MNIST, CIFAR-10, and ImageNet). After a pre-trained policy network is obtained, the final performance of our PDA is evaluated based on a set of 1,000 clean images disjoint from the training/validation set described in the above paragraph (i.e., these 1,000 images are not used in training the policy network and tuning hyperparameters). At this stage, we sample only m x = 25 actions at each iteration for faster convergence in all experiments. We use the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 0.0001 and the cross entropy regularization in Eq. 5 with a coefficient λ = 0.003 is applied. To achieve better trade-offs between exploration and exploitation, we initialize σ in the sampling Gaussian distribution to be 0.003, and scale it at each iteration if necessary, to make sure that the ratio of the average output of the policy network and σ lies in the range of [0.01, 0.5]. The value of σ is doubled if all sampled actions at an iteration receive zero reward. The step size during attack is set as 0.4 xx t 2 , and the geometric regression strategy suggested by Chen et al. ( 2020) is also applied. To make a fair comparison, we also sample 25 probs around each example on the decision boundary for HopSkipJumpAttack, which yields better performance than its default setting which used 100 probs. Other hyper-parameters for performing boundary attack and HopSkipJumpAttack are kept as in their original papers. The starting adversarial examples for untargeted attacks are obtained by sampling from a uniform distribution in the input space [0, 1] n until the adversarial criterion is met, and for targeted attacks we directly select a benign image from the target class as the starting point since for some victim models it is often hard to find an input from a particular class via random sampling. Once generated, the starting points are shared among all the compared attack methods.

4.2. COMPARISON WITH THE STATE-OF-THE-ARTS

Table 1 compares our PDA with the state-of-the-art methods for performing hard-label black-box untargeted attacks. We consider a threat model where a large number of benign examples are required to be attacked in a hard-label black-box manner, such that the queries for pre-training could be omitted. It can be easily seen that in general our method outperforms its competitors, especially in the earlier stage of attacks, which is of importance when lower query budgets are permitted. In particular, with only 100 queries, our method leads to only one-sixth to one-third distortions when compared with HopSkipJumpAttack which is the second best. With a larger query budget of 500, HopSkipJumpAttack still leads to 1.5 to 3.0 times larger distortion than our method. More interestingly, on CIFAR-10, when attacking the ResNet-50 model guarded with adversarial training, which is proved to be one of the most powerful defenses, the superiority of our PDA is in fact more significant. Such an observation is considered consistent with a phenomenon discovered in prior work (Yu et al., 2019; Zhang & Wang, 2019) which shows that adversarially trained models often have less sharp peaks and cliffs on the decision boundary, making it easier for our policy network to capture. Table 2 shows the results for targeted attacks and our PDA again outperforms others in general. Performance of our PDA under different pre-training configurations is given in the appendices. In practice, pre-training is recommended since it is crucial for the superior performance of our method.

4.3. ABLATION STUDY

Table 3 : Comparison of choosing different architectures for the policy network. Mean 2 distortions for performing untargeted attacks are evaluated. Architecture @100 @500 @1K @5K @10K @25K We perform an ablation study on how the architecture design of the policy network would affect the performance of our PDA. More specifically, we train policy networks with several different architectures on the same set S and then attempt to attack the CNN victim model on CIFAR-10 using these policy networks. In addition to our design (as introduced in Section 3.4) which leverages the gradient of an internal classifier h, we mostly consider the U-Net (Ronneberger et al., 2015) which is a popular option for the learning models for assisting adversarial attacks. We compare our proposed architecture with several U-Nets with different configurations. Details of their architectures can be found at https://github.com/milesial/Pytorch-UNet. Note that, for U-Net models, the cross-entropy regularization is not applied, since they do not perform classification and there is no logit for computing such a loss. The performance of different policy networks in terms of the mean l 2 distortion is reported in Table 3 . Obviously, we can see that the proposed architecture outperforms U-Net significantly in the framework of our PDA.

5. CONCLUSION

Existing hard-label black-box attacks often suffer from very high query complexity. In this paper, we have introduced a model-based method (i.e., PDA) for learning from past queries and model feedbacks, based on a reinforcement learning formulation of the attack. We have developed a novel architecture for the policy network that is designed to suggest promising search directions for the adversarial examples. Moreover, it has been demonstrated that pre-training of such a policy network, which is crucial for the attack performance, can be effectively performed using prior attacking logs on the same victim model. Experimental results on various victim models (including both naturally and adversarially trained ones) trained on different datasets (including MNIST, CIFAR-10, and Im-ageNet) suggest that the proposed PDA significantly outperforms existing state-of-the-arts in terms of query efficiency.

A OPTIMAL SEARCH DIRECTION FOR HOPSKIPJUMPATTACK

We illustrate the optimal search direction for HopSkipJumpAttack in a two-dimensional input space in Figure 2 . The decision boundary is illustrated by a horizontal straight line (denoted by B), the benign counterpart x of the adversarial example is assumed to be below B, and the circle C centered at the candidate adversarial example x t with a small radius δ shows all possible locations after jumping along some directions by distance δ from x t . The gradient direction u g shall be vertical under the locally linear assumption, and after updating x t along that direction (by δ) and projecting the updated image back onto the decision boundary B (path marked in blue), we obtain x g t . Let the straight line T be the tangent line of C which goes through x, and u o is the direction which is perpendicular to T . If we update x t along u o and then project the result back to B (path marked in green), x o t is obtained. Clearly, x o t would have a smaller distortion than x g t : x o t -x 2 < x g t -x 2 , indicating u o is a better direction than the gradient u g in the sense of cutting distortion. It is also easy to verify u o is the optimal updating direction in this two-dimensional case. < l a t e x i t s h a 1 _ b a s e 6 4 = " C J t C h z p t m l 4 U J 5 f w O h y E f l 9 q B + A = " > A A A B / X i c b V B N S 8 N A E J 3 U r 1 q / o h 6 9 L B a h p 5 J I Q Y 8 F L x 4 r G F t o Q 9 l s N + 3 S 3 U 3 Y 3 Y g l 1 L / g V e 8 e x a u / x a u / x G 2 b g 2 1 9 M P B 4 b 4 a Z e V H K m T a e 9 + 2 U N j a 3 t n f K u 5 W 9 / Y P D I / f 4 5 E E n m S I 0 I A l P V C f C m n I m a W C Y 4 b S T K o p F x G k 7 G t / M / P Y j V Z o l 8 t 5 M U h o K P J Q s Z g Q b K w W 9 S K C n v l v 1 6 t 4 c a J 3 4 B a l C g V b f / e k N E p I J K g 3 h W O u u 7 6 U m z L E y j H A 6 r f Q y T V N M x n h I u 5 Z K L K g O 8 / m x U 3 R h l Q G K E 2 V L G j R X / 0 7 k W G g 9 E Z H t F N i M 9 K o 3 E / / z u p m J r 8 O c y T Q z V J L F o j j j y C R o 9 j k a M E W J 4 R N L M F H M 3 o r I C C t M j M 1 n a U s k p h U b i r 8 a w T p p X 9 b 9 R t 3 3 7 x r V Z q 3 I p w x n c A 4 1 8 O E K m n A L L Q i A A I M X e I U 3 5 9 l 5 d z 6 c z 0 V r y S l m T m E J z t c v T G 2 V N A = = < / l a t e x i t > x < l a t e x i t s h a 1 _ b a s e 6 4 = " B z E f K L p B J j k J v N 6 6 G R M n M z n H C i A = " > A A A B + X i c b V A 9 T w J B E J 3 D L 8 Q v 1 N J m I z G h I n e G R E s S G k t I R E j g Q v a W O d i w e 3 f Z 3 T P B C 7 / A V n t L Y + u v s f W X u M A V C r 5 k k p f 3 Z j I z L 0 g E 1 8 Z 1 v 5 z C 1 v b O 7 l 5 x v 3 R w e H R 8 U j 4 9 e 9 B x q h h 2 W C x i 1 Q u o R s E j 7 B h u B P Y S h V Q G A r v B t L n w u 4 + o N I + j e z N L 0 J d 0 H P G Q M 2 q s 1 G 4 O y x W 3 5 i 5 B N o m X k w r k a A 3 L 3 4 N R z F K J k W G C a t 3 3 3 M T 4 G V W G M 4 H z 0 i D V m F A 2 p W P s W x p R i d r P l o f O y Z V V R i S M l a 3 I k K X 6 e y K j U u u Z D G y n p G a i 1 7 2 F + J / X T 0 1 4 6 2 c 8 S l K D E V s t C l N B T E w W X 5 M R V 8 i M m F l C m e L 2 V s I m V F F m b D Z / t g R y X r K h e O s R b J L u d c 2 r 1 z y v X a 8 0 q n k + R b i A S 6 i C B z f Q g D t o Q Q c Y I D z D C 7 w 6 T 8 6 b 8 + 5 8 r F o L T j 5 z D n / g f P 4 A V V C T j A = = < / l a t e x i t > C < l a t e x i t s h a 1 _ b a s e 6 4 = " q z X 1 a S 9 J g v P r h o R Y l a v 8 s s G O V + Y = " > A A A B + X i c b V A 9 T w J B E J 3 D L 8 Q v 1 N L m I j G h I n f G R E u i j S U k I i R w I X v L H G z Y 3 b v s 7 p n g h V 9 g q 7 2 l s f X X 2 P p L X O A K A V 8 y y c t 7 M 5 m Z F y a c a e N 5 3 0 5 h Y 3 N r e 6 e 4 W 9 r b P z g 8 K h + f P O o 4 V R R b N O a x 6 o R E I 2 c S W 4 Y Z j p 1 E I R E h x 3 Y 4 v p v 5 7 S d U m s X y w U w S D A Q Z S h Y x S o y V m r f 9 c s W r e X O 4 6 8 T P S Q V y N P r l n 9 4 g p q l A a S g n W n d 9 L z F B R p R h l O O 0 1 E s 1 J o S O y R C 7 l k o i U A f Z / N C p e 2 G V g R v F y p Y 0 7 l z 9 O 5 E R o f V E h L Z T E D P S q 9 5 M / M / r p i a 6 C T I m k 9 S g p I t F U c p d E 7 u z r 9 0 B U 0 g N n 1 h C q G L 2 V p e O i C L U 2 G y W t o R i W r K h + K s R r J P 2 Z c 2 / q v l + 8 6 p S r + b 5 F O E M z q E K P l x D H e 6 h A S 2 g g P A C r / D m P D v v z o f z u W g t O P n M K S z B + f o F U 7 u T i w = = < / l a t e x i t > B < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 0 X e 9 Q 2 t d 3 L U B n J N i A / A 7 K h u 8 Y Y = " > A A A C A H i c b V A 9 S w N B E J 3 z M 8 a v q K X N Y h B T h T s J a B m w s Y x g T C A 5 w t 5 m L 1 m y e 3 f s z o n h S O F f s N X e U m z 9 J 7 b + E j f J F S b x w c D j v R l m 5 g W J F A Z d 9 9 t Z W 9 / Y 3 N o u 7 B R 3 9 / Y P D k t H x w 8 m T j X j T R b L W L c D a r g U E W + i Q M n b i e Z U B Z K 3 g t H N 1 G 8 9 c m 1 E H N 3 j O O G + o o N I h I J R t F K 7 G y j y 1 M O L X q n s V t 0 Z y C r x c l K G H I 1 e 6 a f b j 1 m q e I R M U m M 6 n p u g n 1 G N g k k + K X Z T w x P K R n T A O 5 Z G V H H j Z 7 N 7 J + T c K n 0 S x t p W h G S m / p 3 I q D J m r A L b q S g O z b I 3 F f / z O i m G 1 3 4 m o i R F H r H 5 o j C V B G M y f Z 7 0 h e Y M 5 d g S y r S w t x I 2 p J o y t B E t b A n U p G h D 8 Z Y j W C W t y 6 p X q 3 r e X a 1 c r + T 5 F O A U z q A C H l x B H W 6 h A U 1 g I O E F X u H N e X b e n Q / n c 9 6 6 5 u Q z J 7 A A 5 + s X T Y W W T A = = < / l a t e x i t > x 0 t < l a t e x i t s h a 1 _ b a s e 6 4 = " H H o R O e N 4 5 v U r I 6 J Y 7 U z R E C B Y u U Y = " > A A A B / 3 i c b V A 9 S w N B E J 2 L X z F + R S 1 t F o O Q K t x J Q M u A j W U E 4 w W S I + x t 9 p I l + 3 H s 7 g n h C P g X b L W 3 F F t / i q 2 / x E 1 y h S Y + G H i 8 N 8 P M v D j l z F j f / / J K G 5 t b 2 z v l 3 c r e / s H h U f X 4 5 M G o T B P a I Y o r 3 Y 2 x o Z x J 2 r H M c t p N N c U i 5 j S M J z d z P 3 y k 2 j A l 7 + 0 0 p Z H A I 8 k S R r B 1 U t i P B c o G a l C t + Q 1 / A b R O g o L U o E B 7 U P 3 u D x X J B J W W c G x M L / B T G + V Y W 0 Y 4 n V X 6 m a E p J h M 8 o j 1 H J R b U R P n i 3 B m 6 c M o Q J U q 7 k h Y t 1 N 8 T O R b G T E X s O g W 2 Y 7 P q z c X / v F 5 m k + s o Z z L N L J V k u S j J O L I K z X 9 H Q 6 Y p s X z q C C a a u V s R G W O N i X U J / d k S i 1 n F h R K s R r B O w s t G 0 G w E w V 2 z 1 q o X + Z T h D M 6 h D g F c Q Q t u o Q 0 d I D C B Z 3 i B V + / J e / P e v Y 9 l a 8 k r Z k 7 h D 7 z P H 9 t + l h M = < / l a t e x i t > u o < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 U L h W X + A o 5 Y l U X 1 D b u 2 W 9 n j N 0 F I = " > A A A B / 3 i c b V A 9 S w N B E J 2 L X z F + R S 1 t F o O Q K t x J Q M u A j W U E 4 w W S I + x t 9 p I l + 3 H s 7 g n h C P g X b L W 3 F F t / i q 2 / x E 1 y h S Y + G H i 8 N 8 P M v D j l z F j f / / J K G 5 t b 2 z v l 3 c r e / s H h U f X 4 5 M G o T B P a I Y o r 3 Y 2 x o Z x J 2 r H M c t p N N c U i 5 j S M J z d z P 3 y k 2 j A l 7 + 0 0 p Z H A I 8 k S R r B 1 U t i P B c o G o 0 G 1 5 j f 8 B d A 6 C Q p S g w L t Q f W 7 P 1 Q k E 1 R a w r E x v c B P b Z R j b R n h d F b p Z 4 a m m E z w i P Y c l V h Q E + W L c 2 f o w i l D l C j t S l q 0 U H 9 P 5 F g Y M x W x 6 x T Y j s 2 q N x f / 8 3 q Z T a 6 j n M k 0 s 1 S S 5 a I k 4 8 g q N P 8 d D Z m m x P K p I 5 h o 5 m 5 x o t < l a t e x i t s h a 1 _ b a s e 6 4 = " q m 6 m E M H z 3 3 7 F Z I w 1 J t Y l 9 G d L L G Y V F 0 q w G s E 6 C S 8 b Q b M R B H f N W q t e 5 F O G M z i H O g R w B S 2 4 h T Z 0 g M A E n u E F X r 0 n 7 8 1 7 9 z 6 W r S W v m D m F P / A + f w D O 1 p Y L < / l a t e x i t > u g < l a t e x i t s h a 1 _ b a s e 6 4 = " e Z a w O z K C y r J I w y t F h n G 0 p s A G Y U E = " > A A A C A X i c b V A 9 T w J B E J 3 D L 8 Q v 1 N J m I z G h I n e G R E s S G 0 t M R D B w k r 1 l D z b s x 2 V 3 z 0 g I j X / B V n t L Y + s v s f W X u M A V A r 5 k k p f 3 Z j I z L 0 o 4 M 9 b 3 v 7 3 c 2 v r G 5 l Z + u 7 C z u 7 d / U D w 8 u j M q 1 Y Q 2 i O J K t y J s K G e S N i y z n L Y S T b G I O G 1 G w 6 u p 3 3 y k 2 j A l b + 0 o o a H A f c l i R r B 1 0 n 0 n E u i p a x 9 U t 1 j y K / 4 M a J U E G S l B h n q 3 + N P p K Z I K K i 3 h 2 J h 2 4 C c 2 H G N t G e F 0 U u i k h i a Y D H G f t h 2 V W F A T j m c H T 9 C Z U 3 o o V t q V t G i m / p 0 Y Y 2 H M S E S u U 2 A 7 M M v e V P z P a 6 c 2 v g z H T C a p p Z L M F 8 U p R 1 a h 6 f e o x z Q l l o 8 c w U Q z d y s i A 6 w x s S 6 j h S 2 R m B R c K M F y B K u k e V 4 J q p U g u K m W a u U s n z y c w C m U I Y A L q M E 1 1 K E B B A S 8 M E 4 n e H k G R E 4 Z C B g = " > A A A C A X i c b V A 9 T w J B E J 3 D L 8 Q v 1 N J m I z G h I n e G R E s S G 0 t M R D C A Z G / Z g w 2 7 d 5 f d O S O 5 0 P g X b L W 3 N L b + E l t / i Q t c I e B L J n l 5 b y Y z 8 / x Y C o O u + + 3 k 1 t Y 3 N r f y 2 4 W d 3 b 3 9 g + L h 0 Z 2 J E s 1 4 g 0 U y 0 i 2 f G i 5 F y B s o U P J W r D l V v u R N f 3 Q 1 9 Z u P X B s R h b c 4 j n l X 0 U E o A s E o W u m + 4 y v y 1 M O H Q a 9 Y c i v u D G S V e B k p Q Y Z 6 r / j T 6 U c s U T x E J q k x b c + N s Z t S j Y J J P i l 0 E s N j y k Z 0 w N u W h l R x 0 0 1 n B 0 / I m V X 6 J I i 0 r R D J T P 0 7 k V J l z F j 5 t l N R H J p l b y r + 5 7 U T D C 6 7 q Q j j B H n I 5 o u C R B K M y P R 7 0 h e a M 5 R j S y j T w t 5 K 2 J B q y t B m t L D F V 5 O C D c V b j m C V N M 8 r X r X i e T f V U q 2 c 5 Z O H E z i F M n h w A T W 4 h j o 0 g I G C F 3 i F N + f Z e X c + n M 9 5 a 8 7 J Z o 5 h A c 7 X L 2 + B l v Q = < / l a t e x i t > x g t < l a t e x i t s h a 1 _ b a s e 6 4 = " H 0  t B A F Y X B U r 6 V X F p a i G l O 5 I r U z M = " > A A A B + X i c b V A 9 T w J B E J 3 D L 8 Q v 1 N L m I j G h I n e G R E s S G 0 t I Q E j g Q v a W O d i w u 3 f Z 3 T P B C 7 / A V n t L Y + u v s f W X u M A V C r 5 k k p f 3 Z j I z L 0 w 4 0 8 b z v p z C 1 v b O 7 l 5 x v 3 R w e H R 8 U j 4 9 e 9 B x q i h 2 a M x j 1 Q u J R s 4 k d g w z H H u J Q i J C j t 1 w e r f w u 4 + o N I t l 2 8 w S D A Q Z S x Y x S o y V W u 1 h u e L V v C X c T e L n p A I 5 m s P y 9 2 A U 0 1 S g N J Q T r f u + l 5 g g I 8 o w y n F e G q Q a E 0 K n Z I x 9 S y U R q I N s e e j c v b L K y I 1 i Z U s a d 6 n + n s i I 0 H o m Q t s p i J n o d W 8 h / u f 1 U x P d B h m T S W p Q 0 t W i K O W u i d 3 F 1 + 6 I K a S G z y w h V D F 7 q 0 s n R B F q b D Z / t o R i X r K h + O s R b J L u d c 2 v 1 3 y / V a 8 0 q n k + R b i A S 6 i C D z f Q g H t o Q g c

B MEDIAN DISTORTIONS

We report the median l 2 distortions over perturbing test images at different query count budgets for performing untargeted attacks in Table 4 . It can be easily seen that our method outperforms existing state-of-the-arts on all test cases, especially in the earlier stages of the attacks. 

C EFFECTS OF PRE-TRAINING

In this section, we study the effects of pre-training in our PDA. We only show results on MNIST with CNN as the victim model which is faster to be evaluated. We test our PDA under several different pre-training configurations: (a): the policy network does not own the internal classifier h, i.e., its output is input-agnostic and thus pre-training is meaningless in this case; (b): the policy network has a VGG-13 as its internal classifier as in the main paper, and it is pre-trained on data sets with smaller sizes. To achieve the goal, we first sample a subset D with size |D | ∈ {0, 50, 500, 5000} from D which has 5,000 benign images in total and is used to create S in the main paper, and then we collect tuples from D to form a possibly smaller pre-training set. Table 5 summarizes our results, in which the second column indicates the size of D which reflects the cost for pre-training, in particular, |D | = 0 means no pre-training is applied, and the third column indicates whether or not the policy network owns an internal classifier. We see when pretraining is not applied, with a simplified policy (i.e., without h) our PDA can achieve comparable performance to HopSkipJumpAttack, yet directly incorporating a randomly initialized internal classifier into the policy network would lead to much worse result since, in this case, the initial search directions for adversarial example suggested by the policy network are nearly random and often completely failed to reduce the required distortions. Moreover, it can also be seen that, with the size of the pre-training set increased, the performance of our PDA is also gradually improved. In this section, we study the transferability of the pre-trained policy network across different victim models on the same dataset. To do so, on CIFAR-10, we use policy networks pre-trained on {CNN, WRN, ResNet-50 Adv.} to attack all these three victim models. Table 7 summarizes our results, in which the first column is the victim model used to evaluate the attack performance, and the second column is the victim model used to collect dataset S for pre-training. We see from Table 7 that when using a policy network pre-trained on a different victim model to attack, the attacking performance is degraded on all test cases. However, in the early stage of attacking process, our PDA can still consistently provide smaller distortions even when the policy network is pre-trained on a different victim model, enabling attackers to benefit from our PDA by using it to provide high quality starting points for other attacks. More importantly, the transferability allows one to pre-train the policy network on some local models, thus the queries consumed in collecting the pre-training dataset can be practically saved. 

F ATTACK PERFORMANCE ON THE PRE-TRAINING DATASET

The untargeted attack performance on the pre-training dataset S is reported in Table 8 . We see our method has the best performance on the pre-training images, just like in the test images. By comparing the performance of our method on the pre-training images and the test images, we see the "overfitting" of our method is moderate and acceptable in practice. 

G COMPUTATION AND MEMORY COMPLEXITY

As a policy network is involved in our proposed attack process, our PDA method naturally has higher computation and memory complexity than baseline methods. In our experiment for attacking a ResNet-18 model on ImageNet, for each 100 queries to the victim model, it costs our method extra 947ms on a single GPU (excluding the inference time of the victim model) to fine-tune the policy network, while HopSkipJumpAttack requires 354ms. Although it seems that our method is computational more intensive, the extra overhead is acceptable considering that the inference of the victim model is also costly (672ms), also the run-time of our method can be reduced by using multiple GPUs. As for GPU memory consumption, when attacking the ResNet-18 victim using batch size 25, our method needs additional ∼5GB GPU memory compared with HopSkipJumpAttack which needs ∼2GB GPU memory. The computational and memory cost of our method can be reduced by compressing the policy network, which is to be explored in future work.

H COMPARISON TO SIGN-OPT

In this section, we compare our method to a recent hard-label black-box attack named Sign-OPT (Cheng et al., 2020) . We directly use their official implementation and hyper-parametersfoot_5 . When evaluating Sign-OPT, we use the same victim networks and initial adversarial images as in our method to make a fair comparison. The untargeted attack performance of Sign-OPT on MNIST and CIFAR-10 is reported in Table 9 . We see, in general, our method outperforms Sign-OPT, especially in the earlier stage of attack. 



In practice, it is performed by randomly sampling until the adversarial constraint is satisfied, i.e., it is not classified as y by the victim model, or by directly choosing a benign sample from the adversarial class. If µt is unable to reduce the distortion, or lt < 0.05 • δ, we clip lt to 0.05 • δ for numerical stability. Pre-trained weights: https://github.com/IBM/Autozoom-Attack Pre-trained weights: https://github.com/IBM/Autozoom-Attack Pre-trained weights: https://github.com/bearpaw/pytorch-classification https://github.com/cmhcbb/attackbox



8 w C u 8 e c / e u / f h f c 5 b c 1 4 2 c w w L 8 L 5 + A X w p l v w = < / l a t e x i t >

Figure2: The optimal search direction for HopSkipJumpAttack.

Figure 3 provides visualizations of generated adversarial examples on a randomly selected benign image. Each row in the figure represents a pre-training configuration, and images in each column are candidate adversarial examples under a certain query count budget. The benign example is classified as "7" by the victim model, and all images in the figure are classified as "3". The first column of the E TRANSFERABILITY OF THE PRE-TRAINED POLICY NETWORK

) if zx 2 ≤ x tx 2 , otherwise µ t ← a t,i * , i * = arg max i r t,i

Mean 2 distortions for performing untargeted attacks with different query budgets.

Mean 2 distortions for performing targeted attacks with different query budgets.

Median 2 distortions for performing targeted attacks with different query budgets.

Comparison of different pre-training configurations for the policy network. Mean 2 distortions for performing untargeted attacks are evaluated.

Transferability of pre-trained policy networks on CIFAR-10. Last six columns are median 2 distortions for performing untargeted attacks with different query budgets.

Mean 2 distortions for performing untargeted attacks with different query budgets on the pre-training dataset S.

Mean 2 distortions for performing untargeted attacks with different query budgets.

ACKNOWLEDGMENTS

This work is funded by the National Key Research and Development Program of China (No. 2018AAA0100701) and the NSFC 61876095.

annex

figure shows the common initial adversarial example for all configurations, which is generated via sampling from the uniform distribution in [0, 1] n as described in the main paper. It can be seen that our method could provide adversarial images with higher qualities, especially in the early stage. |D | = 5, 000, VGG-13 init @100 @500 @1K @5K @10K @25K Since favorable results can be obtained with light or even no pre-training of the policy network, our PDA is also applicable to scenarios where only a few adversarial examples are to be generated. This paper introduces a novel perspective of viewing hard-label black-box attacks (as a reinforcement learning problem), thus more advanced policy gradient methods an also be tested for improving the performance of our PDA, with or without pre-training.

D VALUES OF BASELINE LEVELS

In this section, we explain how the values of β 1 = 0 and β 2 = 0.25 are selected. These values are tuned under a constraint of β 1 < β 2 . We try different combinations of values in {0, 0.25, 0.5, 1.0} for β 1 and β 2 on MNIST and found that β 1 = 0, β 2 = 0.25 provide the best result on the validation set (containing 500 images as described in Section 4.1), and these values are then applied to all other experiments. In Table 6 we report the mean l 2 distortions when attacking a CNN victim on MNIST test images (1,000 images as described in Section 4.1) at different query count budgets. We see that the performance of our method is fairly robust to tested values of β 1 and β 2 , and smaller values usually have better performance on the late stage of attacks. (β 1 , β 2 ) @100 @500 @1K @5K @10K @25K (0, 0. 

