HARD-LABEL MANIFOLDS: UNEXPECTED ADVAN-TAGES OF QUERY EFFICIENCY FOR FINDING ON-MANIFOLD ADVERSARIAL EXAMPLES

Abstract

Likewise, recent zeroth order hard-label attacks on image classification tasks have shown comparable performance to their first-order alternatives. It is well known that in this setting, the adversary must search for the nearest decision boundary in a query-efficient manner. State-of-the-art (SotA) attacks rely on the concept of pixel grouping, or super-pixels, to perform efficient boundary search. It was recently shown in the first-order setting, that regular adversarial examples leave the data manifold, and on-manifold examples are generalization errors (Stutz et al., 2019) . In this paper, we argue that query efficiency in the zeroth-order setting is connected to the adversary's traversal through the data manifold. In particular, query-efficient hard-label attacks have the unexpected advantage of finding adversarial examples close to the data manifold. We empirically demonstrate that against both natural and robustly trained models, an efficient zeroth-order attack produces samples with a progressively smaller manifold distance measure. Further, when a normal zeroth-order attack is made query-efficient through the use of pixel grouping, it can make up to a two-fold increase in query efficiency, and in some cases, reduce a sample's distance to the manifold by an order of magnitude.

1. INTRODUCTION

Adversarial examples in the context of deep learning models were originally investigated as blind spots in classification (Szegedy et al., 2013; Goodfellow et al., 2014) . Formalized methods for discovering these blind spots emerged, referred to as gradient-level attacks, and became the first style to reach widespread attention by the deep learning community (Papernot et al., 2016; Moosavi-Dezfooli et al., 2015; Carlini & Wagner, 2016; 2017) . In order to compute the necessary gradient information, such techniques required access to model parameters and a sizeable query budget, needing surrogate information to be competitive (Papernot et al., 2017) . This naturally led to the creation of score-level attacks, which only require the confidence values output by machine learning models (Fredrikson et al., 2015; Tramèr et al., 2016; Chen et al., 2017; Ilyas et al., 2018) . However, the grand prize of adversarial ML (AML), hard-label attacks, have been proposed in very recent years. These methods, which originated from a random-walk on the decision boundary (Brendel et al., 2017) , have been carefully refined to offer convergence guarantees (Cheng et al., 2019) , query efficiency (Chen et al., 2019; Cheng et al., 2020) , and simplicity through super-pixel grouping (Chen & Gu, 2020) , without ever sacrificing earlier advances. Despite the steady improvements of hard-label attacks, open questions persist about their behavior, and AML attacks at large. The existence of adversarial samples were originally assumed to lie in rare pockets of the input space (Goodfellow et al., 2014) , but this was later challenged by the boundary tilting assumption (Tanay & Griffin, 2016; Gilmer et al., 2018) , which adopts a "data-geometric" view of the input space living on a manifold. This is supported by Stutz et al. (2019) , who suggest that regular adversarial examples leave the data manifold, while on-manifold adversarial examples are generalization errors. In this paper, we adopt the boundary-tilting assumption and demonstrate an unexpected benefit of query-efficient zeroth order attacks; such attacks are more likely to discover on-manifold examples, and it is primarily enabled by the use of down-scaling techniques, such as super-pixel grouping. This is initially counter-intuitive, since down-scaling techniques reduce the search dimension, which artificially limits the search space, and can lead to worse (farther-away) adversarial examples. Our results suggest, however, that super-pixels help eliminate the search space of off-manifold adversarial examples, leading to examples which are truly generalization errors. With this knowledge, it is possible to rethink the design of hard-label attacks, towards those which resemble attack (b) in Figure 1 , rather than (a) or (c). (Tanay & Griffin, 2016 ): a) zeroth-order attack targeting low-level features, leaving the manifold, b) an efficient zeroth-order attack targeting mostly high-level features, floating along the manifold, and c) manifold-based zeroth-order attack next to the manifold, but sacrificing similarity. Our specific contributions are as follows: • Reveal new insights of manifold feedback during query-efficient zeroth-order search. We describe an approach for extending dimension-reduction technqiues in the score-level setting (Tu et al., 2019) to hardlabel attacks. Afterwards, we propose the use of FID score (Heusel et al., 2018) as an L p -agnostic means for estimating distance to the sampled submanifold. This measure allows to empirically demonstrate the connection between query efficiency and manifold feedback from the model, beyond the known convergence rates tied to dimensionality (Nesterov & Spokoiny, 2017) . We later tie this result to known behavior in the gradientlevel setting (Engstrom et al., 2019) , which shows that manifold information is leaked from models in the hardlabel setting, regardless of their robustness. • Attack-agnostic method for super-pixel grouping. We show that bilinear down-scaling of the input space act as a form of super-pixel grouping, which yields up to 140% and 210% query efficiency gain for previouslyproposed HSJA (Chen et al., 2019) and Sign-OPT attack (Cheng et al., 2020) , respectively. These results align with previously observed behavior in the scorelevel setting (Tu et al., 2019) . • Introduction of manifold distance oracle. Our analysis covers a comprehensive array of datasets, model regularization methods, and L p -norm settings from the literature. Regardless of the setting, we observe a consistent behavior of leveraging manifold information during query-efficient attacks. Thus we propose an information-theoretical formulation and interpretation of the noisy manifold distance oracle, which enables zeroth-order attacks to craft on-manifold examples. Studying this problem may assist in understanding the fundamental limits and utility of hard-label attacks.

2. RELATED WORK

Since the original discovery of adversarial samples (Szegedy et al., 2013; Goodfellow et al., 2014) and later formulations based on optimization (Carlini & Wagner, 2016; Moosavi-Dezfooli et al., 2015) , the prevailing question was why such examples existed. The original assumption was that adversarial examples lived in low-probability pockets of the input space, and were thus never encountered during parameter optimization (Szegedy et al., 2013) . This effect was believed to be amplified by the linearity of weight activations in the presence of small perturbations (Goodfellow et al., 2014) . These assumptions were later challenged by the manifold assumption, which in summary claims that 1) the train and test sets of a model only occupy a sub-manifold of the true data, while the decision boundary lies close to samples on and beyond the sub-manifold (Tanay & Griffin, 2016) , and 2) the "data geometric" view, where high-dimensional geometry of the true data manifold enables a low-probability error set to exist (Gilmer et al., 2018) . Likewise the manifold assumption describes adversarial samples as leaving the manifold, which has inspired many defenses based on projecting such samples back to the data manifold (Jalal et al., 2019; Samangouei et al., 2018a) , and adaptive attacks for foiling these defenses (Carlini et al., 2019; Carlini & Wagner, 2017; Tramer et al., 2020) . We investigate the scenario where an adversary uses zeroth-order information to either estimate the gradient direction, or find the closest decision boundary, related to previous work in the gradient-level setting by Stutz et al. (2019) . In our setting, the adversary uses the top-1 label feedback from the model to reach their goal. They can also use a low-dimensional approximation of the data manifold to encourage query-efficiency. However, to date it is not completely understood how this affects the traversal through the data manifold, particularly in the zeroth-order setting. 3 ZEROTH-ORDER SEARCH THROUGH THE MANIFOLD Our primary motivation is to characterize recent zeroth-order attacks as they relate to ideas of manifold traversal (Chen et al., 2019; Cheng et al., 2020; Chen & Gu, 2020 ). In the most common problem setting, the adversary is interested in attacking a K-way multi-class classification model f : R d → {1, . . . , K}. Given an original example x 0 , the goal is to generate adversarial example x such that x is close to x 0 and f (x) = f (x 0 ), where closeness is often approximated by the L p -norm of xx 0 . The value of this approximation is debated in the literature (Heusel et al., 2018; Tsipras et al., 2018; Engstrom et al., 2019) . Likewise we turn to alternative methods shown later for measuring closeness. First we step through the formulation for contemporary hard-label attacks, then show how dimension-reduced score-level attacks are extended to the hard-label setting, which enables analysis of zeroth-order decision boundary search.

3.1. GRADIENT-LEVEL FORMULATION

For gradient-level attacks, the goal is satisfied by first assuming that f (x) = argmax i (Z(x) i ), where Z(x) ∈ R K is the final (logit) layer output, and Z(x) i is the prediction score for the i-th class, the stated goal is satisfied by the optimization problem,  h(x) := argmin x {||x -x 0 || p + cL(Z(x))} ,

3.2. SCORE-LEVEL AND HARD-LABEL ATTACKS

In the gradient-level setting, we require the gradient ∇f (•). However, in the score-level setting we are forced to estimate ∂f (x) ∂x without access to ∇f (•), only evaluations of Z(•). Tu et al. (2019) reformulate the previous problem to a version relying instead on the ranking of class predictions from Z. In practical scenarios, the estimate is found using random gradient-free method (RGF), a scaled random full gradient estimator of ∇f (x), over q random directions {u i } q i=1 . The score-level setting was extended to several renditions of the hard-label setting, which we clarify below. In each case the goal is to approximate the gradient by ĝ. OPT-Attack For given example x 0 , true label y 0 , and hard-label black-box function f : R d → {1, . . . , K}, Cheng et al. (2019) define the objective function g : R d → R as a function of search direction θ, where g(θ * ) is the minimum distance from x 0 to the nearest adversarial example along the direction θ. For the untargeted attack, g(θ) corresponds to the distance to the decision boundary along direction θ, and allows for estimating the gradient as, ĝ = 1 q q i=0 g(θ + βu i ) -g(θ) β • u i . ( ) where β is a small smoothing parameter. Notably, g(θ) is continuous even if f is a non-continuous step function. Sign- OPT Cheng et al. (2020) later improved the query efficiency by only considering the sign of the gradient estimate, ∇g(θ) ≈ ĝ := q i=1 sgn (g(θ + βu i ) -g(θ)) u i . We focus on the Sign-OPT variant, as the findings are more relevant to the current SotA. HopSkipJumpAttack Similar to Sign-OPT, HopSkipJumpAttack (HSJA) (Chen et al., 2019) uses a zeroth-order sign oracle to improve Boundary Attack proposed by Brendel et al. (2017) . HSJA lacks the convergence analysis of OPT Attack/Sign-OPT and relies on one-point gradient estimate. Regardless, HSJA is competitive with Sign-OPT for SotA in the L 2 setting. RayS Chen & Gu (2020) propose an alternative method which is to search for the minimum decision boundary radius r from a sample x 0 , along a ray direction θ. Instead of searching over R d to minimize g(θ), Chen et al. propose to perform ray search over directions θ ∈ {-1, 1} d , resulting in 2 d maximum possible directions. This reduction of the search resolution enables SotA query efficiency in the L ∞ setting with proof of convergence. The search resolution is further reduced by the hierarchical variant of RayS, which performs on-the-fly upscaling of image super-pixels.

3.3. DIMENSION-REDUCED ZEROTH-ORDER SEARCH

The attacks described so far each represent an improvement in query efficiency under different L p -norm scenarios. The difficulty in performing a holistic analysis of their behavior lies in each attack's unique design. In order to characterize query-efficient attacks in a consistent way, we rely on a hard-label version of the reduced-dimension scheme proposed by Tu et al. (2019) . This scheme can allow to dynamically scale the expected query-efficiency up or down in a controlled manner. The reduced-dimension search is feasible since it was shown that the intrinsic dimensionality of data is often lower than the true dimension (Amsaleg et al., 2017) . Likewise, zeroth-order attacks can exploit the known convergence rate of zeroth order gradient descent, which is tied to a dimensionality d of the vectorized input (Nesterov & Spokoiny, 2017; Liu et al., 2018) . In practice this reduction is implemented through an encoding map E : R d → R d for reduced dimension d and decoding map D : R d → R d . In general the adversarial sample is created by x = x 0 + g D(θ ) D(θ ) ||D(θ )|| , where θ ∈ R d and is optimized depending on the respective attack (e.g., Sign-OPT and HSJA), and as before, g is a measure of distance to the decision boundary in direction D(θ ). The mapping functions can be initialized with either an autoencoder (AE), or a pair of channel-wise bilinear transform functions (henceforth referred to as BiLN) which simply scales the input up or down depending on a fixed scaling factor. These choices were previously investigated by Tu et al. (2019) as a way to improve query efficiency in score-level attacks, and ultimately performed similarly with respect to query efficiency. For our purposes, these functions represent two distinct methods of synthesizing adversarial samples, which either rely on an approximate description of the manifold (AE), or instead use a deterministic rescaling to achieve efficiency (BiLN). The inclusion of BiLN is important, because it allows to measure the scenario where the adversary has no explicit knowledge of the manifold, and only relies on the feedback from the model. Under the AE scenario, the AE is tuned to minimize reconstruction error of input images. Due to this dependence on labeled data, the output quality of the AE is dependent on the adversary's ability to collect data, which is a realistic consideration. We model the scenario where the adversary only has access to the test set, which is often considerably less informative than the training set. This manifests as an extra distortion in addition to the adversarial noise. Thus the output of the AE-initialized decoder can be used in different ways, which we discuss briefly in Section A.3 of the Appendix. In the case of BiLN, no additional training is required, which means it synthesizes search direction independent of the adversary's manifold description (i.e., possible extracted knowledge about test samples). This can manifest as a lower overall distortion, in the case where only a crude manifold description can be extracted from the test set. Next we describe the exact usage of the mapping functions for each attack scheme. Sign-OPT & HSJA. In general for the attacks which rely on the Cheng et al. (2019) formulation, the update in Equation 2 becomes ĝ = 1 q q i=0 g(θ + βu i ) -g(θ ) β • u i , for the reduced-dimension Gaussian vectors {u i ∈ R d } q i=0 and direction θ ∈ R d for integer d < d. The reduced-dimension direction θ is initialized randomly with θ ∼ N (0, 1) for the untargeted case, or for the targeted case as θ = E(x t ), where x t is a test sample correctly classified as target class t by the victim model. This scheme also applies to HSJA, since HSJA performs a single-point sign estimate. As in the normal variants, ĝ is used to update θ . RayS. The intuition behind RayS attack is to perform a discrete search in at most 2 d directions. Chen et al. also perform a hierarchical search over progressively larger super-pixels of the image. This has the effect of already upscaling on-the-fly (Chen & Gu, 2020) . The inclusion of RayS in our analysis is important, since it has the unique behavior of performing a discrete search for the decision boundary, rather than an explicit gradient estimate. To achieve an appropriate reduced-dimension version of RayS, and test our hypothesis, we modify the calculation of s in Algorithm 3 of Chen & Gu (2020) , which either speeds up upscaling by a factor a (i.e., s = s + a), or extends the search through a specific block index by a factor b (increase block level at k = 2 s b instead of k = 2 s ).

3.4. CAPTURING ZEROTH-ORDER SEARCH DEVIATION

Our analysis follows in the wake of findings presented by (Stutz et al., 2019) . It is shown that "regular" gradient-level adversarial examples leave the data manifold i.e., the sample's distance to the manifold is larger than with an "on-manifold" gradient-level adversarial example. In the score-level and hard-label settings, the manifold can be used to guide the search for the boundary. The benefits of this scenario can be observed in score-level attacks by Tu et al. (2019) . Similarly, hard-label attacks can leverage the concept of super-pixels to achieve gains in performance (Chen & Gu, 2020) . However, to date the connection between dimension manipulation and manifold traversal has not been investigated. The reduced search fidelity naturally limits the resolution of search direction, which is the final noise vector applied to the original sample, as evidenced by Equation 3. This introduces our central research question: How does searching over reduced-dimension increase efficiency, if the search resolution is decreased as a side-effect? To approach this question, we observe that certain gradient-level and score-level attacks can leverage a prior manifold description to attack more efficiently (Stutz et al., 2019; Tu et al., 2019) . We can rely on a sample's distance to the manifold as a measure of deviation during the search, similar to the work by Stutz et al. (2019) . Hereafter, we refer to this distance w.l.o.g as the manifold distance. Our choice is motivated by the fact that from a data-geometric perspective, manifold distance describes the amount of semantic features preserved during the attack process (Engstrom et al., 2019) . Likewise the manifold distance can communicate information about attack behavior better than L p distortion measurements, which are the common choice in existing zeroth-order attack literature (Cheng et al., 2020; Chen et al., 2019; Chen & Gu, 2020) . Second, the manifold approximation technique by Stutz et al. (2019) is mainly suited for L 2 -norm, whereas hard-label attacks exist for both L 2 and L ∞ -norm. Unfortunately, the real data manifold is difficult to describe. This is an open problem in the study of Generative Adversarial Networks (GANs), as designers must ensure that generator images are on-manifold using a continuous function. This has motivated the recently proposed Frechet Inception Distance (FID), which acts as a surrogate measure of the manifold distance (Heusel et al., 2018) . We can leverage FID by treating the adversarial samples as synthetically generated images. FID is viable as it computes Frechet distance of candidate images with respect to images sampled from the true data sub-manifold. Since FID uses an Inception-V3 coding layer to encode images, this distance will correlate with distortion of high-level features, which will be a result of sampling farther from the data manifold. We do not target the Inception-V3 network in any of our experiments, so the FID metric will not rely on any internal aspects of the victim models.

4.1. METHODOLOGY

Our experimental analysis addresses the following three research questions about zeroth-order attacks: Q1. What is the trade-off between query efficiency and reduced search resolution in the zeroth order setting, against both natural and "robust" models? Q2. In addition to reducing queries, are there unexpected benefits for performing a query-efficient zeroth order attack? Q3. Compared to the previous results in the score-level setting (Tu et al., 2019) , do dimensionreduced hard-label attacks produce a similar amount of reduction in query usage? We answer these questions by comparing three hard-label attacks against their dimension-reduced variants. Some variants are not shown due to incompatibility with the base attack. For example, AE+HSJA is not implemented as it relies on only a single-point estimate, thereby only allowing to attack on the manifold directly. This is not practical due to induced distortion discussed previously in Section 3. RayS can perform two-point search, but it assumes codependence of input features, which may not be the case for well-defined latent space of an autoencoder.foot_0 Thus for the AE variant, we rely on Sign-OPT as it can perform two-point estimate and does not rely on codependence of features. In the BiLN cases, the implementations follow the discussion in Section 3.3. Experimental Highlights. Our experiments show that query-efficient attacks exhibit unexpected behaviors and benefits, with explanations summarized below: A1. Reduced search resolution increases search efficiency by allowing the discovery of samples not encountered during training. This is especially true for robust models, which we show have a tendency to overfit on the first-order noise characteristics. A2. Surprisingly, query-efficient attacks search closer to the manifold than non-efficient attacks, thus are more likely to produce on-manifold examples. This occurs because query-efficient search acts as indirect manifold feedback. By reducing the search fidelity, we are more likely to modify high-level features. This effect is magnified on robust models, which are already known to leak manifold information in gradient-level settings (Engstrom et al., 2019) . A3. Dimension-reduced attacks are capable of SotA query-efficiency gains for HSJA, and a two-fold improvement for Sign-OPT. Setup. All attacks run for 25k queries without early stopping. For brevity, we only show results for the untargeted case. FID score is calculated using the 64-dimensional max pooling layer of the Inception-V3 deep network for coding (denoted as FID-64 in figures), taken from an opensource implementation. 2 The choice of the 64-dimensional feature layer allows to calculate full-rank FID without the full 2,048 sample count of original FID, which is prohibitive given the high time complexity of chosen robust models for analysis (in particular, certified smoothing ImageNet). This incurs the cost of losing some high-level features, but due to the position of the chosen coding layer in the network, it is still valuable for direct comparison of manifold distance between attacks. Since the coding layer differs from the original implementation, the magnitudes will differ from those published by Heusel et al. (2018) . Image data consists of the CIFAR-10 ( Krizhevsky, 2009) and ImageNet (Russakovsky et al., 2015) classification datasets. This selection allows to study attack behavior on both small and high-resolution image data. Original samples are chosen from the test set of each dataset using a similar technique as in Chen et al. (2019) . On CIFAR-10, ten random samples are taken from each of ten classes. On ImageNet, ten random classes are chosen with ten random samples taken from each (i.e., 100 total samples on either dataset). We provide further implementation details in Section A.3 of the Appendix. In addition to natural images, we are interested in the attack behavior for a model regularized with some variant of first-order noise. ii. iii. technique proposed by Madry et al. (2017) . For completeness, we also include ablation results on more recent regularization techniques in Section A.5 of the Appendix. For ImageNet, we compare against the SotA at time of writing, randomized smoothing proposed by Cohen et al. (2019) . We use the pre-trained Resnet50 weights and implementation provided by Cohen et al., corresponding to smoothing parameter σ = 0.5 and 1.0.

4.2. EXPERIMENTAL DETAILS

On each dataset, we target the L p -norm which the robust models were regularized or certified under. Likewise we use the L ∞ versions of each attack for CIFAR-10 and L 2 versions for ImageNet. CIFAR-10 case study (L ∞ ). We start by measuring the distortion against remaining query budget of the adversary in Figure 2a . In general, the normal variants of each attack align with the published results. The BiLN variants of RayS each have minimal effect on overall query efficiency (Insets 2a.i and 2a.iii). This is a result of RayS not relying on explicit gradient estimation. The main improvement is with BiLN+HSJA against the Madry adversarial training model, with average distortion at 4k queries decreasing from 0.09 to 0.07, and corresponding success rate increasing from 15% to 25%. This improvement aligns with the result of Tu et al. (2019) , and is contrary to the minimal effect on the natural model (Inset 2a.ii). AE+Sign-OPT outperforms against regular Sign-OPT and BiLN+Sign-OPT on the robust model, since searching along the manifold can grant distortion which was not encountered during first-order regularization (Stutz et al., 2019; Jalal et al., 2019) . However, the success with AE+Sign-OPT tends to be situational, whereas HSJA and RayS outperform in either scenario. Mainly the low quality of the adversary's AE does not permit fine-grained adjustment in latent space that RayS and HSJA can provide in image space. We next focus on Figure 2b , which shows the FID score's trajectory as the search progresses. Every trajectory will begin at a zero value, since there is an expected score of zero for identical images, and then peaks as the attack initialization is performed. In this scenario, our main observation is the similarity of FID trajectory between RayS, HSJA, and dimension-reduced Sign-OPT, despite the method of model regularization, as shown in Insets 2b.i, 2b.ii, and 2b.iii, respectively. For example, the magnitudes for AE+Sign-OPT and RayS peak to 0.28 and 0.04, then fall (and stay) at values near 0.004. BiLN+HSJA and BiLN+Sign-OPT both exhibit lower FID scores than their normal variants, as much as two orders of magnitude less in the case of Sign-OPT and BiLN+Sign-OPT. For a detailed comparison of distortion and FID values for CIFAR-10, we offer Table 1 in Section A.5 of the Appendix. The dimension-reduced variants of RayS do not have a large variation between them (Inset 2b.iv), a side-effect of the adaptive super-pixel search, which can automatically scale the super-pixel size as the search progresses. At a high level, we observe that BiLN+HSJA offers comparable performance to regular RayS, while in the robust case, BiLN and AE variants tend towards lower FID scores than regular variants. This can be viewed as the model leaking manifold information through the decision, as was shown for the first-order gradient by Engstrom et al. (2019) . ImageNet case study (L 2 ). We attack ImageNet in the L 2 -norm setting to compare against the certified smoothing technique proposed by Cohen et al. (2019) . The label output comes from a smooth classifier, approximated by many rounds of Monte Carlo search, which uses the regular model regularized by Gaussian noise. Thus hard-label attacks are well positioned for attacking the smoothing technique. The distortion results of these attacks are shown in Figure 3a . Dimension reduction has a larger impact when coupled with the large ImageNet resolution. Particularly the BiLN+HSJA and BiLN+Sign-OPT attacks profit the most. At 8k queries, success rate increases 1.4x and 2.1x for HSJA and Sign-OPT, respectively. As before with CIFAR-10, the RayS dimension reduction is saturated on ImageNet (Insets 3a.i and 3a.ii). Since RayS does not explicitly rely on gradient estimation, it benefits the least from dimension-reduction techniques. Apart from RayS, BiLN variants outperform both AE and regular variants in every scenario, despite the reduced search fidelity. The improvement is largest on the smoothing technique, particularly with HSJA and Sign-OPT (second column of Figure 3b ). This is due to 1) the adversary's reduced description of the manifold, by only having access to the test set, and 2) BiLN allowing to search closer to the original sample, since it is a deterministic function independent of the adversary's knowledge. The FID scores in Figure 3b paint a more comprehensive picture. BiLN variants produce adversarial examples closer to the manifold than either regular or AE variants, highlighted with HSJA+BiLN in Inset 3b.ii. RayS saw no such improvement, as it does not perform an explicit gradient estimate (Inset 3b.i). We interpret this as follows: BiLN variants on HSJA and Sign-OPT leverage reduced fidelity to increase the probability of finding the decision boundary, and 1) produce a smoother noise distribution, resulting in more spatially correlated distortion, which as a result 2) produces adversarial examples closer to the manifold. Since the manifold description of ImageNet is crude, BiLN variants can excel, since they search independent of this description. Another key observation is the fluctuation of LID score towards the end of Sign-OPT and AE+Sign-OPT, which are not present for HSJA or RayS (first column of Figure 3b ). Notably there is no direct signal of manifold distance for the normal or BiLN variants. This indicates that query-efficient attacks do a better job of capturing implicit manifold distance feedback from the model.

5. DISCUSSION

The noisy manifold distance oracle. In Section 4 we observed that query-efficient attacks, i.e., those which leverage the concept of super-pixels to reduce search fidelity, are more likely to produce samples close to the manifold. This generates samples that are unseen by robust models during their first-order adversarial training. However, this takes place without any direct feedback about manifold distance. To approach this, we first consider that the model relies on a subsampled manifold of the image space. This manifold can be leaked by the loss landscape of the model, as shown by Engstrom et al. (2019) . From an information-theoretical perspective, the zeroth-order adversary observes the noisy gradient, which is leaked as side information by each model decision. Under this explanation, the decision feedback by the model is viewed as a noisy manifold distance (NMD) oracle. The improvement of AE+Sign-OPT on robust CIFAR-10 can be argued as a result of the NMD oracle improving as well. This can be shown using the data processing inequality (Beaudry & Renner, 2012) : if I(M, g) increases, then I(M, g) also increases, where I is mutual information, M is the manifold, and g is the noisy gradient. In words, the quality of the noisy gradient depends on the quality of the model's loss landscape, which can more closely resemble the manifold under robust regularization. This means a higher quality loss landscape leads to a higher quality zeroth-order attack. Qualitative evidence of this effect can be observed in Section A.4 of the Appendix. "Topology" of hard-label settings. We can view zeroth-order attacks as following a topological hierarchy that is a function of the original data dimension. A very simple version of this idea is illustrated in Figure 1 . Each technique illustrated in Figure 1 offers a different traversal distance both along the manifold, and away from it. Efficient attacks represented by (b) can combine elements of staying near the manifold, and traversing it. This is best seen by the results of BiLN variants and regular RayS. RayS succeeds because it assumes spatial correlation, which can be considered a loose description of the manifold, without making further assumptions of the data. Thus we can expect this behavior from future attacks which do not perform explicit gradient estimation. Similarly, purely traversing close to an ideal manifold description, as in (c), may not be advantageous, because a crude manifold approximation induces its own error. Further, the nearest boundary on the manifold could be far away. Thus we can consider an attack which leverages an ideal "manifold description" (e.g., through an autoencoder), but leverages the description as a method for selecting progressively smaller super-pixel groupings. For instance, RayS selects super-pixels as blocks, which are bifurcated until the number of super-pixels equals the original dimension. Notably, RayS displayed little to no benefit from dimension-reduction, which is a byproduct of not relying on explicit gradient estimation. A technique which leverages the manifold description to select super-pixel groups could yield better performance, both in terms of distortion and FID score. To this end, we feel the FID score offers a valuable measure of manifold distance, which can inform the topological behavior, and the quality of future hard-label attacks.

6. CONCLUSION

Despite the recent progress in zeroth-order attack methods, open questions remain about their precise behavior. We shed light on an unexpected nuance, which is their ability to produce on-manifold adversarial examples. This is despite the absence of manifold distance information, which motivates the proposal of the noisy manifold distance oracle. Future work could create a formal definition of this oracle, and attempt to bound the information revealed by the oracle. On the other hand, with knowledge of the oracle's existence, it is possible to further refine hard-label attacks, so they continue to reveal insights into the weaknesses of learning systems.

A APPENDIX

A Manifold projection was originally proposed by Samangouei et al. (2018b) as a scheme to defend models in the gradient-level setting, and was partially broken by Athalye et al. (2018) due to the imperfect projection of the defender decoder. Jalal et al. (2019) later showed it could be circumvented completely, by searching for latent sample pairs which are close to each other in latent space, but far apart on the model loss landscape (Jalal et al., 2019) . These sample pairs are projected to input space and used for adversarial training, which yields the Robust Manifold defense. In direct comparisons under the first-order setting, Jalal et al. (2019) show that the Robust Manifold scheme defends better than previous baselines, Madry adversarial training (Madry et al., 2017) and TRADES (Zhang & Wang, 2019) , which do not leverage the manifold projection. We conduct experiments to directly compare this behavior in the hard-label setting, shown in Figure 4 . The hard-label attacks consistently outperform against the Robust Manifold defense, despite this defense outperforming both the Madry and TRADES defenses under gradient-level attacks. This can be considered a result of the Robust Manifold defense overfitting to the manifold projection of the fixed generator (or "spanner" in the vocabulary of Jalal et al.) . Further, dimension-reduced variants yield better exploration, shown by consistently better performance from 0-5000 queries. This supports the results of the main paper, which show that these defenses can overfit to first-order information. The results on the MNIST ablation also support the notion of a dynamic topological hierarchy, which is dependent on the original data dimension d. In the case of HSJA (second row of Figure 4 ), query-efficient attacks do not benefit the adversary, because MNIST is already sufficiently lowdimension to yield samples near the manifold with regular hard-label attacks. We can validate this finding by comparing a notion of manifold distance between the attacks. FID is not suitable for MNIST as it requires re-training of the Inception-V3 network and calibration for use as a metric, which is outside the scope of this work. Instead we project the adversarial samples back to the manifold, using the Robust Manifold defender's variational autoencoder (VAE), and measure the pairwise distances between original and adversarial projections. This distance is described formally as ||D(E(x 0 )) -D(E(x))|| 2 , for each adversarial sample x and original sample x 0 . We use the checkpoint provided by Jalal et al. (2019) . Notably the defender network has the best description of the manifold, since it has access to the full training set. The distribution of pair-wise distances for each attack and defense combination is shown in Figure 5 . In these plots, we expect the reconstruction distance to be smaller for adversarial samples closer to the manifold, since the VAE will capture high-level semantic information of the dataset. Under our hypothesis, successful hard-label attacks should have a smaller distance on natural models than regular white-box attacks. Thus we also compare against samples generated with Projected Gradient Descent (PGD) in red. The results in Figure 5 support this hypothesis, as both Sign-OPT and HSJA have smaller distance than the PGD samples. This is exaggerated with the AE+Sign-OPT variant, due to the manifold approximation of the adversary. On robust models (second to last columns), PGD acts as a best-case for the adversary, since it is known from previous results by Engstrom et al. (2019) that gradient-level attacks on robust models leak manifold information. As discussed in the main text, hard-label attacks receive a noisy version of this manifold feedback, which is sufficient to create samples near the manifold. This is supported by the pairwise distances on robust models, since PGD samples are consistently closer to regular samples. Due to the dynamics of the topological hierarchy discussed before, the dimension-reduction is less effective on MNIST. Notably the regular hard-label attack variants (shaded blue) can be closer to the original samples than their dimension-reduced variants. RayS (top row) is consistently farther from the PGD samples, as it does not explicitly rely on gradient estimation like Sign-OPT and HSJA. We further point out that despite the variability of distortion in image space between regular and BiLN variants in Figure 4 , these samples still lie near each other when projected to the manifold in Figure 5 . On the Robust Manifold defense, HSJA and Sign-OPT attacks, regardless of variant, will lie closer to the PGD samples. This supports our main result, that query-efficient hard-label attacks leverage manifold information to produce samples, particularly against robust models.

A.2 SUCCESS RATE PLOTS

In Figure 6 we provide query vs. success rate to accompany the results in the main text. 

A.3 IMPLEMENTATION DETAILS

We are primarily interested in the effect of reduced search resolution on attack behavior. Thus in this work, given a candidate direction θ and magnitude (or radius) r, the adversarial sample in the AE case is the blending (1 -r)x 0 + rD E(x 0 ) + θ .foot_2  Natural victim architectures consist of a deep convolutional neural network for CIFAR-10, and a Resnet50 network for ImageNet. The CIFAR-10 network is the same implementation open-sourced by Cheng et al. (2020) , while the Resnet50 network is taken from the PyTorch Torchvision library, including pre-trained weights. 4 For AE attack variants, we implement the same architecture described by Tu et al. (2019) . Specifically it leverages a fully convolutional network for the encoder and decoder. ImageNet samples are downsized to 128x128 before passing to the encoder, and the output of the decoder is scaled back to 224x224, as described by Tu et al. (2019) . Every AE is trained using the held out test set, as we assume disjoint data between adversary and victim. To avoid ambiguity, we label each BiLN variant with the spatial dimension after performing the bilinear transformation, and keep this variable fixed for simplicity.

A.4 VISUAL RESULTS -CIFAR10

We provide visual qualitative results for each attack on CIFAR-10 in Figure 7 . 



Experimentally, codependence hurt the AE+RayS variant more than was practically useful. https://github.com/mseitzer/pytorch-fid We observed that it is detrimental to set x = D(E(x0) + rθ ) or x = D(rθ ) directly. Despite remaining on the data manifold by attacking it directly, the approximation of the data manifold is crude, which results in large distortion(Stutz et al., 2019). https://pytorch.org/docs/stable/torchvision/models.html



Figure 1: Our interpretation of zerothorder attack behavior in the context of boundary tilting (Tanay & Griffin, 2016): a) zeroth-order attack targeting low-level features, leaving the manifold, b) an efficient zeroth-order attack targeting mostly high-level features, floating along the manifold, and c) manifold-based zeroth-order attack next to the manifold, but sacrificing similarity.

for the Euclidean L p -norm || • || p , L(•) is the loss function corresponding to the goal of the attack, and c is a regularization parameter. A popular choice of loss function is the Carlini & Wagner (2016) loss function.

For CIFAR-10, we choose the representative adversarial training Under review as a conference paper at ICLR

Figure 2: Results across attacks for CIFAR-10 dataset, corresponding to a) distortion against query usage (dotted red line denotes the value of , shaded areas mark standard deviation), and b) FID-64 trajectory against the same query usage.

Figure 3: Results across attacks for ImageNet dataset, corresponding to a) distortion against query usage (dotted red line denotes the value of , shaded areas mark standard deviation), and b) FID-64 trajectory against the same query usage.

Figure 4: Results of image space distortion for the MNIST ablation on the natural model, manifold projection defense (Robust Manifold) and non-projection baselines (Madry adversarial training and TRADES). Dotted red line denotes the value of , shaded areas mark standard deviation.

Figure 5: Measurement of pairwise reconstruction distance ||D(E(x 0 )) -D(E(x))|| 2 for the MNIST ablation on the natural model, manifold projection defense (Robust Manifold) and non-projection baselines (Madry adversarial training and TRADES), using the original VAE checkpoint provided by Jalal et al. (2019) for E and D.

Figure 6: Query vs. success rate plots corresponding to each attack variant in the main text, for a) CIFAR-10 and b) ImageNet.

Figure 7: Visual selection of attack trajectories on CIFAR-10.

). Results of distortion for CIFAR10 with four other choices of robust regularization from the literature. Comparison at certain query intervals between regular and robust models on CIFAR-10.

Comparison at certain query intervals between regular and robust models on ImageNet.

