LURING OF TRANSFERABLE ADVERSARIAL PERTUR-BATIONS IN THE BLACK-BOX PARADIGM

Abstract

The growing interest for adversarial examples, i.e. maliciously modified examples which fool a classifier, has resulted in many defenses intended to detect them, render them inoffensive or make the model more robust against them. In this paper, we pave the way towards a new approach to improve the robustness of a model against black-box transfer attacks. A removable additional neural network is included in the target model, and is designed to induce the luring effect, which tricks the adversary into choosing false directions to fool the target model. Training the additional model is achieved thanks to a loss function acting on the logits sequence order. Our deception-based method only needs to have access to the predictions of the target model and does not require a labeled data set. We explain the luring effect thanks to the notion of robust and non-robust useful features and perform experiments on MNIST, SVHN and CIFAR10 to characterize and evaluate this phenomenon. Additionally, we discuss two simple prediction schemes, and verify experimentally that our approach can be used as a defense to efficiently thwart an adversary using state-of-the-art attacks and allowed to perform large perturbations.

1. INTRODUCTION

Neural networks based systems have been shown to be vulnerable to adversarial examples (Szegedy et al., 2014) , i.e. maliciously modified inputs that fool a model at inference time. Many directions have been explored to explain and characterize this phenomenon (Schmidt et al., 2018; Ford et al., 2019; Ilyas et al., 2019; Shafahi et al., 2019) that became a growing concern and a major brake on the deployment of Machine Learning (ML) models. In response, many defenses have been proposed to protect the integrity of ML systems, predominantly focused on an adversary in the white-box setting (Madry et al., 2018; Zhang et al., 2019; Cohen et al., 2019; Hendrycks et al., 2019; Carmon et al., 2019) . In this work, we design an innovative way to limit the transferability of adversarial perturbation towards a model, opening a new direction for robustness in the realistic black-box setting (Papernot et al., 2017) . As ML-based online API are likely to become increasingly widespread, and regarding the massive deployment of edge models in a large variety of devices, several instances of a model may be deployed in systems with different environment and security properties. Thus, the black-box paradigm needs to be extensively studied to efficiently protect systems in many critical domains. Considering a target model M that a defender aims at protecting against adversarial examples, we propose a method which allows to build the model T , an augmented version of M , such that adversarial examples do not transfer from T to M . Importantly, training T only requires to have access to M , meaning that no labeled data set is required, so that our approach can be implemented at a low cost for any already trained model. T is built by augmenting M with an additional component P (with T = M • P ) taking the form of a neural network trained with a specific loss function with logit-based constraints. From the observation that transferability of adversarial perturbations between two models occurs because they rely on similar non-robust features (Ilyas et al., 2019) , we design P such that (1) the augmented network exploits useful features of M and that (2) non-robust features of T and M are either different or require different perturbations to reach misclassification towards the same class. Our deception-based method is conceptually new as it does not aim at making M relying more on robust-features as with proactive schemes (Madry et al., 2018; Zhang et al., 2019) , nor tries to anticipate perturbations which directly target the non-robust features of M as with reactive strategies (Meng & Chen, 2017; Hwang et al., 2019) . Our contributions are as follows: • We present an innovative approach to thwart transferability between two models, which we name the luring effect. This phenomenon, as conceptually novel, opens a new direction for adversarial research. • We propose an implementation of the luring effect which fits any pre-trained model and does not require a label data set. An additional neural network is pasted to the target model and trained with a specific loss function that acts on the logits sequence order. • We experimentally characterize the luring effect and discuss its potentiality for black-box defense strategies on MNIST, SVHN and CIFAR10, and analyze the scalability on Ima-geNet (ILSVRC2012). For reproducibility purposes, the code is available at https://anonymous.4open. science/r/3c64e745-927d-4f51-b187-583e64586ff6/.

2.1. NOTATIONS

We consider a classification task where input-label pairs (x, y) ∈ X × Y are sampled from a distribution D. |Y| = C is the cardinality of the labels space. A neural network model M φ : X → Y, with parameters φ, classifies an input x ∈ X to a label M (x) ∈ Y. The pre-softmax output function of M φ (the logits) is denoted as h M : X → R C . For the sake of readability, the model M φ is simply noted as M , except when necessary.

2.2. CONTEXT: ADVERSARIAL EXAMPLES IN THE BLACK-BOX SETTING

Black-box settings are realistic use-cases since many models are deployed (in the cloud or embedded in mobile devices) within secure environments and accessible through open or restrictive API. Contrary to the white-box paradigm where the adversary is allowed to use existing gradient-based attacks (Goodfellow et al., 2015; Carlini & Wagner, 2017; Chen et al., 2018; Dong et al., 2018; Madry et al., 2018; Wang et al., 2019) , an attacker in a black-box setting only accesses the output label, confidence scores or logits from the target model. He can still take advantage of gradient-free methods (Uesato et al., 2018; Guo et al., 2019; Su et al., 2019; Brendel et al., 2018; Ilyas et al., 2018; Chen et al., 2020) but, practically, the number of queries requires to mount the attack is prohibitive and may be flagged as suspicious (Chen et al., 2019; Li et al., 2020) . In that case, the adversary may take advantage of the transferability property (Papernot et al., 2017) by crafting adversarial examples on a substitute model and then transfering them to the target model.

2.3. OBJECTIVES AND DESIGN

Our objective is to find a novel way to make models more robust against transferable black-box adversarial perturbation without expensive (and sometimes prohibitive) training cost required by many white-box defense methods. Our main idea is based on classical deception-based approaches for network security (e.g. honeypots) and can be summarized as follow: rather than try to prevent an attack, let's fool the attacker. Our approach relies on a network P : X → X , pasted to the already trained target network M before the input layer, such as the resulting augmented model will answer T (x) = M • P (x) when fed with input x. The additional component P is designed and trained to reach a twofold objective: • Prediction neutrality: adding P does not alter the decision for a clean example x, i.e. T (x) = M • P (x) = M (x); • Adversarial luring: according to an adversarial example x crafted to fool T , M does not output the same label as T (i.e. M • P (x ) = M (x )) and, in the best case, x is inefficient (i.e. M (x ) = y). To explain the intuition of our method, we follow the feature-based framework proposed in Ilyas et al. (2019) where a feature f is a function from X to R, that M has learned to perform its predictions. Considering a binary classification task, f is said to be ρ-useful (ρ > 0) if it satisfies Equation 1a and is γ-useful robust if under the worst perturbation δ chosen in a predefined set of allowed perturbations ∆, f stays γ-useful under this perturbation (Equation 1b). A ρ-useful feature f is said to be a non-robust feature if f is not robust for any γ ≥ 0. (a) E (x,y)∼D [y • f (x)] > ρ (b) E (x,y)∼D inf δ∈∆ y • f (x + δ) > γ (1) An adversary which aims at fooling a model will thus perform perturbations to the inputs to influence the useful features which are not robust with respect to the perturbation he is allowed to apply. We denote F * M the set of ρ-useful features learned by M . We consider a set of allowed perturbations ∆ and γ > 0 such that we note F * ,R M and F * ,N R M respectively the set of γ-robust and non-robust features learned by M (relatively to ∆). An adversary that aims at fooling M • P will alter function compositions of the form f • P with f ∈ F * M . These function compositions are the non-robust useful features of M • P , whose set is denoted F * ,N R M •P . Based on the observations that transferability of adversarial perturbations between two models occurs because these models rely on similar non-robust features (Ilyas et al., 2019) , we consider f • P ∈ F * ,N R M •P , and we derive two possibilities which ensure the lowest transferability between M • P and M , regarding the robustness of f . If f is robust, f ∈ F * ,R M , that means the adversarial perturbations from ∆ is sufficient to flip the augmented feature f • P but is not efficient to directly impact f . This case is the optimal one, since the adversarial example is unsuccessful on the target model (M • P (x ) = y and M (x ) = y). On the other hand, if f ∈ F * ,N R M (i.e. both f and f • P are non-robust), restraining the transferability means that the additional model P impacts the way useful features vary with respect to input alterations so that the adversarial perturbation lead to two different labels. We encompass these two cases (illustrated in Figure 1 ) within what we name the luring effect. The adversary is tricked into modifying input values in some way to flip useful and non-robust features of M • P and these modifications are either without effect on the useful features of M , or flip the non-robust features of M in a different way (and therefore are detectable, as presented in Section 4).

2.4. TRAINING THE LURING COMPONENT

To reach our two objectives (prediction neutrality and adversarial luring), we propose to train P with constraints based on the predicted labels order. For x ∈ X , let α and β be the labels corresponding respectively to the first and second highest confidence score given to x by M . The training of P is achieved with a new loss function that constraints α to (still) be the first class predicted by M • P (prediction neutrality) and that makes the logits gap between α and β the highest as possible for M • P (adversarial luring). To understand the intuition behind this loss function, let's formalize the concepts learned by M and M • P . The prediction given by M corresponds to "class α is predicted, class β is the second possible class". Once P has been trained following this loss function, the prediction given by M • P corresponds to "class α is predicted, the higher confidence given to class α, the smaller confidence given to class β". Concepts learned by M and M •P share the same goal of prediction, i.e. "class α" is predicted, but the relation between class α and class β is forced to be the most different as possible. As learned concepts are essentially different, then useful features learned by M and M • P to reach these concepts are necessarily different, and consequently display different types of sensitivity to the same input pixel modifications. In other words, as the direction of confidence towards classes is forced to be structurally different for M • P and M , we hypothesize that useful features of the two classifiers should be different and behave differently to adversarial perturbations. The luring loss, designed to induce this behavior, is given in Equation 2 and the complete training procedure is detailed in Algorithm 1. The parameters of P are denoted by θ, x ∈ X is an input and M is the target model. M has already been trained and its parameters are frozen during the process. h M (x) and h M •P (x) denote respectively the logits of M and M • P for input x. h M i (x) and h M •P i (x) correspond respectively to the values of h M (x) and h M •P (x) for class i. The classes a and b correspond to the second maximum value of h M and h M •P respectively. L x, M (x) = -λ h M •P M (x) (x) -h M •P a (x) + max 0, h M •P b (x) -h M •P M (x) (x) The first term of Equation 2 optimizes the gap between the logits of M • P corresponding to the first and second biggest unscaled confidence score (logits) given by M (i.e. M (x) and a). This part formalizes the goal of changing the direction of confidence between M •P and M . The second term is compulsory to reach a good classification since the first part alone does not ensure that h M •P M (x) (x) is the highest logit value (prediction neutrality). The parameter λ > 0, called the luring coefficient, allows to control the trade-off between ensuring good accuracy and shifting confidence direction. (a,b) ← (class of the 2 nd max value of h M (x i ), class of the 2 nd max value of h M •P (x i )) 7: L(x i , M (x i )) ← -λ(h M •P M (xi) (x i ) -h M •P a (x i )) + max(0, h M •P b (x i ) -h M •P M (xi) (x i )) 8: end for 9: θ ← θ -η B i=1 ∇ θ L(x i , M (x i ))/B 10: end for 11: return P

3.1. OBJECTIVE

Our first experiments are dedicated to the characterization of the luring effect by (1) evaluating our objectives in term of transferability and (2) isolating it from other factors that may influence transferability. For that purpose, we compare our approach (Luring) to the following approaches, which differ from ours in the training procedures and loss functions: • Stack model: M • P is retrained as a whole with the cross-entropy loss. Stack serves as a first baseline to measure the transferability between the two architectures of M • P and M . • Auto model: P is an auto-encoder trained separately with binary cross-entropy loss. Auto serves as a second baseline of a component resulting in a neutral mapping from R d to R d . • C E model: P is trained with the cross-entropy loss between the confidence score vectors M • P (x) and M (x) in order to mimic the decision of the target model M . This model serves as a comparison between our loss and a loss function which does not aim at maximizing the gap between the confidence scores. We perform experiments on MNIST (Lecun et al., 1998) , SVHN (Netzer et al., 2011) and CI-FAR10 (Krizhevsky, 2009) . For MNIST, M has the same architecture as in Madry et al. (2018) . For SVHN and CIFAR10, we follow an architecture inspired from VGG (Simonyan & Zisserman, 2015) . Architectures and training setup for M and P are detailed in Appendix A and B. Table 8 in Appendix C gathers the test set accuracy and agreement rate (on the ground-truth label) between each augmented model and M . We observe that our approach has a limited impact on the test accuracy with a relative decrease of 1.71%, 4.26% and 4.48% for MNIST, SVHN and CIFAR10 respectively.

3.2. ATTACK SETUP AND METRICS

For the characterization of the luring effect, we attack the model M • P of the four approaches and transfer to M only the adversarial examples that are successful for these four models. We define the disagreement rate, noted DR(X ), that represents the rate of successful adversarial examples crafted on M • P for which M and M • P do not agree. To measure the best case where the luring effect leads to unsuccessful adversarial examples when transferred to M , we note IAR(X ) an inefficient adversarial examples rate that represents the proportion of successful adversarial examples on M •P but not on M . For both metrics (see Equation 3), the higher, the better the luring effect to limit transferable attacks on the target model M . DR X = X 1 M oP (x ) =y,M •P (x ) =M (x ) X 1 M •P (x ) =y IAR X = X 1 M •P (x ) =y,M (x )=y X 1 M •P (x ) =y We use the gradient-based attacks FGSM (Goodfellow et al., 2015) , PGD (Madry et al., 2018) , and MIM (Dong et al., 2018) in its l ∞ (MIM) and l 2 (MIML2) versions. We used three l ∞ perturbation budgets ( values): 0.2, 0.3 and 0.4 for MNIST, 0.03, 0.06 and 0.08 for SVHN, and 0.02, 0.03 and 0.04 for CIFAR10. We detail the parameters used to run these attacks in Appendix D.1.

3.3. RESULTS AND COMPLEMENTARY ANALYSIS

We report results for the DR and IAR in Figure 2 . The highest performances reached with our loss (for every ) show that our training method is efficient at inducing the luring effect. More precisely, we claim that the fact that both metrics decrease much slower as increases compared to the other architectures, brings additional confirmation that non-robust features of M and M •P tend to behave more differently than with the three other considered approaches. As a complementary analysis, we investigate the magnitude of the adversarial perturbations. Results show that our approach leads to similar l ∞ and l 2 distortion scales than the other methods. This indicates that our approach truly impacts the underlying useful features and does not only imply different distortions scales needed to fool both M and M • P . The complete analysis is presented in Appendix D.2. For MNIST, we notice that the l 0 distortion (i.e. the cardinality of the pixels impacted by the adversarial attack) is significantly higher with our method, as presented in Figure 3 (left). This points out that P leverages more different useful features, which are easily identifiable for MNIST because of the basic structure of the images. This effect can also be demonstrated thanks to saliency maps (i.e. analyzing ∇ x P (x)), as illustrated in Figure 3 (right): for the Luring model, P is sensitive to many additional useless background pixels compared to M . Note that for Auto, modifying P 's mapping consists almost exclusively on modifying pixels correctly correlated with the true label. For more complex inputs from SVHN or CIFAR10, without a uniform background as for MNIST, we observe that the l 0 distortion is approximately the same for all the architectures (Figure 3 middle). However, for these two data sets, we analyze the influence of P 's mapping thanks to the logits variations. The detailed analysis in presented in Appendix D.3. We find that logits between M and M • P vary more differently with respect to input modifications when P is trained with our approach, which confirms that our luring loss enables to reach Adversarial Luring.

4. USING THE LURING EFFECT AS A DEFENSE

By fooling an adversary on the way to target a black-box model, the phenomenon that we characterized in the previous section can be seen as a defense mechanism on its own or as a complement of state-of-the-art approaches. In this section we discuss the way to use our approach as a protection.

4.1. THREAT MODEL

Attacker Our work sets in the black-box paradigm, i.e. we assume that the adversary has no access to the inner parameters and architecture neither of the target model M nor the augmented model M • P . This setting is classical when dealing with cloud-based API or edge neural networks in secure embedded systems where users have only access to input/output information with different querying abilities and output precision (i.e. scores or labels). More precisely: • The adversary goal is to craft (untargeted) adversarial examples on the model he has access to, i.e. the augmented model T = M • P , and rely on transferability in order to fool M . • The adversarial knowledge corresponds to an adversary having a black-box access to a ML system S A containing the protected model T = M • P , while M stays completely hidden. He can query T (without any limit on the number of queries) and we assume that he is able to get the logits outputs. Moreover, for an even more strict evaluation, we also consider stronger adversaries allowed to use SOTA transferability designed gradient-based methods to attack T (see 4.2). • Classically, the adversarial capability is an upper bound of the perturbation x -x ∞ . Defender For each query, the defender has access to two outputs: M (x) and M •P (x). Therefore, two different inference schemes (illustrated in Appendix E) are viable according to the goals and characteristics of the system to defend, noted S D : • (detection scenario) If S D and S A are the same, the goal of the defender is to take advantage of the luring effect to detect an adversarial example by comparing the two inference outputs M (x) and M • P (x). An appropriate metric is the rate of adversarial examples which are either detected or well-predicted by M , as expressed in Equation 4and noted DAC for Detection Adversarial Accuracy. DAC = 1 -x∈X 1 M •P (x )=M (x ),M (x ) =y |X | • (transferability scenario) If S D is a secure black-box system with limited access, that only contains the target model M , the goal of the defender is to thwart attacks crafted from S A . For this specific scenario, the defender may only rely on the fact that the luring effect likely leads to unsuccessful adversarial examples (M (x ) = y). The appropriate metric is the classical adversarial accuracy (AC), which is the standard accuracy measured on the adversarial set X and noted AC M oP and AC M respectively for M • P and M . Related real-world transferability scenarios may be linked to the increasingly widespread deployment of models in a large variety of devices (edge AI) or services (cloud AI) as an integral part of complex systems with different security requirements. For example, it is classically the case with the IoT domain where a model may be the part of a critical system and, simultaneously, be deployed and used in more mainstream connected objects with looser security requirements. The model M is deployed for the critical system with strong security access, not accessible to an attacker, and the augmented protected model will be deployed in the others systems so that attacks that may be crafted more easily will not transfer to M .

4.2. ATTACKS

In order to evaluate our defense with respect to the threat model, we attack M • P with strong gradient-free attacks: • SPSA attack (Uesato et al., 2018) ensures to consider the strongest adversary in our blackbox setting as it allows the adversary to get the logits of M • P . • Coherently with an adversary that has no querying limitation, ECO (Moon et al., 2019 ) is a strong score-based gradient-estimation free attack. To perform an even more strict evaluation, and to anticipate future gradient-free attacks, we report the best resultsfoot_0 obtained with the state-of-the-art transferability designed gradient-based attacks MIM, DIM, MIM-TI and DIM-ti (Dong et al., 2018; Xie et al., 2018; Dong et al., 2019) , under the name MIM-W. The parameters used to run these attacks are presented in Appendix F.

4.3. RESULTS

The results are presented in Table 1 for SVHN and CIFAR10 (results for MNIST are close to SVHN and are presented in Appendix G). For CIFAR10, since no autoencoder we tried allows to reach a correct test set accuracy for the Auto approach, we do not consider it. First, we notice that AC M OP almost always equal 0 with gradient-based iterative attacksfoot_1 . This is an indication that P does not induce a form of gradient masking (Athalye et al., 2018) . Remarkably on SVHN, for = 0.08 (the largest perturbation), and for the worst-case attacks, adversarial examples tuned to achieve the best transferability only reduce AC M to 0.48 against our approach, compared to almost 0 for the other architectures. The robustness benefits are more observable on SVHN and MNIST but the results on CIFAR10 are particularly promising in the scope of a defense scheme that only requires a pre-trained model. Indeed, for the common l ∞ perturbation value of 0.03, the worst DAC value for the C E and Luring approaches are respectively 0.19 (against DIM attack) and 0.39 (against DIM-TI attack).

4.4. COMPATIBILITY WITH IMAGENET AND ADVERSARIAL TRAINING

We scale our approach to ImageNet (ILSVRC2012). For our experiments, the target model M consists of a MobileNetV2 model (Sandler et al., 2018) , reaching 71.3% of accuracy on the validation set. For a common l ∞ perturbation budget of 4/255, the smaller AC M and DAC observed against the strong MIM-W attack with our approach equal 0.4 and 0.55, while they equal 0.23 and 0.35 with the C E approach. Following the results previously observed on the benchmarks used for characterization, these results validate the scalability of our approach to large-scale data sets. Details on the ImageNet experiments and results are presented in Appendix H.1 Even if very few efforts tackles the issue of thwarting adversarial perturbations transferred from a source model to a target model (transfer black-box attacks), there exists numerous detection and defense schemes intended to protect a target model in a white-box or gray-box setting against adversarial perturbations. As the luring effect is conceptually new, it can be combined to any of these existing methods, to provide even more protection. Indeed, our deceiving-based approach acts on the way M • P performs inference relatively to M . As the effort is focused on the design of P , another defense method, this time focusing on M and designed to protect M , can then be combined with our scheme. We choose here to consider the combination with adversarial training (Madry et al., 2018) , a state-of-the-art approach for robustness in the white-box setting. The model M is already trained with adversarial training. Interestingly, we note that the joint use of these defenses clearly improves the detection performance, with DAC values superior to 0.8 for the three data sets as well a strong improvement of the AC M metric for MNIST (0.97) and CIFAR10 (0.85). The detailed experiments are presented in Appendix H.2.

5. DISCUSSION AND RELATED WORK

A good practice in the field of adversarial robustness is to consider an adaptive adversary (Carlini et al., 2019) : an attacker using some knowledge of the defense method to bypass it more efficiently. In our case, the only information that could be given to an adversary without breaking the black-box threat model is the supplementary knowledge that the augmented model has been trained with the luring loss. Indeed, M has to stay hidden, otherwise the adversary could use a decision-based attack to thwart the defense (Brendel et al., 2018; Cheng et al., 2020) . Moreover, with a more permissive access to T , for example a white-box access to T , the adversary could simply extract M from the augmented model, and then defeat the defense. The additional information of the luring loss, would not bring any advantage to the adversary in order to craft adversarial perturbations on T which transfer to M . Indeed, it would imply for the attacker to inverse the effect of the luring loss on the prediction output vector of T , which appears as a prohibitive effort. Therefore, respecting the threat model, an adaptive adversary would not be more efficient than the adversary considered throughout this work. Our characterization and first results pave the way for a possible extension of the luring effect to more permissive threat models. For now, in a white-box setting, the defense scheme would be defeated since the adversary could use a gradient-based attack to fool both M and M • P . Actually, the better solution to exploit the luring effect in a gray-box setting would be to cause the luring effect between two models M and M with different architectures. A model M trained with the luring loss would be publicly released. The adversary could then be allowed to mount gradient-based attacks on M without being able to access M . The possible efficiency of such future work is backed by results on the MIM-W attacks, which corresponds to T being attacked by gradient-based attacks. Defenses in the black-box context are weakly covered in the literature as compared to the numerous approaches focused on white-box attacks. More particularly, very few approaches deal with query-based black-box attacks, such as the recent BlackLight (Li et al., 2020) . To our knowledge, our work is the first to exploit the idea of luring an adversary taking advantage of the transferability property. However, a honeypot-based approach has been recently proposed (Shan et al., 2019) with a trapdoored model to detect adversarial examples. Even if the threat model (white-box) is different from ours, this approach is conceptually close to our deception approach, and it is interestingly mentioned that transferability between the target and the trapdoor-protected model is very poor. By fitting the trapdoor approach in our black-box threat model we highlight complementary performances in terms of AC M and DAC metrics, meaning that an overall deception-based strategy is a promising direction for future work. The detailed analysis is presented in Appendix I.

6. CONCLUSION

We propose a conceptually innovative approach to improve the robustness of a model against transfer black-box adversarial perturbations, which basically relies on a deception strategy. Inspired by the notion of robust and non-robust features, we derive and characterize the luring effect, which is implemented via a decoy network built upon the target model, and a loss designed to fool the adversary into targeting different non-robust features than the ones of the target model. Importantly, this approach only relies on the logits of target model, does not require a labeled data set and therefore can by applied to any pre-trained model. We show that our approach can be used as a defense and that a defender may exploit two prediction schemes to detect adversarial examples or enhance the adversarial robustness. Experiments on MNIST, SVHN, CIFAR10 and ImageNet demonstrate that exploiting the luring effect enables to successfully thwart an adversary using state-of-the-art optimized attacks even with large adversarial perturbations. 

C TEST SET ACCURACY AND AGREEMENT RATES

For the luring parameter, we set λ = 1 for MNIST and SVHN, and λ = 0.15 for CIFAR10. For CIFAR10, since no autoencoder we tried reaches correct test set accuracy, we exclude the Auto model. For MIML2, we report results when adversarial examples are clipped to respect the threat model with regards to . An illustration of a clean image and its adversarial counterpart for the maximum perturbation allowed is presented in Figure 4 . We note that the ground-truth label is still clearly recognizable. For PGD, MIM and MIML2, the number of iterations is set to 1000, the step size to 0.01 and µ to 1.0 (MIM and MIML2). For MIML2, the l 2 bound is set to 30 on MNIST and 2 on SVHN and CIFAR10. D.2 ANALYSIS OF THE l 2 AND l ∞ DISTORTIONS We investigate the impact of our additional component on the distance between a clean and an adversarial example, compared to the other approaches. We measure the l 2 means of the adversarial perturbations needed to find an adversarial example, and verify that our approach is globally at the same level than the other approaches. For that purpose, an appropriate and recommended attack is the Carlini & Wagner l 2 attack (hereafter, CWL2) (Carlini & Wagner, 2017 ) that searches for the closest adversarial examples in terms of the l 2 distortion. We run CWL2 with the following parameters: 10 binary search steps, learning rate of 0.1, initial constant of 0.5 and 500 iterations for MNIST, SVHN and CIFAR10. These parameters have been chosen such that increasing the number of iterations does not lower the final l 2 distortion. For each data set, we run the attack on 1, 000 test set examples correctly classified for the four approaches for both M and M • P . The average l 2 and l ∞ distortions are reported in Table 9 and show that there is no significant difference between our approach and the other approaches. This is an additional observation that our luring loss allows to train an augmented model which causes the luring effect by predominantly targeting different useful non-robust features rather than artificially modifying the scale of the adversarial distortion needed to cause misclassification.

D.3 LOGITS ANALYSIS

As a complementary analysis on the impact of the mapping induced by the additional compoment P , we analyse how P acts on the variation of the logits with respect to the input features (∇ xi (x)). For that purpose, we first pick 1000 test set examples correctly classified by all the models (the base model M and the augmented models of the four approaches: Auto, Stack, C E and Luring). Then, for each augmented model and each class c, we compute the number Ω c of pixels x d that lead to opposite effect on prediction between M and M • P (see Equation 5): Ω c = d : ∂ c ∂x d • ∂ P c ∂x d < 0 Next, we look at the proportion of these examples for which Ω c is higher for the Luring approach than for the other approaches. Results for each class are presented in Table 10 . For both SVHN and CIFAR10, these proportions are strictly higher than 80% whatever the class. This is a supplementary indication toward the fact that -with respect to the allowed adversarial perturbation altering x -our approach is more relevant than other approaches (that also perform a mapping within X ) to flip useful and non-robust features of M • P and M towards different labels. 

F.1 GRADIENT-FREE ATTACKS

For the SPSA attack, the learning rate is set to 0.1, the number of iterations is set to 100 and the batch size to 128. For the ECO attack, for the three data sets, the number of queries is set to 20, 000. We did not perform early-stopping as it results in less transferable adversarial examples. The block size is set to 2, 8 and 4 respectively for MNIST, SVHN and CIFAR10.

F.2 GRADIENT-BASED ATTACKS

For DIM and DIM-TI, the optimal p value was searched in {0.1, 0.2, • • • , 1.0}. We obtained the optimal values of 1.0, 1.0 and 0.6 respectively on MNIST, SVHN and CIFAR10. For MIM and its variants, the number of iterations is set to 1000, 500 and 100 respectively on MNIST, SVHN and CIFAR10, and µ = 1.0. The optimal kernel size required by MIM-TI and DIM-TI was searched in {(3, 3), (5, 5), (10, 10), (15, 15)}. For the MIM-TI and DIM-TI, the size of the kernel resulting in the lowest AC M and DAC values reported in Section 4 are presented in tables 11, 12 and 13 respectively for MNIST, SVHN and CIFAR10. The ImageNet data set (Deng et al., 2009) is composed of color images, with 1.2 million images and 50, 000 samples in the training and validation set respectively. For our experiments, the target model M consists of a MobileNetV2 model (Sandler et al., 2018) , taking inputs of size 224×224×3 scales between -1 and 1. The architecture for the defense component is described in Table 15 . We choose to present our approach along with the C E approach as it represents well the purpose of the method we present in this paper. Indeed, the C E approach only requires training the additional component P , which fits into the scope of a defense which can be implemented on top of any already trained model M . For both the Luring and C E approaches, the additional component P is trained for 100 epochs, with a bath size of 256, using a momentum of 0.9 with learning rate starting at 0.01, decreasing to 0.001 after 80 epochs. The λ parameter of the loss is set to 1.0. The accuracy of the augmented model as well as the agreement rate (the model M and the augmented model M • P ) are presented in Table 16 . We present the performances of the luring approach when the target model M is with adversarial training (Madry et al., 2018) as a general state-of-the-art method for adversarial robustness. The l ∞ bound constraint for PGD is set to 0.3 for MNIST and 0.03 for both SVHN and CIFAR10. × 3 3 × 3 3 × 3 AUTO 3 × 3 3 × 3 3 × 3 C E 3 × 3 10 × 10 3 × 3 LURING 3 × 3 3 × 3 3 × 3 DIM-TI STACK 3 × 3 3 × 3 3 × 3 AUTO 3 × 3 3 × 3 3 × 3 C E 3 × 3 3 × 3 3 × 3 LURING 3 × 3 3 × 3 3 × 3 G RESULT FOR MNIST During training, the PGD is performed for 40 steps, and a step size of 0.1 for MNIST. The PGD is performed for 10 steps with a step size of 0.008 for both SVHN and CIFAR10. We report in Table 18 the accuracy and agreement (augmented and target models agree on the ground-truth label) between the target model M and its augmented version T , whether M is trained classically or with adversarial training. In Table 19 we report the AC M and DAC values corresponding to the same threat model as the one used throughout this paper, against the strong MIM-W attack. Moroever, we add the accuracy in a white-box setting of the model M against the PGD attack with 40 iterations and a step size of 0.01, denoted as AC M,wb . Interestingly, we observe a productive interaction between the two defenses with DAC values that have greatly increased for the three data sets, as well as the AC M values for MNIST and CIFAR10. 



Here best is from the adversary point of view. Parameters are tuned relatively to the adversarial goal: lowering our defense efficiency. Thus, the lowest performances sometimes do not correspond to the lowest accuracy of M • P on the adversarial examples. https://github.com/Shawn-Shan/dnnbackdoor



Figure 1: Luring effect. (left) x fools M • P by flipping a non-robust feature f • P (toward the green class). (right) However, f can be a robust feature, then M is not fooled (still in the blue class), or a non-robust feature but switched differently than f • P (towards the orange class).

Training of the luring component Input: trained model M , training steps K, learning rate η, batch size B, luring coefficient λ Output: luring component P , with parameters θ 1: h M •P c (x) denotes the logits of M • P for input x and class c 2: Randomly initialize the parameters θ of P 3: for step = 1 . . . K: do 4: {x 1 , x 2 , . . . , x B }: batch of training set examples 5: for i = 1 . . . B: do 6:

Figure 2: Disagreement Rate (solid line) and Inefficient Adversarial examples Rate (dashed line) for different attacks.

Figure 3: l 0 adversarial distortion for MNIST (left) and CIFAR10 (middle). Saliency maps for MNIST (right): (top) clean image and gradient of the cross-entropy loss with respect to input; (bottom) mapping gradients ∇ x P (x) for 3 augmented models.

Figure (top) Clean image, (bottom) adversarial example for the maximum perturbation allowed (left to right: = 0.4, 0.08, 0.04).

Figure 5: Illustration of the black-box threat model with two scenarios. (top) In the inner transferability scenario the system to attack and to defend is the same. Because the luring effect induces different behavior when facing adversarial perturbation, the defender is able to detect adversarial examples by comparing M (x) and T (x). (bottom) In the distant transferability scenario, the system to defend may suffer from transferable adversarial examples but the defender takes advantage of the weak transferability between T and M .

Table 12: SVHN. size for the MIM-TI and DIM-TI attacks. 10 × 10 10 × 10



Defense component architecture for CIFAR10.

Stack Epochs: 200. Batch size: 32. Optimizer: Adam with learning rate starting at 0.1, decreasing to 0.01 and 0.001 respectively after 80 and 120 epochs • C E Epochs: 216. Batch size: 256. Optimizer: Adam with learning rate starting at 0.00001, decreasing to 0.000005 and 0.0000008 respectively after 154 and 185 epochs. Dropout is used • Luring Epochs: 216. Batch size: 256. Optimizer: Adam with learning rate starting at 0.00001, decreasing to 0.000005 and 0.0000008 respectively after 154 and 185 epochs. Dropout is used

Test set accuracy and agreement (augmented and target model agree on the ground-truth label) between each augmented model and the target model M , noted BASE.

Average l 2 and l ∞ distortions between clean and adversarial examples generated with the CWL2 attack for the target model M and the augmented models M • P

Rate of examples (over 1000) for which logits vary more differently between M and M • P , for the Luring approach against the other approaches.

MNIST. Kernel size for the MIM-TI and DIM-TI attacks.

CIFAR10. Kernel size for the MIM-TI and DIM-TI attacks.

MNIST. AC M oP , AC M and DAC for different source model architectures.

Defense component architecture for ImageNet.

Test set accuracy and agreement (augmented and target model agree on the ground-truth label) between each augmented model and the target model M , noted BASE.The attacks are performed on 1, 000 well-classified images from the validation set, and two common l ∞ perturbation budgets are considered : 4/255, 5/255 and 6/255. Results are presented in Table17. The highest results observed in terms of both AC M and DAC values with our approach confirm the scalability of our method on large-scale data sets.

ImageNet. AC M oP , AC M and DAC for different source model architectures.

Test set accuracy and agreement (augmented and target model agree on the ground-truth label) between an augmented luring model T and a target model M . Base and Base-Luring denote respectively a target model trained classically and the augmented model trained with the luring defense. AdvTrain and AdvTrain-Luring denote respectively a target model trained with adversarial training and the augmented model trained with the luring defense.

AC M,wb , AC M and DAC values (among FGSM, MIM, DIM, MIM-TI and DIM-TI) for the luring approach. Base-Luring and AdvTrain-luring denote the cases where the model M is trained respectively classically and with adversarial training.

A SETUP FOR BASE CLASSIFIERS

The architectures of the models trained on MNIST, SVHN and CIFAR10 are detailed respectively in tables 2, 3 and 4. BN, MaxPool(u,v) , UpSampling(u,v) and Conv(f,k,k) denote respectively batch normalization, max pooling with window size (u,v), upsampling with sampling factor (u,v) and 2D convolution with f filters and kernel of size (k,k) .For MNIST, we used 5 epochs, a batch size of 28 and the Adam optimizer with a learning rate of 0.01. For SVHN, we used 50 epochs, a batch size of 28 and the Adam optimizer with a learning rate of 0.01. For CIFAR10, we used 200 epochs, a batch size of 32, and the Adam optimizer with a piecewise learning rate of 0.1, 0.01 and 0.001 after respectively 80 and 120 epochs. 

B TRAINING SETUP FOR DEFENSE COMPONENTS

The architectures for the defense component P on MNIST, SVHN and CIFAR10 are detailed respectively in Tables 5, 6 and 7 . BN , M axP ool(u, v), U pSampling(u, v) and Conv(f, k, k) denote respectively batch normalization, max pooling with window size (u, v), upsampling with sampling factor (u, v) and 2D convolution with f filters and kernel of size (k, k). The detailed parameters used to perform training are also reported. , where ∆ is a specific trigger. Then, the target model is trained by minimizing the cross-entropy between the clean and trapdoored input-label pairs to obtain the trapdoored model. An input x is detected as a targeted adversarial example thanks to a thresholdbased test, that compares a "neuron activation signature" of this input against that of the embedded trapdoors. Note that the authors extend their approach untargeted adversarial examples. We refer to the article (Shan et al., 2019) for more details.The authors mention that transferability between the target and the trapdoored model is very poor.Based on the provided code 3 , we train trapdoored models with the architectures we used for M . We observe that the transferability is hight from M to the trapdoor version but, on the contrary, the transferability is weak when adversarial examples are crafted on the trapdoored model and transferred to the target model. For the latter scenario, we report in Table 20 the worst AC M and DAC values for FGSM, MIM, DIM, MIM-TI and DIM-TI along with those of our method. The parameters used to run the attacks are the same as described in Section 4. These results would correspond to a defense similar to ours where the public model is not the augmented model 

