LURING OF TRANSFERABLE ADVERSARIAL PERTUR-BATIONS IN THE BLACK-BOX PARADIGM

Abstract

The growing interest for adversarial examples, i.e. maliciously modified examples which fool a classifier, has resulted in many defenses intended to detect them, render them inoffensive or make the model more robust against them. In this paper, we pave the way towards a new approach to improve the robustness of a model against black-box transfer attacks. A removable additional neural network is included in the target model, and is designed to induce the luring effect, which tricks the adversary into choosing false directions to fool the target model. Training the additional model is achieved thanks to a loss function acting on the logits sequence order. Our deception-based method only needs to have access to the predictions of the target model and does not require a labeled data set. We explain the luring effect thanks to the notion of robust and non-robust useful features and perform experiments on MNIST, SVHN and CIFAR10 to characterize and evaluate this phenomenon. Additionally, we discuss two simple prediction schemes, and verify experimentally that our approach can be used as a defense to efficiently thwart an adversary using state-of-the-art attacks and allowed to perform large perturbations.

1. INTRODUCTION

Neural networks based systems have been shown to be vulnerable to adversarial examples (Szegedy et al., 2014) , i.e. maliciously modified inputs that fool a model at inference time. Many directions have been explored to explain and characterize this phenomenon (Schmidt et al., 2018; Ford et al., 2019; Ilyas et al., 2019; Shafahi et al., 2019) that became a growing concern and a major brake on the deployment of Machine Learning (ML) models. In response, many defenses have been proposed to protect the integrity of ML systems, predominantly focused on an adversary in the white-box setting (Madry et al., 2018; Zhang et al., 2019; Cohen et al., 2019; Hendrycks et al., 2019; Carmon et al., 2019) . In this work, we design an innovative way to limit the transferability of adversarial perturbation towards a model, opening a new direction for robustness in the realistic black-box setting (Papernot et al., 2017) . As ML-based online API are likely to become increasingly widespread, and regarding the massive deployment of edge models in a large variety of devices, several instances of a model may be deployed in systems with different environment and security properties. Thus, the black-box paradigm needs to be extensively studied to efficiently protect systems in many critical domains. Considering a target model M that a defender aims at protecting against adversarial examples, we propose a method which allows to build the model T , an augmented version of M , such that adversarial examples do not transfer from T to M . Importantly, training T only requires to have access to M , meaning that no labeled data set is required, so that our approach can be implemented at a low cost for any already trained model. T is built by augmenting M with an additional component P (with T = M • P ) taking the form of a neural network trained with a specific loss function with logit-based constraints. From the observation that transferability of adversarial perturbations between two models occurs because they rely on similar non-robust features (Ilyas et al., 2019) , we design P such that (1) the augmented network exploits useful features of M and that (2) non-robust features of T and M are either different or require different perturbations to reach misclassification towards the same class. Our deception-based method is conceptually new as it does not aim at making M relying more on robust-features as with proactive schemes (Madry et al., 2018; Zhang et al., 2019) , nor tries to anticipate perturbations which directly target the non-robust features of M as with reactive strategies (Meng & Chen, 2017; Hwang et al., 2019) . Our contributions are as follows: • We present an innovative approach to thwart transferability between two models, which we name the luring effect. This phenomenon, as conceptually novel, opens a new direction for adversarial research. • We propose an implementation of the luring effect which fits any pre-trained model and does not require a label data set. An additional neural network is pasted to the target model and trained with a specific loss function that acts on the logits sequence order. • We experimentally characterize the luring effect and discuss its potentiality for black-box defense strategies on MNIST, SVHN and CIFAR10, and analyze the scalability on Ima-geNet (ILSVRC2012). For reproducibility purposes, the code is available at https://anonymous.4open. science/r/3c64e745-927d-4f51-b187-583e64586ff6/.

2.1. NOTATIONS

We consider a classification task where input-label pairs (x, y) ∈ X × Y are sampled from a distribution D. |Y| = C is the cardinality of the labels space. A neural network model M φ : X → Y, with parameters φ, classifies an input x ∈ X to a label M (x) ∈ Y. The pre-softmax output function of M φ (the logits) is denoted as h M : X → R C . For the sake of readability, the model M φ is simply noted as M , except when necessary.

2.2. CONTEXT: ADVERSARIAL EXAMPLES IN THE BLACK-BOX SETTING

Black-box settings are realistic use-cases since many models are deployed (in the cloud or embedded in mobile devices) within secure environments and accessible through open or restrictive API. Contrary to the white-box paradigm where the adversary is allowed to use existing gradient-based attacks (Goodfellow et al., 2015; Carlini & Wagner, 2017; Chen et al., 2018; Dong et al., 2018; Madry et al., 2018; Wang et al., 2019) , an attacker in a black-box setting only accesses the output label, confidence scores or logits from the target model. He can still take advantage of gradient-free methods (Uesato et al., 2018; Guo et al., 2019; Su et al., 2019; Brendel et al., 2018; Ilyas et al., 2018; Chen et al., 2020) but, practically, the number of queries requires to mount the attack is prohibitive and may be flagged as suspicious (Chen et al., 2019; Li et al., 2020) . In that case, the adversary may take advantage of the transferability property (Papernot et al., 2017) by crafting adversarial examples on a substitute model and then transfering them to the target model.

2.3. OBJECTIVES AND DESIGN

Our objective is to find a novel way to make models more robust against transferable black-box adversarial perturbation without expensive (and sometimes prohibitive) training cost required by many white-box defense methods. Our main idea is based on classical deception-based approaches for network security (e.g. honeypots) and can be summarized as follow: rather than try to prevent an attack, let's fool the attacker. Our approach relies on a network P : X → X , pasted to the already trained target network M before the input layer, such as the resulting augmented model will answer T (x) = M • P (x) when fed with input x. The additional component P is designed and trained to reach a twofold objective: • Prediction neutrality: adding P does not alter the decision for a clean example x, i.e. T (x) = M • P (x) = M (x); • Adversarial luring: according to an adversarial example x crafted to fool T , M does not output the same label as T (i.e. M • P (x ) = M (x )) and, in the best case, x is inefficient (i.e. M (x ) = y).

