LURING OF TRANSFERABLE ADVERSARIAL PERTUR-BATIONS IN THE BLACK-BOX PARADIGM

Abstract

The growing interest for adversarial examples, i.e. maliciously modified examples which fool a classifier, has resulted in many defenses intended to detect them, render them inoffensive or make the model more robust against them. In this paper, we pave the way towards a new approach to improve the robustness of a model against black-box transfer attacks. A removable additional neural network is included in the target model, and is designed to induce the luring effect, which tricks the adversary into choosing false directions to fool the target model. Training the additional model is achieved thanks to a loss function acting on the logits sequence order. Our deception-based method only needs to have access to the predictions of the target model and does not require a labeled data set. We explain the luring effect thanks to the notion of robust and non-robust useful features and perform experiments on MNIST, SVHN and CIFAR10 to characterize and evaluate this phenomenon. Additionally, we discuss two simple prediction schemes, and verify experimentally that our approach can be used as a defense to efficiently thwart an adversary using state-of-the-art attacks and allowed to perform large perturbations.

1. INTRODUCTION

Neural networks based systems have been shown to be vulnerable to adversarial examples (Szegedy et al., 2014) , i.e. maliciously modified inputs that fool a model at inference time. Many directions have been explored to explain and characterize this phenomenon (Schmidt et al., 2018; Ford et al., 2019; Ilyas et al., 2019; Shafahi et al., 2019) that became a growing concern and a major brake on the deployment of Machine Learning (ML) models. In response, many defenses have been proposed to protect the integrity of ML systems, predominantly focused on an adversary in the white-box setting (Madry et al., 2018; Zhang et al., 2019; Cohen et al., 2019; Hendrycks et al., 2019; Carmon et al., 2019) . In this work, we design an innovative way to limit the transferability of adversarial perturbation towards a model, opening a new direction for robustness in the realistic black-box setting (Papernot et al., 2017) . As ML-based online API are likely to become increasingly widespread, and regarding the massive deployment of edge models in a large variety of devices, several instances of a model may be deployed in systems with different environment and security properties. Thus, the black-box paradigm needs to be extensively studied to efficiently protect systems in many critical domains. Considering a target model M that a defender aims at protecting against adversarial examples, we propose a method which allows to build the model T , an augmented version of M , such that adversarial examples do not transfer from T to M . Importantly, training T only requires to have access to M , meaning that no labeled data set is required, so that our approach can be implemented at a low cost for any already trained model. T is built by augmenting M with an additional component P (with T = M • P ) taking the form of a neural network trained with a specific loss function with logit-based constraints. From the observation that transferability of adversarial perturbations between two models occurs because they rely on similar non-robust features (Ilyas et al., 2019) , we design P such that (1) the augmented network exploits useful features of M and that (2) non-robust features of T and M are either different or require different perturbations to reach misclassification towards the same class. Our deception-based method is conceptually new as it does not aim at making M relying more on robust-features as with proactive schemes (Madry et al., 2018; Zhang et al., 

