ADVERSARIAL FEATURE DESENSITIZATION

Abstract

Deep neural networks can now perform many tasks that were once thought to be only feasible for humans. While reaching this impressive performance under standard settings, such networks are known to be vulnerable to adversarial attacks -slight but carefully constructed perturbations of the inputs which drastically decrease the network performance. Here we propose a new way to improve the network robustness against adversarial attacks by focusing on robust representation learning based on the adversarial learning paradigm, called here Adversarial Feature Desensitization (AFD). AFD desensitizes the representation via an adversarial game between the embedding network and an additional adversarial discriminator, which is trained to distinguish between the clean and perturbed inputs from their high-level representations. Our method substantially improves the state-of-the-art in robust classification on MNIST, CIFAR10, and CIFAR100 datasets. More importantly, we demonstrate that AFD has better generalization ability than previous methods, as the learned features maintain their robustness across a wide range of perturbations, including perturbations not seen during training. These results indicate that reducing feature sensitivity is a promising approach for ameliorating the problem of adversarial attacks in deep neural networks.

1. INTRODUCTION

Despite remarkable recent progress in deep learning that allowed neural networks to achieve a near human-level performance across a range of complex tasks (He et al., 2016; Mnih et al., 2015; Silver et al., 2017; Vinyals et al., 2019) , a number of important open challenges remain. For example, deep networks are know to be highly vulnerability to adversarial attacks (Szegedy et al., 2013) , i.e. small but precise perturbations of the inputs that result in high-confidence predictions which are critically divergent from human judgement. Many prior works on adversarial robustness have tackled the robust classification problem by forcing the classifier to output the correct label for the perturbed inputs (Madry et al., 2017; Kannan et al., 2018; Zhang et al., 2019b) . These approaches essentially push the representations of samples from different categories away from the decision boundary. For example, the Adversarial Training procedure (Madry et al., 2017) , trains a network to minimize the classification loss on the distribution of perturbed input samples. Another recent approach (Zhang et al., 2019b) augments the regular classification loss with an auxiliary term that encourages the network to match the assigned labels to clean and perturbed inputs (Figure 1a ). More recently, several other works have tried to improve the classification robustness by enhancing the smoothness of the classification loss (Wu et al., 2019; Qin et al., 2020) , or the saliency of the Jacobian matrix (Chan et al., 2020b) . These methods has been shown to further improve the robust performance compared to prior approaches that do not consider the gradient landscape of the network. However, despite all these efforts, most of these defenses remain vulnerable against other forms of attacks that were not used during training or even slightly stronger perturbations of the same kind (Schott et al., 2018; Sitawarin et al., 2020) . One reason for the above could be an insufficient focus on the robustness of representations learned by the model. It has been shown that many adversarial perturbations that are often small in magnitude lead to large deviations in the high-level features of deep neural networks (Liao et al., 2018; Yoon et al., 2019) . In addition, previous work (Ilyas et al., 2019) demonstrated that adversarial patterns often rely on specific learned features which generalize even on large datasets such as ImageNet (Deng et al., 2009) . However, these features are highly sensitive to input changes, yielding a potential vulnerability that can be exploited by adversarial attacks. While humans can also experience altered perception in response to particular visual patterns (e.g., visual illusionsfoot_0 ), they are seemingly insensitive to this particular class of perturbations, and often unaware of the subtle image changes resulting from adversarial attacks. This in turn suggests that current deep neural networks may rely on features that are still considerably different from those giving rise to perception in primates (and, particularly, in humans) -even despite many recent studies highlighting their remarkable similarities (Yamins et al., 2014; Khaligh-Razavi & Kriegeskorte, 2014; Bashivan et al., 2019) . It is therefore reasonable to hypothesize that a deep network may become more robust to such adversarial attacks if the corresponding higher-level representations are more robust to input perturbations, similar to those used by our brains. One way to approach the issue of robust classification is to consider the classifier as a relatively simple mapping (e.g. a linear transformation) that produces predictions based on a learned representation. In this case, if the learned representation is robust then the predictions from the simple classifier would consequently be robust too (Garg et al., 2018; Zhu et al., 2020) . Here, instead of focusing on robust classification, we turned our attention to robustness of learned features from which the categories are inferred (e.g. using a simple linear classifier). Our goal is to learn representations that remain stable in the presence of adversarial attacks. We propose to learn robust representations via an adversarial game between two agents: i) an attacker that searches for performance-degrading perturbations given the embedding function and ii) a discriminator function that distinguishes between the clean and perturbed inputs from their high-level representations. The parameters of the embedding and the adversarial discriminator functions are then tuned via an adversarial game between the two (Figure 1b ). This setup is similar to the adversarial learning paradigm widely used in image generation and transformation (Goodfellow et al., 2014a; Karras et al., 2019; Zhu et al., 2017) , unsupervised and semi-supervised learning (Miyato et al., 2018b ), video prediction (Mathieu et al., 2015; Lee et al., 2018 ), domain adaptation (Ganin & Lempitsky, 2015; Tzeng et al., 2017 ), active learning (Sinha et al., 2019 ), and continual learning (Ebrahimi et al., 2020) . While some prior work have also considered adversarial learning to tackle the problem of adversarial examples, they have often been used to learn the distribution of the adversarial images (Wang & Yu, 2019; Matyasko & Chau, 2018) , or the input gradients (Chan et al., 2020b; a) . The main contributions of this work are: • We propose a novel method to learn adversarially robust representations through an adversarial game between the embedding function and an adversarial discriminator that distinguishes between the natural and perturbed representations. • We theoretically show that our proposed adversarial approach leads to a flat loss function in the vicinity of the training samples, thereby making the overall representation more stable against adversarial attacks. • We perform extensive empirical evaluations against many prior art methods, on three datasets, eight types of attacks, with a wide range of attack strength, and show that our proposed approach performs similar or better (often, significantly better) than most previous defense methods under most tested circumstances.



https://michaelbach.de/ot/



Overview of the proposed AFD approach: (a) visual comparison of several adversarial robustness methods (Adversarial training (Madry et al., 2017), TRADES (Zhang et al., 2019b), and AFD). The dotted black line corresponds to the decision boundary of the adversarial discriminator; (b) schematic of the proposed AFD paradigm.

