ADVERSARIAL FEATURE DESENSITIZATION

Abstract

Deep neural networks can now perform many tasks that were once thought to be only feasible for humans. While reaching this impressive performance under standard settings, such networks are known to be vulnerable to adversarial attacks -slight but carefully constructed perturbations of the inputs which drastically decrease the network performance. Here we propose a new way to improve the network robustness against adversarial attacks by focusing on robust representation learning based on the adversarial learning paradigm, called here Adversarial Feature Desensitization (AFD). AFD desensitizes the representation via an adversarial game between the embedding network and an additional adversarial discriminator, which is trained to distinguish between the clean and perturbed inputs from their high-level representations. Our method substantially improves the state-of-the-art in robust classification on MNIST, CIFAR10, and CIFAR100 datasets. More importantly, we demonstrate that AFD has better generalization ability than previous methods, as the learned features maintain their robustness across a wide range of perturbations, including perturbations not seen during training. These results indicate that reducing feature sensitivity is a promising approach for ameliorating the problem of adversarial attacks in deep neural networks.

1. INTRODUCTION

Despite remarkable recent progress in deep learning that allowed neural networks to achieve a near human-level performance across a range of complex tasks (He et al., 2016; Mnih et al., 2015; Silver et al., 2017; Vinyals et al., 2019) , a number of important open challenges remain. For example, deep networks are know to be highly vulnerability to adversarial attacks (Szegedy et al., 2013) , i.e. small but precise perturbations of the inputs that result in high-confidence predictions which are critically divergent from human judgement. Many prior works on adversarial robustness have tackled the robust classification problem by forcing the classifier to output the correct label for the perturbed inputs (Madry et al., 2017; Kannan et al., 2018; Zhang et al., 2019b) . These approaches essentially push the representations of samples from different categories away from the decision boundary. For example, the Adversarial Training procedure (Madry et al., 2017) , trains a network to minimize the classification loss on the distribution of perturbed input samples. Another recent approach (Zhang et al., 2019b) augments the regular classification loss with an auxiliary term that encourages the network to match the assigned labels to clean and perturbed inputs (Figure 1a ). More recently, several other works have tried to improve the classification robustness by enhancing the smoothness of the classification loss (Wu et al., 2019; Qin et al., 2020) , or the saliency of the Jacobian matrix (Chan et al., 2020b) . These methods has been shown to further improve the robust performance compared to prior approaches that do not consider the gradient landscape of the network. However, despite all these efforts, most of these defenses remain vulnerable against other forms of attacks that were not used during training or even slightly stronger perturbations of the same kind (Schott et al., 2018; Sitawarin et al., 2020) . One reason for the above could be an insufficient focus on the robustness of representations learned by the model. It has been shown that many adversarial perturbations that are often small in magnitude lead to large deviations in the high-level features of deep neural networks (Liao et al., 2018; Yoon et al., 2019) . In addition, previous work (Ilyas et al., 2019) demonstrated that adversarial patterns often rely on specific learned features which generalize even on large datasets such as ImageNet (Deng et al., 2009) . However, these features are highly sensitive to input changes, yielding a potential vulnerability that can be exploited by adversarial attacks. While humans can also experience altered

