TOWARDS DEFENDING MULTIPLE ADVERSARIAL PERTURBATIONS VIA GATED BATCH NORMALIZATION

Abstract

There is now extensive evidence demonstrating that deep neural networks are vulnerable to adversarial examples, motivating the development of defenses against adversarial attacks. However, existing adversarial defenses typically improve model robustness against individual specific perturbation types. Some recent methods improve model robustness against adversarial attacks in multiple p balls, but their performance against each perturbation type is still far from satisfactory. To better understand this phenomenon, we propose the multi-domain hypothesis, stating that different types of adversarial perturbations are drawn from different domains. Guided by the multi-domain hypothesis, we propose Gated Batch Normalization (GBN), a novel building block for deep neural networks that improves robustness against multiple perturbation types. GBN consists of a gated subnetwork and a multi-branch batch normalization (BN) layer, where the gated subnetwork separates different perturbation types, and each BN branch is in charge of a single perturbation type and learns domain-specific statistics for input transformation. Then, features from different branches are aligned as domain-invariant representations for the subsequent layers. We perform extensive evaluations of our approach on MNIST, CIFAR-10, and Tiny-ImageNet, and demonstrate that GBN outperforms previous defense proposals against multiple perturbation types, i.e., 1 , 2 , and ∞ perturbations, by large margins of 10-20%. 1 

1. INTRODUCTION

Deep neural networks (DNNs) have achieved remarkable performance across a wide areas of applications (Krizhevsky et al., 2012; Bahdanau et al., 2014; Hinton et al., 2012) , but they are susceptible to adversarial examples (Szegedy et al., 2013) . These elaborately designed perturbations are imperceptible to humans but can easily lead DNNs to wrong predictions, threatening both digital and physical deep learning applications (Kurakin et al., 2016; Liu et al., 2019a) . To improve model robustness against adversarial perturbations, a number of adversarial defense methods have been proposed (Papernot et al., 2015; Engstrom et al., 2018; Goodfellow et al., 2014) . Many of these defense methods are based on adversarial training (Goodfellow et al., 2014; Madry et al., 2018) , which augment training data with adversarial examples. However, most adversarial defenses are designed to counteract a single type of perturbation (e.g., small ∞ -noise) (Madry et al., 2018; Kurakin et al., 2017; Dong et al., 2018) . These defenses offer no guarantees for other perturbations (e.g., 1 , 2 ), and sometimes even increase model vulnerability (Kang et al., 2019; Tramèr & Boneh, 2019) . To address this problem, other adversarial training strategies have been proposed with the goal of simultaneously achieving robustness against multiple types of attacks, i.e., ∞ , 1 , and 2 attacks (Tramèr & Boneh, 2019; Maini et al., 2020) . Although these methods improve overall model robustness against adversarial attacks in multiple p balls, the performance for each individual perturbation type is still far from satisfactory. In this work, we propose the multi-domain hypothesis, which states that different types of adversarial perturbation arise in different domains, and thus have separable characteristics. Training on data from multiple domains can be regarded as solving the invariant risk minimization problem (Ahuja et al., 2020) , in which an invariant predictor is learnt to achieve the minimum risk for different environments. For a deep learning model, instance-related knowledge can be stored in the weight matrix



Our code will be available upon publication.1

