TOWARDS DEFENDING MULTIPLE ADVERSARIAL PERTURBATIONS VIA GATED BATCH NORMALIZATION

Abstract

There is now extensive evidence demonstrating that deep neural networks are vulnerable to adversarial examples, motivating the development of defenses against adversarial attacks. However, existing adversarial defenses typically improve model robustness against individual specific perturbation types. Some recent methods improve model robustness against adversarial attacks in multiple p balls, but their performance against each perturbation type is still far from satisfactory. To better understand this phenomenon, we propose the multi-domain hypothesis, stating that different types of adversarial perturbations are drawn from different domains. Guided by the multi-domain hypothesis, we propose Gated Batch Normalization (GBN), a novel building block for deep neural networks that improves robustness against multiple perturbation types. GBN consists of a gated subnetwork and a multi-branch batch normalization (BN) layer, where the gated subnetwork separates different perturbation types, and each BN branch is in charge of a single perturbation type and learns domain-specific statistics for input transformation. Then, features from different branches are aligned as domain-invariant representations for the subsequent layers. We perform extensive evaluations of our approach on MNIST, CIFAR-10, and Tiny-ImageNet, and demonstrate that GBN outperforms previous defense proposals against multiple perturbation types, i.e., 1 , 2 , and ∞ perturbations, by large margins of 10-20%. 1 

1. INTRODUCTION

Deep neural networks (DNNs) have achieved remarkable performance across a wide areas of applications (Krizhevsky et al., 2012; Bahdanau et al., 2014; Hinton et al., 2012) , but they are susceptible to adversarial examples (Szegedy et al., 2013) . These elaborately designed perturbations are imperceptible to humans but can easily lead DNNs to wrong predictions, threatening both digital and physical deep learning applications (Kurakin et al., 2016; Liu et al., 2019a) . To improve model robustness against adversarial perturbations, a number of adversarial defense methods have been proposed (Papernot et al., 2015; Engstrom et al., 2018; Goodfellow et al., 2014) . Many of these defense methods are based on adversarial training (Goodfellow et al., 2014; Madry et al., 2018) , which augment training data with adversarial examples. However, most adversarial defenses are designed to counteract a single type of perturbation (e.g., small ∞ -noise) (Madry et al., 2018; Kurakin et al., 2017; Dong et al., 2018) . These defenses offer no guarantees for other perturbations (e.g., 1 , 2 ), and sometimes even increase model vulnerability (Kang et al., 2019; Tramèr & Boneh, 2019) . To address this problem, other adversarial training strategies have been proposed with the goal of simultaneously achieving robustness against multiple types of attacks, i.e., ∞ , 1 , and 2 attacks (Tramèr & Boneh, 2019; Maini et al., 2020) . Although these methods improve overall model robustness against adversarial attacks in multiple p balls, the performance for each individual perturbation type is still far from satisfactory. In this work, we propose the multi-domain hypothesis, which states that different types of adversarial perturbation arise in different domains, and thus have separable characteristics. Training on data from multiple domains can be regarded as solving the invariant risk minimization problem (Ahuja et al., 2020) , in which an invariant predictor is learnt to achieve the minimum risk for different environments. For a deep learning model, instance-related knowledge can be stored in the weight matrix of each layer, whereas domain-related knowledge can be represented by the batch normalization (BN) layer statistics (Li et al., 2017) . Inspired by the multi-domain hypothesis, we propose to improve model robustness against multiple perturbation types by separating domain-specific information for different perturbation types, and using BN layer statistics to better align data from the mixture distribution and learn domain-invariant representations for multiple adversarial examples types. In particular, we propose a novel building block for DNNs, referred to as Gated Batch Normalization (GBN), which consists of a gated subnetwork and a multi-branch BN layer. GBN first learns to separate perturbations from different domains on-the-fly and then normalizes them by obtaining domain-specific features. Specifically, each BN branch handles a single perturbation type (i.e., domain). Then, features computed from different branches are aligned as domain-invariant representations that are aggregated as the input to subsequent layers. Extensive experiments on MNIST, CIFAR-10, and Tiny-ImageNet demonstrate that our method outperforms previous defense strategies by large margins, i.e., 10-20%.

2. BACKGROUND AND RELATED WORK

In this section, we provide a brief overview of existing work on adversarial attacks and defenses, as well as batch normalization techniques.

2.1. ADVERSARIAL ATTACKS AND DEFENSES

Adversarial examples are inputs intentionally designed to mislead DNNs (Szegedy et al., 2013; Goodfellow et al., 2014) . Given a DNN f Θ and an input image x ∈ X with the ground truth label y ∈ Y, an adversarial example x adv satisfies f Θ (x adv ) = y s.t. x -x adv ≤ , where • is a distance metric. Commonly, • is measured by the p -norm (p ∈{1,2,∞}). (Papernot et al., 2015; Xie et al., 2018; Madry et al., 2018; Liao et al., 2018; Cisse et al., 2017) , among which adversarial training has been widely studied and demonstrated to be the most effective (Goodfellow et al., 2014; Madry et al., 2018) . Specifically, adversarial training minimizes the worst case loss within some perturbation region for classifiers, by augmenting the training set {x (i) , y (i) } i=1...n with adversarial examples. However, these defenses only improve model robustness for one type of perturbation (e.g., ∞ ) and typically offer no robustness guarantees against other attacks (Kang et al., 2019; Tramèr & Boneh, 2019; Schott et al., 2019) .

Various defense approaches have been proposed to improve model robustness against adversarial examples

To address this problem, recent works have attempted to improve the robustness against several types of perturbation (Schott et al., 2019; Tramèr & Boneh, 2019; Maini et al., 2020) . Schott et al. (2019) proposed Analysis by Synthesis (ABS), which used multiple variational autoencoders to defend 0 , 2 , and ∞ adversaries. However, ABS only works on the MNIST dataset. Croce & Hein (2020a) proposed a provable adversarial defense against all p norms for p ≥1 using a regularization term. However, it is not applicable to the empirical setting, since it only guarantees robustness for very small perturbations (e.g., 0.1 and 2/255 for 2 and ∞ on CIFAR-10). Tramèr & Boneh (2019) tried to defend against multiple perturbation types ( 1 , 2 , and ∞ ) by combining different types of adversarial examples for adversarial training. Specifically, they introduced two training strategies, "MAX" and "AVG", where for each input image, the model is either trained on its strongest adversarial example or all types of perturbations. More recently, Maini et al. (2020) proposed multi steepest descent (MSD), and showed that a simple modification to standard PGD adversarial training improves robustness to 1 , 2 , and ∞ adversaries. In this work, we follow (Tramèr & Boneh, 2019; Maini et al., 2020) to focus on defense against 1 , 2 , and ∞ adversarial perturbations, which are the most representative and commonly used perturbations. However, we propose a completely different perspective and solution to the problem. 



Our code will be available upon publication.



2.2 BATCH NORMALIZATION BN (Ioffe & Szegedy, 2015) is typically used to stabilize and accelerate DNN training. Let x ∈ R d denote the input to a neural network layer. During training, BN normalizes each neuron/channel

