IMPROVING ADVERSARIAL ROBUSTNESS VIA CHANNEL-WISE ACTIVATION SUPPRESSING

Abstract

The study of adversarial examples and their activation has attracted significant attention for secure and robust learning with deep neural networks (DNNs). Different from existing works, in this paper, we highlight two new characteristics of adversarial examples from the channel-wise activation perspective: 1) the activation magnitudes of adversarial examples are higher than that of natural examples; and 2) the channels are activated more uniformly by adversarial examples than natural examples. We find that the state-of-the-art defense adversarial training has addressed the first issue of high activation magnitudes via training on adversarial examples, while the second issue of uniform activation remains. This motivates us to suppress redundant activation from being activated by adversarial perturbations via a Channel-wise Activation Suppressing (CAS) strategy. We show that CAS can train a model that inherently suppresses adversarial activation, and can be easily applied to existing defense methods to further improve their robustness. Our work provides a simple but generic training strategy for robustifying the intermediate layer activation of DNNs.

1. INTRODUCTION

Deep neural networks (DNNs) have become standard models for solving real-world complex problems, such as image classification (He et al., 2016) , speech recognition (Wang et al., 2017) , and natural language processing (Devlin et al., 2019) . DNNs can approximate extremely complex functions through a series of linear (e.g. convolution) and non-linear (e.g. ReLU activation) operations. Despite their superb learning capabilities, DNNs have been found to be vulnerable to adversarial examples (or attacks) (Szegedy et al., 2014; Goodfellow et al., 2015) , where small perturbations on the input can easily subvert the model's prediction. Adversarial examples can transfer across different models (Liu et al., 2017; Wu et al., 2020a; Wang et al., 2021) and remain destructive even in the physical world (Kurakin et al., 2016; Duan et al., 2020) , raising safety concerns in autonomous driving (Eykholt et al., 2018) and medical diagnosis (Ma et al., 2021) . Existing defense methods against adversarial examples include input denoising (Liao et al., 2018; Bai et al., 2019 ), defensive distillation (Papernot et al., 2016) , gradient regularization (Gu & Rigazio, 2014 ), model compression (Das et al., 2018 ) and adversarial training (Goodfellow et al., 2015; Madry et al., 2018; Wang et al., 2019) , amongst which adversarial training has demonstrated the most reliable robustness (Athalye et al., 2019; Croce & Hein, 2020b) . Adversarial training is a data augmentation technique that trains DNNs on adversarial rather than natural examples. In adversarial training, natural examples are augmented (or perturbed) with the worst-case perturbations found within a small L p -norm ball around them. This augmentation has been shown to effectively smooth out the loss landscape around the natural examples, and force the network to focus more on the pixels that are most relevant to the class. Apart from these interpretations, it is still not well understood, from the activation perspective, how small input perturbations accumulate across intermediate layers to subvert the final output, and how adversarial training can help mitigate such an accumulation. The study of intermediate layer activation has thus become crucial for developing more in-depth understanding and robust DNNs. In this paper, we show that, if studied from a channel-wise perspective, strong connections between certain characteristics of intermediate activation and adversarial robustness can be established. Our channel-wise analysis is motivated by the fact that different convolution filters (or channels) learn different patterns, which when combined together, describe a specific type of object. Here, adversarial examples are investigated from a new perspective of channels in activation. Different from the existing activation works assuming different channels are of equal importance, we focus on the relationship between channels. Intuitively, different channels of an intermediate layer contribute differently to the class prediction, thus have different levels of vulnerabilities (or robustness) to adversarial perturbations. Given an intermediate DNN layer, we first apply global average pooling (GAP) to obtain the channel-wise activation, based on which, we show that the activation magnitudes of adversarial examples are higher than that of natural examples. This means that adversarial perturbations generally have the signal-boosting effect on channels. We also find that the channels are activated more uniformly by adversarial examples than that by natural examples. In other words, some redundant (or low contributing) channels that are not activated by natural examples, yet are activated by adversarial examples. We show that adversarial training can effectively address the high magnitude problem, yet fails to address the uniform channel activation problem, that is, some redundant and low contributing channels are still activated. This to some extent explains why adversarial training works but its performance is not satisfactory. It is generic, effective, and can be easily incorporated into many existing defense methods. We also provide a complete analysis on the benefit of channel-wise activation suppressing to adversarial robustness.

2. RELATED WORK

Adversarial Defense. Many adversarial defense techniques have been proposed since the discovery of adversarial examples. Among them, many were found to have caused obfuscated gradients and can be circumvented by Back Pass Differentiable Approximation (BPDA), Expectation over Transformation (EOT) or Reparameterization (Athalye et al., 2019) . Adversarial training (AT) has been demonstrated to be the most effective defense (Madry et al., 2018; Wang et al., 2019; 2020b) , which solves the following min-max optimization problem: min θ max x ∈B (x) L(F(x , θ), y), where, F is a DNN model with parameters θ, x is a natural example with class label y, x is the adversarial example within the L p -norm ball B (x) = {x : xx p ≤ } centered at x,



Therefore, we propose a new training strategy named Channel-wise Activation Suppressing (CAS), which adaptively learns (with an auxiliary classifier) the importance of different channels to class prediction, and leverages the learned channel importance to adjust the channels dynamically. The robustness of existing state-of-the-art adversarial training methods can be consistently improved if applied with our CAS training strategy. Our key contributions are summarized as follows:• We identify, from a channel-wise activation perspective, two connections between DNN activation and adversarial robustness: 1) the activation of adversarial examples are of higher magnitudes than that of natural examples; and 2) the channels are activated more uniformly by adversarial examples than that by natural examples. Adversarial training only addresses the first issue of high activation magnitudes, yet fails to address the second issue of uniform channel activation. • We propose a novel training strategy to train robust DNN intermediate layers via Channelwise Activation Suppressing (CAS). In the training phase, CAS suppresses redundant channels dynamically by reweighting the channels based on their contributions to the class prediction. CAS is a generic intermediate-layer robustification technique that can be applied to any DNNs along with existing defense methods. • We empirically show that our CAS training strategy can consistently improve the robustness of current state-of-the-art adversarial training methods.

availability

//github.com/bymavis

