IMPROVING ADVERSARIAL ROBUSTNESS VIA CHANNEL-WISE ACTIVATION SUPPRESSING

Abstract

The study of adversarial examples and their activation has attracted significant attention for secure and robust learning with deep neural networks (DNNs). Different from existing works, in this paper, we highlight two new characteristics of adversarial examples from the channel-wise activation perspective: 1) the activation magnitudes of adversarial examples are higher than that of natural examples; and 2) the channels are activated more uniformly by adversarial examples than natural examples. We find that the state-of-the-art defense adversarial training has addressed the first issue of high activation magnitudes via training on adversarial examples, while the second issue of uniform activation remains. This motivates us to suppress redundant activation from being activated by adversarial perturbations via a Channel-wise Activation Suppressing (CAS) strategy. We show that CAS can train a model that inherently suppresses adversarial activation, and can be easily applied to existing defense methods to further improve their robustness. Our work provides a simple but generic training strategy for robustifying the intermediate layer activation of DNNs.

1. INTRODUCTION

Deep neural networks (DNNs) have become standard models for solving real-world complex problems, such as image classification (He et al., 2016 ), speech recognition (Wang et al., 2017) , and natural language processing (Devlin et al., 2019) . DNNs can approximate extremely complex functions through a series of linear (e.g. convolution) and non-linear (e.g. ReLU activation) operations. Despite their superb learning capabilities, DNNs have been found to be vulnerable to adversarial examples (or attacks) (Szegedy et al., 2014; Goodfellow et al., 2015) , where small perturbations on the input can easily subvert the model's prediction. Adversarial examples can transfer across different models (Liu et al., 2017; Wu et al., 2020a; Wang et al., 2021) and remain destructive even in the physical world (Kurakin et al., 2016; Duan et al., 2020) , raising safety concerns in autonomous driving (Eykholt et al., 2018) and medical diagnosis (Ma et al., 2021) . Existing defense methods against adversarial examples include input denoising (Liao et al., 2018; Bai et al., 2019) , defensive distillation (Papernot et al., 2016) , gradient regularization (Gu & Rigazio, 2014 ), model compression (Das et al., 2018 ) and adversarial training (Goodfellow et al., 2015; Madry et al., 2018; Wang et al., 2019) , amongst which adversarial training has demonstrated the most reliable robustness (Athalye et al., 2019; Croce & Hein, 2020b) . Adversarial training is a data augmentation technique that trains DNNs on adversarial rather than natural examples. In adversarial training, natural examples are augmented (or perturbed) with the worst-case perturbations found within a small L p -norm ball around them. This augmentation has been shown to effectively smooth out the loss landscape around the natural examples, and force the network to focus more on the pixels that are most relevant to the class. Apart from these interpretations, it is still not well understood,

availability

//github.com/bymavis

