REVISITING ACTIVATION FUNCTION DESIGN FOR IMPROVING ADVERSARIAL ROBUSTNESS AT SCALE Anonymous

Abstract

Modern ConvNets typically use ReLU activation function. Recently smooth activation functions have been used to improve their accuracy. Here we study the role of smooth activation function from the perspective of adversarial robustness. We find that ReLU activation function significantly weakens adversarial training due to its non-smooth nature. Replacing ReLU with its smooth alternatives allows adversarial training to find harder adversarial training examples and to compute better gradient updates for network optimization. We focus our study on the large-scale ImageNet dataset. On ResNet-50, switching from ReLU to the smooth activation function SILU improves adversarial robustness from 33.0% to 42.3%, while also improving accuracy by 0.9% on Ima-geNet. Smooth activation functions also scale well with larger networks: it helps EfficientNet-L1 to achieve 82.2% accuracy and 58.6% robustness, largely outperforming the previous state-of-the-art defense by 9.5% for accuracy and 11.6% for robustness.

1. INTRODUCTION

It is known that convolutional neural networks can be easily fooled by adversarial examples (Szegedy et al., 2014) . To improve robustness, many efforts have been made (Papernot et al., 2016; Guo et al., 2018; Xie et al., 2018; Liu et al., 2018; Pang et al., 2019; Schott et al., 2019) ; while adversarial training (Goodfellow et al., 2015; Kurakin et al., 2017; Madry et al., 2018) , which trains networks with adversarial examples on-the-fly, stands as one of the most effective methods. Later studies further improve adversarial training by feeding networks with harder adversarial examples (Wang et al., 2019b) , maximizing the margin of networks (Ding et al., 2020) , optimizing a regularized surrogate loss (Zhang et al., 2019) , etc. While these methods achieve stronger adversarial robustness, they sacrifice accuracy on clean inputs. It is generally believed such trade-off between accuracy and robustness might be inevitable (Tsipras et al., 2019) , except for enlarging network capacities, e.g., making wider or deeper networks (Madry et al., 2018; Xie & Yuille, 2020) . Another popular direction for increasing robustness against adversarial attacks is gradient masking (Papernot et al., 2017; Xie et al., 2018; Samangouei et al., 2018; Song et al., 2018; Ma et al., 2018; Guo et al., 2018) . With the degenerated gradient quality, attackers cannot successfully optimize the targeted loss and therefore fail to circumvent such defenses. Nonetheless, the gradient masking operation will be ineffective to offer robustness if its differentiable approximation is used for generating adversarial examples (Athalye et al., 2018) . To effectively build robust models, we hereby rethink the relationship between gradient quality and adversarial robustness, especially in the context of adversarial training where gradients are applied more frequently than standard training. In addition to computing gradients to update network parameters, adversarial training also requires gradient computation for generating training samples. Guided by this principle, we identify ReLU, a widely-used activation function in modern ConvNets, significantly weakens adversarial training due to its non-smooth nature, e.g., ReLU's gradient gets an abrupt change (from 0 to 1) when its input is close to zero (see Figure 1 ). In this paper, we revisit the activation function design for improving adversarial robustness, with a special focus on the large-scale ImageNet dataset (Russakovsky et al., 2015) . To fix the issue induced by ReLU as aforementioned, we propose to apply its smooth approximationsfoot_0 for improving the gradient quality in adversarial training (Figure 1 shows Parametric Softplus, an example of smooth approximations for ReLU). Our experiment results show that switching from ReLU to its smooth approximations in adversarial training can substantially improves adversarial robustness. For instance, by training with the computationally cheap single-step PGD attackerfoot_1 on ImageNet, the smooth activation function SILU significantly improves the robustness of the ResNet-50 baseline (which uses ReLU) by 9.3%, from 33.0% to 42.3%, meanwhile increasing standard accuracy by 0.9%. In addition, we note this performance improvement in both robustness and accuracy comes for "free", as the change of activation function does not incur additional computational cost. We next explore the limits of adversarial training with smooth activation function using larger networks. We obtain the best result by using EfficientNet-L1 et al., 2014; Goodfellow et al., 2015; Kurakin et al., 2017; Madry et al., 2018) . Existing works suggest that, to further adversarial robustness, we need to either sacrifice accuracy on clean inputs (Wang et al., 2019b; 2020; Zhang et al., 2019; Ding et al., 2020) , or incur additional computational cost (Madry et al., 2018; Xie & Yuille, 2020; Xie & et al., 2019) . This phenomenon is referred to as no free lunch in adversarial robustness (Tsipras et al., 2019; Nakkiran, 2019; Su et al., 2018) . In this paper, we show that, using smooth activation function in adversarial training, adversarial robustness can be improved for "free"-no accuracy degradation on clean images and no additional computational cost incurred. Our work is also related to the theoretical study (Sinha et al., 2018) , which shows replacing ReLU with smooth alternatives can help networks get a tractable bound when certifying distributional robustness. In this paper, we empirically corroborate the benefits of utilizing smooth activations is also observable in the practical adversarial training on the real-world dataset using large networks. et al., 2018; Wang et al., 2018; 2019a; Liu et al., 2018; Lee et al., 2020; Luo et al., 2020 ), randomized transformations (Xie et al., 2018; Bhagoji et al., 2018; Xiao & Zheng, 2020; Raff et al., 2019; Kettunen et al., 2019; AprilPyone & Kiya, 2020) , adversarial input denoising/purification (Guo et al., 2018; Prakash et al., 2018; Meng & Chen, 2017; Song et al., 2018; Samangouei et al., 2018; Liao et al., 2018; Bhagoji et al., 2018; Pang et al., 2020; Xu et al., 2017; Dziugaite et al., 2016) , etc. Nonetheless, most of these defense methods degenerate the gradient quality of the protected models, therefore could induce the gradient masking issue (Papernot et al., 2017) . As argued in (Athalye et al., 2018) , defense methods with gradient masking may offer a false sense of adversarial robustness. In contrast to these works, we aim to improve adversarial robustness by providing networks with better gradients, but in the context of adversarial training.



The term "smooth" hereby refer to this function satisfies the property of being C 1 smooth, i.e., its first derivative is continuous everywhere. In practice, we note that single-step PGD adversarial training is only ∼1.5× slower than standard training.



Figure 1: The visualization of ReLU and Parametric Softplus. Left panel: the forward pass for ReLU (blue curve) and Parametric Softplus (red curve). Right panel: the first derivatives for ReLU (blue curve) and Parametric Softplus (red curve). Different from ReLU, Parametric Softplus is smooth with continuous derivatives.

Gradient masking. Besides training models on adversarial data, other ways for improving adversarial robustness include defensive distillation(Papernot et al., 2016), gradient discretization(Buckman et al., 2018; Rozsa & Boult, 2019; Xiao et al., 2019), dynamic network architectures (Dhillon

