LAYER-WISE ADVERSARIAL DEFENSE: AN ODE PER-SPECTIVE

Abstract

Deep neural networks are observed to be fragile against adversarial attacks, which have dramatically limited their practical applicability. On improving model robustness, the adversarial training techniques have proven effective and gained increasing attention from research communities. Existing adversarial training approaches mainly focus on perturbations to inputs, while the effect of the perturbations in hidden layers remains underexplored. In this work, we propose layer-wise adversarial defense which improves adversarial training by a noticeable margin. The basic idea of our method is to strengthen all of the hidden layers with perturbations that are proportional to the back-propagated gradients. In order to study the layer-wise neural dynamics, we formulate our approach from the perspective of ordinary differential equations (ODEs) and build up its extended relationship with conventional adversarial training methods, which tightens the relationship between neural networks and ODEs. In the implementation, we propose two different training algorithms by discretizing the ODE model with the Lie-Trotter and the Strang-Marchuk splitting schemes from the operator-splitting theory. Experiments on CIFAR-10 and CIFAR-100 benchmarks show that our methods consistently improve adversarial model robustness on top of widely-used strong adversarial training techniques.

1. INTRODUCTION

Recent years have witnessed the prosperity of deep learning in many tasks (Hinton & Salakhutdinov, 2006; Sutskever et al., 2014; He et al., 2016; LeCun et al., 2015; Huang et al., 2017; Vaswani et al., 2017) . Stacked with multiple layers, neural networks provide an end-to-end solution to all the tasks and prove to be highly effective. However, the seminal study by Szegedy et al. (2013) has shown that deep neural networks (DNNs) can be fragile against attacks: minor perturbations on inputs lead to significant change in model predictions. Regarding the defense approaches, intensive studies on adversarial defense techniques have been proposed (Athalye et al., 2018a; Goodfellow et al., 2014; Zheng et al., 2016; Madry et al., 2018; Zhang et al., 2019b; Kurakin et al., 2017; Pang et al., 2019a; 2020; 2019b; Raff et al., 2019; Guo et al., 2018; Zhang et al., 2020a; Balunovic & Vechev, 2019; Wong et al., 2020; Chan et al., 2020; Zhang et al., 2020b) . Among these techniques, adversarial training algorithms (Madry et al., 2018; Zhang et al., 2019b) incorporate the effect of perturbed inputs into the loss function, which are shown to be competent and boasts the dominant impact in the adversarial defense research field. While adversarial training techniques have gained increasing attention in the robust deep learning research community, most of current approaches concentrate on deriving perturbations on the inputs with gradients back-propagated from the loss function. However, as information flow in neural networks starts from inputs and passes through hidden layers, it is essential to robustify both the inputs and the hidden layers. While previous studies have made successful attempts on introducing damping terms (Yang et al., 2020) or stochastic noise (Liu et al., 2020; Wang et al., 2019) to each layer in neural architectures, they concentrate on improving general model robustness and are less focused on adversarial model robustness. We ask the following question: Can we take the hidden layers of neural networks into account to improve adversarial model robustness? In this work, we propose layer-wise adversarial defense to improve adversarial training, which enhances adversarial model robustness by stabilizing both inputs and hidden layers. In our method,

