LAYER-WISE ADVERSARIAL DEFENSE: AN ODE PER-SPECTIVE

Abstract

Deep neural networks are observed to be fragile against adversarial attacks, which have dramatically limited their practical applicability. On improving model robustness, the adversarial training techniques have proven effective and gained increasing attention from research communities. Existing adversarial training approaches mainly focus on perturbations to inputs, while the effect of the perturbations in hidden layers remains underexplored. In this work, we propose layer-wise adversarial defense which improves adversarial training by a noticeable margin. The basic idea of our method is to strengthen all of the hidden layers with perturbations that are proportional to the back-propagated gradients. In order to study the layer-wise neural dynamics, we formulate our approach from the perspective of ordinary differential equations (ODEs) and build up its extended relationship with conventional adversarial training methods, which tightens the relationship between neural networks and ODEs. In the implementation, we propose two different training algorithms by discretizing the ODE model with the Lie-Trotter and the Strang-Marchuk splitting schemes from the operator-splitting theory. Experiments on CIFAR-10 and CIFAR-100 benchmarks show that our methods consistently improve adversarial model robustness on top of widely-used strong adversarial training techniques.

1. INTRODUCTION

Recent years have witnessed the prosperity of deep learning in many tasks (Hinton & Salakhutdinov, 2006; Sutskever et al., 2014; He et al., 2016; LeCun et al., 2015; Huang et al., 2017; Vaswani et al., 2017) . Stacked with multiple layers, neural networks provide an end-to-end solution to all the tasks and prove to be highly effective. However, the seminal study by Szegedy et al. (2013) has shown that deep neural networks (DNNs) can be fragile against attacks: minor perturbations on inputs lead to significant change in model predictions. Regarding the defense approaches, intensive studies on adversarial defense techniques have been proposed (Athalye et al., 2018a; Goodfellow et al., 2014; Zheng et al., 2016; Madry et al., 2018; Zhang et al., 2019b; Kurakin et al., 2017; Pang et al., 2019a; 2020; 2019b; Raff et al., 2019; Guo et al., 2018; Zhang et al., 2020a; Balunovic & Vechev, 2019; Wong et al., 2020; Chan et al., 2020; Zhang et al., 2020b) . Among these techniques, adversarial training algorithms (Madry et al., 2018; Zhang et al., 2019b) incorporate the effect of perturbed inputs into the loss function, which are shown to be competent and boasts the dominant impact in the adversarial defense research field. While adversarial training techniques have gained increasing attention in the robust deep learning research community, most of current approaches concentrate on deriving perturbations on the inputs with gradients back-propagated from the loss function. However, as information flow in neural networks starts from inputs and passes through hidden layers, it is essential to robustify both the inputs and the hidden layers. While previous studies have made successful attempts on introducing damping terms (Yang et al., 2020) or stochastic noise (Liu et al., 2020; Wang et al., 2019) to each layer in neural architectures, they concentrate on improving general model robustness and are less focused on adversarial model robustness. We ask the following question: Can we take the hidden layers of neural networks into account to improve adversarial model robustness? In this work, we propose layer-wise adversarial defense to improve adversarial training, which enhances adversarial model robustness by stabilizing both inputs and hidden layers. In our method, the layer-wise perturbations are incorporated into the robust optimization framework of adversarial training. We propose to inject scaled back-propagated gradients into the architecture as layer-wise perturbations. Besides, we formulate our method from the perspective of ordinary differential equations and propose a novel ODE as its the continuous limit in order to study the neural dynamics. Inspired from the rich literature on numerical analysis, we use the Lie-Trotter and the Strang-Marchuk splitting schemes to solve the proposed ODE. We refer to the resulted discrete algorithms as Layerwise Adversarial Defense (LAD) and LAD-SM, respectively. Furthermore, we build up the extended relationship between our methods with current natural training and adversarial training techniques by analyzing the second order dynamics. Our analysis shows that our methods have introduced additional perturbations in the first order initial value of the second order dynamics compared with current adversarial training algorithms.Experiments on the CIFAR-10 and CIFAR-100 benchmarks show that our methods improve adversarial model robustness on top of different widely-used strong adversarial training techniques. We summarize our contributions as follows: • We propose layer-wise adversarial defense which generalizes conventional adversarial training approaches with layer-wise adversarial perturbations (Section 3.1); • We investigate the continuous limit of our layer-wise adversarial defense methods and propose an ODE that integrates the adjoint state into the forward dynamics (Section 3.2); • We build up the extended relationship between our methods and current adversarial training approaches by analyzing the second order neural dynamics in theory. Experiments have also shown the effectiveness of our methods in practice. (Section 3.3 and Section 4).

2.1. ADVERSARIAL MODEL ROBUSTNESS

In this section we review the literature on gradient-based attack and defense approaches in the field of adversarial model robustness. For adversarial attacks, widely-used approaches include Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2015) and Iterated Fast Gradient Sign Method (IFGSM) (Madry et al., 2018) . For a given data point, FGSM induces the adversarial example by moving with the attack radius at each component along the gradient ascent direction. Iterated FGSM performs FGSM with inner iteration updates with smaller step size α. Prior studies have inspired multiple adversarial attack techniques (Athalye et al., 2018b; Carlini & Wagner, 2017; Ilyas et al., 2018; Dong et al., 2018; Pang et al., 2018) . Adversarial defense techniques can be categorized by training phase (Athalye et al., 2018a; Goodfellow et al., 2014; Zheng et al., 2016; Madry et al., 2018; Zhang et al., 2019b; Kurakin et al., 2017; Pang et al., 2019a; 2020; Zhang et al., 2020a; Balunovic & Vechev, 2019; Wong et al., 2020; Chan et al., 2020; Zhang et al., 2020b) and inference phase (Pang et al., 2019b; Raff et al., 2019; Xie et al., 2018; Guo et al., 2018) . The widely-used approach in training phase is Projected Gradient Descent (PGD) training (Madry et al., 2018) , which integrates the effect of the perturbed inputs into its loss function. The current state-of-the-art defense approach in training phase is TRADES (Zhang et al., 2019b) , which additionally introduces the boundary error as a regularization term into its loss function. In our experiments, we select PGD training and TRADES as our baselines. While substantially enhancing adversarial model robustness, the gradient-based perturbations in adversarial training are currently only performed on inputs. As cascaded hidden layers comprise the passage for information flow in neural networks, it is essential to stabilize hidden layers as well. In our work, we introduce layerwise gradient-based perturbations to neural architectures to improve adversarial model robustness.

2.2. ODE-INSPIRED ARCHITECTURE DESIGNS

Research about the relationship between neural networks and ODEs starts with the continuous limit formulation of ResNet (E, 2017), which has inspired many novel neural architecture designs (Lu et al., 2018; Zhu et al., 2018; Chang et al., 2018; Haber & Ruthotto, 2017; Chen et al., 2018; Dupont et al., 2019) . Regarding model robustness, most prior studies have focused on improving dynamic system stability by Lyapunov analysis, more stable numerical schemes, or imposing regularization.

