ON INTRIGUING LAYER-WISE PROPERTIES OF RO-BUST OVERFITTING IN ADVERSARIAL TRAINING

Abstract

Adversarial training has proven to be one of the most effective methods to defend against adversarial attacks. Nevertheless, robust overfitting is a common obstacle in adversarial training of deep networks. There is a common belief that the features learned by different network layers have different properties, however, existing works generally investigate robust overfitting by considering a DNN as a single unit and hence the impact of different network layers on robust overfitting remains unclear. In this work, we divide a DNN into a series of layers and investigate the effect of different network layers on robust overfitting. We find that different layers exhibit distinct properties towards robust overfitting, and in particular, robust overfitting is mostly related to the optimization of latter parts of the network. Based upon the observed effect, we propose a robust adversarial training (RAT) prototype: in a minibatch, we optimize the front parts of the network as usual, and adopt additional measures to regularize the optimization of the latter parts. Based on the prototype, we designed two realizations of RAT, and extensive experiments demonstrate that RAT can eliminate robust overfitting and boost adversarial robustness over the standard adversarial training.

1. INTRODUCTION

Deep neural networks (DNNs) have been widely applied in multiple fields, such as computer vision (He et al., 2016) and natural language processing (Devlin et al., 2018) . Despite its achieved success, recent studies show that DNNs are vulnerable to adversarial examples. Well-constructed perturbations on the input images that are imperceptible to human's eyes can make DNNs lead to a completely different prediction (Szegedy et al., 2013) . The security concern due to this weakness of DNNs has led to various works in the study of improving DNNs robustness against adversarial examples. Across existing defense techniques thus far, Adversarial Training (AT) (Goodfellow et al., 2014; Madry et al., 2017) , which optimizes DNNs with adversarially perturbed data instead of natural data, is the most effective approach (Athalye et al., 2018) . However, it has been shown that networks trained by AT technique do not generalize well (Rice et al., 2020) . After a certain point in AT, immediately after the first learning rate decay, the robust test accuracy continues to decrease with further training. Typical regularization practices to mitigate overfitting such as l1 & l2 regularization, weight decay, data augmentation, etc. are reported to be as inefficient compared to simple early stopping (Rice et al., 2020) . Many studies have attempted to improve the robust generalization gap in AT, and most have generally investigated robust overfitting by considering DNNs as whole. However, DNNs trained on natural images exhibit a common phenomenon: features obtained in the first layers appear to be general and applicable widespread, while features computed by the last layers are dependent on a particular dataset and task (Yosinski et al., 2014) . Such behavior of DNNs sparks a question: Do different layers contribute differently to robust overfitting? Intuitively, robust overfitting acts as an unexpected optimization state in adversarial training, and its occurrence may be closely related to the entire network. Nevertheless, the unique effect of different network layers on robust overfitting is still unclear. Without a detailed understanding of the layer-wise mechanism of robust overfitting, it is difficult to completely demystify the exact underlying cause of the robust overfitting phenomenon. tematically investigate the impact of robust overfitting phenomenon on different layers. To do this, we first fix the parameters for the selected layers, leaving them unoptimized during AT, and then normally optimize other layer parameters. We discovered that robust overfitting is always mitigated in the case where the latter layers are left unoptimized, and applying the same effect to other layers is futile for robust overfitting, suggesting a strong connection between the optimization of the latter layers and the overfitting phenomenon. Based upon the observed effect, we propose a robust adversarial training (RAT) prototype to relieve the issue of robust overfitting. Specifically, RAT works in each mini-batch: it optimizes the front layers as usual, and for the latter layers, it implements additional measures on these parameters to regularize their optimization. It is a general adversarial training prototype, where the front and latter network layers can be separated by some simple test experiments, and the implementation of additional measures to regularize network layer optimization can be versatile. For instance, we designed two representative methods for the realizations of RAT: RAT LR and RAT WP . They adopt different strategies to hinder weight update, e.g., enlarging the learning rate and weight perturbation, respectively. Extensive experiments show that the proposed RAT prototype effectively eliminates robust overfitting. The contributions of this work are summarized as follows: • We provide the first diagnosis of robust overfitting on different network layers, and find that there is a strong connection between the optimization of the latter layers and the robust overfitting phenomenon. • Based on the observed properties of robust overfitting, we propose the RAT prototype, which adopts additional measures to regularize the optimization of the latter layers and is tailored to prevent robust overfitting. • We design two different realizations of RAT, with extensive experiments on a number of standard benchmarks, verifying its effectiveness. ) specify the distance between original input data x i and adversarial data x ′ i , which is usually an l p -norm ball such as the l 2 and l ∞ -norm balls and ϵ is the maximum perturbation allowed.

2. RELATED WORK

ℓ AT (w) = min w i max d(xi,x ′ i )≤ϵ ℓ(f w (x ′ i ), y i ),

2.2. ROBUST GENERALIZATION

An interesting characteristic of deep neutral networks (DNNs) is their ability to generalize well in practice (Belkin et al., 2019) . For the standard training setting, it is observed that test loss continues to decrease for long periods of training (Nakkiran et al., 2020) , thus the common practice is to train DNNs for as long as possible. However, this is no longer the case in adversarial training, which exhibits overfitting behavior the longer the training process (Rice et al., 2020) . This phenomenon has been referred to as "robust overfitting" and has shown strong resistance to standard regularization techniques such as l 1 , l 2 regularization and data augmentation methods. (Rice et al., 2020) 



ADVERSARIAL TRAINING Since the discovery of adversarial examples, there have been many defensive methods attempted to improve the DNN's robustness against such adversaries, such as adversarial training (Madry et al., 2017), defense distillation (Papernot et al., 2016), input denoising (Liao et al., 2018), gradient regularization (Tramèr et al., 2018). So far, adversarial training (Madry et al., 2017) has proven to be the most effective method. Adversarial training comprises two optimization problems: the inner maximization and outer minimization. The first one constructs adversarial examples by maximizing the loss and the second updates the weight by minimizing the loss on adversarial data. Here, f w is the DNN classifier with weight w, and ℓ(•) is the loss function. d(., .

