IMPROVING ADVERSARIAL ROBUSTNESS BY PUTTING MORE REGULARIZATIONS ON LESS ROBUST SAMPLES

Abstract

Adversarial training, which is to enhance robustness against adversarial attacks, has received much attention because it is easy to generate human-imperceptible perturbations of data to deceive a given deep neural network. In this paper, we propose a new adversarial training algorithm that is theoretically well motivated and empirically superior to other existing algorithms. A novel feature of the proposed algorithm is to apply more regularization to data vulnerable to adversarial attacks than other existing regularization algorithms do. Theoretically, we show that our algorithm can be understood as an algorithm of minimizing a newly derived upper bound of the robust risk. Numerical experiments illustrate that our proposed algorithm improves the generalization (accuracy on examples) and robustness (accuracy on adversarial attacks) simultaneously to achieve the state-of-the-art performance.

1. INTRODUCTION

It is easy to generate human-imperceptible perturbations that put prediction of a deep neural network (DNN) out. Such perturbed samples are called adversarial examples (Szegedy et al., 2014) and algorithms for generating adversarial examples are called adversarial attacks. It is well known that adversarial attacks can greatly reduce the accuracy of DNNs, for example from about 96% accuracy on clean data to almost zero accuracy on adversarial examples (Madry et al., 2018) . This vulnerability of DNNs can cause serious security problems when DNNs are applied to security critical applications (Kurakin et al., 2017; Jiang et al., 2019) such as medicine (Ma et al., 2020; Finlayson et al., 2019) and autonomous driving (Kurakin et al., 2017; Deng et al., 2020; Morgulis et al., 2019; Li et al., 2020) . Adversarial training, which is to enhance robustness against adversarial attacks, has received much attention. Various adversarial training algorithms can be categorized into two types. The first one is to learn prediction models by minimizing the robust risk -the risk for adversarial examples. PGD-AT (Madry et al., 2018) is the first of its kinds and various modifications including Zhang et al. ( 2020 The aim of this paper is to develop a new adversarial training algorithm for DNNs, which is theoretically well motivated and empirically superior to other existing competitors. Our algorithm modifies the regularization term of TRADES (Zhang et al., 2019) to put more regularization on less robust samples. This new regularization term is motivated by an upper bound of the boundary risk. Our proposed regularized term is similar to that used in MART (Wang et al., 2020) . The two key differences are that (1) the objective function of MART consists of the sum of the robust risk and regularization term while ours consists of the sum of the natural risk and regularization term and (2) our algorithm regularizes less robust samples more but MART regularizes less accurate samples more. Note that our algorithm is theoretically well motivated from an upper bound of the robust risk but no such theoretical explanation of MART is available. In numerical studies, we demonstrate that our algorithm outperforms MART as well as TRADES with large margins.

1.1. OUR CONTRIBUTIONS

We propose a new adversarial training algorithm. Novel features of our algorithm compared to other existing adversarial training algorithms are that it is theoretically well motivated and empirically superior. Our contributions can be summarized as follows: • We derive an upper bound of the robust risk for multi-classification problems. • As a surrogate version of this upper bound, we propose a new regularized risk. • We develop an adversarial training algorithm that learns a robust prediction model by minimizing the proposed regularized risk. • By analyzing benchmark data sets, we show that our proposed algorithm is superior to other competitors in view of the generalization (accuracy on clean examples) and robustness (accuracy on adversarial examples) simultaneously to achieve the state-of-the-art performance. • We illustrate that our algorithm is helpful to improve the fairness of the prediction model in the sense that the error rates of each class become more similar compared to TRADES.

2. PRELIMINARIES

2.1 ROBUST POPULATION RISK Let X ⊂ R d be the input space, Y = {1, • • • , C} be the set of output labels and f θ : X → R C be the score function parameterized by the neural network parameters θ (the vector of weights and biases) such that p θ ( •|x) = softmax(f θ (x)) is the vector of the conditional class probabilities. Let F θ (x) = arg max c [f θ (x)] c , B p (x, ε) = {x ′ ∈ X : ∥x -x ′ ∥ p ≤ ε} and 1(•) be the indicator function. Let capital letters X, Y denote random variables or vectors and small letters x, y denote their realizations. The robust population risk used in the adversarial training is defined as R rob (θ) := E (X,Y) max X ′ ∈Bp(X,ε) 1 {F θ (X ′ ) ̸ = Y} , where X and Y are a random vector in X and a random variable in Y, respectively. Most adversarial training algorithms learn θ by minimizing an empirical version of the above robust population risk. In turn, most empirical versions of (1) require to generate an adversarial example which is a surrogate version of x adv := arg max x ′ ∈Bp(x,ε) 1 {F θ (x ′ ) ̸ = y} . Any method of generating an adversarial example is called an adversarial attack.

2.2. ALGORITHMS FOR GENERATING ADVERSARIAL EXAMPLES

Existing adversarial attacks can be categorized into either the white-box attack (Goodfellow et al., 2015; Madry et al., 2018; Carlini & Wagner, 2017; Croce & Hein, 2020a) or the black-box attack (Papernot et al., 2016; 2017; Chen et al., 2017; Ilyas et al., 2018; Papernot et al., 2018) . For the whitebox attack, the model structure and parameters are known to adversaries who use this information for generating adversarial examples, while outputs for given inputs are only available to adversaries for the black-box attack. The most popular method for the white-box attack is PGD (Projected Gradient Descent) (Madry et al., 2018) . Let η(x ′ |θ, x, y) be a surrogate loss of 1 {F θ (x ′ ) ̸ = y} for given θ, x, y. PGD finds the adversarial example by applying the gradient ascent algorithm to η to update x ′ η and projecting it to B p (x, ε). That is, the update rule of PGD is x (t+1) = Π Bp(x,ε) x (t) + ν sgn ∇ x (t) η(x (t) |θ, x, y) , (2)



); Ding et al. (2020); Zhang et al. (2021) have been proposed since then. The second type of adversarial training algorithms is to minimize the regularized risk which is the sum of the empirical risk for clean examples and a regularized term related to adversarial robustness. TRADES (Zhang et al., 2019) decomposes the robust risk into the sum of the natural and boundary risks, where the first one is the risk for clean examples and the second one is the remaining part, and replaces them to their upper bounds to have the regularized risk. HAT (Rade & Moosavi-Dezfolli, 2022) modifies the regularization term of TRADES by adding an additional regularization term based on helper samples.

