SWITCHING ONE-VERSUS-THE-REST LOSS TO INCREASE LOGIT MARGINS FOR ADVERSARIAL RO-BUSTNESS

Abstract

Adversarial training is a promising method to improve the robustness against adversarial attacks. To enhance its performance, recent methods impose high weights on the cross-entropy loss for important data points near the decision boundary. However, these importance-aware methods are vulnerable to sophisticated attacks, e.g., Auto-Attack. In this paper, we experimentally investigate the cause of their vulnerability via margins between logits for the true label and the other labels because they should be large enough to prevent the largest logit from being flipped by the attacks. Our experiments reveal that the histogram of the logit margins of naïve adversarial training has two peaks. Thus, the levels of difficulty in increasing logit margins are roughly divided into two: difficult samples (small logit margins) and easy samples (large logit margins). On the other hand, only one peak near zero appears in the histogram of importance-aware methods, i.e., they reduce the logit margins of easy samples. To increase logit margins of difficult samples without reducing those of easy samples, we propose switching one-versus-the-rest loss (SOVR), which switches from cross-entropy to one-versus-the-rest loss (OVR) for difficult samples. We derive trajectories of logit margins for a simple problem and prove that OVR increases logit margins two times larger than the weighted cross-entropy loss. Thus, SOVR increases logit margins of difficult samples, unlike existing methods. We experimentally show that SOVR achieves better robustness against Auto-Attack than importance-aware methods.

1. INTRODUCTION

For multi-class classification problems, deep neural networks have become the de facto standard method in this decade. They classify a data point into the label that has the largest logit, which is input of a softmax function. However, the largest logit is easily flipped and deep neural networks can misclassify slightly perturbed data points, which are called adversarial examples (Szegedy et al., 2013) . Various methods have been presented to search the adversarial examples, and Auto-Attack (Croce & Hein, 2020) is one of the most successful methods at finding the worst-case attacks. For trustworthy deep learning applications, classifiers should be robust against the worst-case attacks. To improve the robustness, many defense methods have also been presented (Kurakin et al., 2016; Madry et al., 2018; Wang et al., 2020b; Cohen et al., 2019) . Among them, adversarial training is a promising method, which empirically achieves good robustness (Carmon et al., 2019; Kurakin et al., 2016; Madry et al., 2018) . However, adversarial training is more difficult than standard training, e.g., it requires higher sample complexity (Schmidt et al., 2018; Wang et al., 2020a) and model capacity (Zhang et al., 2021b) . To alleviate the difficulties, several methods focus on the difference in importance of data points (Wang et al., 2020a; Liu et al., 2021; Zhang et al., 2021b) . These studies hypothesize that data points closer to a decision boundary are more important for adversarial training (Wang et al., 2020a; Zhang et al., 2021b; Liu et al., 2021) . To focus on such data points, GAIRAT (Zhang et al., 2021b) and MAIL (Liu et al., 2021) use weighted softmax cross-entropy loss, which controls weights on the losses on the basis of the closeness to the boundary. As the measure of the closeness, GAIRAT uses the least number of steps at which the iterative attacks make models misclassify the data point. On the other hand, MAIL uses the measure based on the softmax outputs. However, these importance-aware methods fail to improve the robustness against Auto-Attack. Thus, it is still unclear how to treat the difference in training data points in adversarial training for good robustness. In this paper, we experimentally investigate the cause of their vulnerability via margins between logits for the true label and the other labels because they should be large enough to prevent the largest logit from being flipped by the attacks. Our experiments show that the histogram of the logit margins of naïve adversarial training has two peaks, i.e., small and large logit margins. This indicates that the levels of difficulty in increasing the logit margins are roughly divided into two: difficult samples and easy samples. On the other hand, logit margins of importance-aware methods concentrate near zero, and thus, importance-aware methods reduce the logit margins of easy samples. This implies that the weighted cross-entropy used in importance-aware methods is not very effective in increasing logit margins. To increase the logit margins of difficult samples, we propose switching one-versus-the-rest loss (SOVR), which switches between cross-entropy and one-versus-the-rest loss (OVR) for easy and difficult samples, instead of weighting cross-entropy. We prove that OVR is always greater than or equal to cross-entropy on any logits. Furthermore, we theoretically derive the trajectories of logit margin losses in minimizing OVR and cross-entropy by using gradient flow on a simple problem and reveal that OVR increases logit margins two times larger than weighted cross-entropy losses. Experiments demonstrate that SOVR increases logit margins more than the naïve adversarial training and outperforms GAIRAT (Zhang et al., 2021b ), MAIL (Liu et al., 2021) , MART (Wang et al., 2020a) , MMA (Ding et al., 2020), and EWAT (Kim et al., 2021) in terms of robustness against Auto-Attack. In addition, we find that our method improves the performance of other recent methods (Wu et al., 2020; Wang & Wang, 2022) that reduce generalization gap of adversarial training.

2.1. ADVERSARIAL TRAINING

Given N data points x n ∈ R d and class labels y n ∈ {1, . . . , K}, adversarial training (Madry et al., 2018) attempts to solve the following minimax problem with respect to the model parameter θ ∈ R m : min θ L AT (θ) = min θ 1 N N n=1 CE (z(x n ,θ),y n ), x n = x n + δ n = x n + arg max ||δn||p≤ε CE (z(x n +δ n , θ),y n ), where z(x, θ) = [z 1 (x, θ), . . . , z K (x, θ)] T and z k (x, θ) is the k-th logit of the model, which is input of softmax: x) . CE is a cross-entropy function, and || • || p and ε are L p norm and the magnitude of perturbation δ n ∈ R d , respectively. The inner maximization problem is solved by projected gradient descent (PGD) (Kurakin et al., 2016; Madry et al., 2018) , which updates the adversarial examples as  f k (x, θ) = e z k (x) / i e zi( δ t = Π ε δ t-1 + ηsign ∇ δt-1 CE (z(x+δ t-1 , θ), y) , where wn ≥ 0 is a weight normalized as wn = wn l w l and n wn = 1. GAIRAT determines the weights through the w n = 1+tanh(λ+5(1-2κn/K)) 2 where κ n is the least steps at which PGD succeeds at attacking models, and λ is a hyperparameter. On the other hand, MAIL uses w n = sigmoid(-γ(P M n -β)) where P M n = f yn (x n , θ)-max k =yn f k (x n , θ). β and γ are hyperparameters. MART (misclassification aware adversarial training) (Wang et al., 2020a) uses a similar approach.



for K steps where η is a step size. Π ε is a projection operation into the feasible region {δ | δ ∈ R d , ||δ|| p ≤ ε}. Note that we focus on p = ∞ since it is a common setting. For trustworthy deep learning, we should improve the true robustness: the robustness against the worst-case attacks in the feasible region. Thus, the evaluation of robustness should use crafted attacks, e.g., Auto-Attack (Croce & Hein, 2020), since PGD often fails to find the adversarial examples misclassified by models.2.2 IMPORTANCE-AWARE ADVERSARIAL TRAININGGAIRAT (geometry aware instance reweighted adversarial training)(Zhang et al., 2021b)  and MAIL (margin-aware instance reweighting learning)(Liu et al., 2021)  regard data points closer to the decision boundary of the model f as important samples and assign higher weights to the loss for them:L weight (θ) = 1 N N n=1 wn CE (z(x n , θ),y n ),

