SWITCHING ONE-VERSUS-THE-REST LOSS TO INCREASE LOGIT MARGINS FOR ADVERSARIAL RO-BUSTNESS

Abstract

Adversarial training is a promising method to improve the robustness against adversarial attacks. To enhance its performance, recent methods impose high weights on the cross-entropy loss for important data points near the decision boundary. However, these importance-aware methods are vulnerable to sophisticated attacks, e.g., Auto-Attack. In this paper, we experimentally investigate the cause of their vulnerability via margins between logits for the true label and the other labels because they should be large enough to prevent the largest logit from being flipped by the attacks. Our experiments reveal that the histogram of the logit margins of naïve adversarial training has two peaks. Thus, the levels of difficulty in increasing logit margins are roughly divided into two: difficult samples (small logit margins) and easy samples (large logit margins). On the other hand, only one peak near zero appears in the histogram of importance-aware methods, i.e., they reduce the logit margins of easy samples. To increase logit margins of difficult samples without reducing those of easy samples, we propose switching one-versus-the-rest loss (SOVR), which switches from cross-entropy to one-versus-the-rest loss (OVR) for difficult samples. We derive trajectories of logit margins for a simple problem and prove that OVR increases logit margins two times larger than the weighted cross-entropy loss. Thus, SOVR increases logit margins of difficult samples, unlike existing methods. We experimentally show that SOVR achieves better robustness against Auto-Attack than importance-aware methods.

1. INTRODUCTION

For multi-class classification problems, deep neural networks have become the de facto standard method in this decade. They classify a data point into the label that has the largest logit, which is input of a softmax function. However, the largest logit is easily flipped and deep neural networks can misclassify slightly perturbed data points, which are called adversarial examples (Szegedy et al., 2013) . Various methods have been presented to search the adversarial examples, and Auto-Attack (Croce & Hein, 2020) is one of the most successful methods at finding the worst-case attacks. For trustworthy deep learning applications, classifiers should be robust against the worst-case attacks. To improve the robustness, many defense methods have also been presented (Kurakin et al., 2016; Madry et al., 2018; Wang et al., 2020b; Cohen et al., 2019) . Among them, adversarial training is a promising method, which empirically achieves good robustness (Carmon et al., 2019; Kurakin et al., 2016; Madry et al., 2018) . However, adversarial training is more difficult than standard training, e.g., it requires higher sample complexity (Schmidt et al., 2018; Wang et al., 2020a) and model capacity (Zhang et al., 2021b) . To alleviate the difficulties, several methods focus on the difference in importance of data points (Wang et al., 2020a; Liu et al., 2021; Zhang et al., 2021b) . These studies hypothesize that data points closer to a decision boundary are more important for adversarial training (Wang et al., 2020a; Zhang et al., 2021b; Liu et al., 2021) . To focus on such data points, GAIRAT (Zhang et al., 2021b) and MAIL (Liu et al., 2021) use weighted softmax cross-entropy loss, which controls weights on the losses on the basis of the closeness to the boundary. As the measure of the closeness, GAIRAT uses the least number of steps at which the iterative attacks make models misclassify the data point. On the other hand, MAIL uses the measure based on the softmax outputs. However, these importance-aware methods fail to improve the robustness against Auto-Attack. Thus, it is still unclear how to treat the difference in training data points in adversarial training for good robustness. In this paper, we experimentally investigate the cause of their vulnerability via margins between logits for the true label and the other labels because they should be large enough to prevent the largest logit from being flipped by the attacks. Our experiments show that the histogram of the logit margins of naïve adversarial training has two peaks, i.e., small and large logit margins. This indicates that the levels of difficulty in increasing the logit margins are roughly divided into two: difficult samples and easy samples. On the other hand, logit margins of importance-aware methods concentrate near zero, and thus, importance-aware methods reduce the logit margins of easy samples. This implies that the weighted cross-entropy used in importance-aware methods is not very effective in increasing logit margins. To increase the logit margins of difficult samples, we propose switching one-versus-the-rest loss (SOVR), which switches between cross-entropy and one-versus-the-rest loss (OVR) for easy and difficult samples, instead of weighting cross-entropy. We prove that OVR is always greater than or equal to cross-entropy on any logits. Furthermore, we theoretically derive the trajectories of logit margin losses in minimizing OVR and cross-entropy by using gradient flow on a simple problem and reveal that OVR increases logit margins two times larger than weighted cross-entropy losses. Experiments demonstrate that SOVR increases logit margins more than the naïve adversarial training and outperforms GAIRAT (Zhang et al., 2021b) , MAIL (Liu et al., 2021) , MART (Wang et al., 2020a) , MMA (Ding et al., 2020) , and EWAT (Kim et al., 2021) in terms of robustness against Auto-Attack. In addition, we find that our method improves the performance of other recent methods (Wu et al., 2020; Wang & Wang, 2022) that reduce generalization gap of adversarial training.

2. PRELIMINARIES 2.1 ADVERSARIAL TRAINING

Given N data points x n ∈ R d and class labels y n ∈ {1, . . . , K}, adversarial training (Madry et al., 2018) attempts to solve the following minimax problem with respect to the model parameter θ ∈ R m : min θ L AT (θ) = min θ 1 N N n=1 CE (z(x n ,θ),y n ), x n = x n + δ n = x n + arg max ||δn||p≤ε CE (z(x n +δ n , θ),y n ), where z(x, θ) = [z 1 (x, θ), . . . , z K (x, θ)] T and z k (x, θ) is the k-th logit of the model, which is input of softmax:f k (x, θ) = e z k (x) / i e zi (x) . CE is a cross-entropy function, and || • || p and ε are L p norm and the magnitude of perturbation δ n ∈ R d , respectively. The inner maximization problem is solved by projected gradient descent (PGD) (Kurakin et al., 2016; Madry et al., 2018) , which updates the adversarial examples as δ t = Π ε δ t-1 + ηsign ∇ δt-1 CE (z(x+δ t-1 , θ), y) , for K steps where η is a step size. Π ε is a projection operation into the feasible region {δ | δ ∈ R d , ||δ|| p ≤ ε}. Note that we focus on p = ∞ since it is a common setting. For trustworthy deep learning, we should improve the true robustness: the robustness against the worst-case attacks in the feasible region. Thus, the evaluation of robustness should use crafted attacks, e.g., Auto-Attack (Croce & Hein, 2020) , since PGD often fails to find the adversarial examples misclassified by models.

2.2. IMPORTANCE-AWARE ADVERSARIAL TRAINING

GAIRAT (geometry aware instance reweighted adversarial training) (Zhang et al., 2021b) and MAIL (margin-aware instance reweighting learning) (Liu et al., 2021) regard data points closer to the decision boundary of the model f as important samples and assign higher weights to the loss for them: L weight (θ) = 1 N N n=1 wn CE (z(x n , θ),y n ), where wn ≥ 0 is a weight normalized as wn = wn l w l and n wn = 1. GAIRAT determines the weights through the w n = 1+tanh(λ+5(1-2κn/K)) 2 where κ n is the least steps at which PGD succeeds at attacking models, and λ is a hyperparameter. On the other hand, MAIL uses w n = sigmoid(-γ(P M n -β)) where P M n = f yn (x n , θ)-max k =yn f k (x n , θ). β and γ are hyperparameters. MART (misclassification aware adversarial training) (Wang et al., 2020a) uses a similar approach. (Krizhevsky & Hinton, 2009) with PreActResNet18 (RN18). SOVR is our proposed method. It regards misclassified samples as important samples and controls the difference between the loss on unimportant and important samples. MMA (max-margin adversarial training) (Ding et al., 2020) also adaptively changes the loss function and ε for each data point, and thus, MMA also has a similar effect to the above methods. We collectively call the above methods importance-aware methods. Hitaj et al. (2021) ; Croce & Hein (2020) ; Kim et al. (2021) have reported that the robust accuracies of GAIRAT, MART, and MMA are lower than naïve adversarial training when using logit scaling attacks or Auto-Attack (Croce & Hein, 2020) . Since Auto-Attack searches adversarial examples by using various attacks, it achieves a larger success attack rate than using one attack, e.g., PGD.

2.3. VULNERABILITY OF IMPORTANCE-AWARE METHODS TO AUTO-ATTACK

To clarify the vulnerabilities, we individually evaluate the robustness against the components of Auto-Attack and PGD (K = 20) on CIFAR10 (Fig. 1 ). The training setup is the same as in Section 6, and we add the results of our method (SOVR) as a reference. This figure shows that almost all importance-aware methods can improve the robustness against PGD and APGD compared with naïve adversarial training (AT (Madry et al., 2018) ). However, they do not improve the true robustness; i.e., their robust accuracies against the worst-case attack are lower than that of AT. Since the reasons of this vulnerability have not been discussed well, we investigate them in the next section.

3. EVALUATION OF ROBUSTNESS VIA LOGIT MARGIN LOSS

We investigate the causes of the vulnerabilities of importance-aware methods by comparing histograms of logit margin losses. First, we explain that logit margin losses determine the robustness. Next, we experimentally reveal that logit margin losses of importance-aware methods concentrate on zero; i.e., their logit margins are smaller than AT. We use training data for empirical evaluation in this section because the goal of this section is to investigate the effect of importance-aware methods, which modify the loss function based on training data points. Experimental setups are provided in Appendix E.6.

3.1. POTENTIALLY MISCLASSIFIED DATA DETECTED BY LOGIT MARGIN LOSS

To investigate the robustness of models near each data point, we apply logit margin loss (Ding et al., 2020) to the models trained by importance-aware methods. Logit margin loss is LM (z(x n , θ), y) = z k * (x ) -z yn (x ) = max k =y z k (x ) -z y (x ), where k * = argmax k =y z k (x ). Since the classifier infers the label of x as ŷ = arg max k z k (x), it correctly classifies x if LM ≤ 0. Thus, the logit margin loss on a difficult sample in adversarial training takes a value near zero. We refer to the absolute value of a logit margin loss | LM | as logit margin. In contrast to PM n of MAIL, LM is not bounded since z k (x) can take an arbitrary value in R. To explain the effect of logit margins, we assume that the Lipschitz constant of the k-th logit function is L k as |z k (x 1 )-z k (x 2 )| ≤ L k ||x 1 -x 2 || ∞ . In this case, we have the following inequality: max k z k (x )-z y (x ) ≤ max k [z k (x)-z y (x)+(L k +L y )ε] ≤ z k * (x)-z y (x)+(L k +L y )ε, where k = arg max k L k . From the above, we define the potentially misclassified sample: All proofs are provided in Appendix A. We can estimate the true robustness of each method by counting the number of potentially misclassified samples. Definition 3.1 and Proposition 3.2 indicate that large logit margins | LM | or small Lipschitz constants L k are necessary for the robustness. Thus, the logit margin loss can be the metric of robustness, and we evaluate it in Section 3.2. In Section 6.2.1, we provide the estimated number of potentially misclassified samples for each method. Definition 3.1. If a data point x satisfies z k * (x)-z y (x) > -(L k +L y )ε,

3.2. HISTOGRAMS OF LOGIT MARGIN LOSS

Since logit margin losses determine the number of potentially misclassified samples, we show the histogram of them for each method on CIFAR10 at the last epoch in Fig. 2 . Comparing AT (Fig. 2(b) ) with standard training (ST, Fig. 2 (a)), AT has two peaks in the histogram. This indicates that the levels of difficulty in increasing the margins in AT are roughly divided into two: difficult samples (right peak) and easy samples (left peak). Difficult samples correspond to the data close to the boundary; i.e., important samples. Next, comparing AT (Fig. 2(b )) with importance-aware methods (Figs. 2(c )-(f)), their logit margin losses LM concentrate on zero, and their peaks are sharper than that of AT. This indicates that importance-aware methods fail to increase the logit margins | LM | for not only difficult samples but also easy samples because the weights for easy samples are relatively small. Thus, it is necessary to increase the small logit margins for difficult samples without reducing those of easy samples. Appendix F.1 provides results under various settings, which show similar tendencies.

4. PROPOSED METHOD

In Section 3.2, we observe that (a) training samples are roughly divided into two types via logit margins; difficult samples and easy samples, and (b) importance-aware methods reduce the logit margins on easy samples since they excessively focus on difficult samples. From these observations, our method is based on two ideas: (i) we switch from cross-entropy to an alternative loss for difficult samples by the criterion of the logit margin loss, and (ii) this alternative loss increases the logit margins of difficult samples more than weighted cross-entropy.

4.1. ONE-VERSUS-THE-REST LOSS (OVR)

The logit margin |z k * (x)-z y (x)| should be large while keeping Lipschitz constants of logit functions small values. To this end, we need a loss function to penalize small logit margins. The logit margin loss can be an intuitive candidate as such a loss function. However, the logit margin loss only considers the pair of the largest logit z k * and the logit for the true label z y , and this is not sufficient for the robustness because k * and k in Eq. ( 6) are not necessarily the same. Moreover, the logit margin loss does not have the desirable property for multi-class classification: infinite sample consistent (ISC) (Zhang, 2004; Bartlett et al., 2003; Lin, 2002) . To consider the logits for all classes and satisfy ISC, our proposed method uses the one-versus-the-rest loss (OVR): OVR (z(x, θ), y) = φ(z y (x)) + k =y φ(-z k (x)). ( ) When φ is a differentiable non-negative convex function and satisfies φ(z) < φ(-z) for z > 0, OVR satisfies ISC (Zhang, 2004) . As such a function, we set φ(z) = log(1+e -z ) and use the following loss: x) ). (8) We provide the detailed reason for this selection of φ(z) in Appendix D. OVR (z(x, θ), y) = log(1+e -zy(x) ) + k =y log(1+e z k (x) ) = -z y (x)+ k log(1+e z k (

4.2. BEHAVIOR OF LOGIT MARGIN LOSSES BY OVR AND CROSS-ENTROPY

To show the effectiveness of OVR in increasing logit margins, we theoretically discuss the difference between OVR and cross-entropy. First, OVR has the following property compared with cross-entropy: Theorem 4.1. If we use OVR (Eq. ( 8)) and softmax as f k (x) = e z k (x) / i e zi(x) , we have 0 ≤ CE (z(x), y) ≤ OVR (z(x), y), ∀(x, y). ( ) When z y (x) → +∞ and z k (x) → -∞ for k = y, we have OVR (z(x), y) → 0 and CE (z(x), y) → 0. Thus, OVR is always larger than or equal to cross-entropy, and OVR and cross-entropy approach asymptotically to zero when | LM | grows to infinity. In fact, we observed that OVR (z, y) is about four times greater than CE (z, y) for random logits z ∼ N (0, I) and randomly selected y from {1, . . . , 10}. Thus, we expect OVR to penalize the small logit margin more strongly than cross-entropy. Besides this general result, we further investigate the effect of OVR in logit margin losses by using a simple problem. To analyze the behavior of logit margins, we formulate the following problem: min z w * (z, y), where * is set to OVR or CE . z ∈ R K is a logit vector for a data point x, and we assume that we can directly move it in this problem. w ∈ R is a weight of the loss, which appears in Eq. (4). To analyze the dynamics of training on Eq. ( 10), we use the following assumption. Assumption 4.2. A logit vector z follows the following gradient flow to solve Eq. ( 10): dz dt = -∇ z w * (z, y), ( ) where t is a time step of training. We assume that z is initialized to zeros z = 0 at t = 0. Equation ( 11) is a continuous approximation of gradient descent z τ+1 = z τ -η∇ z w * and matches it in the limit as η → 0. It is a commonly used method to analyze the dynamics of training (Kunin et al., 2021; Elkabetz & Cohen, 2021) . Under Assumption 4.2, we have the following lemmas about the logits in the training of Eq. ( 10): Lemma 4.3. If we use OVR OVR (z, y) in Eq. ( 10), the k-th logit z k (t) at time t is z k (t) = wt + 1 -W (e wt+1 ) k = y, -wt -1 -W (e wt+1 ) otherwise, ( ) where W is Lambert W function, which is a function satisfying x = W (xe x ) (Corless et al., 1996) . Lemma 4.4. If we use cross-entropy CE (z, y) in Eq. ( 10), the k-th logit z k (t) at time t is z k (t) = wt + 1 K -K-1 K W ( 1 K-1 e K K-1 wt+ 1 K-1 ) k = y, -1 K-1 wt - 1 K(K-1) + 1 K W ( 1 K-1 e K K-1 wt+ 1 K-1 ) otherwise. ( ) These lemmas give the trajectories of logit vectors in minimization of OVR and cross-entropy, respectively. By using the above lemmas, we derive the trajectory of the logit margin loss: Theorem 4.5. Logit margin losses for the logit vector z OV R in the minimization of weighted OVR and logit vector z CE in the minimization of weighted cross-entropy at time t are where w 1 ∈ R and w 2 ∈ R are weights for OVR and cross-entropy, respectively. For large t, they are approximated by LM (z OV R (t)) = -2w 1 t -2 + 2W (e w1t+1 ), LM (z CE (t)) = -K K-1 w 2 t -1 K-1 + W ( 1 K-1 e K K-1 w2t+ 1 K-1 ), LM (z OV R (t)) ≈ -log(w 1 t + 1) 2 , ( ) LM (z CE (t)) ≈ -log(Kw 2 t + 1 -(K -1) log(K -1)), and we have lim t→∞ LM (z OV R )(t) LM (z CE )(t) = 2 for any fixed w 1 , w 2 , and K. This theorem shows the difference in trajectories of logit margin losses between OVR and crossentropy under Assumption 4.2. Regardless of the values of weights w 1 and w 2 , cross-entropy does not increase the logit margins as large as OVR for sufficiently large t. Thus, OVR increases the small logit margins more effectively than GAIRAT and MAIL, which use weighted cross-entropy (Eq. ( 4)). Figure 3 (a) plots the trajectory of logit margin losses LM in the minimization of Eq. ( 10). This figure shows the solutions in Theorem 4.5 (solid lines): we use Eqs. ( 14) and ( 15) unless overflow occurs due to exponential functions and use Eqs. ( 16) and ( 17) when it occurs. It also plots numerical solutions of Eq. ( 11) by using gradients and the Runge-Kutta method as a reference (dashed lines). In Fig. 3 (a), Eqs. ( 14)-(17) exactly match the numerical solutions of RK, and thus, logit margins follow Theorem 4.5. In addition, Fig. 3(a) shows that OVR decreases logit margin losses more than cross-entropy against t regardless of K and w. Thus, using OVR is more suitable for increasing the logit margin on difficult data than previous weighting approaches like GAIRAT and MAIL. Figure 3(b) plots trajectories of LM in adversarial training (Eq. ( 1)) on CIFAR10. It shows that the logit margin | LM | of OVR is about twice as large as that of cross-entropy at the last epoch ( LM (z OV R ) LM(z CE ) = 1.87 for w = 1), like the case of Theorem 4.5. Thus, problem Eq. ( 10) is simple but precise enough to explain the difference in the logit margins between OVR and cross-entropy on a real dataset. In fact, LM (z OV R ) LM(z CE ) at the last epoch is in [1.5, 2] on other datasets, including CIFAR100 (K = 100) (Appendix F.3). As above, we prove that OVR is more effective for increasing logit margins of difficult samples than weighting cross-entropy. In the next section, we compose the objective functions switching between OVR and cross-entropy for difficult and easy samples.

4.3. PROPOSED OBJECTIVE FUNCTION: SOVR

Our proposed objective function is L SOVR (θ) = 1 N (x,y)∈S CE (z(x , θ), y) + λ (x,y)∈L OVR (z(x , θ), y) , ( ) where S is a set where logit margin losses LM are smaller than those in the set L, and we have |S|+|L| = N . These sets correspond to easy and difficult samples in Fig. 2(b ). In our method, we regard the top M % data points in minibatch of SGD as the samples in L. λ is a hyperparameter to balance the loss, and x is an adversarial example generated by Eq. ( 2). The proposed algorithm is shown in Appendix B. Since we do not additionally generate the adversarial examples for OVR , the overhead of our method is negligible: O(b log M 100 b) where b is minibatch size. In the same way to Section 3.2, we evaluate the histograms of logit margin losses for SOVR on CIFAR10 in Fig. 4 . It shows that SOVR succeeds at increasing the left peak compared with AT (Fig. 2(b) ). This is because OVR strongly penalizes difficult samples in the right peak and moves them into the left peak. Though OVR increases logit margins as explained in Section 4.2, we found that OVR (M = 100) is inferior to SOVR because OVR for easy samples can cause overfitting. Figure 5 plots the effect of M in terms of LM at the last epoch, generalization gap at the last epoch, and robust accuracy against Auto-Attack on CIFAR10. It shows that LM monotonically decreases, i.e., robustness improves, when increasing M . However, the generalization gap increases at the same time. Robust accuracy takes the largest value at M = 40. Thus, it is necessary to switch losses to focus on difficult (important) samples, like existing importance-aware methods. We provide the evaluation of effects of λ in Appendix F.4, which shows the similar tendencies.

4.4. EXTENSION FOR OTHER DEFENSE METHODS

Since SOVR only modifies the objective function, it can be used with the optimization algorithms for robustness, e.g., adversarial weight perturbation (AWP) (Wu et al., 2020) or self-ensemble adversarial training (SEAT) (Wang & Wang, 2022) , which improve generalization performance in adversarial training. However, SOVR is difficult to use with TRADES (Zhang et al., 2019) because TRADES also modifies the objective function. To combine our method with TRADES, we propose TSOVR, which uses SOVR instead of cross-entropy for clean data: L TSOVR (θ) = 1 N L SOVR + N n=1 β T max ||δn||p≤ε KL(f (x n , θ), f (x n +δ n , θ)) . ( ) L SOVR = (x,y)∈S CE (z(x, θ), y) + λ (x,y)∈L OVR (z(x, θ), y) where λ and β T are hyperparameters. We evaluate the combinations of SOVR with TRADES, AWP (Wu et al., 2020) , and SEAT (Wang & Wang, 2022) in the experiments.

5. RELATED WORK

The difference in importance of data points in adversarial training has been investigated in several studies (Wang et al., 2020a; Zhang et al., 2020; Sanyal et al., 2021; Dong et al., 2022) (2020) used OVR for OOD detection, they did not discuss the effect of OVR in logit margins. Studies of (Hitaj et al., 2021; Croce & Hein, 2020; Kim et al., 2021) have reported the vulnerabilities of some importance-aware methods to logit scaling attacks or Auto-Attack but few studies discuss the causes. Kim et al. (2021) have pointed out that the cause is high entropy in GAIRAT and MART and presented EWAT (entropy-weighted adversarial training), which imposes a higher weight on the higher entropy. However, the logit margin is more related to robustness than entropy as discussed in Section 3. Furthermore, weighted cross-entropy in EWAT is less effective than OVR as shown Theorem 4.5. (Wang et al., 2020a) , GAIRAT (Zhang et al., 2021b) , MAIL (Liu et al., 2021) , and EWAT (Kim et al., 2021) on three datasets: CIFAR10, SVHN, and CIFAR100 (Krizhevsky & Hinton, 2009; Netzer et al., 2011) . Next, we evaluate the combination of SOVR with TRADES (Zhang et al., 2019) , AWP (Wu et al., 2020) , and SEAT (Wang & Wang, 2022) . Our experimental codes are based on source codes provided by (Wu et al., 2020; Wang et al., 2020a; Ding et al., 2020) . We used PreActResNet-18 (RN18) (He et al., 2016) for all datasets and WideResNet-34-10 (WRN) (Zagoruyko & Komodakis, 2016) for CIFAR10. We used PGD (K = 10, η = 2/255, ε = 8/255) in training. We used early stopping by evaluating test robust accuracies against PGD with K = 10. For AWP and SEAT, we use the original public codes (Wu et al., 2020; Wang & Wang, 2022) . 1 For AT+AWP and SOVR+AWP, we use the training setup that is used for TRADES+AWP in Wu et al. (2020) by changing losses because we found that it achieves a better result. We trained models three times and show the average and standard deviation of test accuracies. For hyperparameters in SOVR, we set (M, λ) to (40, 0.4) for CIFAR10 (RN18), (30, 0.4) for CIFAR10 (WRN) and (50, 0.5) for CIFAR100, and (20, 0.2) for SVHN. We set (M, λ) to (20, 0.8) for TSOVR and (100, 1.2) for TSOVR+AWP. We use Auto-Attack to evaluate the robust accuracy on test data. In Appendix E and F, we provide the details of setups and additional results, e.g., evaluation using other various attacks. To determine the statistical significant difference, we use t-test with p-value of 0.05.

6.2. RESULTS

We list the robust accuracy against Auto-Attack on all datasets in Tab. 1. In this table, SOVR outperforms the importance-aware methods and AT in terms of the robustness against Auto-Attack. This is because SOVR increases the logit margins | LM | by using OVR. In fact, Fig. 4 shows that SOVR increases the logit margins | LM | for difficult samples. MART improved robustness on SVHN and CIFAR100. This might be because MART does not just impose weights on the loss. However, its improvement is less than SOVR. EWAT also achieves higher robust accuracies than AT on several datasets in Tab. 1. However, EWAT is not as robust as SOVR because EWAT employs weighted cross-entropy, which is less effective at increasing logit margins than OVR (Theorem 4.5). In Tab. 1, SOVR slightly sacrifices clean accuracies under some settings. We provide the histograms of logit margin losses on all datasets in Appendix F.1, which also show SOVR increases margins.

6.2.1. EMPIRICAL EVALUATION OF POTENTIALLY MISCLASSIFIED SAMPLES

As discussed in Section 3.1, we can evaluate the robustness near each data point via logit margins and Lipschitz constants of the class-wise logit functions. In this section, we estimate the number of potentially misclassified samples for each method. Since Lipschitz constants for deep neural networks are difficult to compute due to the complexity, we compute the gradient norm of the logit function instead of Lipschitz constants. This is because the gradient norm satisfies sup x ||∇ x z k (x)|| 1 = L k for L k such as |z k (x 1 ) -z k (x 2 )| ≤ L k ||x 1 -x 2 || ∞ (Jordan & Dimakis, 2020 ), and we have Proposition 6.1. If a data point x satisfies z k * (x) -z y (x) > -(max k ||∇ x z k (x)|| 1 + ||∇ x z y (x)|| 1 )ε, it is a potentially misclassified sample. Thus, we can empirically estimate the number of potentially misclassified samples on each method by the gradient norms. Figure 6 plots the rate of data points that satisfy Eq. ( 21), which are potentially misclassified samples. Comparing Fig. 6 with Tab. 1, when the methods have the large rate on test data, they have low robust accuracies against Auto-Attack. This indicates that this rate is a reasonable metric for estimating robustness though it uses gradient norms instead of Lipschitz constants. Whereas most importance-aware methods have higher rates than AT due to small logit margins, SOVR has lower rate. This is because SOVR increases logit margins without increasing gradient norms by using OVR, which is more effective in increasing logit margins than weighted cross-entropy (Section 4.2). In Fig. 6 , EWAT does not necessarily increase the rate because EWAT uses weighted cross-entropy. We discuss the reason the rate of AT gets close to SOVR on CIFAR100 in Appendix F.10.

6.2.2. EXTENSION FOR OTHER METHODS

To improve the robust accuracy on test data, our method mostly focuses on improving the robustness around training data points rather than regularization. Even so, SOVR can be used with recent regularization methods (Wu et al., 2020; Wang & Wang, 2022) , which sometimes sacrifice the training accuracy to reduce the generalization gap. We evaluated the combination of OVR and TRADES (TSOVR), SOVR and AWP, SOVR and SEAT. Table 2 lists the robust accuracy of the combinations against Auto-Attack and shows that SOVR improved the performance of other recent methods. Thus, SOVR and these methods complementarily improve the performance. Among AT, SOVR, TRADES, and TSOVR, SOVR achieved the best trade-off: SOVR achieved similar robust accuracies to TRADES, while it achieved better clean accuracy (Appendix F.6 provides the detailed discussion about tradeoff.). We compare them on other datasets in Appendix F.9. TSOVR+AWP achieved the best robust accuracy, which is statistically significantly different from TRADES+AWP. Note that whereas AWP and SEAT require the time and space complexity depending on model sizes, respectively, the overhead of SOVR depends on only batch-size, and thus, it is easy to use with other methods.

7. CONCLUSION

We investigated the reason importance-aware methods fail to improve the robustness against Auto-Attack. Our empirical results showed the reason to be that they reduce logit margins of easy samples besides those of difficult samples. From the observation, we proposed SOVR, which switches from cross-entropy to OVR by the criterion of the logit margin loss. We theoretically showed OVR increases logit margins more than cross-entropy for a simple problem and experimentally showed that SOVR increases the margins and improves the robustness.

A PROOFS

A.1 THE PROOF OF PROPOSITION 3.2 Proof. From the definition of Lipschitz constants, we have |z k (x + δ) -z k (x)| ≤ L k ||x + δ -x|| ∞ = L k ε, Thus, we have z k (x + δ) ≤ z k (x) + L k ε if z k (x + δ) ≥ z k (x) and z k (x + δ) ≥ z k (x) -L k ε if z k (x + δ) ≤ z k (x) . Therefore, the following inequalities hold for the not potentially misclassified samples: max k =y z k (x + δ) -z y (x + δ) ≤ z k (x) + L k ε -(z y (x) -L y ε) ≤ z k * (x) + L kε -z y (x) + L y ε ≤ z k * (x) -z y (x) + (L k + L y )ε, where k = arg max k =y z k (x + δ), k * = arg max k =y z k (x), and k = arg max k L k . From z k * (x) - z y (x) ≤ -(L k +L y )ε for not potentially misclassified samples and Eq. ( 23), we have max k =y z k (x+ δ)z y (x + δ) ≤ 0, ∀δ ∈ {||δ|| ∞ ≤ ε}. Thus, models are guaranteed to classify adversarial examples of these data points accurately. A.2 THE PROOF OF THEOREM 4.1 Proof. By using logit functions, CE can be written as CE (z(x, θ), y) = -z y (x) + log k e z k (x) . ( ) when a model uses a softmax function. Compared with OVR OVR (z(x, θ), y) = -z y (x) + k log(1+e z k (x) ), the difference is only the second term. Thus, OVR (z(x, θ), y)-CE (z(x, θ), y) can be written as OVR (z(x, θ), y) -CE (z(x, θ), y) = k log(1 + e z k ) -log k e z k , = log k (1 + e z k ) -log k e z k . ( ) Since a logarithm is a strictly increasing function, we have k (1 + e z k ) -k e z k ≥ 0 ⇒ log k (1 + e z k ) -log k e z k ≥ 0. Since e z k ≥ 0 for any z k ∈ R, we have k (1 + e z k ) -k e z k = 1 + k e z k + R(e z k ) -k e z k = 1 + R(e z k ) ≥ 0 where R(e z k ) is the second or higher order terms of e z k , and it takes a positive value because e z k ≥ 0. Thus, the left hand side of Eq. ( 27) holds, and we have log k (1 + e z k ) -log k e z k ≥ 0. Therefore, we have OVR (z(x, θ), y) -CE (z(x, θ), y) ≥ 0: i.e., 0 ≤ CE (z(x, θ), y) ≤ OVR (z(x, θ), y) since CE (z(x, θ), y) ≥ 0. Next, when z k (x) → -∞ for k = y, we have e z k → 0 and lim zy→+∞, z k →-∞ for k =y CE (z(x, θ), y) = lim zy→+∞, z k →-∞ for k =y -z y (x) + log k e z k (x) = -z y + log(e zy ) = 0. (29) On the other hand, when z y (x) → +∞ and z k (x) → -∞ for k = y, we have log(1 + e zy ) → z y and log(1 + e z k ) → 0. Thus, we have lim zy→+∞, z k →-∞ for k =y OVR (z(x, θ), y) = lim zy→+∞, z k →-∞ for k =y -z y (x) + k log(1 + e z k (x) ) = -z y + z y = 0, which completes the proof.

A.3 THE PROOF OF LEMMA 4.3

Proof. From the assumption, we consider the following ordinary differential equation (ODE): dz k dt = - ∂w OV R (z, y) ∂z k . ( ) The initial condition is z(0) = 0. For the correct label y, the gradient of OVR is given by  ∂ OV R (z, e zy+e zy = e wt+c , ( ) where c is a constant, which is determined by the initial condition. We apply the Lambert W function (Corless et al., 1996) for both sides and use log W (x) = log x -W (x) for x > 0 as e zy = W (e wt+c ), z y = wt + c -W (e wt+c ). From the assumption, we have z y (0) = 0, and thus, c satisfies the following equality: c -W (e c ) = 0. From W (xe x ) = x, we have c = 1 and the logit of the correct label is given by z y = wt + 1 -W (e wt+1 ). Next, we consider the logit of another label z k for k = y. Since the gradient for the logit of incorrect label is ∂ OV R (z,y) ∂z k = e z k e z k +1 , we have dz k dt = - we z k 1 + e z k (41) (e -z k + 1)dz k = -wdt. It is solved in the same way as z y , and we have z k = -wt -1 + W (e wt+1 ), for k = y, which completes the proof. A.4 THE PROOF OF LEMMA 4.4 Proof. We first solve the ODE for the logit of the correct label z y . The gradient of cross-entropy is ∂ CE (z, y) ∂z k = -δ ky + e z k m e zm . ( ) From Eq. ( 44), we have the following ODE: dz y dt = w m =y e zm m =y e zm + e zy . (45) Since z = 0 at t = 0, we have z i = z j for ∀i, j = y. In addition, we have i dzi(t) dt = i w∂ CE (z,y) ∂zi = 0 for ∀t. Thus, logits satisfy the following equality: z y = -(K -1)z k , for k = y. From Eq. ( 46), Eq. ( 45) becomes dz y dt = w (K -1)e -1 K-1 zy (K -1)e -1 K-1 zy + e zy (47) (K -1 + e K K-1 zy )dz y = w(K -1)dt (48) K K -1 z y + 1 K -1 e K K-1 zy = K K -1 wt + c (49) 1 K -1 e K K-1 zy e 1 K-1 e K K-1 zy = 1 K -1 e K K-1 wt+c (50) 1 K -1 e K K-1 zy = W ( 1 K -1 e K K-1 wt+c ) (51) K K -1 z y = log (K -1)W ( 1 K -1 e K K-1 wt+c ) z y = wt + K -1 K c - K -1 K W ( 1 K -1 e K K-1 wt+c ) ( ) where c is a constant, which is determined by the initial condition. From Assumption, we have z y (0) = K -1 K c - K -1 K W ( 1 K -1 e c ) = 0 (54) W ( 1 K -1 e c ) = c. Thus, we have c = 1 K-1 since W (xe x ) = x. From Eqs. ( 46) and ( 53), we have z y (t) = wt + 1 K - K -1 K W ( 1 K -1 e K K-1 wt+ 1 K-1 ) (56) z k (t) = - 1 K -1 wt - 1 K(K -1) + 1 K W ( 1 K -1 e K K-1 wt+ 1 K-1 ) for k = y (57) which completes the proof. A.5 THE PROOF OF THEOREM 4.5 From Lemmas 4.3 and 4.4, we have LM (z OV R (t)) = -2w 1 t -2 + 2W (e w1t+c ), LM (z CE (t)) = - K K -1 w 2 t - 1 K -1 + W ( 1 K -1 e K K-1 w2t+ 1 K-1 ). ( ) Since W (x) = log(x)log(log(x)) + O(1) for large x (Hoorfar & Hassani, 2007) , we have LM (z OV R (t)) ≈ -2w 1 t -2 + 2(w 1 t + 1 -log(w 1 t + 1)) (60) = -log(w 1 t + 1) 2 (61) LM (z CE (t)) ≈ - K K -1 w 2 t - 1 K -1 (62) -log(K -1) + K K -1 w 2 t + 1 K -1 (63) -log(-log(K -1) + K K -1 w 2 t + 1 K -1 ) (64) = -log(Kw 2 t + 1 -(K -1) log(K -1)). ( ) Algorithm 1 Switching one-versus-the-rest by the criterion of a logit margin loss 1: Select the minibatch B 2: for x n ∈ B do 3: Generate adversarial examples x n = arg max ||x n -xn||∞≤ε CE (z(x n , θ), y n ) by PGD 4: LM (z(x n , θ), y n ) = max k =yn z k (x n ) -z yn (x n ) 5: end for 6: Select top M 100 |B| samples of (x n , y n ) in terms of LM (z(x n , θ), y n ) and add them to L 7: L SOVR (θ) = 1 |B| (x,y)∈B\L CE (z(x , θ), y) + λ (x,y)∈L OVR (z(x , θ), y) 8: Update the parameter θ ← θ -η∇ θ L SOVR (θ) From the above, we have lim t→∞ LM (z OV R )(t) LM (z CE )(t) = lim t→∞ log(w 1 t + 1) 2 + O(1) log(Kw 2 t + 1 -(K -1) log(K -1)) + O(1) , ( ) = lim t→∞ 2 log t + 2 log(w 1 + t -1 ) + O(1) log t + log(Kw 2 + t -1 (1 -(K -1) log(K -1))) + O(1) , = lim t→∞ 2 + 2 log t log(w 1 + t -1 ) + O(1) log t 1 + log(Kw2+t -1 (1-(K-1) log(K-1))) log t + O(1) log t , = 2, (69) which completes the proof. A.6 THE PROOF OF PROPOSITION 6.1 Proof. Since we have sup x ||∇ x z k (x)|| q = L k for an L k -Lipschitz function such as |z k (x 1 ) - z k (x 2 )| ≤ L k ||x 1 -x 2 || p where 1/q + 1/p = 1 (Jordan & Dimakis, 2020), the following inequality holds if z k * (x) -z y (x) > -(max k ||∇ x z k (x)|| 1 + ||∇ x z y (x)|| 1 )ε and p = ∞: z k * (x) -z y (x) > -(max k ||∇ x z k (x)|| 1 + ||∇ x z y (x)|| 1 )ε ≥ -(max k sup x ||∇ x z k (x)|| 1 + sup x ||∇ x z y (x)|| 1 )ε ≥ -(L k + L y )ε, because ||∇ x z k (x)|| 1 ≤ L k for p = ∞. Thus, we have z k * (x)z y (x) > -(L k + L y )ε on this condition, which completes the proof.

B ALGORITHM

The proposed algorithm is shown in Algorithm 1. We first generate x in Line 3 and compute the LM for them in Line 4. In Line 6, we select the top M % samples in minibatch and add them to L. Finally, we compute the objective L SOVR and its gradient to update θ. Since we do not additionally generate the adversarial examples for OVR , the overhead of our method is O(|B| log M 100 |B|), which is the computation cost of the heap sort for selecting L in Line 6. It is negligible in the whole computation since deep models have huge parameter-size compared with batch-size |B|.

C.1 MART (WANG ET AL., 2020A)

MART (Wang et al., 2020a) uses a similar approach to importance-aware methods. It regards misclassified samples as important samples and minimizes MART (x , y, θ) = BCE(f (x , θ), y) + λKL(f (x, θ), f (x , θ)) • (1 -f y (x, θ)), (71) where BCE(f (x , θ), y) = -log(f y (x , θ))-log(1-max k =y f k (x , θ)) and KL is Kullback-Leibler divergence. MART controls the difference between the loss on unimportant and important samples via 1f y (x n , θ): MART tends to ignore the second term when the model is confident in the true label. C.2 MMA (DING ET AL., 2020) MMA (Ding et al., 2020) attempts to maximize the distancefoot_1 between data points and the decision boundary for robustness. MMA regards min δ ||δ n || ∞ subject to { LM (z(x + δ, θ), y) ≥ 0} as the distance. By using this distance, MMA minimizes the following loss: L(θ) = 1 3 N n=1 CE (z(x n , θ), y n ) + 2 3 L MMA (θ) (72) L MMA (θ) = (x,y)∈S + ∩H CE (z(x+δ MMA , θ), y)+ (x,y)∈S -CE (z(x, θ), y) δ MMA = arg min SLM (z(x+δ,θ),y)≥0 ||δ|| ∞ (74) SLM (z(x, θ), y) = log k =y e z k (x) -z y (x) where S + is a set of correctly classified data points, and S -is a set of misclassified samples. H is a set of data points that have a smaller distance than threshold d max as H = {(x n , y n )| min δn ||δ n || ∞ ≤ d max }. Since MMA uses δ whose magnitude ||δ|| ∞ depends on data points as Eq. ( 74), we consider that it has similar effects to the importance-aware methods. In MMA, Ding et al. ( 2020) use SLM (z(x, θ), y) as an approximated differentiable logit margin loss by changing max into differentiable function log k =y e z k (x) . Comparing OVR with SLM (z(x, θ), y), we have the following: SLM (z(x, θ), y) ≤ CE (z(x, θ), y) ≤ OVR (z(x, θ), y). ( ) This is because we have CE (z(x), y) -SLM (z(x), y) = log k e z k (x)log k =y e z k (x) = log 1 + e zy(x) / k =y e z k (x) ≥ 0 and CE (z(x), y) ≤ OVR (z(x), y) from Theorem 4.1. Thus, we expect that OVR more strongly penalizes the small logit margins than SLM (x, y). Note that training algorithms of MMA is also different from those of the other importance-aware methods (Ding et al., 2020) .

C.3 EWAT (KIM ET AL., 2021)

EWAT uses a weighted cross-entropy like GAIRAT and MAIL, but it is added to cross-entropy as L weight (θ) = 1 N N n=1 (1 + wn ) CE (z(x n ,θ),y n ), where wn ≥ 0 is a weight divided by batch-mean of the weight w n as wn = |B|wn |B| l=1 w l . EWAT determines the weights by using entropy as w n = - K k=1 f k (x n , θ)log(f k (x n , θ)) where f k (x n , θ) is the k-th softmax output, and thus, it can be regarded as the probability for the k-th class label. EWAT is based on the observation that importance-aware methods tend to have high entropy, and it causes their vulnerability. Our theoretical results about logit margins and experiments seem to indicate that a logit margin loss is a more reasonable criterion to evaluate the robustness and improve the robustness by using it than entropy. Furthermore, Theorem 4.5 shows that the weighted cross-entropy is less effective than OVR at increasing logit margins.

D SELECTION OF φ IN OVA D.1 INFINITE SAMPLE CONSISTENCY

Infinite-sample consistency (ISC, also known as classification calibrated or Fisher consistent) is a desirable property for multi-class classification problems (Zhang, 2004; Bartlett et al., 2003; Lin, 2002) . We first introduce ISC in this section. Let f (x) be a model and c be the classifier c(x) = arg max k f k (x). The classification error * is * (c(•)) := E x K k=1,k =c(x) a k p(y = k|x) (79) where a k is a weight for the k-th label and is usually set to one. The optimal classification rule called a Bayes rule is given by c * (•) = max k∈{1,...,K} a k p(y = k|x). (80) Since Eq. ( 79) is difficult to minimize directly, we use a surrogate loss function . In classification problems, we obtain the model f (•) by the minimization of the empirical risk using as f (•) = arg max f (•) 1 n N i=1 (f (x i ), y i ). ( ) On the other hand, the true risk using is written by E x,y (f , y) = E x W (p(•|x), f (x)) (82) W (q, f ) := K k=1 q k (f , k) where p(•|x) = [p(1|x), . . . , p(K|x)] and q is a vector in the set Λ K : Λ K := q ∈ R K : K k=1 q k = 1, q k ≥ 0 . ( ) W (q, f ) is the point-wise true loss of model f with the conditional probability q. By using the above, ISC is defined as the following definition: Definition D.1. (Zhang, 2004) We say that the formulation is infinite-sample consistent (ISC) on a set Ω ⊆ R K with respect to Eq. ( 79) if the following condition holds: • For each k, (•, k) : Ω → R is bounded below and continuous • ∀q ∈ Λ K and k ∈ {1, . . . , K} such that a k q k < sup i a i q i , we have W * (q) := inf f ∈Ω W (q, f ) < inf {W (q, f )|f ∈ Ω, f k = sup i f i } This definition indicates that the optimal solution of W (q, •) leads to a Bayes rule with respect to classification error Zhang (2004) : the minimizer of Eq. ( 82) becomes the minimizer of classification error * (Eq. ( 79)). Thus, surrogate loss functions , e.g., cross-entropy or OVR, should satisfy ISC to minimize the classification error.

D.2 EVALUATION OF φ

It is known that ISC is satisfied when φ in Eq. ( 7) is a differentiable non-negative convex function and satisfies φ(z) < φ(-z) for z > 0. Among common nonlinear functions used in deep neural networks, e -z and log(1 + e -z ) satisfy this condition. We first evaluated e -z and observed that e -z causes numerical unstability. On the other hand, log(1 + e -z ) tends to be stable in computation. This is because log(1 + e -z ) asymptotically closes to max(-z, 0). In addition, let the conditional probability p(y|x) for the class k given x be p(k|x) = 1 1 + e -z k (x) (86) when we choose φ(z) = log(1 + e -z ). We have OVR (z(x, θ), y) = log(1+e -zy(x) ) + k =y log(1+e z k (x) ) (87) = -logp(y|x) + k =y -log(1 -p(k|x)) and we can regard models as K independent binary classifier (Padhy et al., 2020 ).

E EXPERIMENTAL SETUPS

We conducted the experiments for evaluating our proposed method. We first compared our method with baseline methods; Madry's AT (Madry et al., 2018) , MMA (Ding et al., 2020) , MART (Wang et al., 2020a) , GAIRAT (Zhang et al., 2021b) , MAIL (Liu et al., 2021) , and EWAT (Kim et al., 2021) on three datasets; CIFAR10, SVHN, and CIFAR100 (Krizhevsky & Hinton, 2009; Netzer et al., 2011) . Next, we evaluated the combination of our method with TRADES (Zhang et al., 2019) , AWP (Wu et al., 2020) , and SEAT (Wang & Wang, 2022) . Our experimental codes are based on source codes provided by Wu et al. (2020) ; Wang et al. (2020a); Ding et al. (2020) . We used PreActResNet-18 (RN18) (He et al., 2016) and WideResNet-34-10 (WRN) (Zagoruyko & Komodakis, 2016) following Wu et al. (2020) . The L ∞ norm of the perturbation was set to ε = 8/255, and all elements of x i + δ i were clipped so that they were in [0,1]. We used early stopping by evaluating test robust accuracies against 20-step PGD. To evaluate TRADES, AWP, and SEAT, we used the original public code (Wu et al., 2020; Wang & Wang, 2022) . We trained models three times and show the average and standard deviation of test accuracies. We used Auto-Attack to evaluate the robust accuracy on test data. We used one GPU among NVIDIA ®V100 and NVIDIA®A100 for each training in experiments. We trained models three times and show the average and standard deviation of test accuracies. For MART, we used mart loss in the original code (Wang et al., 2020a) foot_2 as the loss function. λ of MART was set to 6.0. For GAIRAT and MAIL, we also used the loss functions in the original codes (Zhang et al., 2021b; Liu et al., 2021) , 45 and thus, hyperparameters of their loss functions were based on them. λ of GAIRAT was set to ∞ until the 50-th epoch and then set to 3.0. (γ, β) of MAIL was set to (10, 0.5). For all settings, the size of minibatch was set to 128. The detailed setup for each dataset was as follows.

E.1 MMA

We trained models by using MMA based on the original code (Ding et al., 2020)foot_5 . Thus, the learning rate schedules and hyperparameters of PGD for MMA were different from those for other methods because the training algorithm of MMA is different from the other methods. The step size of PGD in MMA was set to 2.5ε 10 in training by following Ding et al. (2020) . For AN-PGD in MMA, the maximum perturbation length was 1.05 times the hinge threshold ε max = 1.05d max , and d max was set to 0.1255. The learning rate of SGD was set to 0.3 at the 0-th parameter update, 0.09 at the 20000-th parameter update, 0.03 at the 30000-th parameter update, and 0.009 at the 40000-th parameter update.

E.2 CIFAR10

For PreActResNet18, the learning rate of SGD was divided by 10 at the 100-th and 150-th epoch except for EWAT, and the initial learning rate was set to 0.05 for SOVR and 0.1 for others. We tested the initial learning rate of 0.05 for the other methods and found that the setting of 0.1 achieved better robust accuracies against Auto-Attack than the setting of 0.05 when using ResNet18. For EWAT, we divided the learning rate of SGD at the 100-th and 105-th epoch following Kim et al. (2021) after we found that the division at the 100-th and 150-th epoch was worse than the division at the 100-th and 105-th epoch. When using WideResNet34-10, we set the initial learning rate to 0.1 and divided by 10 at the 100-th and 150-th epoch. We used momentum of 0.9 and weight decay of 0.0005 and early stopping by evaluating test accuracies. We standardized datasets by using mean = [0.4914, 0.4822, 0.4465] and std = [0.2471, 0.2435, 0.2616] as the pre-process. (M, λ) was tuned by grid search over M ∈ [20, . . . , 80, 100] and λ ∈ [0.2, . . . , 0.8, 1.0] for RN18, and tuned by coarse tuning for WRN due to high computation cost.

E.3 CIFAR100

We used PreActResNet18 for CIFAR100. The learning rate of SGD was divided by 10 at the 100-th and 150-th epoch except for EWAT, and the initial learning rate was set to 0.1. Note that we found that the above setting is better than the initial learning rate of 0.05 for all methods. For EWAT, we divided the learning rate of SGD at the 100-th and 105-th epoch following Kim et al. (2021) after we found that the division at the 100-th and 150-th epoch was worse than the division at the 100-th and 105-th epoch. We randomly initialized the perturbation and updated it for 10 steps with a step size of 2/255 for PGD. We used momentum of 0.9 and weight decay of 0.0005 and early stopping by evaluating test accuracies. We standardized datasets by using mean = [0.5070751592371323, 0.48654887331495095, 0.4409178433670343], and std = [0.2673342858792401, 0.2564384629170883, 0.27615047132568404] as the pre-process. (M, λ) was set to (0.5, 0.5) based on the coarse hyperparameter tuning.

E.4 SVHN

We used PreActResNet18 for SVHN. The learning rate of SGD was divided by 10 at the 100-th and 150-th, and the initial learning rate was set to 0.05 for SOVR and 0.01 for others. We tested the initial learning rate of 0.05 for the other methods and found that the setting of 0.01 achieved better robust accuracies against Auto-Attack than the setting of 0.05. For EWAT, the learning rate of SGD was divided by 10 at the 100-th and 105-th epoch after we found that this setting was better than the division at 100-th and 105-th epoch. The hyperparameters for PGD were based on (Wu et al., 2020) : We randomly initialized the perturbation and updated it for 10 steps with a step size of 1/255. For the preprocessing, we standardized data by using the mean of [0.5, 0.5, 0.5], and standard deviations of [0.5, 0.5, 0.5]. (M, λ) is set to (0.2,0.2) on the basis of the coarse hyperparameter tuning.

E.5 TRADES, AWP, AND SEAT

For experimental settings of TRADES and AWP, we followed Wu et al. (2020) and only changed the training loss into SOVR in the training procedure and in the algorithm for computing the perturbation of AWP for SOVR+AWP, TSOVR, and TSOVR+AWP. We used the original codes of AWP (Wu et al., 2020) foot_6 . For AWP and AWP+SOVR, we found that models trained under the setup for TRADES+AWP in original codes, where the dataset is not standardized and AWP is applied after 10 epochs, achieves better robust accuracy than those trained under the setup for cross-entropy+TRADES in original codes. Thus, we used the code for TRADES+AWP by changing the loss functions. β T of TRADES and TSOVR were set to 6, and γ of AWP is set to 0.01 for AWP and AWP+SOVR. γ of AWP was set to 0.005 for TRADES+AWP and TSOVR+AWP. AWP is applied after the 10-th epoch. We used WideResNet34-10 following Wu et al. (2020) . We used SGD with momentum of 0.9 and weight decay of 0.0005 for 200 epochs. The learning rate was set to 0.1 and was divided by 10 at the 100-th and 150-th epoch. For experimental settings of SEAT, we followed Wang & Wang (2022) and only changed the loss into SOVR in the original code (Wang & Wang, 2022) . 8We did not evaluate SEAT with CutMix in our experiments, but we fairly compare SEAT+SOVR with SEAT under the same condition. We used SGD with momentum of 0.9 and weight decay of 7 × 10 -4 for 120 epochs. The initial learning rate was set to 0.1 till the 40-th epoch and then linearly reduced to 0.01 and 0.001 at the 60-th epoch and 120-th epoch, respectively. We used WideResNet32-10 following Wang & Wang (2022) for SEAT. (M, λ) is tuned by grid search over M ∈ [20, . . . , 80, 100] and λ ∈ [0.2, . . . , 0.8, 1.0] for SOVR+AWP, and (M, λ) is tuned by grid search over M ∈ [20, . . . , 80, 100] and λ ∈ [0.2, . . . , 1.0, 1.2] for TSOVR. (M, λ) is set to (0.5,0.5) for SOVR+SEAT after coarse hyperparameter tuning.

E.6 EXPERIMENTAL SETUPS IN SECTION 3

For the experiments in Section 3, we used the models obtained under the above settings, which are the same as models used in Section 6. To obtain the histograms of logit margins, we used the models and computed logit margin loss on adversarial examples of training data set for each data point at 200 epochs. Thus, the number of data points of CIFAR10 and CIFAR100 is 50,000, and that of SVHN is 73,257. We also provide the results of the models obtained by the early stopping in Fig. ??.

F.1 HISTOGRAMS OF LOGIT MARGIN LOSSES

We show the additional histograms of logit margin losses in this section. First, Fig. 7 entropy ||∇ x CE (x, y)|| 1 are relatively small values in all methods. This indicates that adversarial training essentially attempts to suppress the gradient norms for the cross-entropy. MMA has the largest gradient norms, and this is the reason MMA is not robust against Auto-Attack except for SQUARE (Fig. 1 ), which does not use gradient. GAIRAT and MAIL have the smallest and second smallest ||∇ x CE (x, y)|| 1 , and this is the reason they are robust against PGD despite the small logit margins (Fig. 2 ). On the other hand, max k ||∇ x z k (x)|| 1 of importance-aware methods is larger than ||∇ x CE (x, y)|| 1 of them and that of AT. As a result, they can have larger rate of potentially misclassified samples (Fig. 6 ). Gradient norms of cross-entropy for the label that has the largest logit except for the true label ||∇ x CE (x, k * )|| 1 are smaller than those for the randomly selected labels. This implies k * = k, and we need to use the loss that depends on the logits for all classes rather than logit margin loss, which only cares z * k and z y . Gradient norms of EWAT and SOVR are not significantly different from those of AT. Thus, SOVR can reduce the rate of potentially misclassified samples by large logit margins and not large gradient norm. ||∇ x CE (x, y)|| 1 ||∇ x CE (x, k * )|| 1 ||∇ x CE (x, k)|| 1 ||∇ x z y (x)|| 1 ||∇ x z k * (x)|| 1 max k ||∇ x z k (x)

CROSS-ENTROPY

In this section, we evaluate the trajectories of logit margin losses in adversarial training on real data. Experimental setup is the same with that of Section 6 except for the learning rate on CIFAR10, and is robust accuracy against Auto-Attack. Dashed gray line corresponds to the results of AT using cross-entropy loss. thus, we minimize the OVR and cross-entropy averaged over the dataset, unlike Eq. ( 10). While the learning rates are set to 0.1 for cross-entropy and 0.05 for SOVR in Section 6 on CIFAR10, learning rate is set to 0.05 on CIFAR10 for both cross-entropy and OVR to fairly compare their logit margin losses in this experiment. Since we could not obtain results on SVHN with the weight of w = 5 due to unstability, we used w = 2 on SVHN. Fig. 14 plots the logit margin losses averaged over the dataset against epochs in adversarial training with OVR and cross-entropy on CIFAR10, CIFAR100, and SVHN. In Fig. 14 , OVR decreases the logit margin losses more than cross-entropy on all dataset. We also evaluate LM (z OV R )/ LM (z CE ) at the last epoch, which is expected to be about two from Theorem 4.5. Table 4 list LM (z OV R )/ LM (z CE ) at the last epoch, and it is about two, and thus, logit margin losses in adversarial training follow Theorem 4.5 well even though we assume a simple problem that only considers one data point and assumes that logits are directly moved by the gradient for Theorem 4.5. Since the number of classes K of CIFAR100 is 100 and larger than other datasets, the logit margins of cross-entropy is larger than OVR at the beginning of training. This result corresponds to the case of CE (K = 100) in Fig. 3 (a), and this phenomenon is also able to be explained by the simple problem Eq. (10). To the best of our knowledge, this is the first study that explicitly reveals the logit margin of minimization of cross-entropy depends on the number of classes. Though the logit margin loss of OVR in Theorem 4.5 does not depend on K, the logit margins of OVR on CIFAR100 is smaller than those on CIFAR10 and SVHN. This is because CIFAR100 is a more difficult dataset than CIFAR10 and SVHN: robust accuracies of CIFAR100 is about 25 % whereas those of CIFAR10 and SVHN are about 50 % in Tab. 1.

F.4 EFFECTS OF HYPERPARAMETERS λ

SOVR has hyperparameters (M , λ). In this section, we evaluate the effects of λ. Figure 15 LM on CIFAR10 with RN18, generalization gap, and robust accuracy against Auto-Attack. We set M to 40. Note that λ = 0 corresponds to that models are trained on only a set of S, i.e., AT only using the 60 percent of the data points in minibatch when M = 40. First, LM (x ) is monotonically decreasing due to increases in λ (Fig. 15(a) ). However, robust accuracies against Auto-Attack are not monotonically increasing against λ (Fig. 15(c) ). This is because generalization gap increases (Fig. 15(b) ). Thus, too high weights on difficult samples causes overfitting. SOVR is always superior or comparable to AT in terms of robustness against Auto-Attack under all tested values of (0 < M ≤ 100, 0 < λ ≤ 1).

F.5 INDIVIDUALLY TEST OF AUTO-ATTACK

For importance-aware methods, we evaluate the robust accuracies against all components of Auto-Attack in Section 2.3. In this section, we additionally evaluate EWAT by individually using Auto-Attack and discuss the results of SOVR. Figure 16 plots the results, and SOVR is the most robust against t-APG and t-FABand. In addition, it is more robust against SQUARE than AT and EWAT. Although the robust accuracy of SOVR against PGD-20 is lower than those of AT and EWAT, SOVR outperforms other methods in terms of the robustness against the worst-case attacks, which is the goal of this study. We list robust accuracies against various attacks; FGSM (Goodfellow et al., 2014) , 100-step PGD (Madry et al., 2018) , 100-step PGD with CW loss (Madry et al., 2018; Carlini & Wagner, 2017) , SPSA (Uesato et al., 2018) in Tab. 5. Hyperparameters of SPSA are as follows: the number of steps is set to 100, the perturbation size is set to 0.001, learning rate is set to 0.01, and the number of samples for each gradient estimation is set to 256. In this table, we repeat the clean accuracies and robust accuracies against Auto-Attack from the table in the main paper. In addition, we list the worst robust accuracies, which are the least robust accuracy among attacks in the table for each method. In this table, importance-aware methods tend to fail to improve the robustness against SPSA. Since SPSA does not directly use gradients, this result indicates that importance-aware methods improve the robustness by obfuscating gradients (Athalye et al., 2018) . Against some attacks, MMA achieves the highest robust accuracy on several datasets. However, our goal is improving the true robustness, i.e., robust accuracies against the worst-case attacks in δ ∈ {||δ|| ∞ ≤ 8/255}. MMA does not improve the robustness against the worst-case attacks (the columns of Worst). We can see that Auto-Attack always achieves the least robust accuracies, and SOVR improves them: Robust accuracies against Auto-Attack of SOVR are 5.9-12.2 percent points greater than those of MMA. In this section, we evaluate the robustness against the logit scaling attack (Hitaj et al., 2021) . Hitaj et al. (2021) reveals that GAIRAT tends to be vulnerable to logit scaling attacks. The logit scaling attack multiplies logits by α before applying softmax when generating PGD attacks. We set α = [0.1, 1.0, 10, 100]. Fig. 18 plots robust accuracy against α. This figure shows that the robust accuracies of GAIRAT and MAIL tend to decrease when increasing α. Though the robust accuracy of SOVR is the lowest on CIFAR 10 (RN18), it is higher than the robust accuracy against Auto-Attack. Thus, the logit scaling attack is not the worst-case attack. Since the robust accuracy of SOVR does not necessarily decrease when increasing α, the results seem to be caused by high robustness against PGD (Tab. 5) of other methods rather than vulnerability to logit scaling of SOVR. Previous methods tend to be designed to increase the robustness against PGD since Auto-Attack is a relatively recent attack. On the other hand, SOVR is designed to increase the robustness against the worst-case attack, which is Auto-Attack for now.

F.6 TRADE-OFF BETWEEN CLEAN ACCURACY AND ROBUST ACCURACY

F.9 COMPARISON WITH TRADES Section 6 gives the results of TRADES and TSOVR on CIFAR10 with WRN. This section compares our methods with TRADES by using other datasets and RN18. For TSOVR, (M, λ) is set to (80, 0.8) for CIFAR10 (RN18), set to (50, 0.5) for CIFAR100, and set to (20, 0.8) for SVHN. Tab. 6 lists the robust accuracies against Auto-Attack and clean accuracies. We can see that TSOVR achieves the best robust accuracy against Auto-Attack.

F.10 DEPENDENCE OF THE NUMBER OF CLASSES

Figure 6 shows that the rate of AT gets close to SOVR on CIFAR100. This is because the number of classes of CIFAR100 (K = 100) is ten times larger than other datasets (K = 10), and logit margins of cross-entropy depend on the number of classes K (Eq. ( 17)). Thus, this result is a piece of evidence that Theorem 4.5 explains the difference of logit margins between OVR and cross-entropy. In certain finite time step t (not the limit), Eqs. ( 16) and ( 17) show that the difference between OVR and cross-entropy depends on the number of classes K. Even so, Fig. 14(b) shows that the increase rate of logit margins of OVR is larger than that of cross-entropy against epochs. To achieve better performance, we can tune the hyper-parameter λ, which corresponds to w 1 in Eq. ( 16) of Theorem 4.5. When using λ= 0.6, SOVR achieves better robustness than λ = 0.5 on CIFAR100 (Tab. 7). Cross-entropy also has the weight w 2 in Eq. ( 17), and it is automatically tuned in GAIRAT, MAIL, and EWAT. However, this tuning does not achieve comparable performance to SOVR. 

G HISTOGRAM OF PROBABILISTIC MARGIN LOSSES

While our study focuses on the logit margin loss, MAIL (Liu et al., 2021) uses the probabilistic margin, P M n = f yn (x n , θ) -max k =yn f k (x n , θ), to evaluate the difficulty of data point. In the same way as Fig. 2 , Fig. 19 plots the histograms of probabilistic margins on CIFAR10 with PreActResNet18. Since softmax output is bounded in [0, 1], P M is bounded in [-1,1] . As a result, most correctly classified data points concentrate near -1. In addition, softmax uses exponential functions, distributions of P M are similar to the exponential distributions. Due to these effects, histograms of P M make it more difficult to discover the fact that there are two types of data points (easy samples and difficult samples). Since softmax preserves the order of logit, and a classifier infers the label by using the largest logit, the analysis by using P M can under-estimate the distribution of difficult samples. Thus, logit margin losses are more suitable to empirically analyze trained models. Since softmax preserves the order of logit, the probabilistic margin can be used to determine L and S in SOVR.



We could not reproduce the results reported inWu et al. (2020);Wang & Wang (2022) even though we did not modify their codes. This might be because we report the averaged values for reproducibility. For clarity, we use the term "margin" only for the distance between logits of the true labels and of the label that has the largest logit except for the true label, not for the distance between data points and the decision boundary. https://github.com/YisenWang/MART https://github.com/zjfheart/Geometry-aware-Instance-reweighted-Adversarial-Training https://github.com/QizhouWang/MAIL https://github.com/BorealisAI/mma_training https://github.com/csdongxian/AWP https://github.com/whj363636/Self-Ensemble-Adversarial-Training



Figure1: Robustness against PGD and components of Auto-Attack on CIFAR10(Krizhevsky & Hinton, 2009) with PreActResNet18 (RN18). SOVR is our proposed method.

Figure 2: Histogram of LM for training data of CIFAR10 with RN18 at the best (top) and the last (bottom) epoch. Best epoch is the epoch when models achieved the best robust accuracy against PGD by early stopping. ST denotes standard training, i.e., training on clean data. For standard training, we use LM on clean data x, while we plot that on adversarial examples x for the other methods. Blue bins are the correctly classified data points, and red bins are misclassified samples.

Figure 3: Trajectories of LM . CE denotes cross-entropy. In (a), RK denotes 4-th order Runge-Kutta method with the step size of 0.1. (b) is trajectory in adversarial training on CIFAR10, and its setup is provided in Appendix F.3

Figure5: The effect of rate of applying OVR M . λ is set to 0.4. M = 0 corresponds to the result of AT with cross-entropy. Generalization gap is a gap between training robust accuracy and test robust accuracy against PGD (K=20) at the last epoch. Robust Acc. is robust accuracy against Auto-Attack.

plots the result of EWAT on training samples of CIFAR10 at the last epoch. Compared with SOVR, EWAT does not increase logit margins for difficult samples (right peak). Figures 8-11 plot the histograms when

Figure 12: Histogram of logit margin losses on CIFAR10 with WRN for TRADES.

Figure13: of gradient norms with respect to data points. k is randomly selected labels, and k * = arg max k =y z k (x).

Figure 14: Trajectories of logit losses LM in adversarial training using cross-entropy and OVR. RN18 is used on CIFAR100 and SVHN.

Figure16: The robustness against PGD-20 and Auto-Attack on the test set of CIFAR10. We decompose the robust accuracy for Auto-Attack into robust accuracy in each phase.

Figure 17: Trade-off between clean accuracy and robust accuracy.

Figure 18: Robust Accuracy against logit scaling attack.

Figure 19: Histogram of probabilistic margin losses for training data of CIFAR10 with PreActRes-Net18 at the last epoch. ST denotes standard training, i.e., training on clean data. For standard training, we use P M on clean data x, while we plot that on adversarial examples x for the other methods. Blue bins correspond to the correctly classified data points, and red bins are misclassified samples.

.Zhang et al. (2020) investigated the effect of difficult samples on natural generalization performance; i.e., generalization performance on clean data, in adversarial training. They presented friendly adversarial training (FAT) that improves the robustness without compromising the natural generalization performance. Unlike FAT, SOVR focuses on the robust performance.Sanyal et al. (2021) andDong et al. (2022) have investigated the effect of memorization in adversarial training: memorizing difficult samples hurts the generalization performance of adversarial training. Whereas they focused on reducing generalization gap by regularization, our method reduces robust error on test data by reducing training robust error; i.e., mitigating underfitting more than overfitting.Cisse et al. (2017);Tsuzuku et al. (2018);Zhang et al.  (2021a)  used logit margins and Lipschitz constants to present certified defense methods. We present a similar result only to justify our evaluation using logit margin loss in Section 3. Though Padhy et al.

Robust accuracy against Auto-Attack and clean accuracy on test datasets. Robust accuracies of SOVR is statistically significantly different.

Robust Accuracy against various attacks (L ∞ , ε = 8/255). CLN denotes accuracy on clean data, and AA denotes Auto-Attack. Worst represents the least robust accuracy among attacks in the table for each method.

Robust accuracy against Auto-Attack and clean accuracy for AT, SOVR, TRADES, and TSOVR.Robust Accuracy against Auto-Attack (L ∞ , ε = 8/255)

Robust accuracy against Auto-Attack on CIFAR100 tuning λ.Robust Accuracy against Auto-Attack (L ∞ , ε = 8/255)

annex

using WideResNet and other datasets. SOVR tends to increase the left peak under all conditions, and thus, it decreases logit margin losses LM , and thus, it increases the logit margins | LM |. Figure 10 shows that AT does not have two peaks on SVHN. To investigate histograms on SVHN in detail, we additionally evaluate logit margin losses at the 100-th epoch in Fig. 11 . This figure shows that the histogram on SVHN has two peaks at the 100-th epoch, but they became one peak at the 200-th epoch (Fig. 10 ). This might cause the optimal (M, λ) for SOVR to be smaller than that for other datasets. Figure 12 plots the histograms of TRADES and shows that TRADES has two peaks but they are close to each other. This might be because the objective functions for adversarial examples and parameters are different. Table 3 lists the average of logit margin losses. Since the distributions of logit margin losses are long-tailed as shown in histograms, the difference in average values of logit margin losses among methods is small. Even so, SOVR tends to have the lowest logit margin losses under almost all settings.

F.2 EVALUATION OF GRADIENT NORMS

Even though logit margins of importance-aware methods are very small, they are robust against PGD and some attacks (Fig. 1 ). To reveal the cause of this robustness, we additionally evaluate th gradient norms for loss and logit functions (Fig. 13 ). In this figure, the gradient norms of cross- 

