SELF-ADAPTIVE PERTURBATION RADII FOR ADVERSARIAL TRAINING

Abstract

Adversarial training has been shown to be the most popular and effective technique to protect models from imperceptible adversarial samples. Despite its success, it also accompanies the significant performance degeneration to clean data. To achieve a good performance on both clean and adversarial samples, the main effort is searching for an adaptive perturbation radius for each training sample, which essentially suffers from a conflict between exact searching and computational overhead. To address this conflict, in this paper, firstly we show the superiority of adaptive perturbation radii intuitively and theoretically regarding the accuracy and robustness respectively. Then we propose our novel self-adaptive adjustment framework for perturbation radii without tedious searching. We also discuss this framework on both deep neural networks (DNNs) and kernel support vector machines (SVMs). Finally, extensive experimental results show that our framework can improve not only natural generalization performance but also adversarial robustness. It is also competitive with existing searching strategies in terms of running time.

1. INTRODUCTION

The security of machine learning models has long been questioned since most models are vulnerable to perturbations (Papernot et al., 2016) . Extremely tiny perturbations may be imperceptible to human beings but yet cause poor performance of models such as deep neural networks (DNNs) (Goodfellow et al., 2014; Madry et al., 2017; Papernot et al., 2017) , support vector machines (SVMs) (Xiao et al., 2012; Biggio et al., 2012; 2014) and logistic regression (LR) (Papernot et al., 2016) . Examples attacked by such perturbations are generally called as adversarial examples. To learn robust models, adversarial training has now become one of the most effective and widelyused methods, especially on DNNs and SVMs (Zhou et al., 2012; Kurakin et al., 2017; Miyato et al., 2018; Wang et al., 2019; Shafahi et al., 2019; Wu et al., 2021) . However, the success of adversarial training comes at a cost (Tsipras et al., 2018; Zhang et al., 2019) . Specifically, as stated in (Tsipras et al., 2018) , robustness may be at odds with accuracy, which means models after adversarial training may fail to generalize well on unperturbed examples. It is generally believed that this phenomenon is due to the fixed strength of attack throughout the training process, which ignores the fact that every example may have different intrinsic robustness (Cheng et al., 2020; Zhang et al., 2020) . Naturally, the main effort to mitigate this issue is to find the suitable perturbation radius ϵ i for each training sample with explicit or implicit searching strategies. For explicit searching strategies, IAAT (Balaji et al., 2019) uses the brute-force search to find the suitable perturbation radii. MMA (Ding et al., 2018) aims to find the optimal ϵ * i via the bisection search. For implicit searching strategy, Zhang et al. propose an early-stopped PGD strategy called FAT, which adjusts ϵ implicitly in essence. Although FAT skillfully skips the step of searching ϵ * i , it is sensitive to hyperparameters such as steps of PGD attack τ and the uniform perturbation radius ϵ. Thus, in this paper, we mainly focus on explicit searching strategies. We also give a brief review of the above algorithms in Table 1 . From this table, we can see that these searching strategies essentially have an inherent conflict between exact searching and time complexity. To solve this conflict, in this paper, we propose a novel self-adaptive adjustment framework (SAAT) for perturbation radii. It achieves a better trade-off between natural generalization performance and adversarial robustness without much computational overhead. Firstly, for the adaptive perturbation • Theoretically, we prove that adaptive perturbation radii contribute to a lower expected adversarial risk than fixed and uniform perturbation radii, which implies higher robustness against adversarial examples. • Our self-adaptive adversarial training algorithms can skillfully assign the optimal perturbation radius for each data, which avoids the step of exact searching and achieves a better trade-off between adversarial robustness and natural accuracy. • The self-adaptive adversarial training strategy that we propose from the kernel perspective is applicable to both SVMs and DNNs. It efficiently optimizes a minimization problem instead of the conventional minimax one since we transform the inner maximization into a simplified and equivalent form.

2.1. NOTATIONS

We focus on C-class classification problems, then the dataset can be defined as D = {x i , y i } n i=1 , where x i ∈ R d is the input data, and y i ∈ {1, • • • , C} is the label. We will use I{a}, the 0-1 loss, to



Figure 1: Conceptual illustration of standard adversarial training and our self-adaptive adversarial training (i.e., SAAT).

