SQUEEZE TRAINING FOR ADVERSARIAL ROBUSTNESS

Abstract

The vulnerability of deep neural networks (DNNs) to adversarial examples has attracted great attention in the machine learning community. The problem is related to non-flatness and non-smoothness of normally obtained loss landscapes. Training augmented with adversarial examples (a.k.a., adversarial training) is considered as an effective remedy. In this paper, we highlight that some collaborative examples, nearly perceptually indistinguishable from both adversarial and benign examples yet show extremely lower prediction loss, can be utilized to enhance adversarial training. A novel method is therefore proposed to achieve new state-of-the-arts in adversarial robustness.

1. INTRODUCTION

Adversarial examples (Szegedy et al., 2013; Biggio et al., 2013) In this paper, to gain a deeper understanding of DNNs, robust or not, we examine the valley of their loss landscapes and explore the existence of collaborative examples in the ϵ-bounded neighborhood of benign examples, which demonstrate extremely lower prediction loss in comparison to that of their neighbors. Somewhat unsurprisingly, the existence of such examples can be related to the adversarial robustness of DNNs. In particular, if given a model that was trained to be adversarially more robust, then it is less likely to discover a collaborative example. Moreover, incorporating such collaborative examples into model training seemingly also improves the obtained adversarial robustness. On this point, we advocate squeeze training (ST), in which adversarial examples and collaborative examples of each benign example are jointly and equally optimized in a novel procedure, such that their maximum possible prediction discrepancy is constrained. Extensive experimental results verify the effectiveness of our method. We demonstrate that ST outperforms state-of-the-arts remarkably on several benchmark datasets, achieving an absolute robust accuracy gain of >+1.00% without utilizing additional data on CIFAR-10. It can also be readily combined with a variety of recent efforts, e.g., RST (Carmon et al., 2019) and RWP (Wu et al., 2020b) , to further improve the performance.

2. BACKGROUND AND RELATED WORK

2.1 ADVERSARIAL EXAMPLES Let x i and y i denote a benign example (e.g., a natural image) and its label from S = {(x i , y i )} n i=1 , where x i ∈ X and y i ∈ Y = {0, . . . , C -1}. We use B ϵ [x i ] = {x ′ | ∥x ′ -x i ∥ ∞ ≤ ϵ} to represent the ϵ-bounded l ∞ neighborhood of x i . A DNN parameterized by Θ can be defined as a function f Θ (•) : X → R C . Without ambiguity, we will drop the subscript Θ in f Θ (•) and write it as f (•). In general, adversarial examples are almost perceptually indistinguishable to benign examples, yet they lead to arbitrary predictions on the victim models. One typical formulation of generating an Published as a conference paper at ICLR 2023 adversarial example is to maximize the prediction loss in a constrained neighborhood of a benign example. Projected gradient descent (PGD) (Madry et al., 2018) (or the iterative fast gradient sign method, i.e., I-FGSM (Kurakin et al., 2017) ) is commonly chosen for achieving the aim. It seeks possible adversarial examples by leveraging the gradient of g = ℓ • f w.r.t. its inputs, where ℓ is a loss function (e.g., the cross-entropy loss CE(•, y)). Given a starting point x 0 , an iterative update is performed with: x t+1 = Π Bϵ[x] (x t + α • sign(∇ x t CE(f (x t ), y))), where x t is a temporary result obtained at the t-th step and function Π Bϵ[x] (•) projects its input onto the ϵ-bounded neighborhood of the benign example. The starting point can be the benign example (for I-FGSM) or its randomly neighbor (for PGD). et al., 2017; Chen et al., 2017; Ilyas et al., 2018; Cheng et al., 2019; Xie et al., 2019; Guo et al., 2020) and no-box attacks (Papernot et al., 2017; Li et al., 2020) . Recently, the ensemble of a variety of attacks becomes popular for performing adversarial attack and evaluating adversarial robustness. Such a strong adversarial benchmark, called AutoAttack (AA) (Croce & Hein, 2020), consists of three white-box attacks, i.e., APGD-CE, APGD-DLR, and FAB (Croce & Hein, 2019), and one black-box attack, i.e., the Square Attack (Andriushchenko et al., 2019) . We adopt it in experimental evaluations. 

2.2. ADVERSARIAL TRAINING (AT)

Among the numerous methods for defending against adversarial examples, adversarial training that incorporates such examples into model training is probably one of the most effective ones. We will revisit some representative adversarial training methods in this subsection. Vanilla AT (Madry et al., 2018) formulates the training objective as a simple min-max game. Adversarial examples are first generated using for instance PGD to maximize some loss (e.g., the cross-entropy loss) in the objective, and then the model parameters are optimized to minimize the same loss with the obtained adversarial examples: min Θ max x ′ i ∈Bϵ[xi] CE(f (x ′ i ), y i ). Although effective in improving adversarial robustness, the vanilla AT method inevitably leads to decrease in the prediction accuracy of benign examples, therefore several follow-up methods discuss improved and more principled ways to better trade off clean and robust accuracy (Zhang et al., 2019; Kannan et al., 2018; Wang et al., 2020; Wu et al., 2020b) . Such methods advocate regularizing the output of benign example and its adversarial neighbors. With remarkable empirical performance, they are regarded as strong baselines, and we will introduce a representative one, i.e., TRADES (Zhang et al., 2019) . TRADES (Zhang et al., 2019) advocates a learning objective comprising two loss terms. Its first term penalizes the cross-entropy loss of benign training samples, and the second term regularizes the difference between benign output and the output of possibly malicious data points. Specifically, the worst-case Kullback-Leibler (KL) divergence between the output of each benign example and that of any suspicious data point in its ϵ-bounded l ∞ neighborhood is minimized in the regularization term: min Θ i (CE(f (x i ), y i ) + β max x ′ ∈Bϵ[xi] KL(f (x ′ i ), f (x i ))). (3)



* Work was done under co-supervision of Yiwen Guo and Wangmeng Zuo who are in correspondence.



crafted by adding imperceptible perturbations to benign examples are capable of fooling DNNs to make incorrect predictions. The existence of such adversarial examples has raised security concerns and attracted great attention. Much endeavour has been devoted to improve the adversarial robustness of DNNs. As one of the most effective methods, adversarial training (Madry et al., 2018) introduces powerful and adaptive adversarial examples during model training and encourages the model to classify them correctly.

Besides I-FGSM and PGD, the single-step FGSM (Goodfellow et al., 2015), C&W's attack (Carlini & Wagner, 2017), DeepFool (Moosavi-Dezfooli et al., 2019), and the momentum iterative FGSM (Dong et al., 2018) are also popular and effective for generating adversarial examples. Some work also investigates the way of generating adversarial examples without any knowledge of the victim model, which are known as black-box attacks (Papernot

In this paper, we explore valley of the loss landscape of DNNs and study the benefit of incorporating collaborative examples into adversarial training. In an independent paper (Tao et al., 2022), hypocritical examples were explored for concealing mistakes of a model, as an attack. These examples also lied in the valley. Yet, due to the difference in aim, studies of hypocritical examples in (Tao et al., 2022) were mainly performed based on mis-classified benign examples according to their formal definition, while our work concerns local landscapes around all benign examples. Other related work include unadversarial examples (Salman et al., 2021) and assistive signals (Pestana et al., 2021) that designed 3D textures to customize objects for better classifying them.

availability

https://github.com/

