COMPOSITE ADVERSARIAL TRAINING FOR MULTIPLE ADVERSARIAL PERTURBATIONS AND BEYOND Anonymous

Abstract

One intriguing property of deep neural networks (DNNs) is their vulnerability to adversarial perturbations. Despite the plethora of work on defending against individual perturbation models, improving DNN robustness against the combinations of multiple perturbations is still fairly under-studied. In this paper, we propose composite adversarial training (CAT), a novel training method that flexibly integrates and optimizes multiple adversarial losses, leading to significant robustness improvement with respect to individual perturbations as well as their "compositions". Through empirical evaluation on benchmark datasets and models, we show that CAT outperforms existing adversarial training methods by large margins in defending against the compositions of pixel perturbations and spatial transformations, two major classes of adversarial perturbation models, while incurring limited impact on clean inputs.

1. INTRODUCTION

Despite their state-of-the-art performance in tasks ranging from computer vision (Szegedy et al., 2016) to natural language processing (Seo et al., 2017) , deep neural networks (DNNs) are inherently susceptible to adversarial examples (Szegedy et al., 2014) , which are maliciously crafted samples to deceive target DNNs. A flurry of adversarial attacks have been proposed, which craft adversarial examples via either pixel perturbation (Goodfellow et al., 2015b; Moosavi-Dezfooli et al., 2016; Carlini & Wagner, 2017a) or spatial transformation (Engstrom et al., 2017; Xiao et al., 2018; Alaifari et al., 2019) . To defend against such attacks, a line of work attempts to improve DNN robustness by developing new training and inference strategies (Kurakin et al., 2017; Guo et al., 2018; Liao et al., 2018; Tramèr et al., 2018) . Yet, the existing defenses are often circumvented or penetrated by adaptive attacks (Athalye et al., 2018) , while adversarial training (Madry et al., 2018; Shafahi et al., 2019) proves to be one state-of-the-art defense that still stands against adaptive attacks. While most adversarial training methods are primarily designed for individual attacks which are either fixed (Madry et al., 2018) or selected from a pre-defined pool (Tramèr & Boneh, 2019; Maini et al., 2020) , in realistic settings, the adversary is not constrained to individual perturbation models but free to "compose" multiple perturbation models to construct more powerful attacks. Despite their robustness against individual attacks, the DNNs trained using existing methods often fail to defend against such composite attacks (details in § 2). Moreover, the existing adversarial training methods focus on pixel perturbation-based attacks (e.g., bounded by p -norm balls), while the research on training DNNs robust against spatial transformation-based attacks is still limited. To bridge this striking gap, in this paper, we present CAT, a novel adversarial training method able to flexibly integrate and optimize multiple adversarial robustness losses, which leads to DNNs robust with respect to multiple individual perturbation models as well as their "compositions". Specifically, CAT assumes an attack model that composes multiple perturbations and, while bounded by the overall perturbation budget, optimally allocates the budget to each iteration. To solve the computational challenges of this formulation, we extend the recent advances on fast projection to p,1 mixed-norm ball (Liu & Ye, 2010; Sra, 2012; Béjar et al., 2019) to our setting and significantly improve the optimization efficiency. We validate the efficacy of CAT on benchmark datasets and models. For instance, on MNIST, CAT outperforms alternative adversarial training methods (Tramèr & Boneh, 2019) by over 44% in terms of adversarial accuracy against attacks that combine pixel perturbation and spatial transformation (details in § 4), with comparable clean accuracy and training efficiency. Our contributions can be summarized as follows. First, we demonstrate that a new class of adversarial attacks, which "compose" multiple perturbations, render most existing adversarial training methods ineffective; then, we propose CAT, the first adversarial training method designed for multiple perturbation models as well as their compositions; further, we validate the efficacy of CAT by comparing it against alternative methods on benchmark datasets and DNNs; finally, we explore the optimization space of composite perturbations, leading to several promising research directions.

2.1. ADVERSARIAL TRAINING

Adversarial training is a class of techniques to train robust DNNs by minimizing the worst-case loss with respect to a given adversarial perturbation model. Formally, let f θ be a DNN parameterized by θ, the loss function, and D train = {x i , y i } n i=1 the training set. Then the adversarial training with respect to an p adversary with perturbation magnitude is defined as: θ * = arg min θ i max δ∈Bp( ) (f θ (x i + δ), y i ), where B p ( ) = {δ : δ p ≤ } is the p -norm ball of radius . Here, the inner maximization problem essentially defines the target adversarial attack. For instance, instantiating it as the ∞ projected gradient descent (PGD) attack leads to the well-known PGD adversarial training. Despite their effectiveness against considered perturbation models (e.g., ∞ perturbation), the existing adversarial training methods often fail to defend against perturbations that they are not designed for (Tramèr & Boneh, 2019) . Motivated by this, some recent work explores conducting adversarial training with respect to multiple perturbation models simultaneously. AVG and MAX -Tramèr & Boneh (2019) propose two methods, AVG and MAX, to aggregate multiple perturbations. Specifically, AVG formulates the robustness optimization as: θ * = arg min θ i p∈A max δp∈Bp( ) (f θ (x i + δ p ), y i ) where A = {1, 2, ∞}. Compared with Eq. 1, Eq. 2 aggregates multiple adversarial perturbations in the inner loop. Similarly, instead of averaging multiple perturbations, MAX selects the perturbation resulting in the largest loss: θ * = arg min θ i max {δ∈Bp( )|p∈A} (f θ (x i + δ), y i ) If A contains only one adversarial perturbation, Eq. 3, Eq. 2, and Eq. 1 are all equivalent. MSD -While AVG and MAX achieve varying degrees of robustness to the considered perturbations, it is practically difficult to minimize the worst-case loss with respect to the union of perturbations. To this end, Maini et al. (2020) propose multiple steepest descent (MSD) which improves MAX (and AVG) along two aspects. First, it selects the largest (or average) perturbation at each inner iteration; Second, it applies the steepest descent instead of the projected gradient descent in generating adversarial inputs. Formally, MSD formulates the optimization at the t-th iteration as: t) ) for p ∈ A (4) δ (t+1) p = Proj Bp( ) δ (t) + v p (δ δ (t+1) = arg max δ (t+1) p (f θ (x + δ (t+1) p ), y) where Proj C (•) is the projection operator onto the convex set C, and v p (δ (t) ) is the steepest descent direction for p perturbation, v p (δ) = arg max v p ≤λ v T ∇ (f θ (x + δ), y), where λ is the step size.

2.2. COMPOSITE ADVERSARIAL ATTACK

While the existing adversarial training methods seem effective against individual perturbation models which are either fixed or selected from a pre-defined pool (i.e., the union of perturbations), in

