COMPOSITE ADVERSARIAL TRAINING FOR MULTIPLE ADVERSARIAL PERTURBATIONS AND BEYOND Anonymous

Abstract

One intriguing property of deep neural networks (DNNs) is their vulnerability to adversarial perturbations. Despite the plethora of work on defending against individual perturbation models, improving DNN robustness against the combinations of multiple perturbations is still fairly under-studied. In this paper, we propose composite adversarial training (CAT), a novel training method that flexibly integrates and optimizes multiple adversarial losses, leading to significant robustness improvement with respect to individual perturbations as well as their "compositions". Through empirical evaluation on benchmark datasets and models, we show that CAT outperforms existing adversarial training methods by large margins in defending against the compositions of pixel perturbations and spatial transformations, two major classes of adversarial perturbation models, while incurring limited impact on clean inputs.

1. INTRODUCTION

Despite their state-of-the-art performance in tasks ranging from computer vision (Szegedy et al., 2016) to natural language processing (Seo et al., 2017) , deep neural networks (DNNs) are inherently susceptible to adversarial examples (Szegedy et al., 2014) , which are maliciously crafted samples to deceive target DNNs. A flurry of adversarial attacks have been proposed, which craft adversarial examples via either pixel perturbation (Goodfellow et al., 2015b; Moosavi-Dezfooli et al., 2016; Carlini & Wagner, 2017a) or spatial transformation (Engstrom et al., 2017; Xiao et al., 2018; Alaifari et al., 2019) . To defend against such attacks, a line of work attempts to improve DNN robustness by developing new training and inference strategies (Kurakin et al., 2017; Guo et al., 2018; Liao et al., 2018; Tramèr et al., 2018 ). Yet, the existing defenses are often circumvented or penetrated by adaptive attacks (Athalye et al., 2018) , while adversarial training (Madry et al., 2018; Shafahi et al., 2019) proves to be one state-of-the-art defense that still stands against adaptive attacks. While most adversarial training methods are primarily designed for individual attacks which are either fixed (Madry et al., 2018) or selected from a pre-defined pool (Tramèr & Boneh, 2019; Maini et al., 2020) , in realistic settings, the adversary is not constrained to individual perturbation models but free to "compose" multiple perturbation models to construct more powerful attacks. Despite their robustness against individual attacks, the DNNs trained using existing methods often fail to defend against such composite attacks (details in § 2). Moreover, the existing adversarial training methods focus on pixel perturbation-based attacks (e.g., bounded by p -norm balls), while the research on training DNNs robust against spatial transformation-based attacks is still limited. To bridge this striking gap, in this paper, we present CAT, a novel adversarial training method able to flexibly integrate and optimize multiple adversarial robustness losses, which leads to DNNs robust with respect to multiple individual perturbation models as well as their "compositions". Specifically, CAT assumes an attack model that composes multiple perturbations and, while bounded by the overall perturbation budget, optimally allocates the budget to each iteration. To solve the computational challenges of this formulation, we extend the recent advances on fast projection to p,1 mixed-norm ball (Liu & Ye, 2010; Sra, 2012; Béjar et al., 2019) to our setting and significantly improve the optimization efficiency. We validate the efficacy of CAT on benchmark datasets and models. For instance, on MNIST, CAT outperforms alternative adversarial training methods (Tramèr & Boneh, 2019) by over 44% in terms of adversarial accuracy against attacks that combine pixel perturbation and spatial transformation (details in § 4), with comparable clean accuracy and training efficiency.

