LEARNING PERTURBATION SETS FOR ROBUST MA-CHINE LEARNING

Abstract

Although much progress has been made towards robust deep learning, a significant gap in robustness remains between real-world perturbations and more narrowly defined sets typically studied in adversarial defenses. In this paper, we aim to bridge this gap by learning perturbation sets from data, in order to characterize real-world effects for robust training and evaluation. Specifically, we use a conditional generator that defines the perturbation set over a constrained region of the latent space. We formulate desirable properties that measure the quality of a learned perturbation set, and theoretically prove that a conditional variational autoencoder naturally satisfies these criteria. Using this framework, our approach can generate a variety of perturbations at different complexities and scales, ranging from baseline spatial transformations, through common image corruptions, to lighting variations. We measure the quality of our learned perturbation sets both quantitatively and qualitatively, finding that our models are capable of producing a diverse set of meaningful perturbations beyond the limited data seen during training. Finally, we leverage our learned perturbation sets to train models which are empirically and certifiably robust to adversarial image corruptions and adversarial lighting variations, while improving generalization on non-adversarial data. All code and configuration files for reproducing the experiments as well as pretrained model weights can be found at https://github.com/locuslab/perturbation_learning.

1. INTRODUCTION

Within the last decade, adversarial learning has become a core research area for studying robustness and machine learning. Adversarial attacks have expanded well beyond the original setting of imperceptible noise to more general notions of robustness, and can broadly be described as capturing sets of perturbations that humans are naturally invariant to. These invariants, such as facial recognition should be robust to adversarial glasses (Sharif et al., 2019) or traffic sign classification should be robust to adversarial graffiti (Eykholt et al., 2018) , form the motivation behind many real world adversarial attacks. However, human invariants can also include notions which are not inherently adversarial, for example image classifiers should be robust to common image corruptions (Hendrycks & Dietterich, 2019) as well as changes in weather patterns (Michaelis et al., 2019) . On the other hand, although there has been much success in defending against small adversarial perturbations, most successful and principled methods for learning robust models are limited to human invariants that can be characterized using mathematically defined perturbation sets, for example perturbations bounded in p norm. After all, established guidelines for evaluating adversarial robustness (Carlini et al., 2019) have emphasized the importance of the perturbation set (or the threat model) as a necessary component for performing proper, scientific evaluations of adversarial defense proposals. However, this requirement makes it difficult to learn models which are robust to human invariants beyond these mathematical sets, where real world attacks and general notions of robustness can often be virtually impossible to write down as a formal set of equations. This incompatibility between existing methods for learning robust models and real-world, human invariants raises a fundamental question for the field of robust machine learning: In the absence of a mathematical definition, in this work we present a general framework for learning perturbation sets from perturbed data. More concretely, given pairs of examples where one is a perturbed version of the other, we propose learning generative models that can "perturb" an example by varying a fixed region of the underlying latent space. The resulting perturbation sets are welldefined and can naturally be used in robust training and evaluation tasks. The approach is widely applicable to a range of robustness settings, as we make no assumptions on the type of perturbation being learned: the only requirement is to collect pairs of perturbed examples. Given the susceptibility of deep learning to adversarial examples, such a perturbation set will undoubtedly come under intense scrutiny, especially if it is to be used as a threat model for adversarial attacks. In this paper, we begin our theoretical contributions with a broad discussion of perturbation sets and formulate deterministic and probabilistic properties that a learned perturbation set should have in order to be a meaningful proxy for the true underlying perturbation set. The necessary subset property ensures that the set captures real perturbations, properly motivating its usage as an adversarial threat model. The sufficient likelihood property ensures that real perturbations have high probability, which motivates sampling from a perturbation set as a form of data augmentation. We then prove the main theoretical result, that a learned perturbation set defined by the decoder and prior of a conditional variational autoencoder (CVAE) (Sohn et al., 2015) implies both of these properties, providing a theoretically grounded framework for learning perturbation sets. The resulting CVAE perturbation sets are well motivated, can leverage standard architectures, and are computationally efficient with little tuning required. We highlight the versatility of our approach using CVAEs with an array of experiments, where we vary the complexity and scale of the datasets, perturbations, and downstream tasks. We first demonstrate how the approach can learn basic ∞ and rotation-translation-skew (RTS) perturbations (Jaderberg et al., 2015) in the MNIST setting. Since these sets can be mathematically defined, our goal is simply to measure exactly how well the learned perturbation set captures the target perturbation set on baseline tasks where the ground truth is known. We next look at a more difficult setting which can not be mathematically defined, and learn a perturbation set for common image corruptions on CIFAR10 (Hendrycks & Dietterich, 2019) . The resulting perturbation set can interpolate between common corruptions, produce diverse samples, and be used in adversarial training and randomized smoothing frameworks. The adversarially trained models have improved generalization performance to both in-and out-of-distribution corruptions and better robustness to adversarial corruptions. In our final setting, we learn a perturbation set that captures real-world variations in lighting using a multi-illumination dataset of scenes captured "in the wild" (Murmann et al., 2019) . The perturbation set generates meaningful lighting samples and interpolations while generalizing to unseen scenes, and can be used to learn image segmentation models that are empirically and certifiably robust to lighting changes. All code and configuration files for reproducing the experiments as well as pretrained model weights for both the learned perturbation sets as well as the downstream robust classifiers are at https://github.com/locuslab/perturbation_learning.

2. BACKGROUND AND RELATED WORK

Perturbation sets for adversarial threat models Adversarial examples were initially defined as imperceptible examples with small 1 , 2 and ∞ norm (Biggio et al., 2013; Szegedy et al., 2013; Goodfellow et al., 2014) , forming the earliest known, well-defined perturbation sets that were eventually generalized to the union of multiple p perturbations (Tramèr & Boneh, 2019; Maini et al., 2019; Croce & Hein, 2019; Stutz et al., 2019) . Alternative perturbation sets to the p setting that remain well-defined incorporate more structure and semantic meaning, such as rotations and translations (Engstrom et al., 2017 ), Wasserstein balls (Wong et al., 2019) , functional perturbations (Laidlaw & Feizi, 2019 ), distributional shifts (Sinha et al., 2017; Sagawa et al., 2019 ), word embeddings (Miyato et al., 2016) , and word substitutions (Alzantot et al., 2018; Jia et al., 2019) . Other work has studied perturbation sets that are not necessarily mathematically formulated but welldefined from a human perspective such as spatial transformations (Xiao et al., 2018b) . Real-world adversarial attacks tend to try to remain either inconspicuous to the viewer or meddle with features that humans would naturally ignore, such as textures on 3D printed objects (Athalye et al., 2017) , graffiti on traffic signs (Eykholt et al., 2018) , shapes of objects to avoid LiDAR detection (Cao et al., 

