LEARNING PERTURBATION SETS FOR ROBUST MA-CHINE LEARNING

Abstract

Although much progress has been made towards robust deep learning, a significant gap in robustness remains between real-world perturbations and more narrowly defined sets typically studied in adversarial defenses. In this paper, we aim to bridge this gap by learning perturbation sets from data, in order to characterize real-world effects for robust training and evaluation. Specifically, we use a conditional generator that defines the perturbation set over a constrained region of the latent space. We formulate desirable properties that measure the quality of a learned perturbation set, and theoretically prove that a conditional variational autoencoder naturally satisfies these criteria. Using this framework, our approach can generate a variety of perturbations at different complexities and scales, ranging from baseline spatial transformations, through common image corruptions, to lighting variations. We measure the quality of our learned perturbation sets both quantitatively and qualitatively, finding that our models are capable of producing a diverse set of meaningful perturbations beyond the limited data seen during training. Finally, we leverage our learned perturbation sets to train models which are empirically and certifiably robust to adversarial image corruptions and adversarial lighting variations, while improving generalization on non-adversarial data. All code and configuration files for reproducing the experiments as well as pretrained model weights can be found at https://github.com/locuslab/perturbation_learning.

1. INTRODUCTION

Within the last decade, adversarial learning has become a core research area for studying robustness and machine learning. Adversarial attacks have expanded well beyond the original setting of imperceptible noise to more general notions of robustness, and can broadly be described as capturing sets of perturbations that humans are naturally invariant to. These invariants, such as facial recognition should be robust to adversarial glasses (Sharif et al., 2019) or traffic sign classification should be robust to adversarial graffiti (Eykholt et al., 2018) , form the motivation behind many real world adversarial attacks. However, human invariants can also include notions which are not inherently adversarial, for example image classifiers should be robust to common image corruptions (Hendrycks & Dietterich, 2019) as well as changes in weather patterns (Michaelis et al., 2019) . On the other hand, although there has been much success in defending against small adversarial perturbations, most successful and principled methods for learning robust models are limited to human invariants that can be characterized using mathematically defined perturbation sets, for example perturbations bounded in p norm. After all, established guidelines for evaluating adversarial robustness (Carlini et al., 2019) have emphasized the importance of the perturbation set (or the threat model) as a necessary component for performing proper, scientific evaluations of adversarial defense proposals. However, this requirement makes it difficult to learn models which are robust to human invariants beyond these mathematical sets, where real world attacks and general notions of robustness can often be virtually impossible to write down as a formal set of equations. This incompatibility between existing methods for learning robust models and real-world, human invariants raises a fundamental question for the field of robust machine learning:

