LEARNING CONTEXTUAL PERTURBATION BUDGETS FOR TRAINING ROBUST NEURAL NETWORKS

Abstract

Existing methods for training robust neural networks generally aim to make models uniformly robust on all input dimensions. However, different input dimensions are not uniformly important to the prediction. In this paper, we propose a novel framework to train certifiably robust models and learn non-uniform perturbation budgets on different input dimensions, in contrast to using the popular ∞ threat model. We incorporate a perturbation budget generator into the existing certified defense framework, and perform certified training with generated perturbation budgets. In comparison to the radius of ∞ ball in previous works, the robustness intensity is measured by robustness volume which is the multiplication of perturbation budgets on all input dimensions. We evaluate our method on MNIST and CIFAR-10 datasets and show that we can achieve lower clean and certified errors on relatively larger robustness volumes, compared to methods using uniform perturbation budgets. Further with two synthetic datasets constructed from MNIST and CIFAR-10, we also demonstrate that the perturbation budget generator can produce semantically-meaningful budgets, which implies that the generator can capture contextual information and the sensitivity of different features in input images.

1. INTRODUCTION

It has been demonstrated that deep neural networks, although achieving impressive performance on various tasks, are vulnerable to adverarial perturbations (Szegedy et al., 2013) . Models with high accuracy on clean and unperturbed data can be fooled to have extremely poor performance when input data are adversarially perturbed. The existence of adversarial perturbations causes concerns in safety-critical applications such as self-driving cars, face recognition and medical diagnosis. A number of methods have been proposed for training robust neural networks that can resist to adversarial perturbations to some extent. Among them, adversarial training (Goodfellow et al., 2015; Madry et al., 2018) and certified defenses (Wong et al., 2018; Gowal et al., 2018; Zhang et al., 2020) are of the most reliable ones so far, and most of them are trying to make the network robust to any perturbation within an p norm ball. Taking the commonly used ∞ -ball defense as an example, robust training methods aim to make model robust to perturbation on any pixel, which means the model is uniformly robust on all the input dimensions. But is this a valid assumption we should make? As we know, human perception is non-uniform (humans focus on important features even though these features can be sensitive to small noise) and context dependent (what part of image is important heavily depends on what is on the image). We expect a robust model to be close to human perception, rather than learn to defend against a particular fixed threat model, e.g., the traditional ∞ -norm one. Intuitively, we expect a good model to be more sensitive to important features and less sensitive to unimportant features, and the importance of features should be context-dependent. Taking the MNIST hand-written digit classification problem as an example, the digit 9 can be transformed to 4 simply by modifying just a few pixels on its head, so those pixels should be considered more important, and enforcing them to be robust to a large perturbation may not be correct. On the other hand, the pixels on the frame of such an input image can be safely modified without changing the ground-truth label of the image. Therefore, a uniform budget in robust training can greatly hamper the performance of

