LEARNING CONTEXTUAL PERTURBATION BUDGETS FOR TRAINING ROBUST NEURAL NETWORKS

Abstract

Existing methods for training robust neural networks generally aim to make models uniformly robust on all input dimensions. However, different input dimensions are not uniformly important to the prediction. In this paper, we propose a novel framework to train certifiably robust models and learn non-uniform perturbation budgets on different input dimensions, in contrast to using the popular ∞ threat model. We incorporate a perturbation budget generator into the existing certified defense framework, and perform certified training with generated perturbation budgets. In comparison to the radius of ∞ ball in previous works, the robustness intensity is measured by robustness volume which is the multiplication of perturbation budgets on all input dimensions. We evaluate our method on MNIST and CIFAR-10 datasets and show that we can achieve lower clean and certified errors on relatively larger robustness volumes, compared to methods using uniform perturbation budgets. Further with two synthetic datasets constructed from MNIST and CIFAR-10, we also demonstrate that the perturbation budget generator can produce semantically-meaningful budgets, which implies that the generator can capture contextual information and the sensitivity of different features in input images.

1. INTRODUCTION

It has been demonstrated that deep neural networks, although achieving impressive performance on various tasks, are vulnerable to adverarial perturbations (Szegedy et al., 2013) . Models with high accuracy on clean and unperturbed data can be fooled to have extremely poor performance when input data are adversarially perturbed. The existence of adversarial perturbations causes concerns in safety-critical applications such as self-driving cars, face recognition and medical diagnosis. A number of methods have been proposed for training robust neural networks that can resist to adversarial perturbations to some extent. Among them, adversarial training (Goodfellow et al., 2015; Madry et al., 2018) and certified defenses (Wong et al., 2018; Gowal et al., 2018; Zhang et al., 2020) are of the most reliable ones so far, and most of them are trying to make the network robust to any perturbation within an p norm ball. Taking the commonly used ∞ -ball defense as an example, robust training methods aim to make model robust to perturbation on any pixel, which means the model is uniformly robust on all the input dimensions. But is this a valid assumption we should make? As we know, human perception is non-uniform (humans focus on important features even though these features can be sensitive to small noise) and context dependent (what part of image is important heavily depends on what is on the image). We expect a robust model to be close to human perception, rather than learn to defend against a particular fixed threat model, e.g., the traditional ∞ -norm one. Intuitively, we expect a good model to be more sensitive to important features and less sensitive to unimportant features, and the importance of features should be context-dependent. Taking the MNIST hand-written digit classification problem as an example, the digit 9 can be transformed to 4 simply by modifying just a few pixels on its head, so those pixels should be considered more important, and enforcing them to be robust to a large perturbation may not be correct. On the other hand, the pixels on the frame of such an input image can be safely modified without changing the ground-truth label of the image. Therefore, a uniform budget in robust training can greatly hamper the performance of neural networks on certain tasks, and will force network to ignore some important features that are important for classification. Robustness certification with non-uniform perturbation budgets has been discussed in a prior work (Liu et al., 2019) , but training robust models and learning context-dependent perturbation budgets has not been addressed in prior works, which is more challenging and important for obtaining robust models. A detailed discussion on our difference with Liu et al. ( 2019) is in Sec. 2.2. In this paper, we propose the first method that can learn context-dependent non-uniform perturbation budgets in certified robust training, based on prior certified defense algorithms on p -norm threat models (Zhang et al., 2020; Xu et al., 2020) . To learn a context-dependent budget without introducing too many learnable parameters, we introduce a perturbation budget generator with an auxiliary neural network, to generate the context-dependent budgets based on the input image. We also impose constraints on the generator to make generated budgets satisfy target robustness volumes and ranges of budgets, where robustness volume is defined as the multiplication of budgets on all input dimensions. We then train the classifier with a linear-relaxation-based certified defense algorithm, auto LiRPA (Xu et al., 2020) generalized from CROWN-IBP (Zhang et al., 2020) , to minimize the verified error under given budget constraints. The gradients of the loss function can be back-propagated to the perturbation budgets, allowing training the classification network and budget generator jointly in robust training. Our contribution can be summarized below: • We propose a novel algorithm to train robust networks with contextual perturbation budgets rather than uniform ones. We show that it can be incorporated into certified defense methods with linear relaxation-based robustness verification. • We demonstrate that our method can effectively train both the classifier and the perturbation generator jointly, and we able to train models on relatively larger robustness volumes and outperform those trained with uniform budgets. • We also show that the learned perturbation budgets are semantically meaningful and align well with the importance of different pixels in the input image. We further confirm this with two synthetic tasks and datasets constructed from MNIST and CIFAR-10 respectively. 2013)), a great number of works has been devoted to improving the robustness of neural networks from both attack and defense perspectives (Moosavi-Dezfooli et al., 2016; Carlini & Wagner, 2017; Papernot et al., 2016; Moosavi-Dezfooli et al., 2017; Gowal et al., 2019) . On a K-way classification task, training an adversarially robust neural network f w with weight w can generally be formulated as solving the following min-max optimization problem:

2. BACKGROUND AND RELATED WORK

min w E (x,y)∼D max δ∈∆ L(f w (x + δ), y), ( ) where D is the data distribution, and ∆ is a threat model defining the space of the perturbations, and L is a loss function. Adversarial training (Goodfellow et al., 2015; Madry et al., 2018) applies adversarial attacks to solve the inner maximization problem and train the neural network on generated adversarial examples, with efficiency advanced in some recent works (Shafahi et al., 2019; Wong et al., 2020) . However, robustness improvements from adversarial training do not have provable guarantees. Some other recent works seek to train networks that have provable robustness, namely certified defense methods. Such methods solves the inner maximization by computing certified upper bounds that provably hold true for any perturbation within the threat model, including abstract interpretation (Singh et al., 2018) , interval bound propagation (Gowal et al., 2018; Mirman et al., 2018) , randomized smoothing (Cohen et al., 2019; Salman et al., 2019; Zhai et al., 2020) , and linear-relaxation-based methods (Wong & Kolter, 2018; Mirman et al., 2018; Wong et al., 2018; Zhang et al., 2020; Xu et al., 2020) . However, nearly all existing certified defense methods treat all input features equally in the



TRAINING ROBUST NEURAL NETWORKS Since the discovery of adversarial examples (Szegedy et al. (2013), Biggio et al. (

