PRETRAIN-TO-FINETUNE ADVERSARIAL TRAINING VIA SAMPLE-WISE RANDOMIZED SMOOTHING

Abstract

Developing certified models that can provably defense adversarial perturbations is important in machine learning security. Recently, randomized smoothing, combined with other techniques (Cohen et al., 2019; Salman et al., 2019), has been shown to be an effective method to certify models under l 2 perturbations. Existing work for certifying l 2 perturbations added the same level of Gaussian noise to each sample. The noise level determines the trade-off between the test accuracy and the average certified robust radius. We propose to further improve the defense via sample-wise randomized smoothing, which assigns different noise levels to different samples. Specifically, we propose a pretrain-to-finetune framework that first pretrains a model and then adjusts the noise levels for higher performance based on the model's outputs. For certification, we carefully allocate specific robust regions for each test sample. We perform extensive experiments on CIFAR-10 and MNIST datasets and the experimental results demonstrate that our method can achieve better accuracy-robustness trade-off in the transductive setting.

1. INTRODUCTION

The vulnerability of neural networks to adversarial examples has attracted considerable attention in safety-critical scenarios. For example, adding visually indistinguishable perturbations to input images would misguide a deep classifier to make wrong predictions (Szegedy et al., 2013; Goodfellow et al., 2014) . Such an intriguing property of neural networks has spawned a lot of works on training robust neural networks and certifying network robustness with theoretical guarantees. Many heuristic defense algorithms have been developed (Szegedy et al., 2013; Goodfellow et al., 2014; Moosavi-Dezfooli et al., 2016; Papernot et al., 2016; Kurakin et al., 2016; Carlini & Wagner, 2017; Athalye et al., 2018) , together with many empirically robust models intended to defend certain types of adversarial perturbations. However, there is no theoretical guarantees of robustness for many heuristic defense algorithms and many of them have been broken subsequently by more carefully designed and more powerful attack algorithms (Athalye et al., 2018) . This motivates the development of certifiably robust classifiers whose outputs are guaranteed to be the same within a l p -ball of certain radius, hence can defend any adversarial pertubation smaller than that radius (Hein & Andriushchenko, 2017; Raghunathan et al., 2018; Wong & Kolter, 2017; Gowal et al., 2018; Wong et al., 2018; Zhang et al., 2018; Cohen et al., 2019; Salman et al., 2019) . 2019) added the same level of Gaussian noise to each sample, and a higher noise level leads to lower accuracy but a larger average robust radius. While it is well known that accuracy and robustness are at odds with each other (Tsipras et al., 2018) , we find that we can improve both accuracy and robustness via sample-wise randomized smoothing, which assigns different noise levels to different samples. While the above argument seems intuitive, we note that there is a subtle and challenging issue: a close examination of the proof of the randomized smoothing theorem in Cohen et al. ( 2019) dictates that we cannot assign arbitrary noise level to any test point; a certain robustness radius around a point can be certified only if all points within the radius are assigned the same Gaussian variance. To address this issue, we divide the input space into "robust regions", and samples in the same region are assigned the same noise level. To certify a sample-wise randomized smoothed model, we make sure that the certified l 2 -ball is entirely contained by one of the regions, so that the randomized smoothing theorem can be applied to that region. Hence, precisely speaking, in our method, the classification of a testing point depends on which "robust regions" it lies in, and hence on other testing points in the test dataset. Hence, we present our results in the transductive setting (the test dataset is known), but our method also works in the setting where the test data points come online in an i.i.d. fashion (see the detailed discussion in Appendix D). Our contributions can be summarized as follows: 1. We introduce a pretrain-to-finetune framework that first pretrains a model with random noise levels and then further finetunes it with specific selected noise levels, which maximizes the robust radius of each train set sample based on the pretrained model's output. 2. In transductive setting, we allocate specific robust regions and assign noise levels for test set data using linear programming to get near optimal results. If the test data points come online, we allocate robust regions for them one by one. If the newly test sample falls in an allocated region, then it uses the same noise level. Otherwise, we allocate a brand new region for it. 3. We conduct a series of experiments on CIFAR-10 and MNIST to evaluate the performance of our method. Comparing with the state-of-the-art algorithms, Smooth-Adv (Salman et al., 2019) and Macer (Zhai et al., 2020) , our sample-wise method achieves 40% improvements on average certified radius on CIFAR-10 and comparable results on MNIST.

2. RELATED WORK

There are a large body of works proposing different adversarial defense algorithms, which can be divided into two categories, empirical defenses and provable defenses, by whether the resulting model achieves certified robustness. Empirical defenses Empirical defenses train classifiers that are robust under specific attacks. Lacking theoretical guarantees, these defenses can usually be broken by stronger attacks. Early defenses added noise to the input sample or features in inference in hopes that the adversarial perturbation would degenerate into random noise which could be easily handled. 



Recently, Cohen et al. (2019) utilized a randomized smoothing technique to build robust smoothed classifiers with provable l 2 -robustness. Specifically, the smoothed classifier outputs the class most likely to be predicted by the base classifier when the input is perturbed by a certain level of Gaussian noise. Building upon this idea, Salman et al. (2019) combined randomized smoothing with adversarial training and achieved a state-of-the-art provable l2-defense. Both Cohen et al. (2019) and Salman et al. (

For example,  Papernot et al. (2016)  used a distillation framework to remove the effect of adversarial examples on the model.Guo et al. (2017)  preprocessed the input images before feeding them into the networks, such as JPEG compression. However,Athalye et al. (2018)  found that these empirical defenses could be broken by stronger attacks.So far, adversarial training (Goodfellow et al., 2014; Madry et al., 2017), the process of training a model over on-the-fly adversarial examples, has been one of the most powerful empirical defenses. Later directly minimize the classification errors of adversarial examples generated by project gradient descend (PGD) which maximize the errors, actually solving a min-max optimization problem. The TRADES technique (Zhang et al., 2019) improved the results by minimizing a surrogate loss, consisting of the natural error and boundary error, and provided a tight upper bound to the original loss. Though powerful, adversarial training still lacks a certified guarantee. Provable defenses Due to the fact that empirical defenses are only robust to some specific attacks and face potential threats, the focus of the research switches to the certified defenses whose predictions can be proved robust when perturbed within a certain range. Gradually more certified defenses have been proposed recently.Raghunathan et al. (2018) certified a shallow network based on semidefinite relaxation, Wong & Kolter (2017) made use of convex outer adversarial polytope based on relu activation function and Zhang et al. (2018) proposed a solution for general activation function. Though effective, they all rely on the structure of the network.A randomized smoothing classifier(Cohen et al., 2019)  is a certified robust classifier independent of the structure of the network and only relies on the prediction of the base classifier. In fact, it is a virtual classifier and its prediction is generated from the prediction of the base classifier. ThenSalman  et al. (2019)  proposed Smooth-Adv which can directly attack the virtual classifier and promoted

