DOES THE HALF ADVERSARIAL ROBUSTNESS REPRE-SENT THE WHOLE? IT DEPENDS ... A THEORETICAL PERSPECTIVE OF SUBNETWORK ROBUSTNESS

Abstract

Adversarial robustness of deep neural networks has been studied extensively and can bring security against adversarial attacks/examples. However, adversarially robust training approaches require a training mechanism on the entire deep network which can come at the cost of efficiency and computational complexity such as runtime. As a pilot study, we develop in this paper a novel theoretical framework that aims to answer the question of how can we make a whole model robust to adversarial examples by making part of a model robust? Toward promoting subnetwork robustness, we propose for the first time a new concept of semirobustness, which indicates adversarial robustness of a part of the network. We provide a theoretical analysis to show that if a subnetwork is robust and highly dependent to the rest of the network, then the remaining layers are also guaranteed to be robust. To guide the empirical investigation of our theoretical findings, we implemented our method at multiple layer depths and across multiple common image classification datasets. Experiments demonstrate that our method, with sufficient dependency between subnetworks, successfully utilizes subnetwork robustness to match fully-robust models' performance across AlexNet, VGG16, and ResNet50 benchmarks, for attack types FGSM, I-FGSM, PGD, C&W, and AutoAttack.

1. INTRODUCTION

Deep neural networks (DNNs) have been highly successful in computer vision, particularly in image classification tasks, speech recognition, and natural language processing where they can often outperform human abilities Mnih et al. (2015); Radford et al. (2015) ; Goodfellow et al. (2016) . Despite this, the reliability of deep learning algorithms is fundamentally challenged by the existence of the phenomenon of "adversarial examples", which are typically natural images that are perturbed with random noise such that the networks misclassify them. In the context of image classification an extremely small perturbation can change the label of a correctly classified image Szegedy et al. 



(2014); Goodfellow et al. (2014). For this reason, adversarial examples present a major threat to the security of deep-learning systems; however, a robust classifier can correctly label adversarially perturbed images. For example, an adversary could alter images of the road to fool a self-driving car's neural network into misclassifying traffic signs Papernot et al. (2016a), reducing the car's safety, but a robust network would detect and reject the adversarial inputs Ma et al. (2018); Biggio et al. (2013). The problem of finding perturbed inputs, known as adversarial attacks, has been studied extensively Kurakin et al. (2017); Sharif et al. (2016); Brown et al. (2017); Eykholt et al. (2018). To handle adversarial attacks, two major solutions have been studied: (1) Efficient methods to find adversarial examples Su et al. (2019); Laidlaw & Feizi (2019); Athalye et al. (2018); Liu et al. (2016); Xie et al. (2017); Akhtar & Mian (2018), (2) Adversarial training to make deep neural networks more robust against adversarial attacks Madry et al. (2018); Tsipras et al. (2019); Gilmer et al. (2019); Ilyas et al. (2019); Papernot et al. (2016b). The adversarial perturbations may be applied to the input or to the network's hidden layers Goodfellow et al. (2014); Szegedy et al. (2014) and it has been show that this strategy is effective at improving a network's robustness Goodfellow et al. (2014). Several theories have been developed to explain the phenomenon of adversarial examples Raghunathan et al. (2018); Xiao et al. (2019); Cohen et al. (2019); Shamir et al. (2019); Fawzi et al. (2016); Carlini & Wagner (2017); Weng et al. (2018); Ma 1

