DOES THE HALF ADVERSARIAL ROBUSTNESS REPRE-SENT THE WHOLE? IT DEPENDS ... A THEORETICAL PERSPECTIVE OF SUBNETWORK ROBUSTNESS

Abstract

Adversarial robustness of deep neural networks has been studied extensively and can bring security against adversarial attacks/examples. However, adversarially robust training approaches require a training mechanism on the entire deep network which can come at the cost of efficiency and computational complexity such as runtime. As a pilot study, we develop in this paper a novel theoretical framework that aims to answer the question of how can we make a whole model robust to adversarial examples by making part of a model robust? Toward promoting subnetwork robustness, we propose for the first time a new concept of semirobustness, which indicates adversarial robustness of a part of the network. We provide a theoretical analysis to show that if a subnetwork is robust and highly dependent to the rest of the network, then the remaining layers are also guaranteed to be robust. To guide the empirical investigation of our theoretical findings, we implemented our method at multiple layer depths and across multiple common image classification datasets. Experiments demonstrate that our method, with sufficient dependency between subnetworks, successfully utilizes subnetwork robustness to match fully-robust models' performance across AlexNet, VGG16, and ResNet50 benchmarks, for attack types FGSM, I-FGSM, PGD, C&W, and AutoAttack.

1. INTRODUCTION

Deep neural networks (DNNs) have been highly successful in computer vision, particularly in image classification tasks, speech recognition, and natural language processing where they can often outperform human abilities Mnih et al. (2015) ; Radford et al. (2015) ; Goodfellow et al. (2016) . Despite this, the reliability of deep learning algorithms is fundamentally challenged by the existence of the phenomenon of "adversarial examples", which are typically natural images that are perturbed with random noise such that the networks misclassify them. In the context of image classification an extremely small perturbation can change the label of a correctly classified image Szegedy et al. 2019) investigated adversarial robustness from a theoretical perspective. The authors address "useful, non-robust features": useful because they help a network improve its accuracy, and non-robust because they are imperceptible to humans and thus not intended to be used for classification. Normally, a model considers robust features to be about as important as non-robust ones, yet adversarial examples encourage it to rely on only non-robust features. Ilyas et al. ( 2019) introduces a framework to explain the phenomenon of adversarial vulnerability . A feature f is considered a "ρ-useful feature" if it is correlated with the true label in the dataset. Similarly, "γ-robustly useful features" are ρ-useful for a set of adversarial perturbations. While Ilyas et al. ( 2019) constitutes a fundamental advance in the theoretical understanding of adversarial examples, and opens the way to a thorough theoretical characterization of the relation between network architecture and robustness to adversarial perturbations, little attention has been paid to how robustness throughout the network is guaranteed and whether adversarial training must be applied to the entire network. In this paper, we develop a new theoretical framework that monitors the robustness across the layers in a DNN and explains that if the early layers are adversarially trained and are sufficiently connected with the rest of the network, then adversarial robustness of the latter layers is obtained, here by connectivity we mean the early layers are highly dependent to the latter layers. All of these findings raise a fundamental question: How can we make a whole model robust to adversarial inputs by making part of the model robust? In addition, the vulnerability of models trained using standard methods to adversarial perturbations makes it clear that the paradigm of adversarially robust learning is different from the classic learning setting. In particular, we already know that robustness comes at the cost of computationally expensive training methods (more training time) Zhang et al. (2019) , as well as the potential need for more training data and memory capacity Schmidt et al. ( 2018). Hence, one notable challenge in adversarially robust learning is computational complexity while maintaining desired performance. To this end, by exploiting the possibility that subnetworks can be robust to adversarial attacks, we propose a novel approach that aims to theoretically analyze adversarial robustness guarantees in a network by adversarially training only a subset of layers. This work will also pioneer the new concept of "semirobustness" which indicates adversarial robustness of a part of the network. This includes a new perspective of adversarial perturbations and a novel theoretical framework that explains theories for the following claim: If a subnetwork is robust and highly dependent to the rest of the network and passes sufficient connectivity toward the last layer, then the remaining layers are also guaranteed to be robust.

Contributions

To summarize, our contributions in this paper are: (1) We introduce a novel concept of semirobustness in subnetworks. We show that a subnetwork is semirobust if and only if all layers within the subnetwork are semirobust. (2) For the first time we provide a theoretical framework and prove that under some assumptions if the first part of the network is semirobust then the second part of the network's robustness is guaranteed. (3) Experimentally, we demonstrate that given sufficient mutual dependency between subnetworks, our method displays the same adversarial robustness of a network as compared to regular adversarial training.

2. SUBNETWORK ROBUSTNESS

Notations We assume that a given DNN has a total of n layers where, F (n) is a function mapping the input space X to a set of classes Y, i.e. F (n) 1) is the first part of the network up to layer j. Denote σ (l) the activation function in layer l and π(y) the prior probability of class label y ∈ Y. Let f (l) be the l-th layer of F (n) , as f (l) (x l-1 ) = σ (l) (w (l) x l-1 + b (l) ), where σ (l) is the activation function. In this section, we define the notion of a Semirobust Subnetwork. We discuss semirobustness more in Section 2.1. : X → Y ; f (l) is the l-th layer of F (n) ; F (i,j) := f (j) • . . . • f (i) is a subnetwork which is a group of consecutive layers f (i) , . . . , f (j) ; F (j) := F (1,j) = f (j) • . . . • f ( Definition 1 (Semirobust Subnetwork) Suppose input X and label y are samples from joint distribution D. For a given distribution D, a subnetwork F (j) is called γ j -semirobust if there exists a mapping function G j : L j → Y such that E (X,y)∼D inf δ∈Sx y • G j • F (j) (X + δ) ≥ γ j ,



(2014); Goodfellow et al. (2014). For this reason, adversarial examples present a major threat to the security of deep-learning systems; however, a robust classifier can correctly label adversarially perturbed images. For example, an adversary could alter images of the road to fool a self-driving car's neural network into misclassifying traffic signs Papernot et al. (2016a), reducing the car's safety, but a robust network would detect and reject the adversarial inputs Ma et al. (2018); Biggio et al. (2013). The problem of finding perturbed inputs, known as adversarial attacks, has been studied extensively Kurakin et al. (2017); Sharif et al. (2016); Brown et al. (2017); Eykholt et al. (2018). To handle adversarial attacks, two major solutions have been studied: (1) Efficient methods to find adversarial examples Su et al. (2019); Laidlaw & Feizi (2019); Athalye et al. (2018); Liu et al. (2016); Xie et al. (2017); Akhtar & Mian (2018), (2) Adversarial training to make deep neural networks more robust against adversarial attacks Madry et al. (2018); Tsipras et al. (2019); Gilmer et al. (2019); Ilyas et al. (2019); Papernot et al. (2016b). The adversarial perturbations may be applied to the input or to the network's hidden layers Goodfellow et al. (2014); Szegedy et al. (2014) and it has been show that this strategy is effective at improving a network's robustness Goodfellow et al. (2014). Several theories have been developed to explain the phenomenon of adversarial examples Raghunathan et al. (2018); Xiao et al. (2019); Cohen et al. (2019); Shamir et al. (2019); Fawzi et al. (2016); Carlini & Wagner (2017); Weng et al. (2018); Ma et al. (2018). Previously Ilyas et al. (

